Tayzwi

Should be reading more and writing less, but well...

Wednesday, November 17, 2010

So much to learn, ponder about, and do....

And so much time to do it.

Scott Aaronson's "rant" on naysayers of complexity theory got me pondering.

Over the last few months, my general scientific pondering has suffered at the hands of real work, startup life routine, system building, and a general state of being occupied by the trite. It's fascinating, but not inspiring. The days are exciting, but not ecstatic. The nights are cathartic, but not liberating.

I miss my aimless pondering days.

I was missing this life then.

So it goes.

Labels: computer science, life, startups

posted by Tejaswi # 4:25 PM 3 comments

Monday, June 22, 2009

Genius

The Feynman Method

Richard Feynman was fond of giving the following advice on how to be a genius. You have to keep a dozen of your favorite problems constantly present in your mind, although by and large they will lay in a dormant state. Every time you hear or read a new trick or a new result, test it against each of your twelve problems to see whether it helps. Every once in a while there will be a hit, and people will say, “How did he do it? He must be a genius!”

Labels: computer science, research, theory

posted by Tejaswi # 2:24 PM 3 comments

Thursday, May 03, 2007

Counter Example

Finding counter examples to conjectures can be notoriously hard (pun? I think not). This is an area of creativity that mostly goes unappreciated.

Here's a personal anecdote: my father once came up with an algorithm to solve a hugely constrained version of the traveling salesman problem. The greedy proof was slightly hand-wavy, and I felt it would be an easier thing to find a counter example where his algorithm wouldn't find the optimal tour. Of course, I was just trying to tell him that he couldn't have solved TSP (or even approximated it). I learnt two lessons that day.

Lesson One: One must always speak sweet, because one underestimates the number of times one has to eat his own words.

Lesson Two: Finding counter examples can get quite tricky - and if I may, I would admit that it's not just tricky, it's quite hard - requires tons of patience, and a deep understanding of the problem and the algorithm we are out to disprove. I had learnt a similar lesson earlier in Sundar's Approximation Algorithms class. Sundar let us spend one hour counter-exampling that a minimum spanning tree over the vertex set wouldn't give us a minimum Steiner tree. The counter example used a 'construct' that was quite simple, and took a few minutes of dedicated thought to find. But what amazed me today was this fact about Tait's conjecture:

Tait's conjecture states that "Every polyhedron has a Hamiltonian cycle (along the edges) through all its vertices". It was proposed in 1886 by P. G. Tait and disproved in 1946, when W. T. Tutte constructed a counterexample with 25 faces, 69 edges and 46 vertices.

Imagine the patience, creativity, deep understanding of the problem, and the [least appreciated of them all] ability to borrow from related problems and areas - it takes to come up with such a counter example. And I keep wondering: why research?

Labels: computer science, research

posted by Tejaswi # 4:59 AM 9 comments

Thursday, November 30, 2006

Eigen

The inimitable quip-fuckin'-master forced a comment which is worth expanding a little more.

Ever since I found that the PageRank vector of the web-graph is the dominant eigenvector of its matrix representation, I have been meaning to get to the bottom of this eigenvector-eigenvalue concept. I am still snorkeling; long time to scuba dive.

Most of us studied concepts like simultaneous equations in our high school algebra classes, but never really wondered about them deeply, or even more so, felt that they were difficult. Problems like - 3 oranges and 4 apples cost 17 rupees; 2 oranges and 3 apples cost 12 rupees; how much does each apple cost? - never seemed that difficult. We knew that the equations that represented these statements were 'fluid,' and need not apply to just these statements, and an overall tool was being mastered.

Somehow, the same never happened to tools like primal-dual, or eigenvalues-eigenvectors, or other tools from linear algebra, statistics, calculus, etc. Somewhere, the learning process got fucked because of an emphasis on just the concept, and a palpable lack of intuitive visualization. When people talk about a matrix, I never think in terms of transformations. If I did, I could see that some transformations would leave some 'victims' unchanged. And these victims were the eigenvectors of that transformation. Maybe they could change in terms of scale, but not in terms of what they fundamentally are. Here is an example from the Wikipedia entry on Eigenvalue:

As the Earth rotates, every arrow pointing outward from the center of the Earth also rotates, except those arrows that lie on the axis of rotation. Consider the transformation of the Earth after one hour of rotation: An arrow from the center of the Earth to the Geographic South Pole would be an eigenvector of this transformation, but an arrow from the center of the Earth to anywhere on the equator would not be an eigenvector. Since the arrow pointing at the pole is not stretched by the rotation of the Earth, its eigenvalue is 1.

So, was Neo an Eigenvector of The Matrix? I wonder...

As for what it means when they say that PageRank is the dominant eigenvector of the web-graph, we have to visualize the web-graph's matrix representation as a transformation. The matrix representation has all web-pages in rows and columns, and if a web-page in row i has a hyperlink to a web-page in column j, the [i, j] entry is 1, else 0. What does it mean to say that this matrix is a transformation? And what does it mean to say that it can act on 'victims' to change them? And if the 'victim' happens to be a vector which has the pages' pagerank values, the transformation doesn't affect it. What does that say about PageRank, and how is that related to our intuitive perception of pagerank as the importance of each page on a global scale?

We will get to primal-dual some other time. My head hurts. It hurts.

ps: And these are all just tools; albeit mental in nature. If we don't master them, well, it's ok I guess. We can always go back to intellectual stone age, or whatever was there before that. I mean, I always fancied myself as a Cro-Magnon.

Labels: computer science

posted by Tejaswi # 10:27 AM 6 comments

Tuesday, January 03, 2006

Quantum Brain?!?!

In my last post on heuristics and approximation, I asked the question - Given realistic scale and scope, how does the human brain solve such problems (like question-answering)? Does the brain map everything to combinatorial optimization? Can its modus operandi be applied to automated systems so that they solve our problems.... - which leads to further interesting exercises in intellectual masturbation.

Combinational Optimization is about finding the optimal (best/worst) candidate (based on certain acceptable parameters) among all possible options. A large category of problems can be mapped to combinatorial optimization, and most of these are NP-hard. This means that most of them do not have efficient solutions.

But we see humans, with some training and experience, solve many combinatorial optimization problems efficiently using heuristics and instinct. Chess, Go, Sudoku, etc., come into mind. In these games, the human has to search the solution space efficiently to find the correct board/grid position. And we know that humans do not do a brute force search. There is some internal process that short-circuits the search.....somehow.

Now, this somehow-situation could be explained by the following possibilities:

The human brain has a clever technique to narrow search space, and come up with exact solutions. This would mean that P==NP, and computer scientists still haven't found a proof for this. A proof of this would simply be an efficient way of solving one of the NP-hard problems (which, according to this bullet, the human brain is doing right now). We just need to know this purported technique formally to claim P==NP. This situation is highly unlikely because computer scientists have been trying since 1960s to find such a procedure in vain. Of course, their fastidious nature is also making them look for a proof that such an efficient procedure cannot exist (P!=NP). Whether the latter will be proven, or whether the question itself is beyond answer - only time will tell. But the former is very unlikely to be true.
The human brain has come up with incredibly clever heuristics that approximate the search for the exact solution, and give us only approximate solutions. But these are so good, that they look exact almost always. But the fact that Kasparov got defeated by a computer shows that his heuristics used for chess were not infallible, and the computer's brute force search was much better (after all chess is played on a 8X8 board). Go and Bridge players still kick the machine's ass because of their games' bigger search spaces. This seems to be our Winston Wolfe (see Pulp Fiction) for the somehow-situation.
The outlandish situation that my romantic mind wants to use to model itself is a trifle too....er....outlandish. Let me get there. It is known that Factoring a large number into its factors is a computationally hard problem. Its unlikely that a clever technique exists to find exact factors without going through brute force division. But Peter Shor, in 1994, proved that quantum computers can solve the Factoring problem efficiently! Now, quantum computers are still theoretical models which might never get built. Shor and co. coming up with algorithms that work on them motivates physicists and engineers to hopefully come up with a working model of the quantum computer in the future. Now, my outlandish conjecture is that the human brain already has a quantum computer like mechanism built into it, which enables it to solve certain combinatorial problems quickly. Modern physics accepts the reality of quantum mechanical concepts like superposition, etc. in nature. Why is it not possible that such features already physically exist in our brains, and this quantum computing brain is solving problems using quantum algorithms? Is the human brain a realistic quantum computer? Ummmm.....

Its quite possible that a combination of these situations might exist in reality, and there might be more such parameters that actually govern the way we think. Will we ever truly understand ourselves scientifically? or in the end, is it going to be just spiritual enlightenment about the Self? How anti-climactic would that be.

Anti-climax won't do. We need 'em real. That was a long session. Now for the money shot. [insert appropriate onomatopoeia here, in uppercase]

Labels: computer science

posted by Tejaswi # 11:24 PM 7 comments

Heuristics vs. Provability

Given a problem, we can:

Give efficient methods to find exact solutions. When this is possible, all's well. But alas, this is not possible all the time.
Give inefficient methods to find exact solutions (brute force), and bemoan that P != NP (mostly).
Propose intuitively appealing heuristics that are quick, and give "mostly good" results in practice, and hope that the malicious worst case input which can tear the heuristic apart never comes about.
Prove theoretical approximation guarantees about these heuristics so that even in the worst case, the heuristic won't go bad beyond some known bound.

Personally, I am fascinated by #4. Most current large scale problems rely on #3, which hopefully will get elevated to #4 in a few years time, and theoreticians will tell the practitioners - "Yeah, your heuristic will never give a solution that if off the optimal by 200%." As a budding researcher, is it better to build simulations, concrete systems, models, etc., and show empirically that certain heuristics work most of the time? Or is it better to take some small heuristic and prove without doubt its effectiveness under all circumstances?

The latter appeals to me because of its no nonsense, matter of fact, precise nature. But I can also see how we cannot do without the former. The former (heuristics) gets built to tackle real world problems. Theoreticians follow up later by giving provable approximate guarantees for these heuristics. Combinatorial optimization is rife with this pattern. What about other heuristics though?

Take the everyday example of Question Answering. Given a question, and an essay that might contain the answer to the given question, a reasonably educated person can extract an answer without much difficulty. But this simple task is extremely hard for an automated system. It can manage questions of the type: "how much," "what," and "when"; but will get flummoxed by questions of the type: "why" and "how". Practioners in the area of Natural Language Processing come up with heuristics that attempt to answer these tough questions, but there is no provable guarantee that these heuristics can answer all questions correctly all the time.

Can problems from the "Human Domain" be modeled fully as combinatorial problems so that we can hopefully give some approximation guarantee? Given realistic scale and scope, how does the human brain solve such problems (like question-answering)? Does the brain map everything to combinatorial optimization? Can its modus operandi be applied to automated systems so that they solve our problems while we attack other problems that right now are complicated even for the human mind?

Labels: computer science

posted by Tejaswi # 4:05 PM 2 comments

Saturday, April 02, 2005

Traveling Salesman

The traveling salesman problem (TSP) is a very simple combinatorial optimization problem. Given a set of cities and distances between each pair, the problem is to find the tour of least distance that travels each city exactly once. Simple - as in simple to state. But from what I have seen this seems to be the hardest of the hard problems to solve.

To give some background, there are these famous hard problems in Computer Science that do not have "efficient" solutions. Brute force solutions work very easily on them, but take an eternity to run. In spite of 40 years of extensive research by some very smart people, brute force seems to be the only solution. For example, the TSP can be solved by listing all possible tours covering the cities once, and picking the one with the least length. This takes an eternity as there are an exponential number of such tours (n!). These problems are the (in)famous NP-Hard problems. Stephen Cook at the University of Toronto proved that if one of them can be solved efficiently, all of them can be!!! A very intuitively pleasing fact.

Tragically, most combinatorial optimization problems in nature turn out to be NP-hard. As 40 years of research hasn't found a good solution to these problems, people have begun to approximate the solutions. Say, for TSP, if the least length best tour is around 100 KM, a procedure that finds a tour of 150 KMS would be considered good. Of course, if the length of the best tour is around 5000 KM, the same procedure should return a tour of length 7500 KM. This means that the approximate procedure for TSP should return a solution that is always some fixed fraction times more than the best solution. Most of the NP-hard problems have reasonable approximation algorithms; the TSP does not!!

It is proven that you cannot invent a procedure that can take a set of cities and give a tour that will always be some fraction times more than the optimal solution. TSP seems to be very very very very hard. But of course, most of this theory would collapse if someone comes up with a good way to solve one NP-hard problem.

In spite of my being really really really bad at this academically, I am completely enamored by the theory of NP-completeness. Somehow this notion of possible but not-feasible seems to make it intriguing. It is incredibly fascinating to see how all NP-hard problems are closely related to each other, and solving one solves them all, but approximating one, doesn't approximate them all. Seductive!!

Reminds me of Dijkstra's famous quote - "Computer Science is as much about Computers as Astronomy is about Telescopes."

Labels: computer science

posted by Tejaswi # 6:27 AM 4 comments

Saturday, February 26, 2005

the What?

I have never heard the Who's music; and scientists have always wondered about the Why's; Science mostly explains the How's; Today's question is the What.

On first glance, the What appears to be tame compared to its illustrious peers. While beginning Jaynes's tome on probability, I came across this quote said around 1948 and attributed to von Neumann - "For those who believe that comptuers cannot do all that we humans can; please explain in finite, precise steps, what it is that you can do; I will make the computer do precisely that."

This is precisely the importance of the What. What is it that we are doing here? What is it that we want? What is it that happens when we breathe? What is emotion? What is thought? What is pride? What is light? What is everything....

A what can mostly be answered by enumerating all the members of the answer set, and hopefully some abstraction will come out of it so that the next time, we use the abstraction instead of enumeration. But the first time, or till the time the abstraction comes out, we are stuck with the enumeration of all members of the answer set. Here is precisely where things might go out of hand. We might not be able to enumerate all that we think contribute to the answer to a single What. There might be just too many of them.

Let me elaborate that with an example. If someone asks me - What is poetry? - I can either come up with an absract answer which encapuslates all forms of poetry, all poems ever written, all poems that will be written in the future, all poems' purposes, all this, for all languages. The abstract encapuslation of these needs to be simple, concise, and should give a precise description of poetry.
OR
I can simply enumerate ALL possible poems there are, all of them, in a set, and give the enumeration set as the answer to "What is poetry?".

It seems that for all the NP-complete problems, we are stuck with the enumeration till the abstraction for all enumerations is found. We don't even know if such an abstraction exists. This is the famous P vs NP question, and well, there is a chance that this question itself is beyond answer.

My next post will elaborate on my current thoughts on the relevance of the P vs Np question in the eternal human-vs-machine debate.

ps - while writing about that poetry example, I remembered that there was something called Information Complexity which I had read about somewhere, and here it is. It goes by the name of Kolmogorov Complexity.

Labels: computer science

posted by Tejaswi # 3:56 PM 5 comments

Thursday, February 17, 2005

Anchor Text and Focused Crawling

Its been a while since I have blogged anything technical.

These days, I am working on the open source search engine, Nutch. Before I get into what I am doing, let me explain why, in the last sentence, I put the phrase "open source search engine" as a part of the href tag. Search engines use anchor text extensively to figure out what a page is about. For example, the home page of Tejaswi doesn't have the phrase "home page" anywhere. So, by looking at the anchor text of all the in-links to a page, the search engine figures out what the content of the page might be about. This is a latent way of identifying the content of a page: by looking at what in-links call it. Now, when I say "the open source search engine Nutch" in the anchor text and link to nutch.org, that phrase gets associated with the site, and helps someone searching for an open source search engine, but has no clue about Nutch itself.

Currently, I am working on the crawler part of the search engine. The crawler/spider is an offline process that goes all over the web and gets pages for the search engine to index. The idea is to start the crawler with a set of seed pages. The crawler then starts indexing the textual content of each page, and recursively crawls each page's out-links. This goes on ad-infinitum. This part is pretty standard, and is already implemented. My job is to ensure that the crawl is not ad-hoc, ie. not all out-links are crawled. I am trying to "focus" the crawl so that only pages pertinent to certain topics get crawled, and subsequently indexed. Topics like "cycling", "art cinema", "photography", "BDSM" etc. Why do we need to focus a crawl?

Google currently claims that it indexes 8 billion webpages. According to recent estimates, un-indexed pages outnumber indexed pages by a factor of 4-5. This means that there are at at least 33 billion pages out there that Google can index, but is not indexing. Why not? well, for one, more pages doesn't necessarily mean better search results. Good number of pages representing a broad range of topics means better search results. This is where a focused crawl might be preferred over an ad-hoc crawl. If you are really interested, take a look at my advisor's Focused Crawling page for more information.

In other news, read Jeremy Zawodny's post on Mark Jen to know about the Google employee who got fired for blogging some company internals. All corporate bloggers out there....you reading this?

Labels: computer science

posted by Tejaswi # 6:33 PM 0 comments

Tayzwi

Wednesday, November 17, 2010

So much to learn, ponder about, and do....

Monday, June 22, 2009

Genius

Thursday, May 03, 2007

Counter Example

Thursday, November 30, 2006

Eigen

Tuesday, January 03, 2006

Quantum Brain?!?!

Heuristics vs. Provability

Saturday, April 02, 2005

Traveling Salesman

Saturday, February 26, 2005

the What?

Thursday, February 17, 2005

Anchor Text and Focused Crawling

Archives

Quick index to blog-posts I like (from my personal website)