Sunday, 6 April 2014

Heads up tips on Boolean Retrieval

Here's an excellent page to get started: http://nlp.stanford.edu/IR-book/html/htmledition/an-example-information-retrieval-problem-1.html

Now, here are some heads up tips:

1. In Fig 1.1, we see a natural boolean representation of the postings list. Each row contains a binary 1-0 sequence, with '1' telling us that the term is present in the document and 0 suggesting that it is not present.

As we typically work on billions of documents on the web, it is a great idea to employ memory optimized solutions. One such solution is to store only the documents that contain the term in the postings list. 


2. When we evaluate a query, for example, Brutus AND Caesar AND Calpurnia,  it is a great idea to evaluate Brutus AND Calpurnia first. As we can see, the documents containing all three keywords will be best determined by the one that occurs least frequently. This way, we can eliminate irrelevant documents rather smartly in the first step itself.

3. How about evaluating (Brutus OR Calpurnia)?

An OR operation deserves some special attention to itself. Of course, it simply evaluates a Union of the two postings lists. Here is a smart way of achieving the same result.

Set either of the two postings lists as the solution,  in technical terms, initialize the solution as any one of the two postings lists. Now, you could use a set type for storing the solution. Then traverse the other posting list from start to end and add it to the solution set. When you will have traversed the entire second list, you get the solution set. The use of set type eliminates the possibility of storing the same document id more than once. 

The above method of computing Unions also finds an interesting application in the polynomial addition.