Friday, March 30, 2012

Stemmers and Lemmatizers in Search Algorithms

Searching / Search Engines are an essential part of every content specific software or web product. The essential part of a good search engine might involve indexing of the content to be searched so that the content can be stored like a hash table to access the content at O(1) complexity.

How do we achieve this indexing? There are two conventional ways to achieve this, which are Forward and Reverse Indexing. While Forward Indexing involves storing the indexes on a sample to token hash table format, Reverse Indexing stores the indexes as token to sample format.

For example if the sentence to be searched is - 'what is love?', the tokens would be 'what', 'is' and 'love'. Now the entry as per forward indexing would store the key as 'what is love?' and the value as a list of tuples 'what', 'is' and 'love', while as per reverse indexing, we would append the keys 'what', 'is' and 'love' withe the sample 'what is love?'.

One major problem to address while indexing is the generation of correct tokens. Finding the root word of each tuple is quite important as it reduces the index size drastically, say equal, equals and equally can be just stored in one root word equal. Here Stemmers and Lemmatizers come into picture.

Stemming strips down a word into a root word using standard english patterns and does it's job pretty efficiently most of the times. It is fast and involves less computation and is widely used in search algorithms.

Lemmatizers are used where a more precise result is required while finding out the root words, where the  number of keys and data size is a constraint and we need more realistic keys. Lemmatizers are a common concept or branch of Natural Language Processing which intelligently find out the root words of the tokens using heuristics and stored language dictionaries. There are various dictionaries being built for various languages which are pretty efficient.

One of the good implementations I experienced which is still at a nacent stage in building it's English dictionary is the Lemmagen. Extremely efficient, pretty good algorithms used and easily portable via platform invokes / c-library. Completely written in native C++ without using any external libraries. I have written a python-usable exposed makefile and other stuff which creates a shared object of this project when complied and exposes its lemmatization functions to the python interpreter. The code for which can be found HERE.

Θ Ω Sushant ♂

No comments:

Post a Comment

Startup founders cheatsheet (Chief product officer)

Define your goals  The basic definition of "mission" and "vision" of the company is critical when we've past the...