Friday, March 30, 2012

Stemmers and Lemmatizers in Search Algorithms

Searching / Search Engines are an essential part of every content specific software or web product. The essential part of a good search engine might involve indexing of the content to be searched so that the content can be stored like a hash table to access the content at O(1) complexity.

How do we achieve this indexing? There are two conventional ways to achieve this, which are Forward and Reverse Indexing. While Forward Indexing involves storing the indexes on a sample to token hash table format, Reverse Indexing stores the indexes as token to sample format.

For example if the sentence to be searched is - 'what is love?', the tokens would be 'what', 'is' and 'love'. Now the entry as per forward indexing would store the key as 'what is love?' and the value as a list of tuples 'what', 'is' and 'love', while as per reverse indexing, we would append the keys 'what', 'is' and 'love' withe the sample 'what is love?'.

One major problem to address while indexing is the generation of correct tokens. Finding the root word of each tuple is quite important as it reduces the index size drastically, say equal, equals and equally can be just stored in one root word equal. Here Stemmers and Lemmatizers come into picture.

Stemming strips down a word into a root word using standard english patterns and does it's job pretty efficiently most of the times. It is fast and involves less computation and is widely used in search algorithms.

Lemmatizers are used where a more precise result is required while finding out the root words, where the  number of keys and data size is a constraint and we need more realistic keys. Lemmatizers are a common concept or branch of Natural Language Processing which intelligently find out the root words of the tokens using heuristics and stored language dictionaries. There are various dictionaries being built for various languages which are pretty efficient.

One of the good implementations I experienced which is still at a nacent stage in building it's English dictionary is the Lemmagen. Extremely efficient, pretty good algorithms used and easily portable via platform invokes / c-library. Completely written in native C++ without using any external libraries. I have written a python-usable exposed makefile and other stuff which creates a shared object of this project when complied and exposes its lemmatization functions to the python interpreter. The code for which can be found HERE.

Θ Ω Sushant ♂


  1. When we are used a search engine, we should have an idea about its rules. Thank you telling us about its regulation. Similarly, I am giving you USA SEO Company service to you. You can rely on our experts.

  2. I appreciated the level of detail and the real-world examples used in this article, making it easy to understand and apply the information. financial education services

  3. Assuming you have previously dominated every one of the techniques that make up SEM and have a refined framework through which to convey them, you may not require a SEM master. RyanCameron


Startup founders cheatsheet (Chief product officer)

Define your goals  The basic definition of "mission" and "vision" of the company is critical when we've past the...