Skip to main content

Stemmers and Lemmatizers in Search Algorithms

Searching / Search Engines are an essential part of every content specific software or web product. The essential part of a good search engine might involve indexing of the content to be searched so that the content can be stored like a hash table to access the content at O(1) complexity.

How do we achieve this indexing? There are two conventional ways to achieve this, which are Forward and Reverse Indexing. While Forward Indexing involves storing the indexes on a sample to token hash table format, Reverse Indexing stores the indexes as token to sample format.

For example if the sentence to be searched is - 'what is love?', the tokens would be 'what', 'is' and 'love'. Now the entry as per forward indexing would store the key as 'what is love?' and the value as a list of tuples 'what', 'is' and 'love', while as per reverse indexing, we would append the keys 'what', 'is' and 'love' withe the sample 'what is love?'.

One major problem to address while indexing is the generation of correct tokens. Finding the root word of each tuple is quite important as it reduces the index size drastically, say equal, equals and equally can be just stored in one root word equal. Here Stemmers and Lemmatizers come into picture.

Stemming strips down a word into a root word using standard english patterns and does it's job pretty efficiently most of the times. It is fast and involves less computation and is widely used in search algorithms.

Lemmatizers are used where a more precise result is required while finding out the root words, where the  number of keys and data size is a constraint and we need more realistic keys. Lemmatizers are a common concept or branch of Natural Language Processing which intelligently find out the root words of the tokens using heuristics and stored language dictionaries. There are various dictionaries being built for various languages which are pretty efficient.

One of the good implementations I experienced which is still at a nacent stage in building it's English dictionary is the Lemmagen. Extremely efficient, pretty good algorithms used and easily portable via platform invokes / c-library. Completely written in native C++ without using any external libraries. I have written a python-usable exposed makefile and other stuff which creates a shared object of this project when complied and exposes its lemmatization functions to the python interpreter. The code for which can be found HERE.

Θ Ω Sushant ♂


Popular posts from this blog

Startup founders cheatsheet (Chief product officer)

Define your goals  The basic definition of "mission" and "vision" of the company is critical when we've past the stage of experimenting with the startup's model and helps you prioritise better. From the Expedia page it looks like this: "Our Mission is to Revolutionise Travel Through the Power of Technology", in this case, it is also helping the company know that leveraging and scaling with "technology" is imperative for the company (along with operations). 
Set 2-3 basic targets for the next 2-3 months (possibly 6 months) These might be pretty standard and should align with the mission / vision of the company. If your company wants to be the #1 company in Asia for travel, these basic targets might be: Scale to 100,000 app downloads across platformsIncrease revenue by 15%Increase daily unique visitors to 3 times the current value 
Image Source: Mind the product
Goals help define epics and create a huge backlog Now brainstorming how the targe…