Recently there’s been an increase in discussions about latent semantic indexing and the assertion has been made that it is not, never was, never will be in use with the major search engines. I disagree - to a point - and I believe Google itself shed some light on this just a few weeks ago.
I’m referring to this announcement from Google on March 24, 2009 with the headline “Google search gets semantic”. I hope you’ll read it. A few snippets that might interest you:
“Google on Tuesday rolled out semantic search capabilities in 37 languages.” and
“We’re deploying a new technology that can better understand associations and concepts related to your search”.
Microsoft has also been testing a semantic search engine.
Before I go any further I’d like to say that I think part of what leads to disagreement is the term ‘LSI’ itself. It may be a matter of…well…semantics. LSI has been around for decades in relation to document analysis and information retrieval. It was not developed with search engines in mind. When we use the term LSI in relation to SEO we are really referring to artificial intelligence technologies that search engines are developing in order to put some of the principles of LSI into action within their algorithms. Whether you call it LSI, themes, NLP (natural language processing) or something else, it is coming into play.
One of the arguments I recently read is that LSI is not workable with a database the size of Google’s. True enough, but I haven’t heard anyone say that Google uses the original LSI technology and nothing else, but rather that some concepts of LSI are being employed as part of the algorithm, along with other factors such as internal and external link analysis. Referential integrity has been called the real secret factor you need to know. In this context it appears to be a recap of important aspects of search algorithms, in particular link reputation, and how to take advantage of them. Very useful information, but nothing new. (Referential integrity in its original meaning didn’t have anything to do with SEO either.)
Another argument: Google doesn’t return the same search results for singular and plural terms, which it should recognize as having the same meaning, so it must not be using LSI. That’s an oversimplification that doesn’t reflect the depth of this type of artificial intelligence. Let’s look at a better example, this one from Google: “A Google search in English for ‘principles of physics’ triggers suggestions to inquire about ‘big bang’ and ‘quantum mechanics.’ Clearly it is not about whether Google knows that principle and principles are related, but rather that it knows (or more accurately appears to know) that quantum mechanics and physics are related.
I have no doubt that an understanding of semantic search technologies will help you in your search engine marketing efforts. But it is not a silver bullet and will not, in and of itself, lead to top rankings. It is a part of the puzzle, and one that I believe will increase in importance. But you need to know a lot of other things too. Many variables are examined by each search engine and factored into the ranking process.
We’ll be watching the developments in LSI/NLP-type technologies as they relate to search engines and will continue to provide our advanced SEM students with information on this and related concepts.
As for a new name for LSI as it relates to search engines, how about Conceptual Search Technology or Semantic Search Intelligence. Someone should run a contest.
Tags: Latent Semantic Indexing, LSI

Herbert Roitblat
April 24, 2009
Google started using LSI a few years ago on a rather small basis. The demonstrations that used to work (a search for phone~) don’t seem to work now.
It is true that LSI cannot possibly scale to Google’s full scope, but it does not have to to be useful. At the time they were using it for a rather small (by Google standards) subset of vocabulary.
What LSI does is take advantage of the fact that words are not randomly distributed to documents. A document with the baseball in it is much more likely to have words like strike, field, pitch, batter, etc. in them than one that has the word butterfly in it. LSI uses a mathematical technique to reduce (project is probably a better word) words onto meaning scales that can then be used to retrieve documents, even if they don’t have the key word in them, but do have a lot of the other words.
LSI does not scale well, it does not generally work with continuous updates, and it is computationally very expensive. Since the introduction of LSI in the 1980s there have been a number of advances, including probabalistic LSI and other tools, but it still suffers from some practical limits.
We have used related technology that avoids many of these limitations as the basis for Truevert (truevert.com). Truevert is a semantic search engine. This version of truevert is focused on green things and interprets queries from a green perspective. Like LSI, it is largely statistical, it takes advantage of the nonrandom distribution of words, but unlike LSI, it is very fast and can be continuously updated.
Our approach is to focus on verticals, where the power of redundancy is maximized. If you search on truevert for CFL, it gives you pages about compact fluorescent light bulbs, not the Canadian Football League.