SMB 'Find me in Google' Tip: Key in on Phrases of Keywords

Monday, October 30, 2006
Posted by Brawlin Melgar

SEO - Phrase Based Optimization

by David Harry

The main goal of this document is to give SEO enthusiasts a stronger grasp of how Phrasing is dealt with in Search Engines, in an effort to help you further target and optimize your web sites. The theories and information relate well to keyword/phrase research as well as content creation and to a lesser extent back links text development.

The crux of the piece was based on analysis of an existing Google Patent on ‘Phrase based searching’, (see Resources at the end). That is as far as I shall go on the original Patent since it can lead to assumptions of what may, or may not be used in their indexing and retrieval processes (algorithms). Just because they filed the patent, doesn’t necessarily mean they have implemented it. I feel the main point here is to get a better idea of HOW search engineers think and WHAT may possibly be in place now, or in future Search technologies.

Why Phrase based indexing and retrieval is important

The problem facing search engines is that the direct "Boolean" matching of query terms is known to have its limitations. One problem is that it doesn’t identify documents that do not have the query terms, but have related words. IT is a very tightly defined result. A search on "Florida Snakes" doesn’t return results related to local species, (Black Pine for example) Conversely it is likely to also retrieve and highly rank documents related to ‘Florida’ rather then the desired or intended query.

Creating better clusters

The answer is a methodology that uses phrases to index, search, rank, and create descriptions for websites. It looks to identify phrases that have frequent and/or distinguished/unique usage. Using this methodology phrases of four, five, or more terms, can be identified. To establish a ‘predictive measure’ the system can identify phrases that are related to one another. A prediction measure is used that relates the actual usage to an ‘expected usage’ of the two phrases. In essence, the more ‘expected’ related phrasings there are within a document, the higher the score will be.

What is considered to be related phrases are those that are commonly used to discuss or describe a topic or concept, such as "President of the United States" and "White House.", seemingly semantic unknowns, under a Boolean system, but of obvious relation to each other. Phrased based indexing and retrieval systems help alleviate this problem

Multiple purpose phrase relevance

For each phrase, the system (indexing and retrieval) identifies pages that have the phrase. Also, for a given phrase, a second list is used to store data the shows which related phrases of the queried phrase are also present in pages containing the given phrase. It can then identify which pages have which phrases as well as which pages also contain phrases that are related to query phrases. This enables a much tighter scoring for the results to a phrase query.

Using such a methodology creates a variety of clusters of related phrases, which “represent semantically meaningful groupings of phrases”. These are created by phrases that have a high prediction measure between all of the phrases in the cluster. This can now be used to organize the results, score and rank them as well as eliminating documents from the search results.

Query Processing and Phrase Extensions

The system uses the phrases when searching for pages in response to a query. In response to a search query it identifies any phrases that are present in the query, so it can look for related ‘lists’ and phrase information for the query phrases. IT can also be used in instances of an incomplete phrase in a search query; thes