RankBrain - The Word2Vec Patent from Google

Is it part of Google’s RankBrain? Probably worth learning more about.

Last fall Google started talking publicly about their new “RankBrain” approach and to this day, the SEO world seems to be somewhat confused on what exactly it is. At the core it’s a system that uses words as vectors to better understand and predict results to queries they serve up. We also learned that it is most likely closely related to the project; Word2Vec

And with that in mind, there was a patent awarded last year whose authors were many of the same people that worked on the Word2Vec project (Greg Corrado is also listed on the RankBrain work). I thought it was worth covering to hopefully get a little more insight into the concepts.

Ok so let’s take a crash course on how Google might use word vectors to better understand things. Let’s look at some simple word vector examples.

vector(‘Paris’) – vector(‘France’) + vector(‘Italy’) results in a vector that is very close to vector(‘Rome’ )

vector(‘king’) – vector(‘man’) + vector(‘woman’) is close to vector(‘queen’)

And the vector (san francisco) would have close vector relations with (los_angeles) (golden_gate) (california) (oakland) (san_diego)

In another example, vec(“Russia”) + vec(“river”) is close to vec(“Volga River”), and vec(“Germany”) + vec(“capital”) is close to vec(“Berlin”).

You get the idea…now let’s move onto the patent in question.

The Patent

Computing numeric representations of words in a high-dimensional space

Assigned; Google Inc.
Filed; March 15, 2013
Awarded; May 19, 2015

Authors; Tomas Mikolov; Kai Chen; Gregory Corrado; Jeffrey Dean

The system receives a group of words that surround an unknown word in a sentence/query and then

maps the words into a numeric representation. From there a word score is generated for each of the words. Additionally there is a set of training data, made up of sequences of words; training a plurality of classifiers on the set of training data.

Some papers noted a success rate relative to the size of the corpus. When using it over a 33 billion word set they achieved a 73% relevance success rate. When they ran it over only 6 billion words, that dropped to 65%.

Some of the advantages of the system, as stated in the patent;

Unknown words in sequences of words can be effectively predicted if the surrounding words are known.
Words surrounding a known word in a sequence of words can be effectively predicted.
Numerical representations of words in a vocabulary of words can be easily and effectively generated.
The numerical representations can reveal semantic and syntactic similarities and relationships between the words that they represent.

They also discuss the ability to work on large corpuses of words in the order of 200 billion words is achievable through a two-layer prediction system (potentially skip-gram and CBOW?). This process can result in high numeric representations on smaller corpuses. All of that, potentially lowering the training time.

There is also mention that “… input words are tokenized before being received by the system, e.g., so that known compounds, e.g., “New York City” and other entity names, are treated as a single word by the system.” Which I thought was interesting.

Another aspect of the system is a classifier that takes the numeric representations and then predicts values. This seems to be an important part of the process and further leads us towards how it is likely implemented out in the wild.

The Machine learning component

As for the actual training processes, they tend to talk about parallelized approaches. Something they have been looking at as far back as 2007 and beyond. And hey, anyone remember Cafiene and the parallel indexing approaches in 2010?

They mention;

“For example, the training process can be parallelized using one or more of the techniques for parallelizing the training of a machine learning model described in “Large Scale Distributed Deep Networks,” (pdf) Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen , Neural Information Processing Systems Conference, 2012.

There is also some further mention of some of the examples I gave in my article on Rankbrain, which I originally found in a paper authored by the same Googlers.

The example given was;

“…semantic similarities, e.g., showing that the word “queen” is similar to the words “king” and “prince.” Furthermore, because of the encoded regularities, the numeric representations may show that the word “king” is similar to the word “queen” in the same sense as the word “prince” is similar to the word “princess,” and alternatively that the word “king” is similar to the word “prince” as the word “queen” is similar to the word “princess.”

So, when looking identify similarity/relationships of word A to B to word C, they describe an operation such as;

vector(A)-vector(B)+vector(C).

For example, the operation vector(“King”)-vector(“Man”)+vector(“Woman”) may result in a vector that is closest to the vector representation of the word “Queen.”

All of that being said, the training element also produces vector scores of how likely those relationships are. How trust-able the predictions it makes are.

Does it rank?

The next part I am going to share is something I had reservations about doing. In fact, I spoke to more than a few of my peers about including it as it has the word “rank” in it, which obviously will send many an SEO blogger into fits.

Please understand, this part of the patent is merely about other possible ways the system might be implemented, it by no means infers that it is being used as such.

Here’s the snippet;

“In some implementations, instead of the classifier, the concept term scoring system can include a ranking function that orders the words based on the numeric representation generated by the embedding function i.e., in order of predicted likelihood of being the word at position t. The ranking function may be, e.g., a hinge-loss ranking function, a pairwise ranking function, and so on. Once generated, the word score vectors can be stored in a predicted word store or used for some immediate purpose.”

That is NOT the same as actually ranking results independantly folks… ok? Tread lightly. It’s more of a re-ranking element.

“In some implementations, the system can process the numeric representation of the input words using a ranking function instead of a classifier to predict a ranking of the words according to the predicted likelihood that each of the words is the unknown word in the sequence. “

Getting it now? Instead of using the classification system, it might value rank the words to potentially establish which is the unknown word in the set of words (phrase or sentence). That being said, if the query is better understood and changed, it stands to reason that the rankings are going to change. That doesn’t mean it’s a direct scoring element.

Is This RankBrain?

Who knows right? But I can’t help but think that it’s a part of it. If we look at the Google engineer interviewed in the original outing of RankBrain, it was Greg Corrado – whom is named in this patent along with the other folks that have worked on related papers and research projects around Word2Vec.

Let’s also consider the “Google Brain” deep learning research project whose team members included Jeff Dean, Geoffrey Hinton and Greg Corrado… more connections to Rankbrain as well as Word2Vec.

Add to that the fact that most of what has been said by Google about RankBrain, is that it’s dealing with query refinements and classifications as well as using words as vectors. It’s not a huge leap to believe that this patent is at least in part, a glimpse into the workings of RankBrain.

My take? RankBrain is part Google Brain, part Word2Vec, part this patent, all loosely wrapped in Hummingbird.

Related reading/watching

Investigating Google RankBrain and Query Term Substitutions

Google Hummingbird Patent (SEO by the Sea)