Pages

Monday 30 December 2013

Google "WordRank"

One of the more interesting things I learned about while I was a graduate student was Google's PageRank algorithm (to this day, it remains one of my go-to applications when I want to illustrate the real-life utility of certain concepts in linear algebra classes, although as each year passes, the likelihood that a student can remember life before Google gets smaller).  What made it particularly interesting to me is that it could be stated largely in the terminology of algebraic graph theory, my own field of research.  I had seen similar ideas before, but I had never seen anything like it put into practice (well, I had seen Google, but I wasn't aware that it was putting these ideas into practice).

The basic idea behind Google's algorithm is that good websites are websites that are linked to by good websites.  If you think that sounds awkwardly circular, you wouldn't be wrong.  Nevertheless, there's an elegantly simple way to express the idea as a matrix equation, and solutions to this equations are well studied.  Things don't always work out as well in practice as they do in the abstract, but Google's search engine was better than any search engines that came before it.  In this case, the abstract and practice lined up quite well.  In fact, the only one of these older search engines that's still around to this day in any noticeable way is Yahoo, but it has a much smaller share of the search engine market than Google does.

Now, one of my side interests is languages (or linguistics maybe, but I'm not sure how much my forays take me into actual linguistics. Favourite websites/blogs by people who know what they're talking about include Dialect Blog and Language Log).  While pursuing this interest, one of the things I've come across is lists of the most common words in a given language.  I wonder if something like Google's PageRank algorithm could be used to give an alternate word ranking words.  Instead of using links, you could say, in the same awkwardly circular fashion, that important words are words that are used with important words, set it up algebraically, then find the solution.  The result would assign a "WordRank" to each word.

What "used with" means is not well defined, mind you.  We could say that a word $A$ is used with a word $B$ if $A$ and $B$ are used in the same sentence, or perhaps used within a certain number of words of each other, regardless of whether or not they reside in the same sentence.  Maybe there is some other suitable choice.

While I could easily solve the algebra part, setting up the equation would take quite some work, and it might be beyond my abilities.  For example, if I chose to say words are used together when they are in the same sentence, I would need to decide whether or not a period in a given text is meant to indicate an abbreviation or the end of a sentence.  It's easy enough to do if I'm reading, but I would need to program a computer to do this for the a sufficiently large quantity of text.  That strikes me as a nontrivial task.

That wouldn't be a problem if I chose to say that words are used together when they appear within a certain number of words of each other.  Suppose this number is $n$.  The question arises as to what $n$ should be. The answer is not obvious to me.

Another complication is that spelling alone is not enough to determine what word is being used.  For one thing, words can have different forms.  Nouns can be plural or singular, verbs can be present tense or past tense, etc, and spellings change as a result.  Also, different words can have the same spelling.  There might be other issues (how to compensate for typos, maybe).  It would take some effort to sort these out, and I wouldn't know where to start.  These complications would crop up in producing lists based purely on frequency, though, and these lists already exist.  So I would assume somebody already knows how to deal with that.

Once these obstacles are overcome, we could make a matrix $M$.  The rows and columns of $M$ are both indexed by words (so the matrix is square).  If $v$ and $w$ are words, then $(v,w)$-entry of $M$ is the number of times $v$ appears with $w$ in some corpus.  Since $v$ is used with $w$ if and only if $w$ is used with $v$, $M$ is also symmetric.  Following the example of the PageRank algorithm, we find an eigenvector ${\bf x}$ of $M$ corresponding to the largest eigenvalue of $M$.  Provided that $M$ satisfies certain conditions, ${\bf x}$ will be unique up to a scalar multiple.  Furthermore, we can assume that the entries of ${\bf x}$ are all non-negative. (These are consequences of the Perron-Frobenius Theorem).  The vector ${\bf x}$ is called the principal eigenvector of $M$.  The entries of ${\bf x}$ can be ranked according to their size, with one entry ranked higher than a second if the first entry is greater than the second.  The WordRank of word $w$ is then the rank of entry $w$ of ${\bf x}$.

The total number of occurrences of the word $w$ is the row (or column) sum of row $w$ of $M$, which is given by $M{\bf j}$ where ${\bf j}$ is the all-ones vector.  Suppose that $D$ is the diagonal matrix where diagonal entry $w$ is the number of occurrences of word $w$.  Then column sums of $MD$ are all 1.  Since $M$ and $D$ are both nonnegative, $MD$ is as well.  Thus, $MD$ is a stochastic matrix, and therefore, the principal eigenvalue is 1.  As an alternative to the scheme derived in the previous paragraph, we could use the principal eigenvector of $MD$ instead of $M$ to determine a ranking.  The PageRank algorithm is based on the website analogue of this matrix.

In devising the PageRank algorithm, the Google founders found it necessary to modify this matrix further.   Perhaps some other modifications to $M$ or $MD$ would be necessary here.

One reason I think this could be interesting is that it's not just based on how common a word is, but how much each word "works" with other words.  The top $m$ words using WordRank could provide a more coherent list of words.  Of course, a word will get used more if it works well with other words, so any algorithm worthy of the name WordRank could end up producing more or less the same thing as ranking by frequency.

This idea could also be nonsense.  However, in the course of learning more about the PageRank algorithm, I came across another paper which generalized the algorithm.  In the paper, they applied a variation of  PageRank to the problem synonym extraction, with at least modest success.  So using these ideas in the realm of languages does not seem particularly absurd.

In any case, I think it would be interesting to see the results.

Possibly there are other things that could be done with the matrices $M$, $MD$, or maybe some other related matrix.  When designing language learning resources, it's important to come up with vocabulary lists.  We could use frequency or WordRank, or some combination, to compose a these lists.  But these lists probably need to be divided into smaller chunks for individual lessons.  One would hope that these chunks consist of a coherent set of words, for example, words centred on a theme, like food related words, occupation related words, and so forth.

One could do this by hand, of course, and maybe that's the best way once you decide which words to use.  But there are techniques based on eigenvalues and eigenvectors (for example, using the Fiedler vector) that are used to subdivide networks into smaller coherent components.  Perhaps some variation of these techniques could be employed here.