Pages

Monday 30 December 2013

Google "WordRank"

One of the more interesting things I learned about while I was a graduate student was Google's PageRank algorithm (to this day, it remains one of my go-to applications when I want to illustrate the real-life utility of certain concepts in linear algebra classes, although as each year passes, the likelihood that a student can remember life before Google gets smaller).  What made it particularly interesting to me is that it could be stated largely in the terminology of algebraic graph theory, my own field of research.  I had seen similar ideas before, but I had never seen anything like it put into practice (well, I had seen Google, but I wasn't aware that it was putting these ideas into practice).

The basic idea behind Google's algorithm is that good websites are websites that are linked to by good websites.  If you think that sounds awkwardly circular, you wouldn't be wrong.  Nevertheless, there's an elegantly simple way to express the idea as a matrix equation, and solutions to this equations are well studied.  Things don't always work out as well in practice as they do in the abstract, but Google's search engine was better than any search engines that came before it.  In this case, the abstract and practice lined up quite well.  In fact, the only one of these older search engines that's still around to this day in any noticeable way is Yahoo, but it has a much smaller share of the search engine market than Google does.

Now, one of my side interests is languages (or linguistics maybe, but I'm not sure how much my forays take me into actual linguistics. Favourite websites/blogs by people who know what they're talking about include Dialect Blog and Language Log).  While pursuing this interest, one of the things I've come across is lists of the most common words in a given language.  I wonder if something like Google's PageRank algorithm could be used to give an alternate word ranking words.  Instead of using links, you could say, in the same awkwardly circular fashion, that important words are words that are used with important words, set it up algebraically, then find the solution.  The result would assign a "WordRank" to each word.

What "used with" means is not well defined, mind you.  We could say that a word $A$ is used with a word $B$ if $A$ and $B$ are used in the same sentence, or perhaps used within a certain number of words of each other, regardless of whether or not they reside in the same sentence.  Maybe there is some other suitable choice.

While I could easily solve the algebra part, setting up the equation would take quite some work, and it might be beyond my abilities.  For example, if I chose to say words are used together when they are in the same sentence, I would need to decide whether or not a period in a given text is meant to indicate an abbreviation or the end of a sentence.  It's easy enough to do if I'm reading, but I would need to program a computer to do this for the a sufficiently large quantity of text.  That strikes me as a nontrivial task.

That wouldn't be a problem if I chose to say that words are used together when they appear within a certain number of words of each other.  Suppose this number is $n$.  The question arises as to what $n$ should be. The answer is not obvious to me.

Another complication is that spelling alone is not enough to determine what word is being used.  For one thing, words can have different forms.  Nouns can be plural or singular, verbs can be present tense or past tense, etc, and spellings change as a result.  Also, different words can have the same spelling.  There might be other issues (how to compensate for typos, maybe).  It would take some effort to sort these out, and I wouldn't know where to start.  These complications would crop up in producing lists based purely on frequency, though, and these lists already exist.  So I would assume somebody already knows how to deal with that.

Once these obstacles are overcome, we could make a matrix $M$.  The rows and columns of $M$ are both indexed by words (so the matrix is square).  If $v$ and $w$ are words, then $(v,w)$-entry of $M$ is the number of times $v$ appears with $w$ in some corpus.  Since $v$ is used with $w$ if and only if $w$ is used with $v$, $M$ is also symmetric.  Following the example of the PageRank algorithm, we find an eigenvector ${\bf x}$ of $M$ corresponding to the largest eigenvalue of $M$.  Provided that $M$ satisfies certain conditions, ${\bf x}$ will be unique up to a scalar multiple.  Furthermore, we can assume that the entries of ${\bf x}$ are all non-negative. (These are consequences of the Perron-Frobenius Theorem).  The vector ${\bf x}$ is called the principal eigenvector of $M$.  The entries of ${\bf x}$ can be ranked according to their size, with one entry ranked higher than a second if the first entry is greater than the second.  The WordRank of word $w$ is then the rank of entry $w$ of ${\bf x}$.

The total number of occurrences of the word $w$ is the row (or column) sum of row $w$ of $M$, which is given by $M{\bf j}$ where ${\bf j}$ is the all-ones vector.  Suppose that $D$ is the diagonal matrix where diagonal entry $w$ is the number of occurrences of word $w$.  Then column sums of $MD$ are all 1.  Since $M$ and $D$ are both nonnegative, $MD$ is as well.  Thus, $MD$ is a stochastic matrix, and therefore, the principal eigenvalue is 1.  As an alternative to the scheme derived in the previous paragraph, we could use the principal eigenvector of $MD$ instead of $M$ to determine a ranking.  The PageRank algorithm is based on the website analogue of this matrix.

In devising the PageRank algorithm, the Google founders found it necessary to modify this matrix further.   Perhaps some other modifications to $M$ or $MD$ would be necessary here.

One reason I think this could be interesting is that it's not just based on how common a word is, but how much each word "works" with other words.  The top $m$ words using WordRank could provide a more coherent list of words.  Of course, a word will get used more if it works well with other words, so any algorithm worthy of the name WordRank could end up producing more or less the same thing as ranking by frequency.

This idea could also be nonsense.  However, in the course of learning more about the PageRank algorithm, I came across another paper which generalized the algorithm.  In the paper, they applied a variation of  PageRank to the problem synonym extraction, with at least modest success.  So using these ideas in the realm of languages does not seem particularly absurd.

In any case, I think it would be interesting to see the results.

Possibly there are other things that could be done with the matrices $M$, $MD$, or maybe some other related matrix.  When designing language learning resources, it's important to come up with vocabulary lists.  We could use frequency or WordRank, or some combination, to compose a these lists.  But these lists probably need to be divided into smaller chunks for individual lessons.  One would hope that these chunks consist of a coherent set of words, for example, words centred on a theme, like food related words, occupation related words, and so forth.

One could do this by hand, of course, and maybe that's the best way once you decide which words to use.  But there are techniques based on eigenvalues and eigenvectors (for example, using the Fiedler vector) that are used to subdivide networks into smaller coherent components.  Perhaps some variation of these techniques could be employed here.

Tuesday 26 November 2013

A graph labelling problem

I've been doing some work using the shapefiles for the current Canadian federal electoral districts.  Each shape in the shapefile corresponds to a riding (or, in a few cases, a part of a riding).  One thing I have noticed while working with them is that there is no apparent structure to the order in which the shapes appear.

For example, Shape 1 is the boundary of the Vegreville--Wainwright riding in Alberta, while Shape 2 is Bruce--Grey--Owen Sound in Ontario.  Since Alberta and Ontario do not even share a border, these two districts are far apart from each other.  Meanwhile, the ridings adjacent to Vegreville--Wainwright are numbered 5, 21, 144, 183 195, 217, and 240.  Considering there are only 312 shapes [1] in total, 5 and 21 are the only numbers remotely close to 1.  If there is any reasoning behind the numbering scheme, it's not easy to tell what it is by looking at the numbers alone.  One assumes it's arbitrary.

I use the numbers frequently, so it would be helpful if I could tell a little bit more about how close two ridings are based on their numbers.

The best numbering system would have consecutive numbers mapped to adjacent ridings, so that one riding was adjacent to another if and only if their labels differed by exactly 1.  This impossible to do for every riding, however, since it would require each riding to be adjacent to only two other ridings, but most are adjacent to three or more others.

Instead, we could number the ridings so that the numbers of  ``close'' ridings are also close.

There are a couple ways we could do this.  First, let's put the problem in terms of graph theory.  Let $G$ be the graph whose vertices are the electoral districts.  Define two vertices to be adjacent if their corresponding districts share a border (excluding the case where two districts share a point).  Suppose we label each vertex $v$ with a number $n(v)$ between 1 and the number of vertices, using each number for exactly one vertex.
  1. Ensure adjacent vertices have close numbers.  Thus, for each pair of adjacent vertices, compute the difference between their labels, then take the sum of these numbers over all adjacent pairs.  That is, compute
    $$\sum_{uv\in E}|n(u)-n(v)|$$
    Define the best labellings to be those that minimize minimize this sum.
  2. Approximate the the distances between vertices with the difference between their respective labels.  To this end, let $d(u,v)$ be the distance between two vertices in the graph and let $D(u,v)=|n(u)-n(v)|$.  The difference between the labels approximates the distance between the vertices if $|d(u,v)-D(u,v)|$ is small.  Thus, define the best labellings to be those that minimizes the sum
    $$\sum_{uv}|d(u,v)-D(u,v)|$$
The first approach bears some similarity to the graph bandwidth problem.   In the bandwidth problem, the goal is to minimize the maximum value of $|n(u)-n(v)|$.  If this value is $m$, then $m|E|$ is an upper bound on the sum that appears in the first approach.  It's possible, however, that the sum and the maximum are minimized by different labellings. 

The fact that the graph bandwidth problem is NP-hard suggests that it will also be difficult to minimize the sum.  I also suspect that the sum in the second approach will be of comparable difficulty.

Edit: One way of interpreting the graph bandwidth is not obvious from the name (the name is derived from matrices associated with the graph).  If we think of the labels $n(v)$ as locations of vertices along the line from 1 to $|V|$, then $|n(u)-n(v)|$ is the length of the edge.  So the graph bandwidth is the smallest possible value for the length of the longest edge.   We can say something similar about the first solution.  The goal in that case is to minimize the total length of the edges, or, equivalently, the average edge length.

[1] The astute reader will notice that there are 308 ridings in total for the last election, and there will be 338 in the next.  So where does this 312 come from?  Two ridings include islands that are too far from the mainland to be included in the same shape, so are included as separate shapes, and one riding includes 3 mainland components that are separated from each other by another riding, yielding 4 extra shapes.

Monday 25 November 2013

Antiderivative vs. Indefinite integral

I spent some time today getting ready for my class for the next term.  In particular, I was reading through the sections on antiderivatives and indefinite integrals.  I had normally taken these things to be distinct concepts.  Given a function $f$, an antiderivative of $f$ is any function whose derivative is $f$.  The indefinite integral is the set of all antiderivatives.  Since there are infinitely many antiderivatives, while an antiderivative is a single function related to $f$ by differentiation, these are not the same thing.

Or at least, that's what I thought until I read the aforementioned sections. The definitions I read today did not seem to make that distinction.  I consulted with the textbook that I used in my own first year calculus class many years ago.  This textbook does make the distinction.

I suspect the difference between the two ideas is not appreciated by a lot of students.  We don't dwell on it much once we give the definitions.  After that, we usually teach them techniques for finding indefinite integrals, and as long as they remember to include the constant of integration, we're happy.

The textbook that doesn't distinguish the two is one of the most widely used introductory calculus texts.  It's quite an old version, though.  Perhaps newer versions do distinguish.

In any case, it made me wonder how important, if at all, the distinction is.  Was a serious error made by leaving it out?  Am I being too picky by insisting my students understand the difference?

One thing I do find somewhat bothersome about erasing the distinction is the abuse of notation that results.  If $F$ is an antiderivative of $f$, then it can be written
$$\int f(x) dx = F(x)$$
The constant of integration is omitted because there is no distinction between the two terms.  This is an abuse of notation, because there are actually infinitely many functions that could appear on the right, all differing form $F$ by a constant.  So the $=$ sign doesn't belong there, because it implies uniqueness.  Of course, even writing the constant of integration
$$\int f(x) dx = F(x)+C$$
can be seen as an abuse of notation, but at least it captures the idea that there are infinitely many antiderivatives of $f$ and each of them has the form on the right.  I suppose, to capture the essence of the idea, we could instead write
$$\int f(x) dx = \{F(x)|F'(x)=f(x)\}=\{G(x)+C|C\in{\mathbb R}\}$$
where $G$ is a known antiderivative of $f$, but I've never seen it written this way.

Monday 4 November 2013

Expansion Minimization

The determinant of a square matrix $A$ is typically defined in terms of cofactor expansion along the first row of $A$.  Then cofactor expansion along other rows or along columns is defined, and shown (or stated) to be equal to the determinant.  Following that, there is usually an example of a matrix that contains at least one zero.  This example is used to illustrate that sometimes, if we choose the appropriate row or column to expand along (namely, one with zeros), we can reduce work needed to compute the determinant using cofactor expansion.

It is certainly easy to tell which row(s) or column(s) to expand along to minimize the number of cofactors it is necessary to compute at each step.  Simply expand along a row or column that has the most zeros.

But what row or column should we expand along to minimize the total number cofactors to compute overall?  Or, to put it in more practical terms, what row or column should we expand along to minimize the
 amount of writing necessary?

Something to consider is that there are two ways to get a zero in the cofactor expansion.  If $B$ is some square submatrix encountered in the cofactor expansion of $A$, then either $b_{ij}=0$ or $\det B_{ij}=0$ ($B_{ij}$ being the $(i,j)-minor of $B$) to make a cofactor.  Since determining whether or not $\det B_{ij}=0$ requires a determinant computation, we would probably want to know the answer to the question to answer the question, which leads to circular reasoning.  So we'll assume that we're basing our expansion minimization strategy purely on which entries of $A$ are 0.

Of course, the answer to the question posed here is mostly useless for practical purposes, since nobody is going to compute the determinant through cofactor expansion for all but the smallest matrices.


But since we frequently present these example and pose these problems to students, with the expectation that they'll know what the best row or column is to expand along, I wonder if there's actually some way to tell what the best path to the determinant is from the matrix itself.

Thursday 19 September 2013

A Few Questions Related to Planar Eulerian Graphs

Let $G$ be an Eulerian graph.  Suppose there is some planar embedding of $G$ in the plane.  When does $G$ have a "non-crossing" Eulerian cycle?  That is to say, if there is a planar drawing of the graph, when can we find an Eulerian cycle in the graph that can be drawn along the edges of the planar drawing without crossing itself?

More generally, suppose that $G$ has $2t$ pairs of odd-degree vertices.  It is known that the edges of $G$ can be decomposed into $t$ paths.  When is there a set of $t$ paths that decompose the edge set of the graph and that can be drawn without crossing themselves or each other?

Even more generally, suppose that $G$ has an embedding into a surface of genus $g$ (or non-orientable genus $g$).  When is there a set of $t$ paths that decompose the edge set of the graph and that can be drawn without crossing themselves or each other?

Thursday 12 September 2013

Planar Graph Drawing Problem

Suppose that $G$ is a planar graph.  Define a convex representation of a graph to be an assignment of each vertex $v$ in $V(G)$ to a convex polygon $p(v)$ so that two vertices $u$ and $v$ are adjacent if and only if $p(u)$ and $p(v)$ intersect only on their boundaries in at least one edge (i.e. so that the polygons share more than just vertices, but the area of their intersection is zero).  Furthermore, suppose that each vertex $v$ in $V(G)$ is assigned some positive number $f(v)$.

Does there exist a convex representation of $G$ so that the area of polygon $p(v)$ is equal to $f(v)$ for all $v$ in $V(G)$?

If $G$ is not 3-connected, then the answer is no, since a graph has a convex representation if and only if $G$ is 3-connected.  An algorithm due to Tutte, applied to the dual graph, will find a convex representation, though it might not be aesthetically pleasing.

Based on the number of area cartograms that have popped up lately, it seems that representation of the graph by polygons with specified areas is also possible, if not all of the time, then most of the time (I've never seen a theorem to that effect, though it does seem plausible).

What asking for, then, is the intersection of both of these ideas.  Like the Tutte embedding, the representation should be convex.  Like the area embedding, the areas of each polygon should have the same size.

Sunday 4 August 2013

Variable x, Ex-variable.

Is the variable $x$ still a variable once a value has been assigned to it?  Because then, $x$ is fixed at the assigned value and can no longer vary, and that which cannot vary is not variable.

Saturday 3 August 2013

Coefficient Operation

The trace of a matrix satisfies the following equation
$$tr(A+B)=tr(A)+tr(B)$$
The trace is also the coefficient of the second highest power of the characteristic polynomial (up to a factor of $-1$).

The determinant of a matrix satisfies the following equation
$$det(AB)=det(A)det(B)$$
The determinant is also the constant term of the characteristic polynomial (up to a factor of $-1$).

Thus, for both the trace and the determinant functions, there is a binary operation that is defined for both matrices and numbers, and the functions distribute over the corresponding binary operations.  Furthermore, both functions appear in the characteristic polynomial.

If the matrix is 3x3 or larger, then there are many coefficients between that of the determinant and that the second highest power.  We can think of these as generalizations of the trace and determinant.

That is, suppose that $f_k(A)$ is the coefficient of $\lambda^k$ in the characteristic polynomial $p(\lambda)$ of the $n\times n$ matrix $A$.  Then $tr(A)=f_{n-1}(A)$ and $det(A)=f_0(A)$.

That leads to the following question.  For each of the $k$, does there exist a binary operation, say $\circ_k$, that is defined for both square matrices and numbers such that $f_k(A\circ_k B)=f_k(A)\circ_k f_k(B)$ (up to a factor of $-1$)?  Note that we want $A\circ_k B$ to be a matrix and $a\circ_k b$ to be a number.

We can take $\circ_{n-1}$ to be the trace function and $\circ_0$ to be the determinant function, so the answer is yes for $k=n-1$ and $k=0$.

The coefficient of $\lambda^n$ is always $\pm 1$.  Define $\circ$ by $a\circ b=1$ and $A\circ B=I$.  Then $f_n(A\circ B)=f_n(I)=1$ and $f_n(A)\circ f_n(B)=1\circ 1=1$.  Thus,
$$f_n(A\circ B)=f_n(A)\circ f_n(B)$$
So we could take $\circ_n$ to be the binary operation $\circ$ defined here.  So the answer is yes for $k=n$ as well.  Note that $a\circ b$ must be defined to be 1 in order to satisfy the desired property, but $A\circ B$ could be defined to be any matrix, since $f_n$ is always $\pm 1$. 

What can be said about $\circ_k$ for $k\in\{1,\ldots,n-2\}$?

There was some freedom in how we defined $\circ_n$.  It is not unique.  Is this true about the other $\circ_k$, if they even exist?

Sunday 28 July 2013

Two Videos

A friend sent me the following video today, inspired by a child's toy.

I've come across similar statements about things being "countless", and also cocked an eyebrow when I heard them. For example, I once read that postal codes were unlimited as well, when there are, in fact, only 17,576,000 different possibilities, roughly half of the number of people in Canada. In any case, it never occurred to me that such dubious claims (well, dubious to the mathematically trained) could be turned into lessons. It's quite remarkable just how interested the students got in the problem. This video, enticingly titled "998,001 and its Mysterious Recurring Decimals", showed up as one of the recommended videos in the sidebar.
The topic of this video would make an interesting diversion in a course on infinite series, and like the previous video, it has something of a combinatorial flavour.

Wednesday 17 July 2013

David Gregory

[Updated: see below]

On Thursday afternoon, I received an email from one of my colleagues that my former supervisor David Gregory's cancer had progressed and that he was in palliative care.  I originally planned on visiting him the next day on Friday.  However, I ended up spending most of the day finishing up a job application that was due shortly, and the tone of the email did not suggest the situation that was that dire.  I had commitments on the weekend, so I put off the visit until Monday.  On Monday morning, though, I received an email from Kevin Vander Meulen, friend, fellow collaborator, and also former student of David, who had found his obituary:
GREGORY, David - David Gregory of Kingston, Ontario passed away peacefully on Friday July 12, 2013 with friends and family by his side. He is predeceased by his parents Alan and Edith (Towell) Gregory. He is survived by his sister Betty Jane Lerer (Gregory) and his brother-in-law Harvey Lerer. As well he is survived by his nephew Michael Lerer, his niece Tara Lerer (Middleton) and great-nephew David Lerer. The family wants to thank his many friends and colleagues for their care and sympathy throughout his illness. A private funeral is arranged. A memorial celebration of David's life will be held in the near future. In lieu of flowers, donations to The Canadian Cancer Society or the University Hospitals Kingston Foundation would be preferred.
Even if I had gone on Friday, it might have already been too late.

The first time I had seen David speak was when he gave a guest lecture on the Good Will Hunting Problem to my undergraduate graph theory class.  Although I would not meet him until a few years later, I worked with David for the first time in the spring/summer of 1999 between the 3rd and 4th years of my undergraduate studies on the problem of addressing the Petersen Graph, the graph after which this blog is named.  I would collaborate with him again the following summer.

It wasn't until the fall of 2001, when I had started my master's degree at Queen's University, that I met him for the first time.  He was already retired then, after his first bout with cancer.  Although he was still clearly recovering from his treatment, he would still regularly come into his office to work on his research.

After that, I didn't speak with him again until my original supervisor, Dom de Caen, had passed away in the spring of 2002.  Even though he was already retired, David happily accepted me as his master's student, and later as his PhD student.

David was very dedicated as a supervisor.  We met at least once a week, sometimes briefly, and sometimes for a whole afternoon.  If I had questions outside of our regularly scheduled meetings, he was always happy to make time to discuss them with me, even if I just showed up at his door.  When I reflect on the quality of the first draft of my master's project, I can't help but conclude that he was a very patient man.  He helped me to turn what was a very rough draft into the final draft, without ever letting on that I had burdened him in any way.  He demonstrated the same patience and dedication throughout my PhD studies as well, being available early in the morning as well as sending me helpful emails late in the evening.

David was a regular participant in the Discrete Math Seminar held jointly between the respective departments at Queen's University and the Royal Military College, which I frequently attended.  His talks were always interesting, well organized, and easy to follow.

In addition to his role as a supervisor, he was also a good friend, giving me advice on personal matters and helping in other ways.

He will be missed.

-----

A memorial page has been set up by Bryan Shader, one of David's friends and collaborators.

Thursday 6 June 2013

Why iPod Shuffle isn't Random (Updated)

(Updated, see below)

The idea for this post came from a book that I picked up and put down at a  Chapters/Indigo store some time ago.  I don't recall the title or author of the book, unfortunately.  I've gone to both locations in town to find the it back again, but can't.  So instead I'm going to try to recreate the argument that I didn't have time to read in full when I first saw it.

One thing I do recall from the book is that Apple got a bit of bad press because people didn't think their randomization algorithm was truly random.  The time it took to hear a repeat was too short according to listeners.  Suppose for example, that I have 100 tracks on my iPod.  Then shouldn't I hear each track once in every 100 tracks?  On average, yes.  If the selection is truly random and I listen to enough tracks consecutively, then the average gap between each repeat of that track will be 100 (or 99, depending on whether or not you count the second occurrence towards the gap length).

However, the average doesn't describe what always happens, just what happens overall.  It's possible, although unlikely, for the first two tracks to be the same, so we get a repeat right away.  It's also possible, although very unlikely, to be able to listen to all 100 tracks straight, without a single repeat, so we don't get our first repeat until the 101st track.

Note that we must have a repeat by track 101.  For suppose that we listened to 100 tracks straight with no repeat.  Then we have heard all 100 tracks.  Thus, the 101st track must be a repeat.  This is an application of the  Pigeonhole Principle, where each track is a "pigeonhole" and there is one "pigeon" in it for each time we hear the track.  Therefore, we only need to concern ourselves with the chance of a repeat in the first 101 tracks.

We'll answer the question of the average time before a repeat in stages.  First we'll determine the probabilities that the next track is or isn't a repeat, assuming that no repeats have occurred.  Next we'll determine the probabilities that the first repeat occurs at track 1, track 2, etc.  Finally, we'll use these to calculate the average track number at which the first repeat occurs.

Probability that the next track is a repeat
First, we consider the probabilities of a repeat and non-repeat, assuming that previous track is not a repeat.  Obviously, there is a $0\%$ chance that the first track is a repeat, because there are no tracks before it that it can be a repeat of.

After listening to the first randomly chosen track, there are 99 other tracks that we haven't heard.  Since each track is equally likely to occur, the chance that the second track is same as the first is $\frac{1}{100}$, and the chance that it is different is $\frac{99}{100}$.

Suppose that the first two tracks are different, i.e. that the second track was not a repeat.  There are now 2 tracks we have heard, and 98 that we haven't.  Thus, there is a $\frac{2}{100}$ chance that the next track is a repeat and a $\frac{98}{100}$ chance that it is not.

Similarly, if there is no repeat by the 3rd track, then there is a $\frac{3}{100}$ chance that the next track is a repeat and a $\frac{97}{100}$ chance that it is not.  If there is no repeat by the 4th track, then there is a $\frac{4}{100}$ chance that the next track is a repeat and a $\frac{96}{100}$ chance that it is not, and so on.

In general, if we have listened to $k$ tracks without a repeat, then the chance that the next track is a repeat is $\frac{k}{100}$ and the chance that it is not is $\frac{100-k}{100}$.

Probability that a specific track is the first repeat

The next things we need to know are the chances that we will hear the first repeat at track 1, track 2, track 3, etc.

The chance of hearing the first repeat at txrack 1 is 0, as explained above.  The chance of hearing the first repeat at track 2 is $\frac{1}{100}$.

In order to hear the first repeat at track 3, tracks 1 and 2 must be distinct, and track 3 must be the same as either track 1 or 2.  As explained above, the probability that 1 and 2 are distinct is $\frac{99}{100}$ and the probability that 3 is the same as 1 or 2, assuming 1 and 2 are distinct, is $\frac{2}{100}$.  Thus, the probability that the first two tracks are distinct but the third is the same as one of the first two is
$$\frac{99}{100}\frac{2}{100}$$
In order to hear the first repeat at track 4, tracks 1 through 3 must be distinct.  The probability that second track is different from the first is $99/100$ and, assuming the the first two are distinct, the probability that the third is also distinct is $98/100$.  Thus, the probability that the first 3 tracks are distinct is
$$\frac{99}{100}\frac{98}{100}$$
Since the first three tracks are distinct, the probability that the fourth track is a repeat is $3/100$, so the probability that track 4 is the first repeat is
$$\frac{99}{100}\frac{98}{100}\frac{3}{100}$$
Similarly, the probability that track 5 is the first repeat is
$$\frac{99}{100}\frac{98}{100}\frac{97}{100}\frac{4}{100}$$
In general, the probability that the first repeat happens at track $k$ is
$$\frac{99}{100}\frac{98}{100}\cdots \frac{(100-k+2)}{100}\frac{k-1}{100}=\frac{(k-1)99!}{100^{k-1}(100-k+1)!}$$

where $n!=n(n-1)(n-2)\cdots (2)(1)$ (read "$n$ factorial")

Average track of the first repeat

To get the average track number at which the first repeat happens, we multiply each number $k$ between 1 and 100 by the probability that the track $k$ is the first repeat, then add the results.  That is, the average track number for the first repeat is
$$0(1)+\frac{1}{100}(2)+\frac{99}{100}\frac{2}{100}(3)+\frac{99}{100}\frac{98}{100}\frac{3}{100}(4)+\cdots+\frac{(101-1)99!}{100^{101-1}(100-101+1)!}(101)$$
It's far too much work to calculate this number by hand, but plugging this into a computer  gives track 13.21 (rounded to two decimal places).  So we can expect to get through only a small fraction of my music collection before we start to hear repeats.

If we use the average length of a pop song of between 4 and 4.5 minutes, that means we can expect to hear the first repeat at somewhere between 53.8 minutes and and 59.4 minutes.

The other averages
Typically when we talk about average of some values, we mean the mean, i.e. the sum of the values divided by the number of values.  There are other types of averages, though.  The median and the mode are probably the most common. You probably learned about them in school, but forgot about them because they're not as widely used, although they are sometimes situations where one of them might be more suitable.

The median, for example, is the middle value (if the numbers are ordered).  For example, suppose our values are $1,2,2,3,4$.  There are 5 values in total, so the third value is the middle value.  Therefore, the median of this list of values is 2.  The median is preferable to the mean when talking about average incomes, because the very small number of people with very large incomes drives up the mean, giving a skewed perception of the typical income.  The median, on the other hand, is more stable.  For example, the $4$ could be changed to $40$ or $4,000$, but the medians in both cases would still be 2, while the means change to 9.6 and 801.6 respectively.

The mode is the most frequently occurring value.  In each of the lists of values in the preceding paragraph, the number $2$ appears twice, while each other number only appears once, so the mode is 2.  The mode is useful when our values aren't numbers.  For example, you could take a poll on people's favourite colour.  The mode is the colour that is picked by the most people.  The median and the mean are not defined in this case, because colours are not numbers (they can be described numerically, but the numerical description would not be of much use in this case).

A more detailed discussion on mean, median, and mode can be found here.

The median for number of tracks before a repeat in this case is 12 or 13.  This means that roughly half of the time, the first repeat will occur on or before the 12th track, and roughly half of the time, the first repeat will occur on or after the 13th track.

The mode in this case is 11.  That means that the first repeat happens at track 11 more often than it happens at any other track.  The probability that it will happen at track 11 is quite small at only 6.28\%. However, the probability that it will happen at any other track is even smaller (and the further from 11 we get in either direction, the lower the probability is).

Conclusion

All three of these averages are pretty close to each other, relative to the size of the music collection, and they all represent a pretty small fraction of the collection.  To make matters worse, more than half the time, a repeat is going to happen before the average.

Now, a music collection of only 100 tracks is rather small.  My own currently has around 400 tracks, and I'm not even an avid purchaser of music.  The mean track number of the first repeat for a 400 track collection is 25.73, the median is 24, and the mode is 21.  Thus, the different averages just barely double (if that), even though the music collection is 4 times as large.

If we multiply by 4 again, for a collection of 1,600 tracks, the mean is 50.8, the median 48, and the mode 41, again, roughly only doubling the number of tracks before a repeat.

Thus, even for large collections, no matter how much Apple made sure that tracks were chosen randomly, we would be bound to hear repeats after only a small proportion of the tracks.  So we don't actually want true randomness.  Apple solves this problem by choosing a fixed ordering of the tracks each time the iPod (or iPhone, iTunes, etc) is started, much like shuffling a deck of cards, thus the name "shuffle".

Admittedly, I'm using a bit of verbal sleight of hand here, taking advantage of the ambiguity of the word "random".  There is an element of randomness in the shuffle too.  After all, the category of games of chance includes games played with a shuffled deck of cards.  To form the list, the first track can be chosen at random from the collection, and each subsequent track can be chosen randomly from the tracks that haven't been chosen yet, but this is not typically what's meant by the word "random" when it appears without any qualifiers.

Despite Apple's attempts to fix the problem, however, there are still skeptics.

Update.  It's been about a year since I first wrote this.  I never did find the book that I referred to above.  However, I did find a related problem in another book called Duelling Idiots and Other Probability Puzzlers by Paul Nahin.  In one of the latter chapters, the author discusses the NBA draft where a similar problem shows up.

Wednesday 29 May 2013

Test

$\sqrt{x}$
\(\sqrt{x}\)

Welcome to the Petersen Graph

This is not my first attempt at a math blog.  I have an older one on Wordpress called Idiot String Theories.  I chose Wordpress at the time because it was the only one I was aware of that would let me use LaTeX commands to write math notation.  I neglected the blog because other things took up my time and also because I never got used to the Wordpress LaTeX syntax, which was limited and somewhat different from actual LaTeX.

After a while, I also started to feel that the name Idiot String Theories might sound a bit harsh.  It wasn't meant to be.  "Idiot strings" refers to to strings that are tied to gloves or mittens and threaded through the sleeves to prevent children from losing them.  The "theory" part was added as a reference to "string theory", something about which I actually know very little.  Maybe somebody reading it might have concluded that I thought string theory was idiotic, but I don't.  It was just meant to be playful.

Thanks to Christian Perfect, I have discovered a new, better way to include math in blogs called MathJax.  I haven't used it yet -- that's what this blog is for -- but it appears to work more like LaTeX itself does.  Since it's not platform specific, like Wordpress's math support was, I can use my blogger account, which I've had before I started using Wordpress.  I suspect it still won't be as good as using LaTeX itself, but from what I can tell, it is an improvement.  We'll see.

The function of this blog will also be different.  The main purpose of the old blog was to make a public catalogue of questions I'd come up with while doing research, but I either didn't have time to investigate or were too far beyond my area expertise.  I might still do that here, but I also want to write more about other topics, and in particular topics that I am better versed in.  Broadly speaking, I am interested in discrete math.  Specifically, my research area is algebraic graph theory.

The blog is named after Petersen Graph, after one of the most commonly used examples in graph theory.  A picture of the Petersen Graph can be seen in the top right-hand corner of the algebraic graph theory link above.

I may also comment on teaching mathematics, as well as different ways I've noticed where mathematics can be applied to so-called "real life" situations.