Pages

Saturday 7 June 2014

Friends in high density places

Last August, I found the map below.  2% of the Australian population lives in the yellow area and the remaining 98% lives in the white area.



Like Australia, Canada is geographically large, but sparsely populated.  Canada's population density is a little higher, though not by much.  On average, there are 3.56 people per square kilometre in Canada compared to Australia's 3.06.  Also, like Australia, our population is centred around a small number of urban areas.  The three most populous metropolitan areas, Toronto, Montreal, and Vancouver, alone make up more than a third of our population, the top 10 more than a half, and the top 100 nearly 80% (mind you, most of the cities in that top 100 list have less than 100,000 people, and some, such as Timmins and Cape Breton are quite large in area).  (Population data culled from various Wikipedia pages).

So I wondered what a similar map for Canada would look like.

Unfortunately, there doesn't seem to be much information out there about how the map was created.  Within each of the yellow and white areas, we see other borders drawn in.  I'm assuming that these represent some national subdivisions, perhaps municipal boundaries or census regions.  So whoever made the map made a choice of which regions to colour yellow and which regions to colour white.

There are many choices that would split the population between 2% and 98%, though.  For example, to construct a region containing 98% of the population, like the one coloured white here, one could start with the sparsely populated inland regions and add increasingly denser regions until 98% is reached.  The result would not be very interesting.  Australia's population is roughly 23.5 million.  2% of that is 470,000, only a bit smaller than the population of Tasmania, the country's smallest province, and roughly 10% of the population of Sydney's, the country's largest city.

So we could create a map where all of Australia other than some parts of Tasmania are coloured white, or where all but a few urban specks of Sydney are coloured white.  In the latter case, the yellow parts would probably so small that it would be difficult or impossible to see them.  The map would hardly be different from a plain map of Australia.  What makes the existing map interesting, though, is how small the area containing 98% of the population is.

To make the map as interesting as possible, then, we would want the white part to be as small as possible.  As I said above, however, there is no information about how the map was created.  Is the white region as small as it could be, based on the subdivisions used, or is there something other choice that would lead to an even smaller region?  Did the creator simply pick the $m$ sparsest or least populous subdivisions, or was something more sophisticated used?

I found it through Twisted Sifter, which refers back to a Reddit thread.  This thread links to an image hosted on imgur, which I embedded above.  After that the trail goes cold.  One of comments from the original poster suggest that he or she is the one who created it, but there is no follow up, and the user's online profile doesn't offer any direction either.

Without knowing, then, we'd have to come up with our own method.  One simple method is to rank the subdivisions by population, and choose the first $m$ that give 98% of the population.  Another would be to rank by area, and choose the first $m$ that give 98% of the population.  A third option would be to rank by density and choose the first $m$ that give 98% percent of the population.  None of these are guaranteed to give the smallest area that includes 98% percent of the population (or, equivalently, the largest area that includes 2% of the population).

Another less obvious way to solve the problem is through a mathematical technique known as linear programming.  We create a function that gives us the area if a certain set of subdivisions is chosen, then we try to maximize the area under the condition that the percentage of the population that lives in the chosen set of regions is less than or equal to 2%. Ideally, we'd be able to insist that the population is equal to 2%, but this might not be possible.  For our purposes that shouldn't be a problem, though, since a smaller number than 2% would only further illustrate the sparseness.

Now let's write this in mathematical notation.  Suppose that ${\bf c}$ is a vector where entry $i$ of ${\bf c}$ is the area of subdivision $i$.  Let $A$ be a row matrix where column $i$ of the only row of $A$ is the percentage of the population in subdivision $i$.  Let ${\bf x}$ be a vector of variables where entry $i$ of ${\bf x}$ is the variable $x_i$.  If $x_i=1$ when subdivision $i$ is chosen and $0$ otherwise, then $f({\bf x})$ gives the total area for that choice of subdivisions and $A{\bf x}$ is the corresponding percentage of the population.  Then we want the following
  • maximize $f({\bf x})={\bf c}\cdot {\bf x}$ such that
  • $A{\bf x}\leq 0.02(=2\%)$ and
  • $x_i$ is 0 or 1 for each $i$
This is not exactly an linear programming problem.  The third condition makes it a binary integer programming problem, because each $x_i$ can be one of only two integer values, 0 or 1.  To make it a linear programming problem we replace the third condition with the following
  • $x_i\geq 0$ for each $i$.
 This could allow for the possibility that a region is partially included.  That is, only a certain percentage of a region's population counts towards the total percentage.  That means that we also want $x_i\leq 1$, since we can't include more than 100% of a region's population.  In that case, the problem is
  • maximize $f({\bf x})={\bf c}\cdot {\bf x}$ such that
  • $A{\bf x}\leq 0.02$ and
  • $0\leq x_i\leq 1$ for each $i$.
As stated, this is not a linear programming problem.  However, we can easily turn it into one.  Suppose that there are $n$ subdivisions in total.  Let $B$ be the $(n+1)\times n$ matrix whose first row is $A$ and whose remaining $n$ rows is the $n\times n$ identity matrix.  Let ${\bf b}$ be the vector with $n+1$ components whose first entry is $0.02$ and whose remaining entries are all 1.  Then we can rewrite the above as
  • maximize $f({\bf x})={\bf c}\cdot {\bf x}$ such that
  • $B{\bf x}\leq {\bf b}$ and
  • $x_i\geq 0$ for each $i$.
This is now a linear programming problem.  Unfortunately, the solution to the linear programming problem is not what we want, although it has the upside of being easier to solve.  In general, solving a binary integer programming problem is not easy.  So we may be able to solve the linear programming problem and use that to get a good solution to our problem (for example, by rounding), even though it might not be the optimal one.  Note that if we restricted to integer solutions, then we get the binary integer programming problem back.  There are some special cases where integer linear problems are easy to solve.  Perhaps this is one.  (Who knows? Maybe one of the three simple methods I gave above will turn out to be the optimal solution, though I'm sceptical any of them would be.)

Seeing a version of this map for Canada would be most interesting to me, because that's where I'm from and because of the geographical similarities.  But maps of this sort could be interesting for other countries, or cities, provinces or states, continents, or even the whole inhabitable world.  And of course, we need not insist on a split between 2% and 98%.  Perhaps a 50/50 split would be interesting.  Maybe there is some other interesting percentage, such as a percentage $X$ for which $X$ percent of the population lives on $100-X$ of the land (if that number even exists).

If the binary integer programming problem could be solved, it would be a simple (though possibly tedious) matter of collecting the required geographic data and plugging it into some map making software.

Update: After finishing this post, I emailed a colleague of mine who has much more experience in linear programming than me to ask him if he had any insights.  He told me that this is a knapsack problem.  I had heard of this problem before, but only informally, so I didn't clue in that it was a special case of integer linear programming.  Oddly, neither the linear programming nor the integer linear programming Wikipedia pages refer to the knapsack problem, nor does the knapsack page refer to either of the other two.

In any case, this is sort of good news, but only good in the sense of "maybe less bad."  The general binary linear programming problem is NP-hard, whereas the knapsack problem is NP-complete.  NP-hard problems are at least as hard to solve as NP-complete problems.  The reason it's only "maybe less bad" is that there is not currently known whether or not there are efficient algorithms for solving NP-complete problems.  According to the Wiki page on the knapsack problem, though, there are efficient algorithms for finding approximate solutions.  Maybe these would be good enough.  Indeed, our eyes may not be able to discern between the optimal solution and an approximate one.  In that case, I guess it's more or less good news.

Notes:
  • For more on linear programming, see here and here or, better yet, pick up a book, such as this one, which was what I first learned about linear programming from.
  • I don't currently have the resources at my disposal to do any of the above, but the population density below from Statistics Canada probably gives a rough idea of what the result would be.

Thursday 22 May 2014

Teaching Math with PowerPoint - a post mortem

I taught a second year linear algebra course in the fall term of the 2013 - 2014 school year.  The first few lectures were review of the prerequisite course.  My normal teaching practice consists almost exclusively of writing on the board when I'm not directly interacting with students.  But I figured that since my students had learned the material already, they didn't need to take notes and I could run through it a little faster than if they were seeing it for the first time.

So I put all of the review notes on PDF Slides (not actually PowerPoint, despite what I wrote in the title), using the Beamer package for LaTeX.  The first week went fairly well, so instead of switching back to my usual practice at the end of the review stage, I decided to to continue using slides.  I didn't intend on using them for the rest of the term, but in the end, that's what I did.

Overall, it was well received by the students.  Direct feedback was positive, and I received no complaints on course evaluations.  One complaint that could have been levelled at me is that the slides were not always perfect.  Sometimes there were typos, for example.  Perhaps the students were content with the corrected versions that I later posted on the course website.

Student performance was consistent with other courses I've taught.  A couple of students, but one was a student who was also doing poorly in other classes, and there were extenuating circumstances to explain the other student's failure. 

It would be irresponsible to draw conclusions about the effectiveness of the practice at this stage, mind you.  It was the first and only time I've taught this course and the first time I've used slides to this extent.  Nevertheless, it went well enough that I plan on doing it again in the upcoming year.

Although the course was far more heavily oriented towards slides than any other time I taught, I still did rely heavily on the blackboard.  I believe it's better to work out examples with the students, so I would only write the examples on the slides, and work out the solutions on the blackboard (often putting the final answer on the slides too, though not always).  

I also made heavy use of Beamer's "pause" command.   That way, I could present the text of a given slide gradually, one line at a time or even one word at a time if I wanted to, rather than all at once.  To some extent, this mimics writing on the board, where the content is revealed chalk stroke by chalk stroke.  I think this makes it easier to absorb.  Rather than trying to take in a whole slide's worth of information, only one line needs to be comprehended at time.  I also think it helps to keep the pace of the lecture from getting too fast, an easy temptation to give into with slides (or overheads, in the olden days).

Switching back and forth between the blackboard and the screen was a bit frustrating at times, since it seems that most classrooms equipped with projectors (which is most classrooms, if not all, these days) are not designed to mix the two methods. 

I taught in three different classrooms for the class.  Each room had a different layout with different upsides and downsides.

One of the classrooms had the projector screen covering up one panel of the chalkboard.  This was okay as long as I could fit everything in the other panel.  Occasionally, I needed both, which meant rolling up the screen and turning off the projector, then, when I'm done with the other panel, pulling the screen down and waiting for the projector to turn back on.  None of these are onerous tasks, but they interrupt the flow of the lecture and use up time.

The other two classrooms had their screen covering the middle of the blackboard, with the edges of the boards still exposed to write on.  In one room, the exposed edges were quite large, about the size of the uncovered panel in the other classroom.  In the other, however, they were fairly narrow.  The annoyance in both cases was that if an example took up more than one board, I had to finish it on the other side where it was hard to see what I had written on the first side because the screen was in the way.

Another issue I had is the fact that screen real-estate is quite limited.  Some things could not be fit onto a single slide.  Something on one slide might refer to something on another slide.  On a blackboard, one can use two different panels for the different things, but there is only one screen.

The other problem I ran into was malfunctioning technology.  In one room, the projector bulb burnt out.  In another, I could not access the network to log in to the computer.  Fortunately, computer services was quick to replace the bulb in the former case, and there was another open classroom with a working projector in the latter case.


If I had my way, each room would be equipped with two, or even three, projector screens, one for the current slide, and the other(s) for showing preceding slides or calling up older results or examples.  There would also be a chalkboard with at least two panels to write on, even when all screens are in use.  Of course, this wouldn't be feasible in all but the largest classrooms, and even if it were, it could end up being the classroom version of The Homer.  I'd settle for an effort on the part of classroom designers to consider that someone might want to use both blackboard and projector at the same time.

Despite the space taken up by the above complaints, they were all fairly minor in the grand scheme of the course.  They don't add up to enough to discourage me from doing it again, looking for places to improve, and maybe thinking about trying it for other courses.

Wednesday 30 April 2014

Pythagorean Statistics

Suppose that $X_1$ and $X_2$ are independent random variables such that $X_1+X_2$ is defined.  Let $\sigma_1$ and $\sigma_2$, and $\sigma$ be the standard deviations of $X_1$, $X_2,$ and  $X_1+X_2$ respectively.  Then $\sigma^2=\sigma_1^2+\sigma_2^2$ (the numbers $\sigma_1^2,$ $\sigma_2^2$, and $\sigma^2$ are called the variances of $X_1,$ $X_2$, and $X_1+X_2$ respectively).  This looks very Pythagorean Theorem (PT) like.  In fact, at least one person calls it the Pythagorean Theorem of Statistics (PToS).

In the link, the author gives a proof of the PToS, but the proof doesn't look much like the proof for the PT.  Nevertheless, the similarity is hard to ignore.  So I wonder, is the PToS just the PT dressed up in statistical clothing, or is it merely a coincidence that the similarity exists?

I suspect it's the former, but I don't quite see the connection yet.  The PT is about right triangles in a plane, and I don't see what that plane would be for the PToS, nor what the triangle is, nor why the third side of that triangle should be related to $X_1+X_2$.  The other author doesn't seem to be aware of a connection either, since none of the reasons he gives for calling it the PToS are "because it is the Pythagorean Theorem."

Update:

My initial instinct was to represent $X_1$ and $X_2$ with a pair of orthogonal axes using $(\mu_1,\mu_2)$ as the origin, where $\mu_1$ and $\mu_2$ are the means of $X_1$ and $X_2$ respectively.  If we let $\arrow x_1=(\sigma_1,0)$ and $\arrow x_2=(0,\sigma_2)$, then we could represent $X_1+X_2$ with the line through $(\mu_1,\mu_2)$ with the direction vector  $\arrow x_1+\arrow x_2$.  The length of $\arrow x_1+\arrow x_2$ is $\sigma=\sqrt{\sigma_1^2+\sigma_2^2}.$  Therefore, $\sigma^2={\sigma_1^2+\sigma_2^2}.$  So we get the Pythagorean identity. 

This isn't a geometric proof of the Pythagorean Theorem of Statistics, though.  At best, it is an illustration of the Pythagorean Theorem of Statistics by the Pythagorean Theorem from geometry.  It is perhaps natural to represent $X_1$ and $X_2$ by orthogonal axes.  Representing $X_1+X_2$ by the line in the direction of $\arrow x_1+\arrow x_2$ was forced to make the geometry work.  It's not as natural.  The more natural thing to do is represent $X_1+X_2$ by a third axis orthogonal to both $X_1$ and $X_2$.  Also, I do not see how the statistical theorem would follow from the geometrical illustration.

Sunday 9 March 2014

Competition and the Logistic Distribution

In my recent post on the other bell curves, I wondered if there was some familiar processes underlying the logistic distribution that would parallel coin-flipping for the normal distribution.  I noted there that logistic distribution works better for the ELO rating system than the normal distribution did.

I also noted that the curve that describes the logistic distribution arises from population growth models that account for competition for resources.  The ELO rating system is also used in the context of competition, but for games.  I wondered whether there was something to this.  Was it a mere linguistic coincidence, or was there a deeper connection?

Well, I still don't have a complete answer, but I can see at least one connection between the two scenarios.  In both cases, there is competition for a finite resource.  For the population growth model it is food supply, and for games, it is wins. 

The parallel is not exact, however.  In particular, the logistic model is used to describe the size of a population, but for a game, the logistic distribution is only used to compute the probable outcomes within a set of competitors, not the number of competitors. So the connection is not complete.

Thursday 6 March 2014

Understanding Self-Information

I've never taken a course in information theory, but recently I've been trying to learn more about it on my own.  I reached a satisfactory understanding of most of the concepts, but an acceptable understanding of self-information proved elusive for quite a while.  If it were just another term of many, I might have been content to ignore it, but the definitions of all of the other terms rest on this concept, so I didn't consider ignoring it to be an option.

Most introductory texts on information theory that I've seen take the definition as axiomatic.  If an event occurs with probability $p$, then self-information of that event is defined to be $-\log p$, where the logarithm is to some chosen base (often 2).  What I wanted to understand better is why the self-information is defined that way.  In my experience so far, this is rarely explained and never explained well.

It did make some intuitive sense, as various authors claim.  One property of this definition is that the self-information of an event increases as the probability decreases.  This is a reasonable property (I would even say necessary) for any self-information function.  Suppose, for example, that we are playing a game of hangman.  The letter "z" is far less frequent in English than the letter "s", so if a "z" is showing, then it is easier to narrow down the word than it would be if an "s" were showing.  Thus, the rare event "contains the letter 'z'" gives more information than the common event "contains the letter 's'."

But $-\log p$ is not the only function that works like that.  We could also use $1-p,$ which is far simpler and avoids the dreaded logarithms.  Simple is always preferred to complicated, so there must be a reason why $-\log p$ is the function of choice.

Well, in order for information to be transmitted, there must be some way to represent it.  Spoken languages use sounds.  Written languages use words with letters some alphabet.  Computers use strings of 0s and 1s, which we could think of as words from a two letter alphabet. 

The number of possible strings (words) of length $k$ from an alphabet with $n$ symbols is $n^k$.  Let's represent this number by $w$, so $w=n^k.$  We could also write this relationship between $w,$ $n$, and $k$ with the equation $k=\log_n w$.  Although this equation is equivalent to $w=n^k$, we can interpret it differently as follows.  If we want to assign unique labels to $w$ different objects using strings of the same length from an alphabet of size $n$ (that's a mouthful!), then $k$ is the smallest possible length of the strings.

Now, if these $w$ strings are chosen randomly with equal probability, say $p$, then $p=1/w$.  Therefore, $w=1/p,$ and so $k=\log_n \frac{1}{p}=-\log_n p$.  Aside from the base appearing explicitly in this expression for $k$, it is identical to the self-information.  So the self-information of an event with a given probability $p$ can be interpreted as the length of a string that is needed to represent that event if it is one of many events that all occur with equal probability.  Of course, it is possible that some event other than $w$ would have a different probability, but that doesn't change the amount of information provided by by the occurrence of $w$.

I don't know if this (rather long-winded) explanation makes the definition any more sensible for you, the reader.  But once I arrived at this realization, I finally felt like I had understood why self-information is defined that way it is.

-----------------

Another difficulty that I had with the definition is the probabilistic model through which is usually explained.  I can see why using this model is sensible from the perspective of the working information theorist.  But as someone trying to learn about it, I couldn't see how a randomly transmitted symbol could provide any information at all.  If I flip a coin and it shows up heads, so what, even if it's an unfair coin and heads is rare?  And information is not usually transmitted randomly anyway.

Tuesday 25 February 2014

The other bell curves

After lines, conic sections (circles, parabolas, etc.), and graphs of trigonometric functions, the most familiar curve has to be the bell curve.  Perhaps it is even more famous than some or all of those other curves, especially amongst students who just finished an exam that they don't feel good about.

Explaining exactly why we use the bell curve takes some work, but a rough explanation can be given in terms of coin flips.  If you flip a coin say 20 times, then you can expect to get 20 heads with a certain probability, 19 heads and 1 tail with a certain probability, 18 heads and 2 tails with a certain probability, and so on.  This is an example of a binomial distribution.  If you plot these percentages, you get a shape that looks roughly like the bell curve.  For 30 coin flips, the shape looks even more like the bell curve.

In fact, you can make the shape looks as close to the bell curve as you want by considering a large enough number of coin flips.  The coin doesn't even need to be fair.  That is, it could favour one of heads or tails more than the other, and with enough coin flips, you could still make the shape look as much like the bell curve as you like.

So lurking behind many things that give a shape that looks like the bell curve, there is a process something like a coin flip going on.  This is an oversimplification, of course.  There are other ways to reach the bell curve besides coin flips.  The whole story would take a while to explain, however.  The point is that we can arrive at the bell curve through a simple, familiar process.

The most famous bell curve is not the only bell curve, though.  The one I'm talking about above is known as the Normal Curve or Gaussian Curve [1], but there are other curves besides these that have a bell shape, but aren't Normal Curves.

One curve comes from the logistic distribution.  Another curve comes from the Cauchy distribution.  What I wonder about these curves is this.  Is there is some way to arrive at either of these distributions from some simple process in the same way that we can arrive at the normal distribution through the process of coin flips?

One place where the logistic distribution crops up is in the ELO rating system.  This system was developed by a physicist named Arpad Elo (for some reason, his name gets capitalized, as if it's an acronym, when it comes to the rating system) to rate chess players, though it can be used for any type of competition, from individual games like chess to team sports, like soccer.  Each competitor has a rating, and the probability that one competitor will beat another can be calculated from their ratings.  The calculation depends a probability distribution.  In the original scheme, a normal distribution was used, but it was discovered that a logistic distribution gave more accurate predictions.

In fact, it was this application of the logistic distribution that led me to my question above.  The empirical evidence suggests that it's the better choice, but I want to know if there is a reason why.  Is there some sort of logistic equivalent to coin flipping for the normal distribution that explains why predictions are better with the logistic distribution?


Logistic curves also pop up in some population growth models.  There are continuous and discrete versions of these models, which in a sense parallel the normal curve (continuous) and the coin flips (discrete).  Perhaps the answer to my question lies here, but I can't see the connection between population growth and games.  (Both do involve competition, I suppose, but it's not clear to me whether or not this is a meaningful connection, or just a linguistic coincidence.)

The Wikipedia article for the Cauchy distribution says that "it has the distribution of a random variable that is the ratio of two independent standard normal random variables."  That suggests that ratios of two independent binomial random variables (coin flips, for example) could be an avenue to explore.

It's also not hard to create other bell curves.  Take any function $f(x)$ that is positive, symmetric about some vertical axis, and concave up.  Then $1/f(x)$ will give a bell curve.  I expect that most of these aren't interesting, though.  Are there any others, besides the 3 mentioned above, that are?

[1] It's actually a class of curves that are related to each other by stretching and shifting vertically and horizontally.

Sunday 19 January 2014

Paul Revere and Graph Theory

Constructing a graph based on historical data, Kieran Healy explains how we can use the graph to "find" Paul Revere.
I have been asked by my superiors to give a brief demonstration of the surprising effectiveness of even the simplest techniques of the new-fangled Social Networke Analysis in the pursuit of those who would seek to undermine the liberty enjoyed by His Majesty’s subjects. This is in connection with the discussion of the role of “metadata” in certain recent events and the assurances of various respectable parties that the government was merely “sifting through this so-called metadata” and that the “information acquired does not include the content of any communications”. I will show how we can use this “metadata” to find key persons involved in terrorist groups operating within the Colonies at the present time. I shall also endeavour to show how these methods work in what might be called a relational manner.

Saturday 4 January 2014

When to set office hours

This past term, I was assisting as a tutor for one section of a math course (I am teaching this course in the present term).  The students in the class wanted to have some extra tutorial time.  One of the students collected information about when everyone was available.  It turned out that there was no common period where all interested students could come, so there ended up being no extra tutorial.

This was between the professor and the students.  I didn't find out about it until after the fact.  I assume from the conversation, though, that there was going to be at most one extra tutorial, because the proposal was nixed once it was revealed that there was no common period.

But suppose that the professor was willing to offer a number of tutorials limited only by what periods where either he or I would have been available to lead one (assuming that the tutorials would be held during the day).  Then it may still be possible to find a way to accommodate everyone.  In fact, it's pretty easy to tell whether or not it is.

We already have a list of which periods students are available.  Since at least one of the available tutors (in this case, me and the professor) must be at the tutorial, we can strike from that list all periods where none are available.  If everyone is available in at least one of the remaining periods, then all students can be accommodated.  Otherwise it's not.

One way to do this is simply hold tutorials in each of those remaining periods.  However, professors have many responsibilities besides teaching ("publish or perish" and all that), so it's desirable to use the fewest tutorial periods possible.  So how can we do that?

It turns out that this is an application of a well studied problem in combinatorics, known as the set cover problem.  Suppose we are given a collection of sets, say $S_1,S_2,\ldots, S_k$.  Let $U=\cup_{i=1}^k S_i.$  Suppose that $I$ is a subset of $\{1,2,\ldots,k\}$ and $U=\cup_{i\in I}S_i$.  Then $I$ is called a {\em set cover} of $U$.  The set cover problem is to minimize the size of $I$.

Now, suppose that the periods are numbered $1,2,\ldots, k$.  For each $i$, let $S_i$ be the set of students who are available during period $i$.  Then the we can find the fewest tutorial periods needed by finding the smallest set cover of $U$.

Unfortunately, the set cover problem is NP-complete.  The formal definition of what this means takes a fair bit of work to explain, so I won't get into it here (if you're interested, follow the link to the topic in the Wikipedia article on the set cover problem).  In practical terms, though, it means that there is no currently known way to find the smallest set cover efficiently (and there probably won't ever be).  Perhaps the number of students and number of periods is small enough that it is still manageable by computer, or there is some approximation algorithm that will give a near-optimal solution.