Pages

Saturday, 7 June 2014

Friends in high density places

Last August, I found the map below.  2% of the Australian population lives in the yellow area and the remaining 98% lives in the white area.



Like Australia, Canada is geographically large, but sparsely populated.  Canada's population density is a little higher, though not by much.  On average, there are 3.56 people per square kilometre in Canada compared to Australia's 3.06.  Also, like Australia, our population is centred around a small number of urban areas.  The three most populous metropolitan areas, Toronto, Montreal, and Vancouver, alone make up more than a third of our population, the top 10 more than a half, and the top 100 nearly 80% (mind you, most of the cities in that top 100 list have less than 100,000 people, and some, such as Timmins and Cape Breton are quite large in area).  (Population data culled from various Wikipedia pages).

So I wondered what a similar map for Canada would look like.

Unfortunately, there doesn't seem to be much information out there about how the map was created.  Within each of the yellow and white areas, we see other borders drawn in.  I'm assuming that these represent some national subdivisions, perhaps municipal boundaries or census regions.  So whoever made the map made a choice of which regions to colour yellow and which regions to colour white.

There are many choices that would split the population between 2% and 98%, though.  For example, to construct a region containing 98% of the population, like the one coloured white here, one could start with the sparsely populated inland regions and add increasingly denser regions until 98% is reached.  The result would not be very interesting.  Australia's population is roughly 23.5 million.  2% of that is 470,000, only a bit smaller than the population of Tasmania, the country's smallest province, and roughly 10% of the population of Sydney's, the country's largest city.

So we could create a map where all of Australia other than some parts of Tasmania are coloured white, or where all but a few urban specks of Sydney are coloured white.  In the latter case, the yellow parts would probably so small that it would be difficult or impossible to see them.  The map would hardly be different from a plain map of Australia.  What makes the existing map interesting, though, is how small the area containing 98% of the population is.

To make the map as interesting as possible, then, we would want the white part to be as small as possible.  As I said above, however, there is no information about how the map was created.  Is the white region as small as it could be, based on the subdivisions used, or is there something other choice that would lead to an even smaller region?  Did the creator simply pick the $m$ sparsest or least populous subdivisions, or was something more sophisticated used?

I found it through Twisted Sifter, which refers back to a Reddit thread.  This thread links to an image hosted on imgur, which I embedded above.  After that the trail goes cold.  One of comments from the original poster suggest that he or she is the one who created it, but there is no follow up, and the user's online profile doesn't offer any direction either.

Without knowing, then, we'd have to come up with our own method.  One simple method is to rank the subdivisions by population, and choose the first $m$ that give 98% of the population.  Another would be to rank by area, and choose the first $m$ that give 98% of the population.  A third option would be to rank by density and choose the first $m$ that give 98% percent of the population.  None of these are guaranteed to give the smallest area that includes 98% percent of the population (or, equivalently, the largest area that includes 2% of the population).

Another less obvious way to solve the problem is through a mathematical technique known as linear programming.  We create a function that gives us the area if a certain set of subdivisions is chosen, then we try to maximize the area under the condition that the percentage of the population that lives in the chosen set of regions is less than or equal to 2%. Ideally, we'd be able to insist that the population is equal to 2%, but this might not be possible.  For our purposes that shouldn't be a problem, though, since a smaller number than 2% would only further illustrate the sparseness.

Now let's write this in mathematical notation.  Suppose that ${\bf c}$ is a vector where entry $i$ of ${\bf c}$ is the area of subdivision $i$.  Let $A$ be a row matrix where column $i$ of the only row of $A$ is the percentage of the population in subdivision $i$.  Let ${\bf x}$ be a vector of variables where entry $i$ of ${\bf x}$ is the variable $x_i$.  If $x_i=1$ when subdivision $i$ is chosen and $0$ otherwise, then $f({\bf x})$ gives the total area for that choice of subdivisions and $A{\bf x}$ is the corresponding percentage of the population.  Then we want the following
  • maximize $f({\bf x})={\bf c}\cdot {\bf x}$ such that
  • $A{\bf x}\leq 0.02(=2\%)$ and
  • $x_i$ is 0 or 1 for each $i$
This is not exactly an linear programming problem.  The third condition makes it a binary integer programming problem, because each $x_i$ can be one of only two integer values, 0 or 1.  To make it a linear programming problem we replace the third condition with the following
  • $x_i\geq 0$ for each $i$.
 This could allow for the possibility that a region is partially included.  That is, only a certain percentage of a region's population counts towards the total percentage.  That means that we also want $x_i\leq 1$, since we can't include more than 100% of a region's population.  In that case, the problem is
  • maximize $f({\bf x})={\bf c}\cdot {\bf x}$ such that
  • $A{\bf x}\leq 0.02$ and
  • $0\leq x_i\leq 1$ for each $i$.
As stated, this is not a linear programming problem.  However, we can easily turn it into one.  Suppose that there are $n$ subdivisions in total.  Let $B$ be the $(n+1)\times n$ matrix whose first row is $A$ and whose remaining $n$ rows is the $n\times n$ identity matrix.  Let ${\bf b}$ be the vector with $n+1$ components whose first entry is $0.02$ and whose remaining entries are all 1.  Then we can rewrite the above as
  • maximize $f({\bf x})={\bf c}\cdot {\bf x}$ such that
  • $B{\bf x}\leq {\bf b}$ and
  • $x_i\geq 0$ for each $i$.
This is now a linear programming problem.  Unfortunately, the solution to the linear programming problem is not what we want, although it has the upside of being easier to solve.  In general, solving a binary integer programming problem is not easy.  So we may be able to solve the linear programming problem and use that to get a good solution to our problem (for example, by rounding), even though it might not be the optimal one.  Note that if we restricted to integer solutions, then we get the binary integer programming problem back.  There are some special cases where integer linear problems are easy to solve.  Perhaps this is one.  (Who knows? Maybe one of the three simple methods I gave above will turn out to be the optimal solution, though I'm sceptical any of them would be.)

Seeing a version of this map for Canada would be most interesting to me, because that's where I'm from and because of the geographical similarities.  But maps of this sort could be interesting for other countries, or cities, provinces or states, continents, or even the whole inhabitable world.  And of course, we need not insist on a split between 2% and 98%.  Perhaps a 50/50 split would be interesting.  Maybe there is some other interesting percentage, such as a percentage $X$ for which $X$ percent of the population lives on $100-X$ of the land (if that number even exists).

If the binary integer programming problem could be solved, it would be a simple (though possibly tedious) matter of collecting the required geographic data and plugging it into some map making software.

Update: After finishing this post, I emailed a colleague of mine who has much more experience in linear programming than me to ask him if he had any insights.  He told me that this is a knapsack problem.  I had heard of this problem before, but only informally, so I didn't clue in that it was a special case of integer linear programming.  Oddly, neither the linear programming nor the integer linear programming Wikipedia pages refer to the knapsack problem, nor does the knapsack page refer to either of the other two.

In any case, this is sort of good news, but only good in the sense of "maybe less bad."  The general binary linear programming problem is NP-hard, whereas the knapsack problem is NP-complete.  NP-hard problems are at least as hard to solve as NP-complete problems.  The reason it's only "maybe less bad" is that there is not currently known whether or not there are efficient algorithms for solving NP-complete problems.  According to the Wiki page on the knapsack problem, though, there are efficient algorithms for finding approximate solutions.  Maybe these would be good enough.  Indeed, our eyes may not be able to discern between the optimal solution and an approximate one.  In that case, I guess it's more or less good news.

Notes:
  • For more on linear programming, see here and here or, better yet, pick up a book, such as this one, which was what I first learned about linear programming from.
  • I don't currently have the resources at my disposal to do any of the above, but the population density below from Statistics Canada probably gives a rough idea of what the result would be.