## The WikiGame and Graphs

Today, I had an urge to play the WikiGame. I haven’t played this game since middle school, but, hey, it’s winter break so why not? The objective of the game is given a random starting article on Wikipedia, try to arrive at some other predetermined article’s page only by clicking links within the article. A player may race against another player. Whoever reaches the destination article with the fewest clicks wins. (Other variations of the game may have other victory conditions.) Neither player may use Control + F at any time during the game.

In the true spirit of the game, the articles are generated randomly by using Wikipedia’s Random Article link. A example play of the game is as follows. Given Ysé Tardan-Masquelier as the starting article and Omaha, Nebraska as the ending article, the following series of clicks is a correct solution.

Some interpretations of the rules forbid using countries or cities as links but for simplicity of example, I’ll pretend this rule doesn’t exist. A shorter path may exist between the two articles. I highly suspect I can go directly from France to the United States but given the “no Control + F” rule, I must read the entire article to find the link. As tempting as it is to learn all about France, I decided to skip ahead to “head of state.”

It’s obviously very interesting to come upon esoteric Wikipedia articles during the course of the game. But all my rambling about this game does have some other purpose than another bout of nostalgia, an algorithmic purpose!

Shortest Paths

If one imagines that all articles on Wikipedia represent nodes in a graph and links represent directed edges from one node to another, then the WikiGame (as presented here) easily reduces to the problem of finding the shortest path (all edges have cost 1) between two nodes. A simple breadth-first search can solve this problem. To see a python implementation of a WikiGame solver using breadth-first search, visit this.

Diameter of WikiGame

The trickier, and I think, more interesting question is what is the longest shortest path–the diameter–between any two pages (excluding pages that have no paths between them). Or in other words, given $N$ pages and let $d(p_i, p_j)$ represent the length of the shortest path from page $p_i$ to page $p_j$, compute $\max_{0\leq i\neq j\leq N-1} \left\{d(p_i, p_j)\right\}$. The problem shouldn’t be theoretically “hard” with respect to the total number of links and articles. For a given graph, one may find the diameter by using any polynomial time all-pairs shortest path algorithm such as the Floyd-Warshall algorithm and finding the maximum of all returned shortest paths. The algorithm is polytime in the number of nodes and edges in the input graph. By the most recent statistics, a graph created from English Wikipedia articles and links would have ~5 million nodes and so at most ~50 million directed edges. A pretty sizable graph but I think searchable given enough motivation.

Largest Strongly Connected Component

Turns out there’s been quite a few people who wondered about how many degrees of separation exists between two pages on Wikipedia. The Six Degrees of Wikipedia page provides surprising examples of articles that are separated by at most six degrees (i.e. the shortest path consists of at most 6 clicks). Also, it appears my question of the longest shortest path has been investigated before here. The article also presents the largest strongly connected component (one may reach an article from any other article in the component). Dolan, the author, found that the “center” of Wikipedia (disregarding dates and years) is an article on the “United Kingdom.” He defines the center as an article in the largest strongly connected component from which it takes the fewest average number of clicks to get to any other article in the component. From “United Kingdom,” it takes an average of 3.45 clicks to any other article in the component. The data he used in his study was taken on March 3, 2008 so it’s possible that some of these statistics have changed since then.

Other Graph Characteristics and Questions

The results presented were very interesting, but now I wonder what other information we can gain from articles on Wikipedia by considering other graph characteristics like maximum cliques in the graph (maximum number of pages separated by at most one degree from each other), max independent set (maximum number of pages that are separated by greater than one degree of separation), maximum spanning tree (not sure what the Wikipedia interpretation is here), or TSP tour (whether every article may be visited once starting with an article using a minimum number of clicks). Admittedly, some of the problems I presented are NP-hard, but I think approximations may also be interesting, especially for the TSP tour of Wikipedia (does it exist? what is the approximate minimum length of such a tour?) If such a tour exists then we find out something interesting about how information is organized on Wikipedia, that is, stated very vaguely, every piece of information is “connected” in some way to every other piece of information.

The WikiGame turned out to be a very nice late evening diversion but I also think it presents an interesting problem in the graph theoretic sense. Here’s a list of the links I used in my article in case you want to read more (in order of what I personally found most interesting):