My work often takes me through a jumble of ideas related to statistics, data, and data science; and I find myself learning about subjects I never expected to look at.
The work I've done on Natural Language Processing got me making Frequency-Rank plots (Zipf's laws) for words in data collected from twitter and from canonical sets, such as the Brown Corpus. The work I've done on searching for methods for creating influence ranks in twitter lead me to learn about PageRank and the stochastic exploration of social networks.
Here's an observation: both in the tokens to types relationship from corpus linguistics (i.e. Zipfian Frequency-Rank plots) and PageRank Value to PageRank ranking from internet datasets we see a power law. We know, from Wentian Li, that a random word creation process with a Markov property produces the power law in the Frequency-Rank relationship for natural languages. We know from the founders of Google that the PageRank is the equalibrium state of a stochastic exploration of the internet, following link-to-link, with a teleportation action that occurs from time-to-time (and when the stochastic explorer is dead ended).
Why are these similar? After a little thought during some driving around, it came to me: word composition is also a stochastic exploration of a state-space with a Markov property. In this case, the states are 'a' to 'z' and whitespace (the delimiter) and the teleportation comes from the breaking of the symbol stream into words with the delimiter symbol. That is take Brin and Page's random surfer and have them "explore" the alphabet with some Markov Chain transition matrix P. With some probability, α, they abandon their search, generate a delimiter, and start composing a new word.
Thus there is a direct link between the word generation process that leads to Zipf's laws and the PageRank vector for a stochastic exploration process. They are conceptually the same thing and that's why they have the same empirical properties.
Why is this interesting? Well, if one examines the Brown Corpus (or any other) in detail one finds negative curvature at the extreme right hand side of the Frequency-Rank plot — but Markov property pseudoword generating processes create positive curvature at the right hand side suggesting, to mean, that one must break the Markov property to properly account for the actual distributions of words in real prose.
So what if I also broke the Markov property for stochastic exploration of the internet? How would PageRanks change if the teleportation vector was not constant but history dependent? i.e. If my choice of link to follow depended upon where I'd been or how long I'd been searching? What if Brin and Page's alpha was actually α(t)?