On the Relationship between Word Frequencies and PageRanks

by Graham Giller April 17, 2013 23:32

My work often takes me through a jumble of ideas related to statistics, data, and data science; and I find myself learning about subjects I never expected to look at.

The work I've done on Natural Language Processing got me making Frequency-Rank plots (Zipf's laws) for words in data collected from twitter and from canonical sets, such as the Brown Corpus. The work I've done on searching for methods for creating influence ranks in twitter lead me to learn about PageRank and the stochastic exploration of social networks.

Here's an observation: both in the tokens to types relationship from corpus linguistics (i.e. Zipfian Frequency-Rank plots) and PageRank Value to PageRank ranking from internet datasets we see a power law. We know, from Wentian Li, that a random word creation process with a Markov property produces the power law in the Frequency-Rank relationship for natural languages. We know from the founders of Google that the PageRank is the equalibrium state of a stochastic exploration of the internet, following link-to-link, with a teleportation action that occurs from time-to-time (and when the stochastic explorer is dead ended).

Why are these similar? After a little thought during some driving around, it came to me: word composition is also a stochastic exploration of a state-space with a Markov property. In this case, the states are 'a' to 'z'  and whitespace (the delimiter) and the teleportation comes from the breaking of the symbol stream into words with the delimiter symbol. That is take Brin and Page's random surfer and have them "explore" the alphabet with some Markov Chain transition matrix P. With some probability, α, they abandon their search, generate a delimiter, and start composing a new word.

Thus there is a direct link between the word generation process that leads to Zipf's laws and the PageRank vector for a stochastic exploration process. They are conceptually the same thing and that's why they have the same empirical properties.

Why is this interesting? Well, if one examines the Brown Corpus (or any other) in detail one finds negative curvature at the extreme right hand side of the Frequency-Rank plot — but Markov property pseudoword generating processes create positive curvature at the right hand side suggesting, to mean, that one must break the Markov property to properly account for the actual distributions of words in real prose.

So what if I also broke the Markov property for stochastic exploration of the internet? How would PageRanks change if the teleportation vector was not constant but history dependent? i.e. If my choice of link to follow depended upon where I'd been or how long I'd been searching? What if Brin and Page's alpha was actually α(t)?

 

Elemetary Analytics of Twitter for #oscars

by Graham Giller February 24, 2013 20:38

Here's the pretty but not totally informative word cloud for #oscars influencers...

And a bit more quantitative, a Zipf-Mandelbrot law for the influencer's ranks...
  
 

The Zipf-Mandelbrot Law for Tweets and an Archetype for Social Media

by Graham Giller February 20, 2013 23:36

Actually, this is way more frivolous than the title implies. I'm currently working on twitter, again, and spent some time on tokenization of the tweet corpus collected from the streaming API. A nice check that everything is working correctly is to plot the Zipf-Mandelbrot law (that the frequency of a token in the corpus scales as a power law function of the token rank in the frequency plot). This is illustrated below, after tokenization on whitespace and punctuation and the removal of stop words.

 

 
Sadly, this merely confirms what ad hoc inspection discovers: the social discourse on twitter is essentially the endless repitition of the archetypal phrase “@justinbieber I love you. [Please] follow me.” Wonder what @patrickcarney has to say about that?
 

A Trivial SQLCMD Like Client for SQL Server Using Python

by Graham Giller January 23, 2013 10:47

For a long time I've wanted a command line client for SQL Server that I can run from the unix command shell in Mac OS X (I work on an iMac, but use a big-iron SQL Server instance for some of my analytical work). I've tried various versions from the internet, but none were really satisfactory. So here is what I ended up with, a Python script that uses pyodbc and is almost completely trivial. In fact 1/3 of the code is the GPL, 1/3 parses the command line, and 1/3 does the actual work, and it's only 90 odd lines long!

Usage: echo "sp_who2" | ./sqlcmd.py 

sqlcmd.py (6.08 kb)

 

“Bayesian” or “Frequentist” and Do We Really Care Anymore

by Graham Giller November 21, 2012 11:41

I am currently reading and enjoying Nate Silver's The Signal and the Noise: Why So Many Predictions Fail-but Some Don't. However, in one chapter he, oddly I think, takes time out to join in the currently vogueish bashing of R.A. Fisher and Egon Person's  “Frequentist” hypothesis testing approach to statistics. As somebody who is an empirical scientist and has learned statistics autodidactically rather than in formal training, I think it's time to stop making points about what I feel are non-issues. Perhaps Fisher's personality was, it is reported, not the most smooth. But he's been dead for a long time now, and he did make incredible contributions to statistical science and to genetics.

Statistics is a science that has much too teach us and is not a religion wrought into two conflicting sets. I feel that myself, and most other people who practically apply statistics in their work on a daily basis, probably see no distinctions. I see a broad set of tools and I choose to apply the best ones for each case. If a problem solves naturally in a Bayesian or Information Theoretic framework I'll do that. If there is a well defined hypothesis to test, I'll follow that route. I don't forbid myself from one or other method because I feel that they are both expressions of the same behaviour of the universe.

I feel that Fisher's great work on maximum likelihood is a Bayesian method. Bayes' formula is a piece of undeniable mathematical logic, and it says: posterior probability is a combination of the likelihood of observations arising from each candidate model and the prior probabilities of the models. I have always felt that Bayes' formula is the best justification for using maximum likelihood as a method to understand the universe. Maximizing the posterior probability of the model is equal to maximizing the likelihood under the simplifying assumption that the answer will not be strongly biased by the prior. There are clearly circumstances when that doesn't work, but there are also many in which it does!

I also feel that Nyman and Pearson's framework for hypothesis testing is a Bayesian method. I think it's reasonable to ask, given a prior assumption that the data I've captured is random noise, how likely my results are to have arisen from the overactive application of the human mind to the organization of random data. The idea that anything with a p-Value less than 0.001 is TRUE! and anthing with a larger value is FALSE! is belied by the year's I've spent working on forecasting financial time-series. Data misbehaves, and assessing the likelihood of that misbehavour is useful information about whether my conclusions are trustworthy.

So my summary is to ask writers to stop propagating this non-existent scism. It is no longer the 1950's and in the real world we use all the tools that work.

 

Tags:

Theory

Sidetrack Continued: The Nth Ranked Item in an Aggregation

by Graham Giller August 12, 2012 22:42

Building on the prior post, it's natural to as for the LASTRANKED() item in an aggregation, and trivial to code up, but also to ask for the Nth ranked item in an aggregation. This is a little more complicated as we have to maintain an ordered list of items, have the ability to merge and prune the sublists prepared by separate threads of an execution plan, do that with reasonable efficiency, and yet keep the memory footprint under control. However, not impossible so here is my SQL Server user defined aggregate:

CREATE AGGREGATE dbo.NthRanked(@Value AS FLOAT,@Rank AS FLOAT,@Order AS INT) RETURNS FLOAT

For compactness I decided to code it so that if a call with ranking order N is positive it returns the items based on the ordering from least to greatest and if it is negative based on the ordering from greatest to least. i.e. order +1 is the first item and −1 is the last item. This is somewhat similar to the function NTH() provided by Google's Big Query system, but actually I believe more general in specification.

A note. This code does not make any attempts to keep it's internal data structures within a single database page in size, so if you ask for the millionth ranked item you might generate an exception in the run time. However, this approach allows such limits to be imposed from the outside rather than as a design constraint, so I prefer it that way!

Here's a zipfile will all of the functions:  FirstRanked.zip (5.50 kb)

 

Powered by BlogEngine.NET 1.4.5.0
Theme by Mads Kristensen | Modified by Mooglegiant



RecentComments

Comment RSS

About the Author

Graham Giller - Headshot GRAHAM GILLER
Dr. Giller holds a doctorate from Oxford University in experimental elementary particle physics. His field of research was statistical astronomy using high energy cosmic rays. After leaving Oxford, he worked in the Process Driven Trading Group at Morgan Stanley, as a strategy researcher and portfolio manager. He then ran a CTA/CPO firm which concentrated on trading eurodollar futures using statistical models. From 2004, he has managed a private family investment office. In 2009, he joined a California based hedge fund startup, concentrating on high frequency alpha and volatility forecasting. My updated resume is on LinkedIn.

Pages


Disclaimer

Nothing on this site should be construed as a reccommendation to buy or sell any specific security nor as a solicitation of an order to buy or sell any specific security. Before making any trade for any reason you should consult your own financial advisor. The author may hold long or short positions in any of the securities discussed either before or after publication of an article mentioning such a security.

Copyright Notice

All post on this blog are © Copyright property of Giller Investments (New Jersey), LLC. All comments are the property of their respective authors and neither the author or this blog nor any entity associated with him are responsible for or accept any responsibility for their content. Offensive comments and spam may be removed at the authors discretion.

Data provided on this blog or through links to this blog are either property of Giller Investments (New Jersey), LLC or publicly available or derived from data that is publically available. Any data that is proprietary to Giller Investments (New Jersey), LLC is published here for the public interest and may be reproduced for private research or in public forums provided that suitable attribution and acknowledgement of ownership is made.

Privacy Policy

We use third-party advertising companies to serve ads when you visit our website. These companies may use information (not including your name, address, email address, or telephone number) about your visits to this and other websites in order to provide advertisements about goods and services of interest to you. If you would like more information about this practice and to know your choices about not having this information used by these companies, click here.