I have a few days capture of the tweet stream for users who mention S&P 400 stocks (the API will only accept 400 individual keywords) and do it including the $ symbol to indicate a ticker. e.g. $AAPL not AAPL. This is important to exclude ticker symbols that are subsets of regular words. Let's start by looking at the users:
This graph is a histogram of the age of the tweeter account (meaning the number of days from the account creation time to the tweet timestamp) in years and a zoom in days. We see three prominent features:
- A sharp peak around 1 day;
- A broad peak around 3 years; and,
- An apparently triangular underlying distribution.
The first and last of these features are easy to explain.
For the first: twitter is a system which is plagued by spam. In a way you could think of it as a white list for spam since, as a user, when I follow a user I am telling people to send me information that they think is interesting to me. Clearly, a simple naive spambot model is to create an new account and immediately send a lot of spam tweets from it. Presumably, Twitter Inc.'s response to this is to identify the spambot account via well known techniques such as Naive Bayes Classifiers etc. and close down the account. This would lead to an excess of accounts with very short ages — the maximum age being the surveillance frequency of the Twitter anti-spambot's activity. Formally, we should do more analysis to investigate this hypothesis — but it seems fairly clear.
The third feature can be explained by the following stochastic model:
- New users create accounts at an approximately constant rate λ;
- Each user creates tweets at another apparently constant rate μ.
It is straight-forward to see with this dynamic that the distribution of tweeter account ages that appear in this sample will naturally be of a right triangular form, just as we see.
This leaves an apparent strong peak of users who tweet about the S&P 400 and have done so for around three years. So what happened three years ago?