This post is concerned with an analytical method that could be thought of as a half-way house between the standard linear models that work at lower frequencies and the non-linear models that we are hypothesizing are necessary to deal with highly quantized high frequency data. In common with our prior post, on computing bias in Presidential approval rating opinion polling, it is also using a political not financial data set. However, that merely changes the meaning of the analysis and not the method of analysis. (It actually represents a piece of work I did prior to the recent Presidential election, which happened to use this method, so I thought I'd include it here as it is entertaining.)
What I'm discussing are methods to deal with what are now known as discrete dependent variables in statistics. To my knowledge this methodology was pioneered by Chester Bliss in his analysis on the Probability of Insect Toxicity or Probit Models. Faced with a binary outcome (either the insects die, or they don't), a standard linear regression was unsuitable for estimating this quantity. Instead, Bliss decided to estimate the probability of the outcome via what we now call a link function.
)
Here Φ is the c.d.f. for the Normal (or other suitable) distribution. We note that this formalism is identical to that of the Perceptron in which the discriminant output of the system is a sigmoid (i.e. “S” shaped) function of an linear function of the data. However, unlike a typical Neural Net, which generally just describes a functional relationship of some kind, we have a specific probabilistic interpretation of the output of our system. Thus we can adopt the entire toolkit developed for maximum likelihood analysis to this problem. Specifically, if our models do not involve degenerate specifications, we can use the maximum likelihood ratio test to evaluate the utility of various composite hypotheses.
At this point I'm going to describe our data briefly, and then present our analysis in a later post. The data comes from the 6th. October, 2008, edition of the New York Times. Specifically an article entitled The Measure of a President (the currently linked to document contains corrections printed the next day). I have tabulated this data and it is available from this website. I have also included the outcome of the 2008 election, which was unknown at the the time the article was printed (and when my analysis was orignally done). You can download it and read it the analysis programme of your choice. For example, the following will load the data directly into R and then create the needed auxiliary variables.
presdata<-
read.table("http://www.gillerinvestments.com/Downloader/Files/Election.txt",
skip=6,col.names=c("YEAR","D_FT","D_IN","D_LB","R_FT","R_IN","R_LB","PRES"))
presdata$D_HT<-presdata$D_FT+presdata$D_IN/12.
presdata$R_HT<-presdata$R_FT+presdata$R_IN/12.
presdata$HEIGHT<-presdata$R_HT-presdata$D_HT
presdata$WEIGHT<-presdata$R_LB-presdata$D_LB
presdata$INCUMBENCY<-c(0,presdata$PRES[1:(length(presdata$PRES)-1)])
presdata$CHANGE<-c(1,0,presdata$PRES[1:(length(presdata$PRES)-2)])
This data presents the heights and weights of Presidential candidates since 1896, together with the binary outcome (“0” for a Democrat and “1” for a Republican — and this data is organized alphabetically not by any kind of hidden preference). The light hearted goal of the article was to investigate whether the public prefers the lighter, taller candidate over the shorter, heavier alternate. We are going to augment this data by computing the Body Mass Index, to ask whether the data public preferred the “healthier” candidate and also the first two lags of the Presidential party indicator, which we will call incumbency and change respectively. We will therefore seek to estimate the probability of a Republican President from this data via Probit Analysis. Results to follow.