Continuing the recent theme on the application of Machine Learning and Interior Analysis, here we investigate the utility of Support Vector Regression methods versus Ordinary Least Squares. The job is to predict the daily range of the top ranked stock in the Compact Model Portfolio from the prior value of that metric. By Daily Range we mean the ratio of the difference between the Closing Price and Opening Price to the difference between the Highest Price and Lowest Price. I chose this particular metric because it is an interior metric but it doesn't use any microstructure information. It is also frequency discussed in non-academic literature.

The data used is the daily range computed for the top ranked stock in the Compact Model Portfolio, with the period 01/03/2001 – 12/31/2009 use as the training data and the period 01/03/2010 – 08/24/2010 used as the testing data. This division is a simple binary sample cross-validation technique. The analysis was performed in R, and the code is appended to this post.
Before discussing the chart above, which exhibits many interesting features, let's talk about the methods. On the training set, the ksvm procedure was used to execute an ε-insensitive regression and allowed to use it's default methods and tolerances. The OLS procedure lm was similarly run without user tuning. I then used both models to predict responses in the testing set and used an OLS regression of the response onto the forecasts as a simple methodology for evaluating the quality of the systems. Notable differences were found. The out-of-sample β was established to be 2.01417 ± 1.30812 for the OLS model and 1.04366 ± 0.43193 for the SVR model. The R²'s were 0.01526 and 0.03676, respectively. Thus the SVR model is a much more accurate performer out-of-sample, as it is advertised to be.
The chart exhibits the out-of-sample predictor and response data as well as the OLS regression line and the SVR model. We see that the SVR model contains numerous wriggles and kinks, yet my instinct is to reject the information content of these features — making the assumption that they indicate a need to tune the kernel used by the system. However, intuition is not necessarily truth, so we are in need of a procedure to establish where the superior predictive power of this model comes from. Does it come from some simple non-linearity in response that the algorithm has picked up — or is it actually due to the more funky nature of the model. One way to establish this would be to see if we can create some kind of piecewise linear model that does as well as the SVR.
require(kernlab)
training<-read.table("CMP_Training.txt",header=TRUE)
testing<-read.table("CMP_Testing.txt",header=TRUE)
training$DailyRange<-(training$ClosingPrice-training$OpeningPrice)/(training$HighestPrice-training$LowestPrice)
training$PriorDailyRange<-(training$PriorClosingPrice-training$PriorOpeningPrice)
/(training$PriorHighestPrice-training$PriorLowestPrice)
testing$DailyRange<-(testing$ClosingPrice-testing$OpeningPrice)/(testing$HighestPrice-testing$LowestPrice)
testing$PriorDailyRange<-(testing$PriorClosingPrice-testing$PriorOpeningPrice)
/(testing$PriorHighestPrice-testing$PriorLowestPrice)
names(training)
training$sample<-!(is.na(training$DailyRange)|is.na(training$PriorDailyRange))
testing$sample<-!(is.na(testing$DailyRange)|is.na(testing$PriorDailyRange))
summary(linmod<-lm(DailyRange~PriorDailyRange,data=training,subset=training$sample))
summary(lm(testing$DailyRange~predict(linmod,newdata=testing)))
print(svrmod<-ksvm(DailyRange~PriorDailyRange,type='eps-svr',data=training))
summary(lm(rest(testing$DailyRange)~predict(svrmod,newdata=testing)))
isort<-order(rest(testing$PriorDailyRange))
plot(testing$PriorDailyRange[testing$sample],testing$DailyRange[testing$sample],axes=TRUE,
xlab='Prior Daily Range',ylab='Daily Range',main='Comparison of SVR and OLS Models for Daily Range',
sub=paste('Data: Compact Model Portfolio, Top Ranked Stock; Resolution: Daily; Training:',
training$MarkDate[1],'--',training$MarkDate[length(training$MarkDate)],'Testing:',testing$MarkDate[1],
'--',testing$MarkDate[length(testing$MarkDate)]))
lines(c(-1.1,1.1),c(0,0),col='gray')
lines(c(0,0),c(-1.1,1.1),col='gray')
abline(linmod,col='red')
lines(rest(testing$PriorDailyRange)[isort],predict(svrmod,newdata=testing)[isort],col='blue')