This article will describe how to get an average 75% prediction accuracy in next day’s average price change. The target magnitude is the 2-day simple moving average. The reason is that if we do not apply smoothing to daily prices, the forecasts are much harder to get. The minimum possible smoothing is two days, and that will be the target: altering actual prices as little as possible.

I have selected randomly a company from the New York Stock Exchange, it was “CNH Industrial NV”. No reason for that, it has been a completely random choice among a couple thousand files I have generated extracted from either Yahoo! or Google finance, I do not remember the source. The files are uploaded here: https://drive.google.com/open?id=18DkJeCqpibKdR8ezwk9hGjdHYSGwovWH.

The method is valid for any financial data as long as it has the same structure. I have also tested it with Forex data getting similar accuracy levels with currencies such as EURUSD, GBPUSD or USDJPY. The interesting point of forecasting those quotes is that by examining where it fails, I think you will improve your price action trading skills and your understanding of the market and what matters.

## Data Collection and Variable Configuration

There are millions of possible variable candidates that may seem valid to be analyzed. And which will be the target value we will try to aim? I like thinking that price is like any other object subject to physical laws. It reacts to market forces, it has an inertia, velocity, acceleration, etc.

The forces may be volume, it may have a potential energy depending if it is very high or very low, the rate of change may be important and so on. There are other many factors we could analyze such as gaps, breakouts, technical patterns, candlestick analysis or price distribution within space just to mention a few. For this example we will only be focused on **price action** and **volume**.

I have the files saved in csv format to be used with Excel, so let’s start loading the csv file into a DataFrame object using **Python**.

# Importing all the libraries that we will use. import pandas as pd import matplotlib.pyplot as plt import xgboost as xgb from sklearn.metrics import accuracy_score #Load the data from a csv file. CNHI = {"stock_name":"CNH Industrial NV", "data": pd.read_csv("./data/CNHI_excel.csv",sep="\t",header=0,decimal=',')} CNHI["data"]=CNHI["data"].drop("Adj Close",1).set_index("Date")

The previous code will, after extracting, remove a column that won’t be used (“Adj Close”) and creating an index using the “Date” column. The date is not a variable we may use for forecasting, so there is no need to keep it as a column of the dataset.

The data now has the typical structure of the financial data: Date, Open, High, Low and Close. The first three rows are shown in the next table:

Date |
Open |
High |
Low |
Close |
Volume |

2013-09-30 | 2.75 | 13.08 | 12.5 | 12.5 | 352800 |

2013-10-01 | 12.76 | 13.16 | 12.75 | 12.92 | 1477900 |

2013-10-02 | 13.02 | 13.08 | 12.87 | 12.9 | 1631900 |

### Predictors

We are going to omit High, Low and Open, using only **Open** and **Volume** for the study. Let’s start preparing the data for the analysis. The predictors (**X** variables) to be used to predict the target magnitued (**y** variable) will be the following ones:

**Two day simple moving average (SMA2)**. The formula is (Ct – Ct-1)/2, being Ct equal to current day’s open price and Ct-1 to previous day’s open price. This formula is applied to each row of the data set.

Predictors = pd.DataFrame({"sma2":CNHI["data"].Open.rolling(window=2).mean()})

**1 day window SMA2.**The previous day’s SMA2 value.

Predictors["sma2_1"] = Predictors.sma2.shift(1)

And the other predictors will be:

**Current day SMA2 increment.**(SMA2t – SMA2t-1).**1 day****window SMA2 increment.**(SMA2t-1 – SMA2t-2).**Current day volume increment. (**Volt – Volt-1).**Current day volume rate of change.****(**Volt – Volt-1)/Volt**1 day window open price**. (Ct-1)**Current day open price increment.**Ct – Ct-1**Current day open price.**Ct.

Predictors["sma2_increment"] = Predictors.sma2.diff() Predictors["sma2_1_increment"] = Predictors.sma2_1.diff() Predictors["vol_increment"] = CNHI["data"].Volume.diff() Predictors["vol_rel_increment"] = CNHI["data"].Volume.diff() / CNHI["data"].Volume Predictors["open_1"] = CNHI["data"].Open.shift(1) Predictors["open_incr"] = CNHI["data"].Open - CNHI["data"].Open.shift(1) Predictors["open"] = CNHI["data"].Open # The rows with nulls generated by rolling values will be removed. Predictors = Predictors.dropna()

A sample of the first 5 rows:

Date |
sma2 |
sma2_1 |
sma2_increment |
sma2_1_increment |
vol_increment |
vol_rel_increment |
open_1 |
open_incr |
open |

2013-10-03 | 12.895 | 12.89 | 0.005 | 0.135 | -495500 | -0.436026047 | 13.02 | -0.25 | 12.77 |

2013-10-04 | 12.765 | 12.895 | -0.13 | 0.005 | -21800 | -0.019558586 | 12.77 | -0.01 | 12.76 |

2013-10-07 | 12.59 | 12.765 | -0.175 | -0.13 | -400 | -0.000359002 | 12.76 | -0.34 | 12.42 |

2013-10-08 | 12.42 | 12.59 | -0.17 | -0.175 | 104600 | 0.08582212 | 12.42 | 0 | 12.42 |

2013-10-09 | 12.5 | 12.42 | 0.08 | -0.17 | -232400 | -0.235604217 | 12.42 | 0.16 | 12.58 |

### Target Variable

This will be a classification variable, if the average price will go either up or down the next day. The target will be forecasting the difference between today’s price and tomorrow’s price (which is unkonwn).

target = pd.DataFrame({"value":Predictors.sma2.shift(-1) - Predictors.sma2}).dropna()

After calculating the data to predict, the three first rows look like this:

Date | value |

2013-10-03 | -0.13 |

2013-10-04 | -0.175 |

2013-10-07 | -0.17 |

Finally we will match predictors and target values by date and remove those rows without counterpart in the other table.

X = pd.merge(Predictors, target,left_index=True,right_index=True)[Predictors.columns] y = pd.merge(Predictors, target,left_index=True,right_index=True)[target.columns]

**X **now contains the predictors and **y** the target values. The table contains 1,059 records at this moment.

### Extreme Gradient Boosting prediction

The extreme gradient boosting is an exceptional machine learning technique for many reasons. It is based on decision trees and it has nice features such as residuals analysis, non-linear regression, feature selection tools, overfitting avoidance and many other more. Other machine learning alternative techniques commonly used for this type of analysis are *Support Vector Machines, Neural Networks *and *Random Forest*. I have used all of those for predicting market prices and the Extreme Gradient Boosting is always my first choice.

We will setup the regression model using the 65% of the data and with that model, the next 35% of the data will be used to predict future values. This simulates the actual scenario in which we have past data to train our model and we want to predict how a future datum will be with the data we currently have on hand. The data will be split in two sets: the * training *set to preconfigure the model and the

**testing**set that won’t be used to build the model, but only to test if it works as expected with new data.

train_samples = int(X.shape[0] * 0.65) X_train = X.iloc[:train_samples] X_test = X.iloc[train_samples:] y_train = y.iloc[:train_samples] y_test = y.iloc[train_samples:]

After applying the data splitting, the test data set contains:

- Train records: 688.
- Test records: 371.

The target variables will be transformed for binary classification. A positive change in the value of prices will be classified as **1** and a non-positive change as **0**.

def getBinary(val): if val>0: return 1 else: return 0 # and the transformation is applied on the test data for later use. # The train data will be transformed while it is being fit. y_test_binary = pd.DataFrame(y_test["value"].apply(getBinary)

And next, the model is trained and the test data predicted to verify the accuracy of the system:

regressor = xgb.XGBRegressor(gamma=0.0,n_estimators=150,base_score=0.7,colsample_bytree=1,learning_rate=0.01) xgbModel = regressor.fit(X_train,y_train.value.apply(getBinary)) y_predicted = xgbModel.predict(X_test) y_predicted_binary = [1 if yp >=0.5 else 0 for yp in y_predicted] print (accuracy_score(y_test_binary,y_predicted_binary)) Out: 0.76010781671159033

So, the initial **accuracy** without optimizing the model is **76%** **predicting the daily average price change for each of the the next 371 trading days**.

The model can be optimized, I have just used a few parameters to avoid overfitting with the training data and adjusting the learning rate.

The features used should also be analyzed to avoid using redundant variables and to discard those with no correlation. New features should be added to try improved approaches and, to sum up, there is a lot of work that could be done around this basic model.

XGBOOST has also ways to study features. Let’s take a look at their importance:

fig = plt.figure(figsize=(8,8)) plt.xticks(rotation='vertical') plt.bar([i for i in range(len(xgbModel.feature_importances_))], xgbModel.feature_importances_.tolist(), tick_label=X_test.columns, color="chocolate") plt.show()

It is obvious that the field extension is huge and especially interesting.

Excellent article. Thank you for publishing. I may be missing something, but it seems that there is a future leak in the model — it seems the model predicts the directional change in the open price based on tomorrow’s values (e.g. all the “.shift(1)” references refer to the next day’s prices so any prediction today is based on knowing tomorrow’s data. I may be overlooking or misinterpreting something, but in writing the arrays out to excel and stepping through the logic I can’t convince myself that the predictions are not using data values from the future.

Could you clarify if the prediction is not based on future data that in any out of sample prediction would not be known? Thank you.

Probably the issue may be that you are assuming that I am sorting the data in the opposite order I am doing it. Could it be the case? The window is built always with past periods data anyway.

It is also important understanding that this is not a trading model, but a machine learning exercise. This prediction has no application in real trading and it is not a trading model.

Thank you for your reply. I accounted for the sort order and it definitely seems that the model is fitted to predict the change in open price for each day using the data for the same day — sort of a classifier to see if it can predict a known value. Interesting exercise nonetheless. Thank you.

In the “Predictors” paragraph there is a table that shows the window in which next day’s value is put alongside previous day’s data. Therefore, the prediction model is trying to predict future data using past data.

If you are getting today’s data comparing tomorrow data using shift(1), you can change it by using shift(-1). The models I have prepared use those past data to predict future data as well.

Thank you. Can you pick any day and create a prediction for the next day without having any data for the next day at the time the prediction is run? I.e. if the current date was 2013-10-04 and you did not have any data for 2013-10-07 could you create a prediction for 2013-10-07?

Also, I’m fairly new to Python and am unable to appreciate the purpose of the pandas merge statements:

X = pd.merge(Predictors, target,left_index=True,right_index=True)[Predictors.columns]

y = pd.merge(Predictors, target,left_index=True,right_index=True)[target.columns]

After the merge, X seems to be a copy of the Predictors dataframe and and y is a single column of the target dataframe. Does the merge do something more than if one was to simply use:

X=Predictors

y=target

Thank you very much for your kind advice and for sharing your expertise.

The idea is that you predict “y” for day n using X from previous periods <n, so you use past data to predict future data. The default Merge function is equivalent to SQL’s INNER JOIN. In this case the common key are the indexes (that should be the dates). The purpose of merging is removing any row that does not have an equivalent in the other set (INNER JOIN). When shifting and creating columns with past dates you will always end up with records that have not matching data for certain columns.

Great article. Thank you for sharing your work.

Just to clarify, the current market price and volume for the particular time frame or day has been used to predict the same days Up or Down call?

Shouldn’t it be shift(1) and shift(2) to predict the future Up or Down call?

Is it possible to see the developed XGB Model or regression for the related work in Python (Eg: Y=mX + C)

Thank You!

Hi.

The values used to calculate the data for the predicted period are always shifted in fact.

The XGB is not using a linear regression function, but something closer to a logistic regression, as show in the article:

y_predicted_binary = [1 if yp >=0.5 else 0 for yp in y_predicted]

Because we are not looking for the best fit to the line but the highest probability.

Please correct me if I’m wrong, I see that you used Accuracy score for model evaluation. Accuracy score is a flawed metric. You need to compare the performance using a confusion matrix. What if I only predict UP? Since APPL’s trend is up, only predicting up could give me 75% accuracy already.

Hi Tim.

Very good question you bring here.

In fact, my concern was specifically studying how LSTMs could perform with raw timeseries data at the moment I wrote this article. And I wrote this articles about neural networks, xgboost and time series to show a few examples on how we should prepare data for analytics using neural networks and machine learning. And what we get here is precisely the obvious stuff: «hey, this is trending up». Regarding a confusion matrix, what I usually do is just to adjust the threshold (in binary representation it could be 0.5/0.5 or 0.7/03, etc…). It depends how much sensitivity and specificity you want.

For financial analysis, there are a few other features you need to take into account such as open markets, news releases, spreads and commissions and a buch of other stuff like equity, leverage, risk, stop loss; one could become mad.

Besides, having an accuracy if a market will go up or down is not enough. What about the trade management that implies an uncertain number of future time steps where there is no way to know how every and each next bar will be?

The financial time series problem, from the machine learning perspective is very complex I mean and we need to drastically reduce the complexity of the data upfront in my opinion. So I think we need to think about financial data structures more than about particular events like open, high, low and close.

And then, predicting or clasifying is only a description of a certain state. But, as I say, it does not gives us an optimal way to handle that information. How much to buy, how much to sell, how much to stay in the trade…?

Coming back to your initial question, in trading the important thing is how much you win in average vs how much you lose. You can make a profit by winning 30% of the time and lose all your money in a week even if your winning ratio is 95%.

Regards.