Detección de cambios direccionales en Python

= np.NaN
valleysBuffer = np.NaN

i = depth

while (i < Bars – depth – 1):

is_upper_fractal = False
is_lower_fractal = False

lower_range_pos = i – depth
upper_range_pos = i + depth + 1
N = lower_range_pos + depth

lower_range_values = data.iloc[lower_range_pos:N][column_mode].values
upper_range_values = data.iloc[N + 1:upper_range_pos][column_mode].values
N_value = data.iloc[N][column_mode]

# Basic Fractal:
# Peaks
if np.append([N_value], lower_range_values).argmax() == 0 and np.append([N_value],
upper_range_values).argmax() == 0:
is_upper_fractal = True
peaksBuffer[N] = N_value

# Valleys
if not is_upper_fractal:
if np.append([N_value], lower_range_values).argmin() == 0 and np.append([N_value],
upper_range_values).argmin() == 0:
is_lower_fractal = True
valleysBuffer[N] = N_value
i += 1

peaksBufferSeries = pd.Series(peaksBuffer, name=”peaks”, index=data.index).dropna()
valleysBufferSeries = pd.Series(valleysBuffer, name=”valleys”, index=data.index).dropna()

self.__data = data
self.__peaks = peaksBufferSeries
self.__valleys = valleysBufferSeries
self.__column_mode = column_mode

return pd.merge(peaksBufferSeries, valleysBufferSeries, left_index=True, right_index=True, how=”outer”)

 

  = np.NaN
valleysBuffer = np.NaN

i = depth

while (i < Bars - depth - 1): is_upper_fractal = False is_lower_fractal = False lower_range_pos = i - depth upper_range_pos = i + depth + 1 N = lower_range_pos + depth lower_range_values = data.iloc[lower_range_pos:N][column_mode].values upper_range_values = data.iloc[N + 1:upper_range_pos][column_mode].values N_value = data.iloc[N][column_mode] # Basic Fractal: # Peaks if np.append([N_value], lower_range_values).argmax() == 0 and np.append([N_value], upper_range_values).argmax() == 0: is_upper_fractal = True peaksBuffer[N] = N_value # Valleys if not is_upper_fractal: if np.append([N_value], lower_range_values).argmin() == 0 and np.append([N_value], upper_range_values).argmin() == 0: is_lower_fractal = True valleysBuffer[N] = N_value i += 1 peaksBufferSeries = pd.Series(peaksBuffer, name="peaks", index=data.index).dropna() valleysBufferSeries = pd.Series(valleysBuffer, name="valleys", index=data.index).dropna() self.__data = data self.__peaks = peaksBufferSeries self.__valleys = valleysBufferSeries self.__column_mode = column_mode return pd.merge(peaksBufferSeries, valleysBufferSeries, left_index=True, right_index=True, how="outer")

¿Cómo se utiliza esto? Hace falta pasarle un DataFrame de pandas cuyo índice sea de tipo datetime y que esté ordenado ascendente.

Los datos

Supongamos que tenemos el siguiente DataTrame:

date open high low close
2003-05-28 08:40:00 1.1782 1.17968 1.17619 1.1795
2003-05-28 08:45:00 1.17967 1.18012 1.17967 1.18004
2003-05-28 08:50:00 1.17996 1.18007 1.17939 1.17939
2003-05-28 08:55:00 1.17932 1.17944 1.17691 1.17695
2003-05-28 09:00:00 1.17702 1.17796 1.17702 1.1779
2003-05-28 09:05:00 1.17795 1.17823 1.17759 1.17759
2003-05-28 09:10:00 1.17768 1.17805 1.17756 1.17802
2003-05-28 09:15:00 1.17791 1.17802 1.1778 1.17787
2003-05-28 09:20:00 1.17796 1.17923 1.17773 1.17923
2003-05-28 09:25:00 1.17933 1.17935 1.17786 1.17789
2003-05-28 09:30:00 1.17793 1.17832 1.17762 1.17793
2003-05-28 09:35:00 1.17794 1.17839 1.17789 1.17802
2003-05-28 09:40:00 1.17808 1.17837 1.177 1.17703
2003-05-28 09:45:00 1.17673 1.17827 1.17666 1.17827
2003-05-28 09:50:00 1.17819 1.1783 1.17779 1.17788
2003-05-28 09:55:00 1.17784 1.17872 1.17784 1.17872
2003-05-28 10:00:00 1.17847 1.17867 1.17833 1.17844
2003-05-28 10:05:00 1.17863 1.17886 1.17852 1.17862
2003-05-28 10:10:00 1.17865 1.17872 1.17771 1.17772
2003-05-28 10:15:00 1.1775 1.1787 1.17744 1.17839
2003-05-28 10:20:00 1.1784 1.17883 1.17837 1.17859
2003-05-28 10:25:00 1.17859 1.17869 1.17842 1.17855
2003-05-28 10:30:00 1.17859 1.1787 1.17839 1.17851
2003-05-28 10:35:00 1.17845 1.17874 1.17831 1.17867
2003-05-28 10:40:00 1.17866 1.17977 1.17818 1.17861
2003-05-28 10:45:00 1.17852 1.17866 1.17822 1.17856
2003-05-28 10:50:00 1.1787 1.17893 1.1786 1.1786
2003-05-28 10:55:00 1.17855 1.17855 1.17786 1.17786
2003-05-28 11:00:00 1.17769 1.17783 1.17426 1.17429
2003-05-28 11:05:00 1.17429 1.17474 1.17411 1.17456
2003-05-28 11:10:00 1.17467 1.17516 1.17462 1.17484
2003-05-28 11:15:00 1.17472 1.17517 1.17217 1.17517
2003-05-28 11:20:00 1.17518 1.17537 1.17493 1.17493
2003-05-28 11:25:00 1.17493 1.17532 1.17469 1.17511
2003-05-28 11:30:00 1.17504 1.17504 1.17436 1.17469
2003-05-28 11:35:00 1.17471 1.17563 1.17471 1.17558
2003-05-28 11:40:00 1.17559 1.17793 1.17528 1.17529
2003-05-28 11:45:00 1.17519 1.17559 1.17506 1.17547
2003-05-28 11:50:00 1.17558 1.17572 1.17533 1.17533
2003-05-28 11:55:00 1.17536 1.17536 1.17307 1.17307
2003-05-28 12:00:00 1.17268 1.17268 1.17166 1.17167
2003-05-28 12:05:00 1.17168 1.17237 1.17147 1.17224
2003-05-28 12:10:00 1.17214 1.17228 1.17194 1.17217
2003-05-28 12:15:00 1.17224 1.17237 1.17187 1.17209
2003-05-28 12:20:00 1.17205 1.17229 1.17153 1.17153
2003-05-28 12:25:00 1.17121 1.17226 1.17047 1.17204
2003-05-28 12:30:00 1.17208 1.1726 1.17181 1.17259
2003-05-28 12:35:00 1.17261 1.17264 1.17182 1.17192
2003-05-28 12:40:00 1.17203 1.17337 1.17189 1.17337
2003-05-28 12:45:00 1.17351 1.1736 1.1714 1.17153

Si el campo “date” es una columna tendríamos y el dataframe se llama “data”, por ejemplo, tendríamos que escribir lo siguiente para poder usarlo como argumento, pues el  índice ha de ser datetime:

data = data.set_index("date")

Por supuesto el código de la clase anterior se puede modificar para funcionar con numpy o cualquier otra biblioteca.

La aplicación

Si queremos obtener los máximos y mínimos locales de los valores “high” de nuestra serie de datos utilizando 6 valores para calcular los picos y valles, es tan simple como hacer esto:

#"data" es un DataFrame que contiene los datos de la tabla de arriba.
fr = fractals()
high_fractals = fr.getFractals(data=data,column_mode="high", depth=3)
<pre>
&nbsp;
 
&nbsp; = np.NaN
        valleysBuffer = np.NaN
 
        i = depth
 
        while (i < Bars - depth - 1):
 
            is_upper_fractal = False
            is_lower_fractal = False
 
            lower_range_pos = i - depth
            upper_range_pos = i + depth + 1
            N = lower_range_pos + depth
 
            lower_range_values = data.iloc[lower_range_pos:N][column_mode].values
            upper_range_values = data.iloc[N + 1:upper_range_pos][column_mode].values
            N_value = data.iloc[N][column_mode]
 
            # Basic Fractal:
            # Peaks
            if np.append([N_value], lower_range_values).argmax() == 0 and\
                    np.append([N_value],upper_range_values).argmax() == 0:
 
                if N_value not in lower_range_values:
                    is_upper_fractal = True
                    peaksBuffer[N] = N_value
 
            # Valleys
            if not is_upper_fractal:
                if np.append([N_value], lower_range_values).argmin() == 0 and \
                        np.append([N_value], upper_range_values).argmin() == 0:
 
                    if N_value not in lower_range_values:
                        is_lower_fractal = True
                        valleysBuffer[N] = N_value
            i += 1
 
            peaksBufferSeries = pd.Series(peaksBuffer, name="peaks", index=data.index).dropna()
            valleysBufferSeries = pd.Series(valleysBuffer, name="valleys", index=data.index).dropna()
 
            self.__data = data
            self.__peaks = peaksBufferSeries
            self.__valleys = valleysBufferSeries
            self.__column_mode = column_mode
 
        return pd.merge(
            peaksBufferSeries, valleysBufferSeries, left_index=True, right_index=True, how="outer", sort=False
        )

¿Cómo se utiliza esto? Hace falta pasarle un DataFrame de pandas cuyo índice sea de tipo datetime y que esté ordenado ascendente.

Los datos

Supongamos que tenemos el siguiente DataTrame:

date open high low close
2003-05-28 08:40:00 1.1782 1.17968 1.17619 1.1795
2003-05-28 08:45:00 1.17967 1.18012 1.17967 1.18004
2003-05-28 08:50:00 1.17996 1.18007 1.17939 1.17939
2003-05-28 08:55:00 1.17932 1.17944 1.17691 1.17695
2003-05-28 09:00:00 1.17702 1.17796 1.17702 1.1779
2003-05-28 09:05:00 1.17795 1.17823 1.17759 1.17759
2003-05-28 09:10:00 1.17768 1.17805 1.17756 1.17802
2003-05-28 09:15:00 1.17791 1.17802 1.1778 1.17787
2003-05-28 09:20:00 1.17796 1.17923 1.17773 1.17923
2003-05-28 09:25:00 1.17933 1.17935 1.17786 1.17789
2003-05-28 09:30:00 1.17793 1.17832 1.17762 1.17793
2003-05-28 09:35:00 1.17794 1.17839 1.17789 1.17802
2003-05-28 09:40:00 1.17808 1.17837 1.177 1.17703
2003-05-28 09:45:00 1.17673 1.17827 1.17666 1.17827
2003-05-28 09:50:00 1.17819 1.1783 1.17779 1.17788
2003-05-28 09:55:00 1.17784 1.17872 1.17784 1.17872
2003-05-28 10:00:00 1.17847 1.17867 1.17833 1.17844
2003-05-28 10:05:00 1.17863 1.17886 1.17852 1.17862
2003-05-28 10:10:00 1.17865 1.17872 1.17771 1.17772
2003-05-28 10:15:00 1.1775 1.1787 1.17744 1.17839
2003-05-28 10:20:00 1.1784 1.17883 1.17837 1.17859
2003-05-28 10:25:00 1.17859 1.17869 1.17842 1.17855
2003-05-28 10:30:00 1.17859 1.1787 1.17839 1.17851
2003-05-28 10:35:00 1.17845 1.17874 1.17831 1.17867
2003-05-28 10:40:00 1.17866 1.17977 1.17818 1.17861
2003-05-28 10:45:00 1.17852 1.17866 1.17822 1.17856
2003-05-28 10:50:00 1.1787 1.17893 1.1786 1.1786
2003-05-28 10:55:00 1.17855 1.17855 1.17786 1.17786
2003-05-28 11:00:00 1.17769 1.17783 1.17426 1.17429
2003-05-28 11:05:00 1.17429 1.17474 1.17411 1.17456
2003-05-28 11:10:00 1.17467 1.17516 1.17462 1.17484
2003-05-28 11:15:00 1.17472 1.17517 1.17217 1.17517
2003-05-28 11:20:00 1.17518 1.17537 1.17493 1.17493
2003-05-28 11:25:00 1.17493 1.17532 1.17469 1.17511
2003-05-28 11:30:00 1.17504 1.17504 1.17436 1.17469
2003-05-28 11:35:00 1.17471 1.17563 1.17471 1.17558
2003-05-28 11:40:00 1.17559 1.17793 1.17528 1.17529
2003-05-28 11:45:00 1.17519 1.17559 1.17506 1.17547
2003-05-28 11:50:00 1.17558 1.17572 1.17533 1.17533
2003-05-28 11:55:00 1.17536 1.17536 1.17307 1.17307
2003-05-28 12:00:00 1.17268 1.17268 1.17166 1.17167
2003-05-28 12:05:00 1.17168 1.17237 1.17147 1.17224
2003-05-28 12:10:00 1.17214 1.17228 1.17194 1.17217
2003-05-28 12:15:00 1.17224 1.17237 1.17187 1.17209
2003-05-28 12:20:00 1.17205 1.17229 1.17153 1.17153
2003-05-28 12:25:00 1.17121 1.17226 1.17047 1.17204
2003-05-28 12:30:00 1.17208 1.1726 1.17181 1.17259
2003-05-28 12:35:00 1.17261 1.17264 1.17182 1.17192
2003-05-28 12:40:00 1.17203 1.17337 1.17189 1.17337
2003-05-28 12:45:00 1.17351 1.1736 1.1714 1.17153

Si el campo “date” es una columna y el dataframe se llama “data”, por ejemplo, tendríamos que escribir lo siguiente para poder usarlo como argumento, pues el  índice ha de ser datetime:

data = data.set_index("date")

Por supuesto el código de la clase anterior se puede modificar para funcionar con numpy o cualquier otra biblioteca.

La aplicación

Si queremos obtener los máximos y mínimos locales de los valores “high” de nuestra serie de datos utilizando 6 valores para calcular los picos y valles, es tan simple como hacer esto:

#"data" es un DataFrame que contiene los datos de la tabla de arriba.
fr = fractals()
high_fractals = fr.getFractals(data=data,column_mode="high", depth=3)

El resultado se cargaría en la variable high_fractals y sería el siguiente:

Y para extraer los máximos, por ejemplo, solo bastaría ejecutar el siguiente código:

Con una profundidad de 2, éste sería el resultado para los datos:

import matplotlib.pyplot as plt
 
fr = fractals() 
#"data" es un DataFrame que contiene los datos de la tabla de arriba. fr = fractals() 
high_fractals = fr.getFractals(data=data,column_mode="high", depth=2)
 
fig, ax = plt.subplots()
fig.set_size_inches(10, 5)
plt.plot(high_fractals["valleys"].dropna(), marker='^', linestyle="none", markersize=7,  color="red")
plt.plot(high_fractals["peaks"].dropna(), marker='v', linestyle="none", markersize=7,  color="green")
plt.plot(data.high, color="black")
plt.show()

El resultado es más que satisfactorio para la mayoría de escenarios.  En mi caso, lo voy a mejorar añadiendo también identificación de movimientos a partir de cierto nivel (porcentaje, pips, etc).

Pero, en general, este sencillo algoritmo cumple con todas las expectativas que personalmente necesito para marcar cambios direccionales en series temporales. Menos es más.

¿Y qué se puede hacer con estos puntos?

Con dos puntos podemos pintar una línea, por ejemplo.

data = high_fractals["peaks"].dropna()
TF_seconds = 300 #timeframe de 5 minutos.
peaks = []
 
for i in range(1, data.shape[0]):
    peaks.append( { "type": "H", 
    "x1": data[i - 1:i].index, "y1": data[i - 1], 
    "x2": data[i:i + 1].index, "y2": data[i], 
    )
 
for point in peaks:
    diference = (point["x2"] - point['x1']).seconds[0]
    units = diference / TF_seconds
    slope = (point["y2"] - point["y1"]) / units
    slope = slope / 0.0001
    peaks[i]["slope"] = slope
    peaks[i]["units"] = units
    i += 1

Y he ahí todos los vectores directores de todas las líneas de todos los picos consecutivos. Con meta-información relativa a su pendiente e incremento de X en formato escalar (aparte de datetime). La imagen de abajo ilustra la estructura de la implementación completa, donde la pendiente está calculada en pips:

La imaginación es el límite.

 

Predicting Stock Exchange Prices with Machine Learning

This article will describe how to get an average 75% prediction accuracy in next day’s average price change. The target magnitude is the 2-day simple moving average. The reason is that if we do not apply smoothing to daily prices, the forecasts are much harder to get. The minimum possible smoothing is two days, and that will be the target: altering actual prices as little as possible.

I have selected randomly a company from the New York Stock Exchange, it was “CNH Industrial NV”. No reason for that, it has been a completely random choice among a couple thousand files I have generated extracted from either Yahoo! or Google finance, I do not remember the source. The files are uploaded here: https://drive.google.com/open?id=18DkJeCqpibKdR8ezwk9hGjdHYSGwovWH.

The method is valid for any financial data as long as it has the same structure. I have also tested it with Forex data getting similar accuracy levels with currencies such as EURUSD, GBPUSD or USDJPY. The interesting point of forecasting those quotes is that by examining where it fails, I think you will improve your price action trading skills and your understanding of the market and what matters.

Data Collection and Variable Configuration

There are millions of possible variable candidates that may seem valid to be analyzed. And which will be the target value we will try to aim? I like thinking that price is like any other object subject to physical laws. It reacts to market forces, it has an inertia, velocity, acceleration, etc.

The forces may be volume, it may have a potential energy depending if it is very high or very low, the rate of change may be important and so on. There are other many factors we could analyze such as gaps, breakouts, technical patterns, candlestick analysis or price distribution within space just to mention a few. For this example we will only be focused on price action and volume.

I have the files saved in csv format to be used with Excel, so let’s start loading the csv file into a DataFrame object using Python.

# Importing all the libraries that we will use.
 
import pandas as pd
import matplotlib.pyplot as plt
import xgboost as xgb
from sklearn.metrics import accuracy_score
 
#Load the data from a csv file.
CNHI = {"stock_name":"CNH Industrial NV", "data": pd.read_csv("./data/CNHI_excel.csv",sep="\t",header=0,decimal=',')}
 
CNHI["data"]=CNHI["data"].drop("Adj Close",1).set_index("Date")

The previous code will, after extracting, remove a column that won’t be used (“Adj Close”) and creating an index using the “Date” column. The date is not a variable we may use for forecasting, so there is no need to keep it as a column of the dataset.

The data now has the typical structure of the financial data: Date, Open, High, Low and Close. The first three rows are shown in the next table:

Date Open High Low Close Volume
2013-09-30 2.75 13.08 12.5 12.5 352800
2013-10-01 12.76 13.16 12.75 12.92 1477900
2013-10-02 13.02 13.08 12.87 12.9 1631900

Predictors

We are going to omit High, Low and Open, using only Open and Volume for the study. Let’s start preparing the data for the analysis. The predictors (X variables) to be used to predict the target magnitued (y variable) will be the following ones:

  • Two day simple moving average (SMA2). The formula is (Ct – Ct-1)/2, being Ct equal to current day’s open price and Ct-1 to previous day’s open price. This formula is applied to each row of the data set.
Predictors = pd.DataFrame({"sma2":CNHI["data"].Open.rolling(window=2).mean()})
  • 1 day window SMA2. The previous day’s SMA2 value.
Predictors["sma2_1"] = Predictors.sma2.shift(1)

And the other predictors will be:

  • Current day SMA2 increment. (SMA2t – SMA2t-1).
  • 1 day window SMA2 increment. (SMA2t-1 – SMA2t-2).
  • Current day volume increment. (Volt – Volt-1).
  • Current day volume rate of change. (Volt – Volt-1)/Volt
  • 1 day window open price. (Ct-1)
  • Current day open price increment. Ct – Ct-1
  • Current day open price. Ct.
Predictors["sma2_increment"] = Predictors.sma2.diff()  
 
Predictors["sma2_1_increment"] = Predictors.sma2_1.diff()  
 
Predictors["vol_increment"] = CNHI["data"].Volume.diff()
 
Predictors["vol_rel_increment"] = CNHI["data"].Volume.diff() / CNHI["data"].Volume
 
Predictors["open_1"] = CNHI["data"].Open.shift(1)
 
Predictors["open_incr"] = CNHI["data"].Open - CNHI["data"].Open.shift(1)
 
Predictors["open"] = CNHI["data"].Open
 
# The rows with nulls generated by rolling values will be removed.
Predictors = Predictors.dropna()

A sample of the first 5 rows:

Date sma2 sma2_1 sma2_increment sma2_1_increment vol_increment vol_rel_increment open_1 open_incr open
2013-10-03 12.895 12.89 0.005 0.135 -495500 -0.436026047 13.02 -0.25 12.77
2013-10-04 12.765 12.895 -0.13 0.005 -21800 -0.019558586 12.77 -0.01 12.76
2013-10-07 12.59 12.765 -0.175 -0.13 -400 -0.000359002 12.76 -0.34 12.42
2013-10-08 12.42 12.59 -0.17 -0.175 104600 0.08582212 12.42 0 12.42
2013-10-09 12.5 12.42 0.08 -0.17 -232400 -0.235604217 12.42 0.16 12.58

 

Target Variable

This will be a classification variable, if the average price will go either up or down the next day.  The target will be forecasting the difference between today’s price and tomorrow’s price (which is unkonwn).

target = pd.DataFrame({"value":Predictors.sma2.shift(-1) - Predictors.sma2}).dropna()

After calculating the data to predict, the three first rows look like this:

Date value
2013-10-03 -0.13
2013-10-04 -0.175
2013-10-07 -0.17

Finally we will match predictors and target values by date and remove those rows without counterpart in the other table.

X = pd.merge(Predictors, target,left_index=True,right_index=True)[Predictors.columns]
y = pd.merge(Predictors, target,left_index=True,right_index=True)[target.columns]

X now contains the predictors and y the target values. The table contains 1,059 records at this moment.

Extreme Gradient Boosting prediction

The extreme gradient boosting is an exceptional machine learning technique for many reasons. It is based on decision trees and it has nice features such as residuals analysis, non-linear regression, feature selection tools, overfitting avoidance and many other more. Other machine learning alternative techniques commonly used for this type of analysis are Support Vector Machines, Neural Networks and Random Forest. I have used all of those for predicting market prices and the Extreme Gradient Boosting is always my first choice.

We will setup the regression model using the 65% of the data and with that model, the next 35% of the data will be used to predict future values. This simulates the actual scenario in which we have past data to train our model and we want to predict how a future datum will be with the data we currently have on hand. The data will be split in two sets: the training set to preconfigure the model and the testing set that won’t be used to build the model, but only to test if it works as expected with new data.

train_samples = int(X.shape[0] * 0.65)
 
X_train = X.iloc[:train_samples]
X_test = X.iloc[train_samples:]
 
y_train = y.iloc[:train_samples]
y_test = y.iloc[train_samples:]

After applying the data splitting, the test data set contains:

  • Train records: 688.
  • Test records: 371.

The target variables will be transformed for binary classification. A positive change in the value of prices will be classified as 1 and a non-positive change as 0.

def getBinary(val):
    if val>0:
        return 1
    else:
        return 0
 
# and the transformation is applied on the test data for later use.
# The train data will be transformed while it is being fit.
y_test_binary = pd.DataFrame(y_test["value"].apply(getBinary)

And next, the model is trained and the test data predicted to verify the accuracy of the system:

regressor = xgb.XGBRegressor(gamma=0.0,n_estimators=150,base_score=0.7,colsample_bytree=1,learning_rate=0.01)
 
xgbModel = regressor.fit(X_train,y_train.value.apply(getBinary))
 
y_predicted = xgbModel.predict(X_test)
y_predicted_binary = [1 if yp >=0.5 else 0 for yp in y_predicted]
 
print (accuracy_score(y_test_binary,y_predicted_binary))
 
 
Out: 0.76010781671159033

So, the initial accuracy without optimizing the model is 76% predicting the daily average price change for each of the the next 371 trading days.

The model can be optimized, I have just used a few parameters to avoid overfitting with the training data and adjusting the learning rate.

The features used should also be analyzed to avoid using redundant variables and to discard those with no correlation. New features should be added to try improved approaches and, to sum up, there is a lot of work that could be done around this basic model.

XGBOOST has also ways to study features. Let’s take a look at their importance:

fig = plt.figure(figsize=(8,8))
plt.xticks(rotation='vertical')
plt.bar([i for i in range(len(xgbModel.feature_importances_))], xgbModel.feature_importances_.tolist(), tick_label=X_test.columns, color="chocolate")
plt.show()

It is obvious that the field extension is huge and especially interesting.