Entornos de Inteligencia Artificial para Trading y Simulando Ticks

A estas alturas, si has estado intentando diseñar modelos de machine learning  o inteligencia artificial para hacer trading, ya te habrás dado cuenta de que la cosa no es tan fácil como normalizar unos cuantos datos históricos de Forex o Bolsa, meterlos en una red neuronal y esperar que tu output tipo {mantener, comprar, vender} te salga lo suficiente bien balanceado como para que la red negocie por ti.

De hecho, en el pasado, mi experiencia es que arquitecturas como XGBOOST daban mejores predicciones con datos suavizados con medias móviles de 2 o 3 muestras que perceptrones o LSTMs.

Todo depende del tiempo que uno tiene disponible. Si es poco, como en mi caso, lo mejor es ir primero a los más seguro y luego ir refinando el trabajo para poder, con la mayor brevedad posible, aprobar o descartar una metodología. Lo de «la mayor brevedad» es un concepto bastante relativo.

En mi caso, la mayor brevedad han sido unos cinco años de trabajo intenso más un esfuerzo financiero relevante. En mi caso, como ya he comentado en otras entradas, he estudiado casi todas las técnicas de trading, las he probado, validado; una tras otra, durante años. Tanto el trading manual como el trading algorítmico. Es decir, que no solo he estudiado trading o pagado por aprender, sino que además he tenido que investigar mucho sobre algoritmos, estructuras, lenguajes, estadística (incluyendo un master en Ciencia de Datos), probabilidad, machine learning, inteligencia artificial y otras materias relevantes.

En ese sentido, se puede decir que me he hecho una idea genérica relativamente amplia del asunto. Escribo este rollo, no para ponerme medallas, sino para que quien lea este texto entienda que el trading no es llegar y triunfar. En el mejor de los casos hacen falta entre 3 o 5 años para empezar a entenderlo y puede que la tasa de fracaso sea del 95%. Esto es, que de cada 100 personas que intentan hacer trading, solo 5 conseguiran cubrir costes y un porcentaje menor ganar algo. Y no solo depende de lo buena que sea la técnica utilizada, el 80% de los factores van a ser ajenos al dominio de una técnica de trading determinada. De hecho, no me pongo medallas porque no enseño mis técnicas gratuitamente nunca. Tengo por ahí una mención a un seminario de 5 días de trading por 12000 euros por si alguien tiene curiosidad por saber qué principios y técnicas utilizo.

Lo que voy a mostrar ahora es una implementación personalizada de un proveedor de datos virtual; que es más sencillo de programar que un canal de comunicación con mi broker, de forma que puedo hacer simulaciones y estudios exactamente igual que si estuviera conectado, pero con mucho menos esfuerzo. Con bastante esfuerzo, pero con mucho menos.

A estas alturas ya funciona razonablemente bien, es un borrador del final, pero ya tiene suficiente complejidad para que valga la pena como ejemplo. Me he programado un broker virtual también, porque me gusta la modularidad. Este broker puede recibir datos de distintas fuentes. Para ello, uso una interfazclase abstracta (depende del lenguaje) que define unos métodos y propiedades iguales, sea cual sea la fuente de datos.

Una de las fuentes de datos que utilizo son ficheros csv en formato tick (date, ask, bid). Estos los he convertido en OHLC para Ask y OHLC para bid en marcos temporales de 1-minuto y 5-minutos. Para mí es suficiente, pero si alguien quiere más granularidad y exactitud, puede usar directamente los datos ask-bid, o hacer un muestreo a 30-segundos, 10-segundos, etc. para los datos tick. Para que se entienda mejor, tengo:

  1. Un único csv con todos los datos OHLC en el marco temporal de 5-minutos entre 2003 y 2019. Este fichero se utiliza para dibujar la gráfica OHLC, entrenar un modelo estadístico o lo que sea.
  2. Un fichero por año de datos OHLC en el marco temporal de 1-minuto. Este fichero se utiliza para extraer los ticks que tendrán lugar hasta que la siguiente barra OHLC se tenga que dibujar. Son como los datos que nos van llegando del broker en tiempo real y muestran cómo va cambiando el precio del instrumento.

Como observación, estuve decidiendo si utilizar datos tick ASK/BID, pero me pareció que tanta precisión no era necesaria y que iba a utilizar un generador de ruído para calcular el spread mediante un generador de números aleatorios con una distribución T de student o bien promediando High-Low de mi muestra histórica. Pero eso es otra historia. Recordemos que compramos a precio Ask, pero vendemos a precio Bid. La comisión por operación también tiene que tenerse en cuenta, por ejemplo.

Vamos a plantear el sistema:

Datos históricos en 5 minutos
(última fila, núm. filas=n)
(AMD = año-mes-día)

Datos en «tiempo real» en 1 minuto
(todos los ticks desde que cierra la vela histórica hasta que abre la siguiente).
Date Open High Low Close
AMD 08:00:00 On Hn Ln Cn

Tengamos en cuenta que la vela de 8:00 en el TF de 5m contiene todos los precios entre las 8:00 y las 8H 4' 59''.

Por lo que las señales en tiempo real tienen que empezar a las 8H 5' 00''.

Date Open High Low Close
AMD 08:05:00  O1 H1  L1  C
AMD 08:06:00 O2 H2  L2  C
AMD 08:07:00 O3 H3  L3  C
AMD 08:08:00 O4 H4  L4  C
08:09:00 O5 H5  L5  C

Empecemos con el código:

# Como está orientado a csv, entonces no voy a incluir información sobre conexiones mediante sockets, puertos, ejecutables, etc...
# Definimos una clase abstracta que sirva de interfaz para el resto que vengan.
 
class Provider_abstract():
 
    def __init__(self):
        raise NotImplementedError
 
    def getRTPricesNext(self):
        raise NotImplementedError
 
    def getHistPricesNext(self):
        raise NotImplementedError
 
    def setTickDataPath(self):
        raise NotImplementedError
 
    def setHistoricalFile(self):
        raise NotImplementedError

Los métodos tendrán las siguientes funciones:

  • getRTPricesNext: Devuelve el dataframe con los ticks.
  • getHistPricesNext: Devuelve un histórico de precios.
  • setTickDataPath: El directorio donde están los datos tick.
  • setHistoricalFile: La ruta al fichero con los datos históricos.

Y ahora procedemos a la implementación, veamos el código completo primero.

from gym_gspfx.gspProviders.provider import Provider_abstract
import pandas as pd
 
# Reads TF data and generates tick data from the same or other data source.
# For instance, reads 5m data nd generates ticks from 1m data.
 
class Provider(Provider_abstract):
 
    def __init__(self, historical_file, tick_data_path, start_row=0, number_of_samples=300, \
                 tick_file_prefix="EURUSD_1M_"):
 
        self.min_year=None
        self.max_year=None
        self.__historical_file = ""
        self.__tick_data_path = ""
        self.setHistoricalFile(historical_file)
        self.setTickDataPath(tick_data_path)
        self.current_start_row_index = None
        self.number_of_samples = number_of_samples
        self.next_future_start_row = start_row
        self.last_closed_row = None
        self.next_future_row_to_deliver_index = 0
        self.next_two_future_rows_to_deliver = None
        self.__tick_data_files = {"EMPTY": None}
        self.tick_file_prefix=tick_file_prefix
 
 
    def setHistoricalFile(self,fileName):
        self.__historical_file = pd.read_csv(fileName, parse_dates=["date"]).set_index("date")
 
    def setTickDataPath(self, tick_data_path):
        self.__tick_data_path=tick_data_path
 
 
    def getHistPricesNext(self):
        self.current_start_row_index = self.next_future_row_to_deliver_index
        self.next_future_row_to_deliver_index += 1
 
        __df = self.__historical_file.iloc[self.current_start_row_index:self.current_start_row_index + self.number_of_samples]
 
        self.last_closed_row = __df.tail(1)
 
        self.next_two_future_rows_to_deliver = pd.DataFrame( \
            self.__historical_file.iloc[self.current_start_row_index + self.number_of_samples:\
                                        self.next_future_row_to_deliver_index + self.number_of_samples + 1])
        return __df.copy()
 
 
    def getRTPricesNext(self):
 
        if self.last_closed_row is None:
            return None
 
        if "EMPTY" in self.__tick_data_files.keys() and self.__historical_file.shape[0]>0:
            del self.__tick_data_files["EMPTY"]
 
        self.min_year = self.last_closed_row.index.year.values[0]
        self.max_year = self.next_two_future_rows_to_deliver.head(1).index.year.values[0]
 
        if self.max_year not in self.__tick_data_files.keys():
            self.__tick_data_files[self.max_year] = \
                pd.read_csv(self.__tick_data_path + "EURUSD_1M_" + str(self.max_year) + ".csv", parse_dates=["date"])
 
            if self.min_year in self.__tick_data_files.keys() and self.min_year != self.max_year:
                del self.__tick_data_files[self.min_year]
 
        initial_date = self.next_two_future_rows_to_deliver.head(1).index.values[0]
        end_date = self.next_two_future_rows_to_deliver.tail(1).index.values[0]
 
        __df = self.__tick_data_files[self.max_year]
        __df = __df[(__df["date"] >= initial_date) & (__df["date"] < end_date)]
 
        return __df.copy()

El método __init__ se encarga de parametrizar los valores iniciales de la clase, como la fila del histórico desde el que se comenzará a extraer datos, el número de filas del histórico que se extraerán. También se han de definir al instanciar la clase la ruta completa al fichero de históricos y la carpeta donde están los ficheros de ticks.

Los ficheros de ticks los guardo en un diccionario conforme los necesito por motivos prácticos. He optado por usar ficheros de 1m, pero podría tener otra estructura y he preferido prever la posibilidad de tener que utilizar ficheros de tick múltiples. Todo ello se define en el constructor, junto con atributos que serán utilizados por la clase.

 def __init__(self, historical_file, tick_data_path, start_row=0, number_of_samples=300, \
                 tick_file_prefix="EURUSD_1M_"):
 
        self.min_year=None
        self.max_year=None
        self.__historical_file = ""
        self.__tick_data_path = ""
        self.setHistoricalFile(historical_file)
        self.setTickDataPath(tick_data_path)
        self.current_start_row_index = None
        self.number_of_samples = number_of_samples
        self.next_future_start_row = start_row
        self.last_closed_row = None
        self.next_future_row_to_deliver_index = 0
        self.next_two_future_rows_to_deliver = None
        self.__tick_data_files = {"EMPTY": None}
        self.tick_file_prefix=tick_file_prefix

Tras ello, los métodos que cargan el fichero de históricos y configuran la ruta de los ficheros de ticks.
Nótese que utilizo la librería de pandas y que asumo que los datos vienen ya ordenados por fecha, ascendente.

    def setHistoricalFile(self,fileName):
        self.__historical_file = pd.read_csv(fileName, parse_dates=["date"]).set_index("date")
 
   def setTickDataPath(self, tick_data_path):
       self.__tick_data_path=tick_data_path

Cada vez que queremos avanzar 1 vela, tenemos que llamara a este procedimiento.
El procedimiento devuelve un DataFrame con el número de filas indicado en el constructor o modificado en el atributo number_of_samples.
Cada vez que se le llama, avanza 1 fila.

    def getHistPricesNext(self):
        self.current_start_row_index = self.next_future_row_to_deliver_index
        self.next_future_row_to_deliver_index += 1
 
        __df = self.__historical_file.iloc[self.current_start_row_index:self.current_start_row_index + self.number_of_samples]
 
        self.last_closed_row = __df.tail(1)
 
        self.next_two_future_rows_to_deliver = pd.DataFrame( \
            self.__historical_file.iloc[self.current_start_row_index + self.number_of_samples:\
                                        self.next_future_row_to_deliver_index + self.number_of_samples + 1])
        return __df.copy()

Por útlimo, tenemos el método que nos devuelve todos los ticks que se producirán desde el cierre de la última vela de los históricos hasta justo antes de la apertura del siguiente.

   def getRTPricesNext(self):
 
        if self.last_closed_row is None:
            return None
 
        if "EMPTY" in self.__tick_data_files.keys() and self.__historical_file.shape[0]>0:
            del self.__tick_data_files["EMPTY"]
 
        self.min_year = self.last_closed_row.index.year.values[0]
        self.max_year = self.next_two_future_rows_to_deliver.head(1).index.year.values[0]
 
        if self.max_year not in self.__tick_data_files.keys():
            self.__tick_data_files[self.max_year] = \
                pd.read_csv(self.__tick_data_path + "EURUSD_1M_" + str(self.max_year) + ".csv", parse_dates=["date"])
 
            if self.min_year in self.__tick_data_files.keys() and self.min_year != self.max_year:
                del self.__tick_data_files[self.min_year]
 
        initial_date = self.next_two_future_rows_to_deliver.head(1).index.values[0]
        end_date = self.next_two_future_rows_to_deliver.tail(1).index.values[0]
 
        __df = self.__tick_data_files[self.max_year]
        __df = __df[(__df["date"] >= initial_date) & (__df["date"] < end_date)]
 
        return __df.copy()

Estudiando el código anterior se puede hacer una idea de cómo realizar un proveedor de ticks y datos financieros personalizado, así como adaptarlo a las necesidades propias. Una aplicación puede ser un simulador de trading por ejemplo. Yo lo uso para mis proyectos de inteligencia artificial, de manera que no necesito estar conectado a ningún proveedor de precios online para simular series temporales y mercados financieros.

En la siguiente imagen se muestra cómo usar la clase creada con un ejemplo real.

Generador de datos históricos y tick en Python

Y si volvemos a ejecutar los métodos, obtenemos la vela siguiente con sus históricos y los ticks correspondientes.

Si alguien no conoce la librería Pandas y no entiende alguna de las expresiones del código, se pueden consultar en la documentación oficial, en el siguiente enlace: http://pandas.pydata.org/pandas-docs/stable/

Y preguntados:

  • ¿Cómo puede obtenerse algo negociando si no es desarrollando una técnica para acertar las suficientes veces como para cubrir los costes de los fallos?
  • ¿Cómo puede mantenerse un negocio sin conocer el riesgo asociado y una buena gestión financiera y de riesgos conforme?
  • ¿cómo podría un cerebro humano ser mejor calculando aspectos técnicos y numéricos que una compleja computadora?

Predicting Stock Exchange Prices with Machine Learning

This article will describe how to get an average 75% prediction accuracy in next day’s average price change. The target magnitude is the 2-day simple moving average. The reason is that if we do not apply smoothing to daily prices, the forecasts are much harder to get. The minimum possible smoothing is two days, and that will be the target: altering actual prices as little as possible.

I have selected randomly a company from the New York Stock Exchange, it was «CNH Industrial NV». No reason for that, it has been a completely random choice among a couple thousand files I have generated extracted from either Yahoo! or Google finance, I do not remember the source. The files are uploaded here: https://drive.google.com/open?id=18DkJeCqpibKdR8ezwk9hGjdHYSGwovWH.

The method is valid for any financial data as long as it has the same structure. I have also tested it with Forex data getting similar accuracy levels with currencies such as EURUSD, GBPUSD or USDJPY. The interesting point of forecasting those quotes is that by examining where it fails, I think you will improve your price action trading skills and your understanding of the market and what matters.

Data Collection and Variable Configuration

There are millions of possible variable candidates that may seem valid to be analyzed. And which will be the target value we will try to aim? I like thinking that price is like any other object subject to physical laws. It reacts to market forces, it has an inertia, velocity, acceleration, etc.

The forces may be volume, it may have a potential energy depending if it is very high or very low, the rate of change may be important and so on. There are other many factors we could analyze such as gaps, breakouts, technical patterns, candlestick analysis or price distribution within space just to mention a few. For this example we will only be focused on price action and volume.

I have the files saved in csv format to be used with Excel, so let’s start loading the csv file into a DataFrame object using Python.

# Importing all the libraries that we will use.
 
import pandas as pd
import matplotlib.pyplot as plt
import xgboost as xgb
from sklearn.metrics import accuracy_score
 
#Load the data from a csv file.
CNHI = {"stock_name":"CNH Industrial NV", "data": pd.read_csv("./data/CNHI_excel.csv",sep="\t",header=0,decimal=',')}
 
CNHI["data"]=CNHI["data"].drop("Adj Close",1).set_index("Date")

The previous code will, after extracting, remove a column that won’t be used («Adj Close») and creating an index using the «Date» column. The date is not a variable we may use for forecasting, so there is no need to keep it as a column of the dataset.

The data now has the typical structure of the financial data: Date, Open, High, Low and Close. The first three rows are shown in the next table:

Date Open High Low Close Volume
2013-09-30 2.75 13.08 12.5 12.5 352800
2013-10-01 12.76 13.16 12.75 12.92 1477900
2013-10-02 13.02 13.08 12.87 12.9 1631900

Predictors

We are going to omit High, Low and Open, using only Open and Volume for the study. Let’s start preparing the data for the analysis. The predictors (X variables) to be used to predict the target magnitued (y variable) will be the following ones:

  • Two day simple moving average (SMA2). The formula is (Ct – Ct-1)/2, being Ct equal to current day’s open price and Ct-1 to previous day’s open price. This formula is applied to each row of the data set.
Predictors = pd.DataFrame({"sma2":CNHI["data"].Open.rolling(window=2).mean()})
  • 1 day window SMA2. The previous day’s SMA2 value.
Predictors["sma2_1"] = Predictors.sma2.shift(1)

And the other predictors will be:

  • Current day SMA2 increment. (SMA2t – SMA2t-1).
  • 1 day window SMA2 increment. (SMA2t-1 – SMA2t-2).
  • Current day volume increment. (Volt – Volt-1).
  • Current day volume rate of change. (Volt – Volt-1)/Volt
  • 1 day window open price. (Ct-1)
  • Current day open price increment. Ct – Ct-1
  • Current day open price. Ct.
Predictors["sma2_increment"] = Predictors.sma2.diff()  
 
Predictors["sma2_1_increment"] = Predictors.sma2_1.diff()  
 
Predictors["vol_increment"] = CNHI["data"].Volume.diff()
 
Predictors["vol_rel_increment"] = CNHI["data"].Volume.diff() / CNHI["data"].Volume
 
Predictors["open_1"] = CNHI["data"].Open.shift(1)
 
Predictors["open_incr"] = CNHI["data"].Open - CNHI["data"].Open.shift(1)
 
Predictors["open"] = CNHI["data"].Open
 
# The rows with nulls generated by rolling values will be removed.
Predictors = Predictors.dropna()

A sample of the first 5 rows:

Date sma2 sma2_1 sma2_increment sma2_1_increment vol_increment vol_rel_increment open_1 open_incr open
2013-10-03 12.895 12.89 0.005 0.135 -495500 -0.436026047 13.02 -0.25 12.77
2013-10-04 12.765 12.895 -0.13 0.005 -21800 -0.019558586 12.77 -0.01 12.76
2013-10-07 12.59 12.765 -0.175 -0.13 -400 -0.000359002 12.76 -0.34 12.42
2013-10-08 12.42 12.59 -0.17 -0.175 104600 0.08582212 12.42 0 12.42
2013-10-09 12.5 12.42 0.08 -0.17 -232400 -0.235604217 12.42 0.16 12.58

 

Target Variable

This will be a classification variable, if the average price will go either up or down the next day.  The target will be forecasting the difference between today’s price and tomorrow’s price (which is unkonwn).

target = pd.DataFrame({"value":Predictors.sma2.shift(-1) - Predictors.sma2}).dropna()

After calculating the data to predict, the three first rows look like this:

Date value
2013-10-03 -0.13
2013-10-04 -0.175
2013-10-07 -0.17

Finally we will match predictors and target values by date and remove those rows without counterpart in the other table.

X = pd.merge(Predictors, target,left_index=True,right_index=True)[Predictors.columns]
y = pd.merge(Predictors, target,left_index=True,right_index=True)[target.columns]

X now contains the predictors and y the target values. The table contains 1,059 records at this moment.

Extreme Gradient Boosting prediction

The extreme gradient boosting is an exceptional machine learning technique for many reasons. It is based on decision trees and it has nice features such as residuals analysis, non-linear regression, feature selection tools, overfitting avoidance and many other more. Other machine learning alternative techniques commonly used for this type of analysis are Support Vector Machines, Neural Networks and Random Forest. I have used all of those for predicting market prices and the Extreme Gradient Boosting is always my first choice.

We will setup the regression model using the 65% of the data and with that model, the next 35% of the data will be used to predict future values. This simulates the actual scenario in which we have past data to train our model and we want to predict how a future datum will be with the data we currently have on hand. The data will be split in two sets: the training set to preconfigure the model and the testing set that won’t be used to build the model, but only to test if it works as expected with new data.

train_samples = int(X.shape[0] * 0.65)
 
X_train = X.iloc[:train_samples]
X_test = X.iloc[train_samples:]
 
y_train = y.iloc[:train_samples]
y_test = y.iloc[train_samples:]

After applying the data splitting, the test data set contains:

  • Train records: 688.
  • Test records: 371.

The target variables will be transformed for binary classification. A positive change in the value of prices will be classified as 1 and a non-positive change as 0.

def getBinary(val):
    if val>0:
        return 1
    else:
        return 0
 
# and the transformation is applied on the test data for later use.
# The train data will be transformed while it is being fit.
y_test_binary = pd.DataFrame(y_test["value"].apply(getBinary)

And next, the model is trained and the test data predicted to verify the accuracy of the system:

regressor = xgb.XGBRegressor(gamma=0.0,n_estimators=150,base_score=0.7,colsample_bytree=1,learning_rate=0.01)
 
xgbModel = regressor.fit(X_train,y_train.value.apply(getBinary))
 
y_predicted = xgbModel.predict(X_test)
y_predicted_binary = [1 if yp >=0.5 else 0 for yp in y_predicted]
 
print (accuracy_score(y_test_binary,y_predicted_binary))
 
 
Out: 0.76010781671159033

So, the initial accuracy without optimizing the model is 76% predicting the daily average price change for each of the the next 371 trading days.

The model can be optimized, I have just used a few parameters to avoid overfitting with the training data and adjusting the learning rate.

The features used should also be analyzed to avoid using redundant variables and to discard those with no correlation. New features should be added to try improved approaches and, to sum up, there is a lot of work that could be done around this basic model.

XGBOOST has also ways to study features. Let’s take a look at their importance:

fig = plt.figure(figsize=(8,8))
plt.xticks(rotation='vertical')
plt.bar([i for i in range(len(xgbModel.feature_importances_))], xgbModel.feature_importances_.tolist(), tick_label=X_test.columns, color="chocolate")
plt.show()

It is obvious that the field extension is huge and especially interesting.

Deep Learning Nonlinear Regression

In this article we put to work a perceptron to predict a high difficulty level nonlinear regression problem. The data has been generated using an exponential function with this shape:

Eckerle4_exponential_function_graph

The graph above corresponds to the values of the dataset that can be downloaded from the Statistical Reference Dataset of the Information Technology Laboratory of the United States on this link: http://www.itl.nist.gov/div898/strd/nls/data/eckerle4.shtml

Neural networks are especially appropriate to learn patterns and remember shapes. Perceptrons are very basic but yet very powerful neural networks types. Their structure is basically an array of weighted values that is recalculated and balanced iteratively. They can implement activation layers or functions to modify the output within a certain range or list of values.

In order to create the neural network we are going to use Keras, one of the most popular Python libraries. The code is as follows:

The first thing to do is to import the elements that we will use. We will not use aliases for the purpose of clarity:

# Numeric Python Library.
import numpy
# Python Data Analysis Library.
import pandas
# Scikit-learn Machine Learning Python Library modules.
#   Preprocessing utilities.
from sklearn import preprocessing
#   Cross-validation utilities.
from sklearn import cross_validation
# Python graphical library
from matplotlib import pyplot
 
# Keras perceptron neuron layer implementation.
from keras.layers import Dense
# Keras Dropout layer implementation.
from keras.layers import Dropout
# Keras Activation Function layer implementation.
from keras.layers import Activation
# Keras Model object.
from keras.models import Sequential

In the previous code we have imported the numpy and pandas libraries to manage the data structures and perform operations with matrices. The two scikit-learn modules will be used to scale the data and to prepare the test and train data sets.

The matplotlib package will be used to render the graphs.

From Keras, the Sequential model is loaded, it is the structure the Artificial Neural Network model will be built upon. Three types of layers will be used:

  1. Dense: Those are the basic layers made with weighted neurons that form the perceptron. An entire perceptron could be built with these type of layers.
  2. Activation: Activation functions transform the output data from other layers.
  3. Dropout: This is a special type of layer used to avoid over-fitting by leaving out of the learning process a number of neuron.

First we load the dataset already formatted as csv.

# Peraring dataset
# Imports csv into pandas DataFrame object.
Eckerle4_df = pandas.read_csv("Eckerle4.csv", header=0)
 
# Converts dataframes into numpy objects.
Eckerle4_dataset = Eckerle4_df.values.astype("float32")
# Slicing all rows, second column...
X = Eckerle4_dataset[:,1]
# Slicing all rows, first column...
y = Eckerle4_dataset[:,0]
 
# Data Scaling from 0 to 1, X and y originally have very different scales.
X_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
y_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
X_scaled = ( X_scaler.fit_transform(X.reshape(-1,1)))
y_scaled = (y_scaler.fit_transform(y.reshape(-1,1)))
 
# Preparing test and train data: 60% training, 40% testing.
X_train, X_test, y_train, y_test = cross_validation.train_test_split( \
    X_scaled, y_scaled, test_size=0.40, random_state=3)

The predictor variable is saved in variable X and the dependent variable in y. The two variables have values that differ several orders of magnitude; and the neural networks work better with values next to zero. For those two reasons the variables are scaled to remove their original magnitudes and put them within the same magnitude. Their values are proportionally transformed within 0 and 1.

The data is divided into two sets. One will be used to train the neural network, using 60% of all the samples; and the other will contain 40% of the data, that will be used to test if the model works well with out-of-the-sample data.

Now we are going to define the neural network. It will consist in an input layer to receive the data, several intermediate layers, to process the weights, and a final output layer to return the prediction (regression) results.

The objective is that the network learns from the train data and finally can reproduce the original function with only 60% of the data. It could be less, it could be more; I have chosen 60% randomly. In order to verify that the network has learnt the function, we will ask it to predict which response should return the test data that was not used to create the model.

Now let’s think about the neural network topology. If we study the chart, there are three areas that differ considerably. Those are the left tail, up to the 440 mark, a peak between the 440 and 465 marks approximately, and the second tail on the right, from the 465 mark on. For this reason we will use three neuron intermediate layers, so that the first one learns any of these areas, the second one other area, and the third one the final residuals that should correspond to the third area. We will have therefore 3 layers in our network plus one input and one output layer too. The basic layer structure of the neural network should be similar to this, a sequence of layers, from left to right with this topology:

INPUT LAYER(2) > [HIDDEN(i)] > [HIDDEN(j)] > [HIDDEN(k)] > OUTPUT(1)

An input layer that accepts two values X and y, a first intermediate layer that has i neurons, a second hidden layer that has j neurons, an intermediate layer that has k neurons, and finally, an output layer that returns the regression result for each sample X, y.

# New sequential network structure.
model = Sequential()
 
# Input layer with dimension 1 and hidden layer i with 128 neurons. 
model.add(Dense(128, input_dim=1, activation='relu'))
# Dropout of 20% of the neurons and activation layer.
model.add(Dropout(.2))
model.add(Activation("linear"))
# Hidden layer j with 64 neurons plus activation layer.
model.add(Dense(64, activation='relu'))
model.add(Activation("linear"))
# Hidden layer k with 64 neurons.
model.add(Dense(64, activation='relu'))
# Output Layer.
model.add(Dense(1))
 
# Model is derived and compiled using mean square error as loss
# function, accuracy as metric and gradient descent optimizer.
model.compile(loss='mse', optimizer='adam', metrics=["accuracy"])
 
# Training model with train data. Fixed random seed:
numpy.random.seed(3)
model.fit(X_train, y_train, nb_epoch=256, batch_size=2, verbose=2)

Now the model is trained by iterating 256 times through all the train data, taking each time two sampless.

In order to graphically see the accuracy of the model, now we apply the regression model to new data that has not been used to create the model. We will also plot the predicted values versus the actual values.

~ Predict the response variable with new data
predicted = model.predict(X_test)
 
# Plot in blue color the predicted adata and in green color the
# actual data to verify visually the accuracy of the model.
pyplot.plot(y_scaler.inverse_transform(predicted), color="blue")
pyplot.plot(y_scaler.inverse_transform(y_test), color="green")
pyplot.show()

And the produced graph shows that the network has adopted the same shape as the function:

Eckerle4_perceptron_predicted_vs_actual

This demonstrates the exceptional power of neural networks to solve complex statistical problems, especially those in which causality is not crucial such as image processing or speech recognition.