Predicting Stock Exchange Prices with Machine Learning

Share this

This article will describe how to get an average 75% prediction accuracy in next day’s average price change. The target magnitude is the 2-day simple moving average. The reason is that if we do not apply smoothing to daily prices, the forecasts are much harder to get. The minimum possible smoothing is two days, and that will be the target: altering actual prices as little as possible.

I have selected randomly a company from the New York Stock Exchange, it was “CNH Industrial NV”. No reason for that, it has been a completely random choice among a couple thousand files I have generated extracted from either Yahoo! or Google finance, I do not remember the source. The files are uploaded here:

The method is valid for any financial data as long as it has the same structure. I have also tested it with Forex data getting similar accuracy levels with currencies such as EURUSD, GBPUSD or USDJPY. The interesting point of forecasting those quotes is that by examining where it fails, I think you will improve your price action trading skills and your understanding of the market and what matters.

Data Collection and Variable Configuration

There are millions of possible variable candidates that may seem valid to be analyzed. And which will be the target value we will try to aim? I like thinking that price is like any other object subject to physical laws. It reacts to market forces, it has an inertia, velocity, acceleration, etc.

The forces may be volume, it may have a potential energy depending if it is very high or very low, the rate of change may be important and so on. There are other many factors we could analyze such as gaps, breakouts, technical patterns, candlestick analysis or price distribution within space just to mention a few. For this example we will only be focused on price action and volume.

I have the files saved in csv format to be used with Excel, so let’s start loading the csv file into a DataFrame object using Python.

# Importing all the libraries that we will use.
import pandas as pd
import matplotlib.pyplot as plt
import xgboost as xgb
from sklearn.metrics import accuracy_score
#Load the data from a csv file.
CNHI = {"stock_name":"CNH Industrial NV", "data": pd.read_csv("./data/CNHI_excel.csv",sep="\t",header=0,decimal=',')}
CNHI["data"]=CNHI["data"].drop("Adj Close",1).set_index("Date")

The previous code will, after extracting, remove a column that won’t be used (“Adj Close”) and creating an index using the “Date” column. The date is not a variable we may use for forecasting, so there is no need to keep it as a column of the dataset.

The data now has the typical structure of the financial data: Date, Open, High, Low and Close. The first three rows are shown in the next table:

Date Open High Low Close Volume
2013-09-30 2.75 13.08 12.5 12.5 352800
2013-10-01 12.76 13.16 12.75 12.92 1477900
2013-10-02 13.02 13.08 12.87 12.9 1631900


We are going to omit High, Low and Open, using only Open and Volume for the study. Let’s start preparing the data for the analysis. The predictors (X variables) to be used to predict the target magnitued (y variable) will be the following ones:

  • Two day simple moving average (SMA2). The formula is (Ct – Ct-1)/2, being Ct equal to current day’s open price and Ct-1 to previous day’s open price. This formula is applied to each row of the data set.
Predictors = pd.DataFrame({"sma2":CNHI["data"].Open.rolling(window=2).mean()})
  • 1 day window SMA2. The previous day’s SMA2 value.
Predictors["sma2_1"] = Predictors.sma2.shift(1)

And the other predictors will be:

  • Current day SMA2 increment. (SMA2t – SMA2t-1).
  • 1 day window SMA2 increment. (SMA2t-1 – SMA2t-2).
  • Current day volume increment. (Volt – Volt-1).
  • Current day volume rate of change. (Volt – Volt-1)/Volt
  • 1 day window open price. (Ct-1)
  • Current day open price increment. Ct – Ct-1
  • Current day open price. Ct.
Predictors["sma2_increment"] = Predictors.sma2.diff()  
Predictors["sma2_1_increment"] = Predictors.sma2_1.diff()  
Predictors["vol_increment"] = CNHI["data"].Volume.diff()
Predictors["vol_rel_increment"] = CNHI["data"].Volume.diff() / CNHI["data"].Volume
Predictors["open_1"] = CNHI["data"].Open.shift(1)
Predictors["open_incr"] = CNHI["data"].Open - CNHI["data"].Open.shift(1)
Predictors["open"] = CNHI["data"].Open
# The rows with nulls generated by rolling values will be removed.
Predictors = Predictors.dropna()

A sample of the first 5 rows:

Date sma2 sma2_1 sma2_increment sma2_1_increment vol_increment vol_rel_increment open_1 open_incr open
2013-10-03 12.895 12.89 0.005 0.135 -495500 -0.436026047 13.02 -0.25 12.77
2013-10-04 12.765 12.895 -0.13 0.005 -21800 -0.019558586 12.77 -0.01 12.76
2013-10-07 12.59 12.765 -0.175 -0.13 -400 -0.000359002 12.76 -0.34 12.42
2013-10-08 12.42 12.59 -0.17 -0.175 104600 0.08582212 12.42 0 12.42
2013-10-09 12.5 12.42 0.08 -0.17 -232400 -0.235604217 12.42 0.16 12.58


Target Variable

This will be a classification variable, if the average price will go either up or down the next day.  The target will be forecasting the difference between today’s price and tomorrow’s price (which is unkonwn).

target = pd.DataFrame({"value":Predictors.sma2.shift(-1) - Predictors.sma2}).dropna()

After calculating the data to predict, the three first rows look like this:

Date value
2013-10-03 -0.13
2013-10-04 -0.175
2013-10-07 -0.17

Finally we will match predictors and target values by date and remove those rows without counterpart in the other table.

X = pd.merge(Predictors, target,left_index=True,right_index=True)[Predictors.columns]
y = pd.merge(Predictors, target,left_index=True,right_index=True)[target.columns]

X now contains the predictors and y the target values. The table contains 1,059 records at this moment.

Extreme Gradient Boosting prediction

The extreme gradient boosting is an exceptional machine learning technique for many reasons. It is based on decision trees and it has nice features such as residuals analysis, non-linear regression, feature selection tools, overfitting avoidance and many other more. Other machine learning alternative techniques commonly used for this type of analysis are Support Vector Machines, Neural Networks and Random Forest. I have used all of those for predicting market prices and the Extreme Gradient Boosting is always my first choice.

We will setup the regression model using the 65% of the data and with that model, the next 35% of the data will be used to predict future values. This simulates the actual scenario in which we have past data to train our model and we want to predict how a future datum will be with the data we currently have on hand. The data will be split in two sets: the training set to preconfigure the model and the testing set that won’t be used to build the model, but only to test if it works as expected with new data.

train_samples = int(X.shape[0] * 0.65)
X_train = X.iloc[:train_samples]
X_test = X.iloc[train_samples:]
y_train = y.iloc[:train_samples]
y_test = y.iloc[train_samples:]

After applying the data splitting, the test data set contains:

  • Train records: 688.
  • Test records: 371.

The target variables will be transformed for binary classification. A positive change in the value of prices will be classified as 1 and a non-positive change as 0.

def getBinary(val):
    if val>0:
        return 1
        return 0
# and the transformation is applied on the test data for later use.
# The train data will be transformed while it is being fit.
y_test_binary = pd.DataFrame(y_test["value"].apply(getBinary)

And next, the model is trained and the test data predicted to verify the accuracy of the system:

regressor = xgb.XGBRegressor(gamma=0.0,n_estimators=150,base_score=0.7,colsample_bytree=1,learning_rate=0.01)
xgbModel =,y_train.value.apply(getBinary))
y_predicted = xgbModel.predict(X_test)
y_predicted_binary = [1 if yp >=0.5 else 0 for yp in y_predicted]
print (accuracy_score(y_test_binary,y_predicted_binary))
Out: 0.76010781671159033

So, the initial accuracy without optimizing the model is 76% predicting the daily average price change for each of the the next 371 trading days.

The model can be optimized, I have just used a few parameters to avoid overfitting with the training data and adjusting the learning rate.

The features used should also be analyzed to avoid using redundant variables and to discard those with no correlation. New features should be added to try improved approaches and, to sum up, there is a lot of work that could be done around this basic model.

XGBOOST has also ways to study features. Let’s take a look at their importance:

fig = plt.figure(figsize=(8,8))
plt.xticks(rotation='vertical')[i for i in range(len(xgbModel.feature_importances_))], xgbModel.feature_importances_.tolist(), tick_label=X_test.columns, color="chocolate")

It is obvious that the field extension is huge and especially interesting.

Network Analysis Applied to Product Management -Betweenness

Share this

Data can be analyzed using multiple approaches. When I think in product research or data mining for marketing one of the first ideas that come into my mind is clustering. How to organize clients and products into homogeneous groups with similar attributes and analyze the evolution of these groups over time.

One may try to find the correlation between product A and product B to determine whether A and B are purchased together and so on. However, what if we want to analyze products depending on their location within a mall, their proximity, the altitude of the shelf they are located on, their price and the relationship they have among all the other products?

Graph theory may help to uncover many relevant features hidden within clients, prices, products, location and invoices at once. For instance:

  • Which are the more accessible (in terms of selection) elements within the network?
  • Which products are critical to build the shopping cart?
  • Are the elements of this network grouped in classes of any kind?
  • What differences can be found among different sites or countries?
  • Which product is the best one to promote another product?
  • Which is the model of one product that, even if it is not the best sold, must be included to sell other three particular products to a group of customers classified within the “definitive product acquisition still in progress” cluster?

Online Retail mall dataset description

The dataset used corresponds to the Online Retail dataset by Daqing Chen, Sai Liang Sain, and Kun Guo, “Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining”, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197–208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).

The files are available online on: and it consists on an Excel file with all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.

The dataset has a total of 539,392 rows. For the purposes of the study all transactions with price equal to zero and those records having any null value have been removed.

There are three types of entities that have been used to create the vertices:

  • Clients: all their ids have been modified adding a leading “C^^” prefix.
  • Invoices: all their ids have been modified adding a leading “I^^” prefix.
  • Products: all their ids have been modified adding a leading “S^^” prefix corresponding to the initial letter of “Stock”.

Here is a sample of the original data used:

InvoiceDate InvoiceNo StockCode Description Quantity UnitPrice CustomerID Country
2011-03-03 09:41:00 545468 21166 COOK WITH WINE METAL SIGN 12 2.08 16571.0 United Kingdom
2011-04-07 12:30:00 549258 71459 HANGING JAM JAR T-LIGHT HOLDER 12 0.85 13102.0 United Kingdom
2011-06-23 17:20:00 557949 21094 SET/6 RED SPOTTY PAPER PLATES 12 0.85 15530.0 United Kingdom
2011-07-01 11:20:00 558682 21935 SUKI  SHOULDER BAG 3 4.13 United Kingdom
2011-07-20 12:51:00 560710 23295 SET OF 12 MINI LOAF BAKING CASES 1 0.83 14646.0 Netherlands
2011-07-27 10:40:00 561396 23118 PARISIENNE JEWELLERY DRAWER 4 7.5 13458.0 United Kingdom
2011-08-11 14:54:00 563035 22470 HEART OF WICKER LARGE 3 2.95 17790.0 United Kingdom
2011-08-22 10:59:00 563954 22623 BOX OF VINTAGE JIGSAW BLOCKS 12 5.95 16652.0 United Kingdom
2011-08-26 09:37:00 564559 37327 ASSTD MULTICOLOUR CIRCLES MUG 48 0.39 15811.0 United Kingdom
2011-12-04 10:10:00 580384 22314 OFFICE MUG WARMER CHOC+BLUE 12 1.25 17243.0 United Kingdom

Data Insights

The online store operates across 38 countries and the first country in the rank is United Kingdom that amounts to 8.2 million sterlings.

Country Amount Sold GBP
United Kingdom 8.209,93 1
Netherlands 284,66 2
EIRE 263,28 3
Germany 221,70 4
France 197,40 5
Australia 137,08 6
Switzerland 56,39 7
Spain 54,77 8
Belgium 40,91 9
Sweden 36,60 10
Japan 35,34 11
Norway 35,16 12
Portugal 29,37 13
Finland 22,33 14
Channel Islands 20,09 15
Denmark 18,77 16
Italy 16,89 17
Cyprus 12,95 18
Austria 10,15 19
Hong Kong 10,12 20
Singapore 9,12 21
Israel 7,91 22
Poland 7,21 23
Unspecified 4,75 24
Greece 4,71 25
Iceland 4,31 26
Canada 3,67 27
Malta 2,51 28
United Arab Emirates 1,90 29
USA 1,73 30
Lebanon 1,69 31
Lithuania 1,66 32
European Community 1,29 33
Brazil 1,14 34
RSA 1,00 35
Czech Republic 0,71 36
Bahrain 0,55 37
Saudi Arabia 0,13 38

The typical purchase/transaction amount has a:

  • Median equal to 303.83 GBPs
  • Maximum transaction of 168,469.60 GBPs
  • Minimum transaction of 0.38 GBPs

The graphs below show the transaction distribution. The values are in base-10 logarithms to highlight the price scales:

Logarithmic scale. Frequency on y-axis.

And the boxplot for the same dataset:

Analysis of United Kingdom Transactions

Let’s take a look at the United Kingdom data. For doing it I am going to create a directed graph which nodes will be the clients, invoices and products within the invoices. The edges or ties will be respectively the correspondences between clients and invoices and the same between invoices and products. Products and clients will not have direct edges. The initial weight for edges will be the price for products-invoices and the total amount of the invoice for invoices-clients.

The network main features are:

  • Graph type: Directed Graph
  • Order: 27,468 vertices (clients + invoices + products).
  • Size: 743,510 edges.
  • Average degree: 27 (average number of ties per vertex, same value for in and out edges).

A subset of the UK network is represented as a graph in the following figure (click on the image to expand it):

Product and Client Analysis

The top five clients by transaction amount in GBP are:

Customer ID Amount (GBP)
C^^18102 256,438.49
C^^17450 187,482.17
C^^17511 88,125.38
C^^16684 65,892.08
C^^13694 62,653.1

The first client ego graph displays the invoices related to that client (click on the image to zoom in):

The central or core node represents the client and the nodes in the periphery are the related invoices. Its main features are:

  • 47 nodes (1 corresponds to the customer ID and 46 to the invoices).
  • 92 ties, 46 edges that from the client to each invoice and other 46 from each invoice to the client.

Now, we may be interested on knowing which products were only sold once to our clients. Maybe some of them are very profitable and we could be interested on analyzing how those products we are interested on promoting them have been sold to other clients to create a specific marketing plan for those products.

The number of ties is called degree, so we want to sort our products by their degree. Over 400 of them only appear on one invoice, just to mention a few codes:

StockCode Description Price (GBP)
S^^15039 SANDALWOOD FAN 0.85
S^^15039 SANDALWOOD FAN 0.53

Now we realize that we should split the product codes into sub-codes for each variety, but it is unnecessary for the purposes of the example.

Centrality Study

Two important centrality study measures are:

  • Closeness: The nodes that have more connections are considered central nodes. They are good candidates to support the entire system and they can have an overall big influence on the other nodes.
  • Betweenness: The nodes with the highest betweennes may not have many connections, but the connections they have are critical to the system because if those nodes are lost, important parts of the system will be disconnected or put away from the rest. These nodes are good candidates for critical maintenance plans since they are key for the cohesion of the network. These nodes receive more information than others since they are in more cases the unique path that connect other nodes across the network or the shortest path for many nodes to reach other nodes.

Closeness is very similar to central magnitudes such as median, but betweenness is more related to links between different parts of the system and I find it very interesting to explore. The top products with the highest betweenness are:

StockCode Betweenness
S^^82484 5,98%
S^^85123A 3,43%
S^^85099B 3,42%
S^^21166 2,88%
S^^22993 1,87%
S^^22189 1,80%
S^^21080 1,68%
S^^22423 1,54%
S^^22170 1,49%
S^^21181 1,45%
S^^82482 1,45%
S^^23203 1,35%
S^^22197 1,31%
S^^82600 1,28%
S^^20685 1,26%
S^^21212 1,24%
S^^22960 1,22%
S^^22469 1,17%
S^^21175 1,05%
S^^20750 1,04%
S^^21174 1,02%
S^^21876 1,00%
S^^23202 0,98%
S^^21428 0,97%
S^^22659 0,91%

¿Is any of those nodes a product that promotes upselling since they seem to be linked to other exclusive nodes or  many nodes exist where they are?

To answer this question we should analyze the invoices and the related products, not necessarily using graphs. Let’s take a look at the ego graph of one of those products:

We can observe the invoices related to that node (S^^82484) of the top five customers data subset. The same can be done with the rest, for example with the second product in the rank:

Now we have two products which betweenness is the maximum found across the top-five clients subset. We could analyze if those products are key to increase the sells amount, for example, by offering them to other clients within the network to facilitate the creation new links and greater invoice amounts.

We could also be interested on those top five customers that amount to 22 percent of the total sales, because it is worth exploring how to start improving their fidelity since upgrading the best is always the most difficult task.

Many other types of data analytics can be performed using graph and network theories, not only on products but also on companies, society and other complex domains and systems such as ecosystems, Medicine, Business processes, or Politics.

This article provides a very small grasp of all the network and graph theories and how they can be actually applied to multiple problems.

Deep Learning Nonlinear Regression

Share this

In this article we put to work a perceptron to predict a high difficulty level nonlinear regression problem. The data has been generated using an exponential function with this shape:


The graph above corresponds to the values of the dataset that can be downloaded from the Statistical Reference Dataset of the Information Technology Laboratory of the United States on this link:

Neural networks are especially appropriate to learn patterns and remember shapes. Perceptrons are very basic but yet very powerful neural networks types. Their structure is basically an array of weighted values that is recalculated and balanced iteratively. They can implement activation layers or functions to modify the output within a certain range or list of values.

In order to create the neural network we are going to use Keras, one of the most popular Python libraries. The code is as follows:

The first thing to do is to import the elements that we will use. We will not use aliases for the purpose of clarity:

# Numeric Python Library.
import numpy
# Python Data Analysis Library.
import pandas
# Scikit-learn Machine Learning Python Library modules.
#   Preprocessing utilities.
from sklearn import preprocessing
#   Cross-validation utilities.
from sklearn import cross_validation
# Python graphical library
from matplotlib import pyplot
# Keras perceptron neuron layer implementation.
from keras.layers import Dense
# Keras Dropout layer implementation.
from keras.layers import Dropout
# Keras Activation Function layer implementation.
from keras.layers import Activation
# Keras Model object.
from keras.models import Sequential

In the previous code we have imported the numpy and pandas libraries to manage the data structures and perform operations with matrices. The two scikit-learn modules will be used to scale the data and to prepare the test and train data sets.

The matplotlib package will be used to render the graphs.

From Keras, the Sequential model is loaded, it is the structure the Artificial Neural Network model will be built upon. Three types of layers will be used:

  1. Dense: Those are the basic layers made with weighted neurons that form the perceptron. An entire perceptron could be built with these type of layers.
  2. Activation: Activation functions transform the output data from other layers.
  3. Dropout: This is a special type of layer used to avoid over-fitting by leaving out of the learning process a number of neuron.

First we load the dataset already formatted as csv.

# Peraring dataset
# Imports csv into pandas DataFrame object.
Eckerle4_df = pandas.read_csv("Eckerle4.csv", header=0)
# Converts dataframes into numpy objects.
Eckerle4_dataset = Eckerle4_df.values.astype("float32")
# Slicing all rows, second column...
X = Eckerle4_dataset[:,1]
# Slicing all rows, first column...
y = Eckerle4_dataset[:,0]
# Data Scaling from 0 to 1, X and y originally have very different scales.
X_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
y_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
X_scaled = ( X_scaler.fit_transform(X.reshape(-1,1)))
y_scaled = (y_scaler.fit_transform(y.reshape(-1,1)))
# Preparing test and train data: 60% training, 40% testing.
X_train, X_test, y_train, y_test = cross_validation.train_test_split( \
    X_scaled, y_scaled, test_size=0.40, random_state=3)

The predictor variable is saved in variable X and the dependent variable in y. The two variables have values that differ several orders of magnitude; and the neural networks work better with values next to zero. For those two reasons the variables are scaled to remove their original magnitudes and put them within the same magnitude. Their values are proportionally transformed within 0 and 1.

The data is divided into two sets. One will be used to train the neural network, using 60% of all the samples; and the other will contain 40% of the data, that will be used to test if the model works well with out-of-the-sample data.

Now we are going to define the neural network. It will consist in an input layer to receive the data, several intermediate layers, to process the weights, and a final output layer to return the prediction (regression) results.

The objective is that the network learns from the train data and finally can reproduce the original function with only 60% of the data. It could be less, it could be more; I have chosen 60% randomly. In order to verify that the network has learnt the function, we will ask it to predict which response should return the test data that was not used to create the model.

Now let’s think about the neural network topology. If we study the chart, there are three areas that differ considerably. Those are the left tail, up to the 440 mark, a peak between the 440 and 465 marks approximately, and the second tail on the right, from the 465 mark on. For this reason we will use three neuron intermediate layers, so that the first one learns any of these areas, the second one other area, and the third one the final residuals that should correspond to the third area. We will have therefore 3 layers in our network plus one input and one output layer too. The basic layer structure of the neural network should be similar to this, a sequence of layers, from left to right with this topology:

INPUT LAYER(2) > [HIDDEN(i)] > [HIDDEN(j)] > [HIDDEN(k)] > OUTPUT(1)

An input layer that accepts two values X and y, a first intermediate layer that has i neurons, a second hidden layer that has j neurons, an intermediate layer that has k neurons, and finally, an output layer that returns the regression result for each sample X, y.

# New sequential network structure.
model = Sequential()
# Input layer with dimension 1 and hidden layer i with 128 neurons. 
model.add(Dense(128, input_dim=1, activation='relu'))
# Dropout of 20% of the neurons and activation layer.
# Hidden layer j with 64 neurons plus activation layer.
model.add(Dense(64, activation='relu'))
# Hidden layer k with 64 neurons.
model.add(Dense(64, activation='relu'))
# Output Layer.
# Model is derived and compiled using mean square error as loss
# function, accuracy as metric and gradient descent optimizer.
model.compile(loss='mse', optimizer='adam', metrics=["accuracy"])
# Training model with train data. Fixed random seed:
numpy.random.seed(3), y_train, nb_epoch=256, batch_size=2, verbose=2)

Now the model is trained by iterating 256 times through all the train data, taking each time two sampless.

In order to graphically see the accuracy of the model, now we apply the regression model to new data that has not been used to create the model. We will also plot the predicted values versus the actual values.

~ Predict the response variable with new data
predicted = model.predict(X_test)
# Plot in blue color the predicted adata and in green color the
# actual data to verify visually the accuracy of the model.
pyplot.plot(y_scaler.inverse_transform(predicted), color="blue")
pyplot.plot(y_scaler.inverse_transform(y_test), color="green")

And the produced graph shows that the network has adopted the same shape as the function:


This demonstrates the exceptional power of neural networks to solve complex statistical problems, especially those in which causality is not crucial such as image processing or speech recognition.

Exclusive and Independent Probability Events

Share this

There are two commonly used terms in Probability and generically in Statistics:

  • Exclusive or disjoint events.
  • Independent events.

A great number of theorems and applications within the Statistics field depend on whether the studied events are either mutually exclusive or not, and if they are either mutually independent or not as well.

Disjoint or mutually exclusive events

Two events are disjoint if they cannot occur at the same time. For instance, the age ranges probabilities for a customers are disjoint. It cannot occur simultaneously that a particular customer is more than twenty and less than twenty year old.

Other example is the status of an order. It may be in preparation, at the magazine, en route or delivered to the consignee; being those states mutually exclusive as well.

On the other hand non-disjoint events may coexist at the same point in time. A customer may live in a particular town and be at concurrently more than twenty year old. Those two conditions are not mutually exclusive. Those type of events are not disjoint or mutually exclusive. In the same way, an order may either be in preparation and being assembled or in preparation and ready for delivery all together.

Depending if two or more events are or not disjoint, the way to calculate their probabilities is different. And the outcome of the probabilistic calculus will vary therefore based on it.

Dependent events

Two events are independent when the outcome of one does not depend on the other. In terms of probability, two events are independent when the probability of one of them is not affected by the probability of the other event.

This is the case of the games of chance like lotteries and casinos. Every time the die is rolled the chances to obtain a particular outcome do not change; at each roll the probability of obtaining any of the six possible values for a six-sided die is equal to \(\)\(\frac{1}{6}\).

Conversely, dependent events are affected by their respective probabilities. In this case we talk about conditional probability and that probability is expressed using the nomenclature \(P(A|B)\). An example may be the probability of selling an on-line product when the user has already opened an account on the site and returns. It is different if the second event (account opened previously) occurs or not.

Python for Digital Signal Processing

Share this

Before starting with this post, you may want to learn the basics of Python. If you’re an experienced programmer and head Python for the first time, you will likely find it very easy to understand. One important thing about Python: Python requires perfect indentation (4 spaces) to validate the code. So if you get an error and you code seems perfect, review if you have indented correctly each line. Python has also a particular way to deal with arrays, more close the the R programming language than to C-like style.

Python’s core functionality is extended with thousands of available free libraries, many of them are incredibly handy. Even if Python is not a compiled language, many of its libraries are written in C, being Python a wrapper for them.

The libraries used on this article are:

  • scipy – Scientific library for Python.
  • numpy – Numeric library for Python.

To load a wav file in Python:

# Loads the package for later usage of io.wavefile module.
from scipy import io
# Location of the wav file in the file system.
fileName = '/Downloads/Music_Genres/genres/country/country.00053.wav'
# Loads sample rate (bps) and signal data (wav). Notice that Python
# functions can return multiple variables at the same time.
sample_rate, data =
# Print in sdout the sample rate, number of items. 
#and duration in seconds of the wav file. 
print "Sample rate:{0}, data size:{1}, duration:{2} seconds" \

The output generated should seem like:

Sample rate:22050, data size:(661794L,), duration:30 seconds

The output shows that the wav file contains in all 661,794 samples (the data variable is an array with 661,794 elements). The sample rate is 22,050 samples per second. Dividing 661,794 elements by 22,050 samples per second, we obtain 30 seconds, the length in seconds of the wav file.

The Fourier Transform

The Fourier transform is the method that we will use to extract the prevalent frequencies from the wav file. Each frequency corresponds to a musical tone; knowing the frequencies from a particular time interval we are able to know which are the most frequent tones within that interval, being possible to infer the key and chords played during that time lapse.

This article is not going to enter into the details of the Fourier transform, only on how to use it to extract information regarding the frequency power from the wav signal analyzed. The video below is an intuitive introduction to the Fourier transform in case the reader is interested on it. It also includes examples of how to implement it algorithmically. It is quite advisable to watch it once now and then come back again to review it after the training in Fourier transform is completed.

Basically, given a signal, a wav file on this post, which is composed by a number n of samples \(x[n]\). We can get the frequency power within the signal with the FFT (Fast Fourier Transform) function. The FFT function is an improvement that optimizes the Fourier transform.

The FFT function receives two arguments, the signar \(x\) and the number of items to retrieve \(k, k\leq n\). The commonly choosen k value is \(\frac{n}{2}\) because the FFT result, \(fft[k]\) is usually symmetric around that length. This means that in order to calculate the FFT, only a half of the total signal length is required to retrieve the different frequencies occurrence. So, in plain words, if the original signal file has 100 samples, only 50 samples are needed to process the complete FFT transform.

In Python language there are two useful functions to calculate and get the Fourier transform from a sample array, like the one where the data variable from the wav file is stored:

  • fftfreq – Returns the frequency corresponding to each \(x_i\) sample from the signal data sample file \(x[n]\) corresponding to the power of the fourier transform. This is the frequency to which each fft element corresponds to.
  • fft – Returns the fourier transform data from the sample file. The position of the elements returned correspond to the position of the fftfreq, so that using both arrays the fft power elements correspond by position to the fftfreq frequencies.

For instance, if the fourier transform function returns fft = {0,0.5,1} and \(\)fftfreq = {100,200,300}\(\), it means that the signal has a power of 0 for frequency 100Hz, a power of 0.5 for 200Hz and a power of 1 within 300Hz; being 300Hz the frequency most frequent.

The following code would extract from a wav file the first 10 second, apply the fourier transform  and the frequencies associated to each item within  the spectral data.

# Package that implements the fast
# fourier transform functions.
from scipy import fftpack
import numpy as np
# Loads wav file as array.
fileName = './country.00053.wav'
sample_rate, data =
# Extracting 10 seconds. N is the numbers of samples to
# extract or elements from the array.
seconds_to_extract = 10
N = seconds_to_extract * sample_rate
# Knowing N and sample rate, fftfreq gets the frequency
# Associated to each FFT unit.
f = fftpack.fftfreq(N, 1.0/sample_rate)
# Extracts fourier transform data from the sample
# returning the spectrum analysis
subdata = data[:N]
F = fftpack.fft(subdata)

F contains the power and f the frequency each item within F is related to. The higher the power, the higher the frequency prevalence across the signal. Filtering the frequencies using the f matrix and extracting the power we could get a graph like the next one:fourier_transform_python

On the y-axis, \(|F|\) is the absolute value of each unit from F and the values of f are the Frequency (Hz) on the x-axis. The green and orange lines can be ignored. To get the subset of frequencies [200-900] displayed on the chart, the next code was used:

# Interval limits
Lower_freq = 200
Upper_freq = 900
# f (frequencies) between lower frequency AND
# f (frequencies) upper frequencies.
filter_subset = (f >= Lower_freq) * (f <= Upper_freq)
# Extracts filtered items from the frequency list.
f_subset = f[filter_subset]
# Extracts filtered items from the Fourier transform power list.
F_subset = F[filter_subset]

Spectral Analysis and Harmony

Share this

Chromatic scale tone frequencies

On the previous post, Spectral Analysis and Harmony, it is shown an elementary introduction to harmony and digital signal. We are now going to study the range of tones between A3 an A5. Our central axis is A tone (or A4) which frequency is equal to 440Hz.

The next table shows all the tones and frequencies within the chromatic scale belonging to the range between A3 and A5. The piano key number corresponding to each tone is also displayed.

A3 37 a 220.000
A♯3/B♭3 38 a♯/b♭ 233.082
B3 39 b 246.942
C4 Middle C 40 c′ 1-line octave 261.626
C♯4/D♭4 41 c♯′/d♭′ 277.183
D4 42 d′ 293.665
D♯4/E♭4 43 d♯′/e♭′ 311.127
E4 44 e′ 329.628
F4 45 f′ 349.228
F♯4/G♭4 46 f♯′/g♭′ 369.994
G4 47 g′ 391.995
G♯4/A♭4 48 g♯′/a♭′ 415.305
A4 – A440 49 a′ 440.000
A♯4/B♭4 50 a♯′/b♭′ 466.164
B4 51 b′ 493.883
C5 Tenor C 52 c′′ 2-line octave 523.251
C♯5/D♭5 53 c♯′′/d♭′′ 554.365
D5 54 d′′ 587.330
D♯5/E♭5 55 d♯′′/e♭′′ 622.254
E5 56 e′′ 659.255
F5 57 f′′ 698.456
F♯5/G♭5 58 f♯′′/g♭′′ 739.989
G5 59 g′′ 783.991
G♯5/A♭5 60 g♯′′/a♭′′ 830.609
A5 61 a′′ 880.000

The difference or leap between two tones is called interval. One interesting feature of the chromatic scale is that it is composed by constant intervals. For instance, tone A3 is equal to 220Hz, tone A4 to 440Hz and tone A5 to 880Hz. Each tone frequency is double its analogue tone from the precedent respective octave.

The important idea is that we can analyze tones as numbers and operate with basic arithmetics with them with their frequencies. Who said emotions cannot be explained by Science? Do not be intimidated if you don’t know neither music theory nor Optical Physics; These texts will led you by the hand on a trip at which end you will know how to extract the waves, tones and emotions from digital music even without knowing none of those.

Frequency analysis in a nutshell

In order to analyze the frequencies that compose a piece of music, we take a part from it and extract a subset of frequencies. Like using an equalizer we filter the sound between two specific frequencies or tones. For instance, we could read the first ten seconds of a music mp3 file and generate a table displaying how many times tone A appears within that sequence. Going farther we could analyze how many tones appear and how many times each tone is played within those 10 first seconds.

As seen on the Emotions Within Digital Signals article, those tones can be used to define the chords and keys a piece of music is formed by.

In order to extract the signal frequency occurrences, we can use a frequency spectrum graph. This graph displays how many times a frequency appears on a signal and its power or prevalence other the rest. In this case, the signal is the first 10 seconds of music. Let’s see an example:

Signal and Frequency Spectrum Graphs

From the graph on the right we can see that the most used frequencies, those having higher \(|F|\), are one next to the 200Hz, another between the 300Hz and 400Hz and a third one between the 400Hz and the 500Hz. The x-axis shows the frequency spectrum (or range) we are analyzing, and the y-axis the power of the signal. The higher the line at a certain point on the x-axis, the more the power that signal has over that frequency.

To get an insight of the most used tones, the frequencies that have more power can be extracted, and in this case the dominant frequencies within the signal are in particular 220Hz, 246.942Hz, 329.628Hz and 440Hz. Rounding those frequencies to the nearest integer and comparing them to the ones in the table above, we can extract some of the main tones within the first ten seconds of the song.

A3 220.000
B3 247
E4 330
A4 A440 440

From the data above it can be determined that the dominant key within the first seconds is composed by tones A, B and  E. That key corresponds to chord A2Sus (A 2nd suspended). A chord is how it’s called the sound composed by multiple tones, multiple frequencies. The names of the different chords are not described in this article, since there are many of them.

In terms of music harmony A2Sus, or generically speaking 2nd suspended chords are tones that create a sensation of waiting for something to be resolved. The listener is holding on until the song resolves in something. We could say that the first ten seconds of this song are causing an emotion of expectation.

For more information on music and emotions, search in Google “emotions chords harmony”. For a good introduction to the matter I would recommend the paper Music and Emotions.

This article and the previous one, Emotions Within Digital Signals, set the basis to successfully tackle the problem of extracting emotions from music sequences. I will explain how to perform that task using Python language in the post Python for Digital Signal Processing.

Emotions Within Digital Signals

Share this

Music and artistic expression are conceived to provoke emotions to people. Music and visual arts travel in waves through the air across distances, from the transmitter to the receiver. Music is maybe the most influencing form of art, capable of producing deep emotional effects, evoking feelings and awakening memories when one is exposed to it.

The sound perceived is nothing else that the effect from the vibration of the eardrum hit by the sound waves traveling through the air. Like a pendulum, a fast one, the eardrum oscillates and that oscillation is felt as an emotion by our brain.

The fundamental unit in music is the tone. When one sings a song that one is reproducing a sequence of tones in a certain order to produce a melody. In music secondary tones usually follow the lead tone or principal melody. When multiple tones sound at the same time we may call it a chord. Chords define the temper of the music, and are in large part responsible for the emotions that individuals will appreciate when hearing the music.

The tone is the basic musical unit. Western music uses twelve typical tones (C, C#, D, D#, E, F, F#, G, G#, A, A#, B). That range of tones is called octave, and that tone structure is also commonly called chromatic scale.

Each chord is composed exclusively and always by two or more of those tones in western music. For instance, the C Major Chord I is formed by tones C, E and G played at the same time.

Same way we call chord to the sound of multiple tones at once, we call key to the group of tones the music evolves through. Key is similar to chord, and the basic difference is that keys are tones across time within the same space or plane, and chords are tones on the same instant but across different planes. To summarize we can assume that chords are multiple simultaneous tones and keys are multiple tones belonging always to the same space of tones.

For instance C Major chord would consist on tones C, E and G played at the same time for two seconds. C Major key could consist instead on C tone played on second 1, E tone played on second 2, and finally G tone played alone on second 3.

Remember that we said that music are just waves, in fact tones are waves too, and each tone has an unique corresponding wave. If we examine the most common waves within each part of a musical piece, we can find out which notes are defining that music within each time interval. We can therefore extract the tones, chords and the key of that music just by analyzing the frequency of the waves it is composed of.

Frequency is the time a wave completes a cycle. It is measured in hertzs. One hertz is equal to one cycle per second. Each tone has a fixed frequency that never changes. For instance, tone A corresponds to a frequency of 440Hz. Instruments are usually tempered using that tone A as a basis, meaning that all instruments that we can hear and produce notes will produce the same frequencies for the same tones.

In the next post, Spectral Analysis and Harmony, we will see how can we take advantage of wave analysis (Digital Signal Processing) and Music theory (Harmony) to programmatically identify feelings from music files.