Data can be analyzed using multiple approaches. When I think in product research or data mining for marketing one of the first ideas that come into my mind is clustering. How to organize clients and products into homogeneous groups with similar attributes and analyze the evolution of these groups over time.
One may try to find the correlation between product A and product B to determine whether A and B are purchased together and so on. However, what if we want to analyze products depending on their location within a mall, their proximity, the altitude of the shelf they are located on, their price and the relationship they have among all the other products?
Graph theory may help to uncover many relevant features hidden within clients, prices, products, location and invoices at once. For instance:
- Which are the more accessible (in terms of selection) elements within the network?
- Which products are critical to build the shopping cart?
- Are the elements of this network grouped in classes of any kind?
- What differences can be found among different sites or countries?
- Which product is the best one to promote another product?
- Which is the model of one product that, even if it is not the best sold, must be included to sell other three particular products to a group of customers classified within the “definitive product acquisition still in progress” cluster?
Online Retail mall dataset description
The dataset used corresponds to the Online Retail dataset by Daqing Chen, Sai Liang Sain, and Kun Guo, “Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining”, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197–208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).
The files are available online on: http://archive.ics.uci.edu/ml/datasets/online+retail and it consists on an Excel file with all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.
The dataset has a total of 539,392 rows. For the purposes of the study all transactions with price equal to zero and those records having any null value have been removed.
There are three types of entities that have been used to create the vertices:
- Clients: all their ids have been modified adding a leading “C^^” prefix.
- Invoices: all their ids have been modified adding a leading “I^^” prefix.
- Products: all their ids have been modified adding a leading “S^^” prefix corresponding to the initial letter of “Stock”.
Here is a sample of the original data used:
InvoiceDate | InvoiceNo | StockCode | Description | Quantity | UnitPrice | CustomerID | Country |
2011-03-03 09:41:00 | 545468 | 21166 | COOK WITH WINE METAL SIGN | 12 | 2.08 | 16571.0 | United Kingdom |
2011-04-07 12:30:00 | 549258 | 71459 | HANGING JAM JAR T-LIGHT HOLDER | 12 | 0.85 | 13102.0 | United Kingdom |
2011-06-23 17:20:00 | 557949 | 21094 | SET/6 RED SPOTTY PAPER PLATES | 12 | 0.85 | 15530.0 | United Kingdom |
2011-07-01 11:20:00 | 558682 | 21935 | SUKI SHOULDER BAG | 3 | 4.13 | United Kingdom | |
2011-07-20 12:51:00 | 560710 | 23295 | SET OF 12 MINI LOAF BAKING CASES | 1 | 0.83 | 14646.0 | Netherlands |
2011-07-27 10:40:00 | 561396 | 23118 | PARISIENNE JEWELLERY DRAWER | 4 | 7.5 | 13458.0 | United Kingdom |
2011-08-11 14:54:00 | 563035 | 22470 | HEART OF WICKER LARGE | 3 | 2.95 | 17790.0 | United Kingdom |
2011-08-22 10:59:00 | 563954 | 22623 | BOX OF VINTAGE JIGSAW BLOCKS | 12 | 5.95 | 16652.0 | United Kingdom |
2011-08-26 09:37:00 | 564559 | 37327 | ASSTD MULTICOLOUR CIRCLES MUG | 48 | 0.39 | 15811.0 | United Kingdom |
2011-12-04 10:10:00 | 580384 | 22314 | OFFICE MUG WARMER CHOC+BLUE | 12 | 1.25 | 17243.0 | United Kingdom |
Data Insights
The online store operates across 38 countries and the first country in the rank is United Kingdom that amounts to 8.2 million sterlings.
Country | Amount Sold GBP (Thousands) |
rank |
United Kingdom | 8.209,93 | 1 |
Netherlands | 284,66 | 2 |
EIRE | 263,28 | 3 |
Germany | 221,70 | 4 |
France | 197,40 | 5 |
Australia | 137,08 | 6 |
Switzerland | 56,39 | 7 |
Spain | 54,77 | 8 |
Belgium | 40,91 | 9 |
Sweden | 36,60 | 10 |
Japan | 35,34 | 11 |
Norway | 35,16 | 12 |
Portugal | 29,37 | 13 |
Finland | 22,33 | 14 |
Channel Islands | 20,09 | 15 |
Denmark | 18,77 | 16 |
Italy | 16,89 | 17 |
Cyprus | 12,95 | 18 |
Austria | 10,15 | 19 |
Hong Kong | 10,12 | 20 |
Singapore | 9,12 | 21 |
Israel | 7,91 | 22 |
Poland | 7,21 | 23 |
Unspecified | 4,75 | 24 |
Greece | 4,71 | 25 |
Iceland | 4,31 | 26 |
Canada | 3,67 | 27 |
Malta | 2,51 | 28 |
United Arab Emirates | 1,90 | 29 |
USA | 1,73 | 30 |
Lebanon | 1,69 | 31 |
Lithuania | 1,66 | 32 |
European Community | 1,29 | 33 |
Brazil | 1,14 | 34 |
RSA | 1,00 | 35 |
Czech Republic | 0,71 | 36 |
Bahrain | 0,55 | 37 |
Saudi Arabia | 0,13 | 38 |
The typical purchase/transaction amount has a:
- Median equal to 303.83 GBPs
- Maximum transaction of 168,469.60 GBPs
- Minimum transaction of 0.38 GBPs
The graphs below show the transaction distribution. The values are in base-10 logarithms to highlight the price scales:

And the boxplot for the same dataset:
Analysis of United Kingdom Transactions
Let’s take a look at the United Kingdom data. For doing it I am going to create a directed graph which nodes will be the clients, invoices and products within the invoices. The edges or ties will be respectively the correspondences between clients and invoices and the same between invoices and products. Products and clients will not have direct edges. The initial weight for edges will be the price for products-invoices and the total amount of the invoice for invoices-clients.
The network main features are:
- Graph type: Directed Graph
- Order: 27,468 vertices (clients + invoices + products).
- Size: 743,510 edges.
- Average degree: 27 (average number of ties per vertex, same value for in and out edges).
A subset of the UK network is represented as a graph in the following figure (click on the image to expand it):
Product and Client Analysis
The top five clients by transaction amount in GBP are:
Customer ID | Amount (GBP) |
C^^18102 | 256,438.49 |
C^^17450 | 187,482.17 |
C^^17511 | 88,125.38 |
C^^16684 | 65,892.08 |
C^^13694 | 62,653.1 |
The first client ego graph displays the invoices related to that client (click on the image to zoom in):
The central or core node represents the client and the nodes in the periphery are the related invoices. Its main features are:
- 47 nodes (1 corresponds to the customer ID and 46 to the invoices).
- 92 ties, 46 edges that from the client to each invoice and other 46 from each invoice to the client.
Now, we may be interested on knowing which products were only sold once to our clients. Maybe some of them are very profitable and we could be interested on analyzing how those products we are interested on promoting them have been sold to other clients to create a specific marketing plan for those products.
The number of ties is called degree, so we want to sort our products by their degree. Over 400 of them only appear on one invoice, just to mention a few codes:
StockCode | Description | Price (GBP) |
S^^10125 | MINI FUNKY DESIGN TAPES | 0.85 |
S^^10125 | MINI FUNKY DESIGN TAPES | 0.42 |
S^^10135 | COLOURING PENCILS BROWN TUBE | 0.25 |
S^^10135 | COLOURING PENCILS BROWN TUBE | 1.25 |
S^^10135 | COLOURING PENCILS BROWN TUBE | 0.42 |
S^^10135 | COLOURING PENCILS BROWN TUBE | 2.46 |
S^^10135 | COLOURING PENCILS BROWN TUBE | 1.06 |
S^^15034 | PAPER POCKET TRAVELING FAN | 0.14 |
S^^15034 | PAPER POCKET TRAVELING FAN | 0.07 |
S^^15039 | SANDALWOOD FAN | 0.85 |
S^^15039 | SANDALWOOD FAN | 0.53 |
S^^15044A | PINK PAPER PARASOL | 5.79 |
S^^15044A | PINK PAPER PARASOL | 2.95 |
S^^15044A | PINK PAPER PARASOL | 2.55 |
S^^15044B | BLUE PAPER PARASOL | 2.55 |
S^^15044B | BLUE PAPER PARASOL | 2.95 |
S^^15044C | PURPLE PAPER PARASOL | 2.95 |
S^^15044C | PURPLE PAPER PARASOL | 2.55 |
S^^15044D | RED PAPER PARASOL | 2.55 |
S^^15044D | RED PAPER PARASOL | 2.95 |
S^^15044D | RED PAPER PARASOL | 5.79 |
S^^15058C | ICE CREAM DESIGN GARDEN PARASOL | 7.95 |
S^^15058C | ICE CREAM DESIGN GARDEN PARASOL | 3.95 |
S^^16043 | POP ART PUSH DOWN RUBBER | 0.12 |
Now we realize that we should split the product codes into sub-codes for each variety, but it is unnecessary for the purposes of the example.
Centrality Study
Two important centrality study measures are:
- Closeness: The nodes that have more connections are considered central nodes. They are good candidates to support the entire system and they can have an overall big influence on the other nodes.
- Betweenness: The nodes with the highest betweennes may not have many connections, but the connections they have are critical to the system because if those nodes are lost, important parts of the system will be disconnected or put away from the rest. These nodes are good candidates for critical maintenance plans since they are key for the cohesion of the network. These nodes receive more information than others since they are in more cases the unique path that connect other nodes across the network or the shortest path for many nodes to reach other nodes.
Closeness is very similar to central magnitudes such as median, but betweenness is more related to links between different parts of the system and I find it very interesting to explore. The top products with the highest betweenness are:
StockCode | Betweenness |
S^^82484 | 5,98% |
S^^85123A | 3,43% |
S^^85099B | 3,42% |
S^^21166 | 2,88% |
S^^22993 | 1,87% |
S^^22189 | 1,80% |
S^^21080 | 1,68% |
S^^22423 | 1,54% |
S^^22170 | 1,49% |
S^^21181 | 1,45% |
S^^82482 | 1,45% |
S^^23203 | 1,35% |
S^^22197 | 1,31% |
S^^82600 | 1,28% |
S^^20685 | 1,26% |
S^^21212 | 1,24% |
S^^22960 | 1,22% |
S^^22469 | 1,17% |
S^^21175 | 1,05% |
S^^20750 | 1,04% |
S^^21174 | 1,02% |
S^^21876 | 1,00% |
S^^23202 | 0,98% |
S^^21428 | 0,97% |
S^^22659 | 0,91% |
¿Is any of those nodes a product that promotes upselling since they seem to be linked to other exclusive nodes or many nodes exist where they are?
To answer this question we should analyze the invoices and the related products, not necessarily using graphs. Let’s take a look at the ego graph of one of those products:
We can observe the invoices related to that node (S^^82484) of the top five customers data subset. The same can be done with the rest, for example with the second product in the rank:
Now we have two products which betweenness is the maximum found across the top-five clients subset. We could analyze if those products are key to increase the sells amount, for example, by offering them to other clients within the network to facilitate the creation new links and greater invoice amounts.
We could also be interested on those top five customers that amount to 22 percent of the total sales, because it is worth exploring how to start improving their fidelity since upgrading the best is always the most difficult task.
Many other types of data analytics can be performed using graph and network theories, not only on products but also on companies, society and other complex domains and systems such as ecosystems, Medicine, Business processes, or Politics.
This article provides a very small grasp of all the network and graph theories and how they can be actually applied to multiple problems.