4.2 Analysing Data

The Digital Firm

Estimated Reading Time: 28 minutes

Contents

Statistical thinking is merely an artifact of human ignorance. We need to use statistics nature does not.

Pierre-Simon Laplace

Introduction

In this session, we learn the very basic of data analysis. Here’s the program for today.

After a brief discussion of dimensional modelling, we will discuss the topic of collective intelligence and then very briefly introduce the topic of modern analytics.

Dimensional Models

In the theory sessions, you heard about dimensional modelling and OLAP cubes and whatsnot. What are those? At their core, those concepts boil down to a simple idea: Facts do not live in a vacuum. If I tell you “Sales were 150,000”, I did not give you any meaningful information. Sales of what, sales where, over which period of time, 150,000 of which units. Intuitively, we all understand we should provide facts along some dimensions and we usually do.

Starting with this idea, the whole field of traditional business intelligence (or BI) created a number of concepts and techniques so that business people can access data in the format they will intuitively understand. The need for up-to-date created tension in organisation between two “classes” of IT workers: the ones working in the operations and other working in the data and analytics. Although the skills are roughly transferable, even more so at the management level, the focus is relatively different and the systems are usually decoupled.

While traditional IT owns most of the systems, core (such as ERP, CRM, Master Data Management,…) and legacy (all the systems that are not directly involved in the core productive process), the data and analytics teams usually own big database systems. Traditionally, the process was organised around 3 phases: The ETL, for Extract-Transform-Load, the Semantic Layer, usually a data warehouse, and the Presentation Layer, usually data cubes. While very big companies today have a parallel structure for advanced data analyst that consists of data lakes that combine ETL and semantic layer and dispense the presentation layer altogether, most non-technical people, even in very large companies, still access data through cubes because of the ease of use of those precompiled data and because most people have, in fact, relatively stable needs in term of data. Therefore, we will spend a little time understanding the traditional data flow.

ETL Layer

The ETL step is usually the messy one. Indeed, it involves interfacing your data analysis infrastructure with the rest of the world. The rest of the world is scary and messy, the world inside a data warehouse is neat and tidy (it is why people build them in the first place). You understand that, in order to brutally reduce entropy at the border of those two worlds, you need serious design and that the process is relatively turbulent. ⊕ Passing from an uncontrolled to a controlled environment usually requires serious organisation Indeed, the ETL part of the data analysis workflow, even for one-time analysis is usually the trickiest part. It involved extracting data from a source (often what are called transactional systems, as opposed to analytical systems, or even from the web), transforming them to standardized them and finally, load and store them in the data infrastructure.

This process is usually done in batches: Once a day, for instance, the whole database underpinning the ERP system is copied into the data warehouse. Indeed, running this extract sometimes has side effects (even though it shouldn’t it might sometimes modify or break something) and in any case requires resources. Since it is more important to be able to perform the daily operations than to have analytics, it makes sense to contain the risky moment to a batch “windows”. In other case, data is ingested in real-time.

Still, in the context of this class, and since we have not a full-blown transactional system at our disposal, we have already seen the toolbox for the ETL part: We extract data with either webscraping or API. You already know about those so I won’t spend any more time on that.

Semantic Layer

The goal of an analytics system is to answer business questions. In order to do so, your data must have meaning. Meaning is usually built by providing context to datapoints. Context may be thought about dimensions In the mathematical sense of the term. on which we position the datapoint.

A good part of the ETL process consists in reconciling dimensions to make them consistent. Usually, the most important dimension is time: Assume we want to check if your sales increase when the buzz around our sales on social media increases. To know that, we need to align our records of sales coming from our CRM with the ones coming directly from the social media. Fortunately, time is pretty much the same everywhere and for everyone (timezones notwithstanding). But for other kind of data, it is trickier: In our CRM, we have the sales at the level of the municipality or even the address (because we know where we ship), while it is hard, or sometimes impossible or illegal, to get data with such granularity about revenues, for example. In this case, we need to decide how to reconcile them. Do we really need information at the address level and if so, how are we going to impute values? Are we OK with aggregating sales at the municipality level?

Once we have that, we have a semantic layer: a part of database where data is embedded in context. In this case, data is usually organised in a dimensional model. In such a model, numerical values are gathered in tables called fact tables. Besides these values, fact tables only contain foreign keys corresponding to points in dimensions.

On this schema, books sales are gathered in a fact table and contextualised along several dimensions: On the left, dimensions related to the product - the exact book, that are then organised by authors or by genre - and on the right the time dimension - each sale was done at a certain date from which we can retrieve the week or the month and then the year. This specific organisation, where the hierarchies inside dimensions is called a Snowflake schema (because it kind of look like a snowflake with little arms going from a center - the fact table). In traditional dimensional modeling, the facts must be additive. This means that it must make business sense to add two rows of the fact table. This is the case here: adding the revenue of all the sales of a same book gives you the revenue this book generate. Similarly, adding all sales from a same date gives you the total revenue generated on that date. It is less clear for the discount: if the discount is express as a percentage of the revenue (for example a discount of \(0.05\) or \(5\%\)), then it is not additive: Summing all the discount rates of sales of a sale book does not give you the global discount rate. However, in this case, the discount is likely expressed in dollars meaning that, on a row, if the discount field contains \(1\), for instance, the database analyst will understand that a discount of \(1\) euro was granted on the sale of that book. Now, if you add all the discounts along the book dimension, you do retrieve the total discounts granted on those books (and dividing by the sum of the revenues plus discounts gives you the discount rate for a book, or a date, depending on which dimension you aggregate).

At this point, you’re probably wondering what is the point of all this. In fact, organising your database in such a way makes it super easy to answer most business answers that typical business analysts asks. Let’s take it for a little spin. Unfortunately, publicly available snowflake or star schemas (dimensional model where the hierarchies are “flattened” into a single table per dimension), are relatively hard to come-by and we have no time to build it from a transactional system⊕ Those interested in understanding how to go from a transactional system to a dimensional one are invited to follow or get the material for the class Business Intelligence and Data Science during the second semester. . We are going to use a synthetic database created using this tool.

Let’s start by downloading the base that I generated (this code might run for a long time for the first time, the size of the database is 100MB):

import pandas as pd
import sqlite3

# This cell retrieves the database, downloads it to the folder where you have your notebook
# It then opens a connection with the database engine (it is like a tunnel through which you will
# pass the query from python to sqlite and the results back)

# You need to have the pandas library installed. If you don't have it and do not know
# how to install it, have a look at the notebook about modularity and open-source

import os.path
if not os.path.isfile("SSB.db"): # If you don't have the file yet...
    import urllib.request 
    urllib.request.urlretrieve("http://homepages.ulb.ac.be/~choffreu/resources/SSB.db", "SSB.db") # download it

con = sqlite3.connect("SSB.db") # create a connection with the database

Here’s what this database looks like:

It seems simple enough: a fact tables, 4 dimensions. Remark that it differs from the schema above, that we called snowflake schema in that each dimension is only comprised of 1 table. Look, for example the time dimension. Sure it is at the level of the day but you see that the information of the year has been “rolled up” in that table so that the hierarchical structure of the dimension is represented by several columns in the table rather than several table. This is a design decision that eases further writing the queries.

Now, talking about queries, let’s see how easy it is to answer business questions when we have a start schema! It all boils down to the fact that the joins here are trivial. Long gone are the days where we had to do crazy joins across distant tables. In a start schema, all you need is one table away. Let’s see a series of questions and how they translate into code. First, we need to realize the the base is relatively heavy, it contains a little more than \(1,000,000\) rows of facts, so we are above Excel scale here. So be a bit patient if the answer take a bit to arrive.

Let’s start easy: How much did we sell in 1995 to customers in Europe?

query = """ SELECT SUM(lo_ordtotalprice)
            FROM lineorder lo
                JOIN date d ON lo.lo_orderdate = d.d_datekey
                JOIN customer c ON lo.lo_custkey = c.c_custkey
            WHERE d.d_year = 1995 
                AND c.c_region = "EUROPE" 
        """

pd.read_sql_query(query, con)

	SUM(lo_ordtotalprice)
0	582823124941

Boom! Blazing fast, crazy simple! But before I drop the mike and we go on to the next topic, I can sense some of you are still doubtful. Why am I hyping dimensional models so much? Well, it is because the very simple query we wrote above is about as hard a query you have to write in a dimensional model. The structure of a query in a dimensional model will always be the same:

The SELECT part will always be a mix of unaggregated dimensions fields and aggregated measures
The FROM part will always be between the fact table and dimensions table and sometimes (in the snowflake) between dimensions table in the same hierarchy.
The WHERE is a filter on one or more dimensions
The GROUP BY will always be along the dimensions of the unaggregated fields in the SELECT

If you’re one of those Excel wizards, and by that time you probably are, you’re probably thinking “pivot table” right now. You are right! We will come back to that. But for now, let me just show you a couple more queries and ask you to write a few as well. Here’s how we can have the cost of goods sold paid to suppliers from middle eastern countries from 1990 to 1995 by country of the supplier.

query = """ SELECT s.s_nation, SUM(lo_supplycost)
            FROM lineorder lo
                JOIN date d ON lo.lo_orderdate = d.d_datekey
                JOIN supplier s ON lo.lo_suppkey = s.s_suppkey
            WHERE d.d_year BETWEEN 1990 AND 1995 
                AND s.s_region = "MIDDLE EAST"
            GROUP BY s.s_nation
        """

pd.read_sql_query(query, con)

	s_nation	SUM(lo_supplycost)
0	EGYPT	2336691909
1	IRAN	2280002613
2	IRAQ	2288083426
3	JORDAN	1814785982
4	SAUDI ARABIA	2509290291

The query is remarkably similar to the previous one in its construction. Let’s see if we can see from the sales to which country the supply cost going to the Middle East come.

query = """ SELECT c.c_nation, s.s_nation, SUM(lo_supplycost)
            FROM lineorder lo
                JOIN date d ON lo.lo_orderdate = d.d_datekey
                JOIN customer c on lo.lo_custkey = c.c_custkey
                JOIN supplier s ON lo.lo_suppkey = s.s_suppkey
            WHERE d.d_year BETWEEN 1990 AND 1995 
                AND s.s_region = "MIDDLE EAST"
            GROUP BY c.c_nation, s.s_nation
        """

pd.read_sql_query(query, con)

	c_nation	s_nation	SUM(lo_supplycost)
0	ALGERIA	EGYPT	91510962
1	ALGERIA	IRAN	86235221
2	ALGERIA	IRAQ	87029185
3	ALGERIA	JORDAN	70257007
4	ALGERIA	SAUDI ARABIA	94949454
...	...	...	...
120	VIETNAM	EGYPT	93074372
121	VIETNAM	IRAN	87627326
122	VIETNAM	IRAQ	88861180
123	VIETNAM	JORDAN	74356296
124	VIETNAM	SAUDI ARABIA	103470364

125 rows × 3 columns

Again, this is just a minor variation on the original query. Here are a few queries that you might try as exercises.

Exercise 1 - The Color of Money

For each year, compute the total revenues generated per color of parts. Classify them chronologically and by decreasing order of revenue.

See solution

query = """ SELECT d.d_year, p.p_color, SUM(lo_revenue) as revenues
            FROM lineorder lo
                JOIN date d ON lo.lo_orderdate = d.d_datekey
                JOIN part p on lo.lo_partkey = p.p_partkey
            GROUP BY d.d_year, p.p_color
            ORDER BY d.d_year ASC, revenues DESC 
        """

pd.read_sql_query(query, con)

	d_year	p_color	revenues
0	1992	blush	8013806433
1	1992	beige	8003628753
2	1992	blue	7954274115
3	1992	antique	7909984436
4	1992	aquamarine	7854998354
...	...	...	...
639	1998	spring	2674484870
640	1998	yellow	2596359593
641	1998	wheat	2572572362
642	1998	white	2545421666
643	1998	tomato	2522630675

644 rows × 3 columns

Exercise 2 - Multinational, Multimodal

For each region and for the whole dataset, what is the total quantity of products of color “chartreuse” that have been shipped with each shipment method?

See solution

query = """ SELECT c.c_region, lo.lo_shipmode, SUM(lo_quantity) as quantity
            FROM lineorder lo
                JOIN part p ON lo.lo_partkey = p.p_partkey
                JOIN customer c ON lo.lo_custkey = c.c_custkey
            WHERE p.p_color = "chartreuse"
            GROUP BY c.c_region, lo.lo_shipmode
        """

pd.read_sql_query(query, con)

	c_region	lo_shipmode	quantity
0	AFRICA	AIR	10476
1	AFRICA	FOB	9541
2	AFRICA	MAIL	9197
3	AFRICA	RAIL	10151
4	AFRICA	REG AIR	9677
5	AFRICA	SHIP	10400
6	AFRICA	TRUCK	11537
7	AMERICA	AIR	10802
8	AMERICA	FOB	10303
9	AMERICA	MAIL	10238
10	AMERICA	RAIL	10313
11	AMERICA	REG AIR	9810
12	AMERICA	SHIP	10667
13	AMERICA	TRUCK	9608
14	ASIA	AIR	9091
15	ASIA	FOB	11126
16	ASIA	MAIL	10240
17	ASIA	RAIL	9789
18	ASIA	REG AIR	10475
19	ASIA	SHIP	9901
20	ASIA	TRUCK	9547
21	EUROPE	AIR	9502
22	EUROPE	FOB	8906
23	EUROPE	MAIL	10244
24	EUROPE	RAIL	9520
25	EUROPE	REG AIR	10384
26	EUROPE	SHIP	10926
27	EUROPE	TRUCK	10372
28	MIDDLE EAST	AIR	10023
29	MIDDLE EAST	FOB	11655
30	MIDDLE EAST	MAIL	11009
31	MIDDLE EAST	RAIL	9883
32	MIDDLE EAST	REG AIR	9009
33	MIDDLE EAST	SHIP	9865
34	MIDDLE EAST	TRUCK	10256

On Holiday

On average, do we sell more expensive items (in terms of average order price per unit sold) during regular days or holidays?

See solution

query = """ SELECT holiday, tot_sales / quant
            FROM
                (SELECT d.d_holidayfl as holiday, SUM(lo_ordtotalprice) tot_sales, SUM(lo_quantity) as quant
                FROM lineorder lo
                    JOIN date d ON lo.lo_orderdate = d.d_datekey
                GROUP BY holiday)
        """

pd.read_sql_query(query, con)

	holiday	tot_sales / quant
0	0	741151
1	1	738927

You get the gist. Writing query in a star schema (and to a lesser extent in a snowflake schema) is relatively easy. So easy in fact that the average business analyst, even if she is not versed in the subtle art of SQL querying is able to manipulate. But anything that looks like code inspires fear in the average person. What people like are spreadsheets. Although you and I know that spreadsheets are dry and uninspiring, they are intuitive and integrate with the workflow in most companies. The pivot table was born. They are just a nice user interface on top of a dimensional model.

Remain an issue: Excel is slow and greedy in resources. It brings the computer to a screeching halt whenever you ask it to crunch a few millions rows… That’s quite the conundrum… what to do?

Presentation Layer

Data workers have added a last layer, the presentation layer. It can take many forms: those shiny dashboards in Tableau, SQL terminals for geeks or, and that what we will see here, OLAP cube. OLAP, for On-Line Analytical Processing, is in fact just a set of pre-computed dimensional queries, aggregated at each level of the hierarchy and served to the spreadsheets of the analysts. That way, they do not need to work with millions of rows but with a few thousands and the computationally expensive steps is done on big servers rather than on the small laptop or desktops of the people consuming the data.

An OLAP cube is just that: a set of pre-computed aggregations on a set of dimensions. It is in fact only a cube when there are three dimensions with the same number of components but the term stuck. One have to admit that talking about “releasing the cube” or “querying the cube” adds a touch of mystery to the description of a desk job. And isn’t that what matters in the end?

Collective Intelligence

Internet has been weird lately. Everywhere you go, your every moves are tracked and data about it are stored somewhere for people and corporations to use. While I personally believe this goes too far Although you can take some steps to mitigate the effects. , this is not unequivocably bad. When someone has information about a lot of people (and most big corporation do have that), they can, sometimes, do useful things with it for the same people about whom they have info. This is the mass-customization people talk about. The technical side of it has been referred to as collective intelligence.

Collective intelligence comes in many forms and shapes. Here, we are going to do some light application on a classical problem: which films to recommend to somebody. We start by downloading a classical dataset containing movie reviews by a bunch of people (6000 of them). For copyright reasons, I understand I cannot provide it to you directly. But I also assume I can provide you with code that will download, extract and strip the set of unnecessary data…⊕ If you’re a lawyer and you’re reading this, feel free to correct me and I’ll find some other sophistry to make it easy to get that data without any manual labour. Nice place to remember Aaron Swartz. Here goes the code:

import os.path, shutil, os
from zipfile import ZipFile

if not os.path.isfile("ratings.dat"): # If you don't have the file yet...
    import urllib.request 
    urllib.request.urlretrieve("http://files.grouplens.org/datasets/movielens/ml-1m.zip",
                               "ml-1m.zip")
    with ZipFile("ml-1m.zip", 'r') as zip_ref:
        zip_ref.extract("ml-1m/ratings.dat", path = ".")
        zip_ref.extract("ml-1m/movies.dat")
        zip_ref.extract("ml-1m/users.dat")
    shutil.move("./ml-1m/ratings.dat", ".")
    shutil.move("./ml-1m/movies.dat", ".")
    shutil.move("./ml-1m/users.dat", ".")
    shutil.rmtree("./ml-1m")
    os.remove("./ml-1m.zip")

Mouahahaha, is there anything we cannot do with code? Anyways. Open the files you just downloaded using a text editor and take a look at the formating of the file: they are a series of lines where the “fields” are separated by two colons: ::. We can load them using pandas into dataframes that will be easier to manipulate.

import pandas as pd

movies = pd.read_csv("movies.dat", sep="::",
                     names = ["ID", "Name", "Type"])

users = pd.read_csv("users.dat", sep="::",
                    names = ["ID", "Gender", 
                            "Age", "Occupation",
                            "ZIP"])

ratings = pd.read_csv("ratings.dat", sep="::",
                     names = ["User", "Movie", 
                              "Rating", "Timestamp"])

You might get a warning that you can happily ignore It tells you that, since you are using Jupyter, there is a difficulty because the separator is composed of two characters, but then it hands the problem over to regular Python which know how to handle such cases. If this bothers you, just add the parameter engine = "python" to the read_csv calls . We now have two dataframes: One containing informations about the movies that we will in fact not use that much. The other contains the rating every user gave to the movies.

We are going to use a very mathematical but nonetheless relatively effective method to tackle the collaborative filtering problem called Singular Value Decomposition See here for intuition about what the Singular Values are. . It supposes that we have a matrix containing the ratings with every row representing an individual and every column representing a movie.

It’s big brains time! To transform the we could just initiate a matrix full of zeros, insert the rating and be done with it… it is a bad idea. Most people haven’t evaluated most of the movies so the rows would be full of zeros and some rating. But in the end, this would give the impression that people who haven’t evaluated the movie actually hated it…. Surely, we can do better. So since here we are not really interested in keeping the information about which films are absolutely good or bad, we will standardize every row so that it’s mean is 0 and the modified rating of people having evaluated it represents in fact the variation around the mean of the other ratings. Scary, hu? Not quite, let’s do this step by step and using as little “magic” as possible. Let’s start by computing the average and standard deviation of the ratings for each of the movies. For some reason, the IDs of users and films start at 1. This is not pythonic so I’ll convert those so those indices are based on 0. This will come handy later when we build the actual matrix.

import numpy as np

users = users.reset_index()
movies = movies.reset_index()

users = users.set_index("ID")
movies = movies.set_index("ID")

rebase_users = lambda x : int(users.loc[x, "index"])
rebase_movies = lambda x : int(movies.loc[x, "index"])

ratings["User"] = list(map(rebase_users, ratings.User))
ratings["Movie"] = list(map(rebase_movies, ratings.Movie))

users = users.set_index("index")
movies = movies.set_index("index")

movies_spec = ratings.groupby("Movie").agg([np.mean, np.std])["Rating"]

movies_spec[:5]

	mean	std
Movie
0	4.146846	0.852349
1	3.201141	0.983172
2	3.016736	1.071712
3	2.729412	1.013381
4	3.006757	1.025086

Let’s check, what is movie ID 1 in the original dataframe and which got a high rating? Toys Story… OK, quite good. And the fourth, which got a very low average rating? Waiting to Exhale. Never heard of it. Apparently, it got \(56\%\) on Rotten Tomatoes, no it is consistent with our score here. Cool, everything seems to be working. Moving on.

We are now going to going to:

Initiate a matrix with a bunch of 0
Iterate the rating dataframe

For each row, we produce the standardized rating
We insert this into the matrix

Let’s go

shape = (int(np.max(ratings.User)) + 1,
        int(np.max(ratings.Movie)) + 1)

design_matrix = np.full(shape, 0.0)

for _idx, row in ratings.iterrows():
    mean_movie = movies_spec.loc[row["Movie"]]["mean"]
    std_movie = movies_spec.loc[row["Movie"]]["std"]
    
    standardized_rating = (row["Rating"] - mean_movie) / std_movie
    
    design_matrix[row["User"], row["Movie"]] = standardized_rating
    
design_matrix = np.nan_to_num(design_matrix)

Let’s take a look at the beginning of the matrix.

design_matrix[:3, :3]

array([[1.00094387, 0.        , 0.        ],
       [0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        ]])

And verify that the first three columns are centered:

np.sum(design_matrix[:,:3], 0)

array([4.65849581e-13, 4.44089210e-14, 1.37223566e-13])

Those look like numerical errors If computers were able to represent a number perfectly, you would expect those to be exactly 0. However, having such a small remainder after adding 6.000 values is OK … Close enough. We can now continue. Now for the big leap: we are going to transform the matrix into its singular vectors. Ready? Here we go.

design_matrix.shape

(6040, 3883)

from scipy.sparse.linalg import svds
U, S, M_filtered = svds(design_matrix, k = 100)

That’s all… we now have to analyse the data. We produce those matrices with a twist: We reduce the number of dimensions on which the the information is represented. Concretely, we do everything as if there were only 100 factors that determined the grade each viewer would assign to each film. It is a monumental trick: we determine, simultaneously, the similarity of people and of movie. Each person is determined in a space at 11 dimensions and, movies are also represented in a space of 100 dimensions.

Your head is spinning? Understandable. But now, we retrieve a result that makes business sense: if we multiply the three matrices, we have a set of recommendations that “could be observed” if the film choice was really directed by 100 abstract factors this number is a wild guess, use whatever value you’re comfortable with. Higher values allow for more variation, which is both a blessing, because you allow for greater customization, and a curse, because your recommendations become more noisy. On the topic, you can read about the curse of dimensionality and the excellent bias-variance trade-off article on Wikipedia. . Technically, we can retrieve the “strenght” of the signal on each dimension by representing the absolute value of the singular values, but we will leave it for a finer analysis, here we just decide to keep the 100 first vectors.

new_base = np.dot(np.dot(U, np.diag(S)), M_filtered)
new_base[:3,:3]

array([[ 0.59018665, -0.04393189,  0.02897357],
       [ 0.16460058, -0.05389524,  0.08092475],
       [-0.28872293,  0.04940678,  0.03531134]])

Let’s identify a random user, and check if the films she actually likes are indeed similar to the films our system will recommend her. Let’s start by inverting the transformation we made earlier. We will work with the columns vectors to save some time In most programming languages, vectorized operations are faster than loops for a series of reason. This is certainly the case for R, Matlab and most scientific programming languages, with the notable exception of Julia (for most cases). For Python, it depends on the packages you use. If performance is a concern for you, read up on the topic and make some experiments. . We then convert the matrix back to a “tall” dataframe with the columns of the viewer ID and the Movie ID in the 1-based index (to ensure consistency with the movie dataframe).

recommendations_mat = new_base.copy()

for _idx, row in ratings.iterrows():
    movie = row["Movie"]
    mean_movie = movies_spec.loc[movie]["mean"]
    std_movie = movies_spec.loc[movie]["std"]
    recommendations_mat[:, movie] = new_base[:, movie] * std_movie + mean_movie
    
recommendations = pd.DataFrame(recommendations_mat).reset_index()

recommendations_melted = recommendations.melt(id_vars = "index", var_name = "Movie",
                                      value_name = "Simulated Score")

recommendations_melted.columns = ["ID", "Movie", "Simulated Score"]

We now draw a viewer randomly and check the type of movies she evaluated the best.

viewer = np.random.choice(recommendations["index"])
viewer

And we compare the list of the highest rated movies for this viewer and the top recommendations.

movies.loc[ratings.loc[ratings.User == viewer].sort_values("Rating", ascending = False)["Movie"]][:5]

	Name	Type
index
1216	Harold and Maude (1971)	Comedy
1899	Breakfast Club, The (1985)	Comedy\|Drama
293	Pulp Fiction (1994)	Crime\|Drama
2928	Being John Malkovich (1999)	Comedy
2789	American Beauty (1999)	Comedy\|Drama

Let’s now build our… TOP5!

Oh Yes

movies.loc[recommendations_melted.loc[recommendations_melted.ID == viewer].sort_values("Simulated Score", ascending = False)["Movie"].values][:5]

	Name	Type
index
589	Silence of the Lambs, The (1991)	Drama\|Thriller
2928	Being John Malkovich (1999)	Comedy
293	Pulp Fiction (1994)	Crime\|Drama
2789	American Beauty (1999)	Comedy\|Drama
49	Usual Suspects, The (1995)	Crime\|Thriller

Here we see that the recommender system actually retrieves most of the movies the person rated the highest. It seems to indicate that the system got at least a reasonable idea of what that person likes. It will not work everytime, of course, because some people have too idiosyncratic a taste to lend itself to such primitive modelling but it is interesting to know that, by just having information about what other people like and not even looking at the individual characteristics of the person you can already build a recommender system. I won’t put an exercise here but try to get the broad line of how the code above works.

This is cool, but it is sooooooo 2010…

Since then, analytics has become much more advanced… much darker also. Eigen- or singular vectors-based techniques are relatively mecanic in the way they work (they use properties of the space and then distort it to get a notion of proximity), and they are not really data-hungry: They take observations about something in a sparse matrix and then try to extrapolate values to empty cells. They do not accomodate easily predictors: Indeed, imagine you added the age or whatever of the viewer in the analysis above, when rebasing the matrix, you would end up “predicting” the age of the viewer (which you already know) in the process and the whole analysis becomes super messy. Those techniques are also difficult to tune to one’s desire and they tend to only convince mathematicians and mathematically-minded people. Finally, they tend to bring people to a baseline and then segregate them around this baseline If you followed closely what we did, since we added the average score of each movie and only make it vary by the deviation from ones own supposed assessment. .

Enter two Knights of the Privacy Apocalypse: Statistics and Computer Science. Begins the Era of…

Modern Analytics - The Next Frontier

⊕ The Mechanical Turk Supposed for a long time to be one of the first automaton that played chess. Turns out somebody was inside the machine making all the decisions. lol. Like all cool apocalypse story, this one has been long in the making. The concept of “Artificial Intelligence”, which probably goes back to Babbage and Ada Lovelace and gained traction with Alan Turing with his Chinese Room problem made some cool advances in the 70’… but, even at the time, it was closer to the Mechanical Turk than to Aasimov’s Robot. Indeed, the approach to artificial intelligence was based on the manipulation of symbols or concepts. Researchers at the time were mostly trying to make machine “learn” like one learns mathematics: Start with axioms, build some theorem, use the result of this theorem to build other theorems, rince and repeat.

As it turns out, this did not work so great. Despite early successes, there was a long time that artificial intelligence was considered old news (the so-called AI Winter) and interest faded. But, much like the One Ring, it was found again and came back in possession of a Hobb… I mean, under the form of statistical machine learning.

Indeed, while statistics had existed for a couple of century, the specific type of statistics useful for the tasks AI was asked to solve and the some of the required technique was out of the reach of the mathematician. Indeed, they required fast computers and, in many case, the ability to perform matrix computation very fast. Bizarrely, one of the reason we now have those tools is because of the gaming community. Since one of the most important requirement to have good 3D graphics is also matrix computation And some ill-intentioned people would say that gamers had money to spare because of their lack of social life. , manufacturer started to build increasingly faster Graphical Processing Units for prices that were accessible even for university researchers, who in turn used them to implement algorithms and concepts that had existed for a long time but were not really used because they were impractical. This is the story of AI in a nutshell and how gamers saved the world… until they destroy it…

Indeed, while traditional Machine Learning techniques based on symbolic computation do not require data, those new algorithms needed much data to “learn”. How much data? Much, much data. But remember the start of our journey. A computer is just a big calculator, so what does “learning” mean in such context? What do machine learn? Well, they learn mathematical relations. And of course, in order to have such relation, you need models, which we already talked about. So, in some sense, we’ve gone full circle: Much like modern programming languages, such as Python, are built on the older, lower levels languages such as Assembler, but work at a higher level of abstraction, modern analytics algorithms are able to manipulate concepts and represent relations that you will typically find in our today’s mental models. However, they depend in large part on the concepts created in the last iteration of the economy of information: If you do not have an ERP or some Management Information System, such as a CRM, you won’t have the data required to figure out which product you can sell to a specific consumer to maximize your profit. If you are not able to retrieve, join and agregate data in a large, structured database, having killer algorithms won’t do you much good. If your business processes are not somehow modelled and abstracted, you won’t be able to improve them and expose them to your business partners and customers to make them nimble and adaptive. And, of course, if you’re not on the internet today, you’re not in business at all.

In the next video, I propose a small teaser and a very short intro about how you need to change your mindset to grasp modern analytics. Going into much more detail isn’t in the scope of this class but I hope that, with these (admittedly very verbose) exercises sessions, you are now confident that you can use a computer to help you perform a broad arrays of tasks that would be hard or impossible without them. And you do not need to be a computer genius for that. Basic skills, a hefty dose of search and the disposition to learn are what you need. Without further ado, here’s the very very shallow overview of modern data science (or analytics).