4.2 Analysing Data

The Digital Firm

Estimated Reading Time: 28 minutes

Contents

Statistical thinking is merely an artifact of human ignorance. We need to use statistics nature does not.

Pierre-Simon Laplace

Introduction

In this session, we learn the very basic of data analysis. Here’s the program for today.

After a brief discussion of dimensional modelling, we will discuss the topic of collective intelligence and then very briefly introduce the topic of modern analytics.

Dimensional Models

In the theory sessions, you heard about dimensional modelling and OLAP cubes and whatsnot. What are those? At their core, those concepts boil down to a simple idea: Facts do not live in a vacuum. If I tell you “Sales were 150,000”, I did not give you any meaningful information. Sales of what, sales where, over which period of time, 150,000 of which units. Intuitively, we all understand we should provide facts along some dimensions and we usually do.

Starting with this idea, the whole field of traditional business intelligence (or BI) created a number of concepts and techniques so that business people can access data in the format they will intuitively understand. The need for up-to-date created tension in organisation between two “classes” of IT workers: the ones working in the operations and other working in the data and analytics. Although the skills are roughly transferable, even more so at the management level, the focus is relatively different and the systems are usually decoupled.

While traditional IT owns most of the systems, core (such as ERP, CRM, Master Data Management,…) and legacy (all the systems that are not directly involved in the core productive process), the data and analytics teams usually own big database systems. Traditionally, the process was organised around 3 phases: The ETL, for Extract-Transform-Load, the Semantic Layer, usually a data warehouse, and the Presentation Layer, usually data cubes. While very big companies today have a parallel structure for advanced data analyst that consists of data lakes that combine ETL and semantic layer and dispense the presentation layer altogether, most non-technical people, even in very large companies, still access data through cubes because of the ease of use of those precompiled data and because most people have, in fact, relatively stable needs in term of data. Therefore, we will spend a little time understanding the traditional data flow.

ETL Layer

The ETL step is usually the messy one. Indeed, it involves interfacing your data analysis infrastructure with the rest of the world. The rest of the world is scary and messy, the world inside a data warehouse is neat and tidy (it is why people build them in the first place). You understand that, in order to brutally reduce entropy at the border of those two worlds, you need serious design and that the process is relatively turbulent. Passing from an uncontrolled to a controlled environment usually requires serious organisation Indeed, the ETL part of the data analysis workflow, even for one-time analysis is usually the trickiest part. It involved extracting data from a source (often what are called transactional systems, as opposed to analytical systems, or even from the web), transforming them to standardized them and finally, load and store them in the data infrastructure.

This process is usually done in batches: Once a day, for instance, the whole database underpinning the ERP system is copied into the data warehouse. Indeed, running this extract sometimes has side effects (even though it shouldn’t it might sometimes modify or break something) and in any case requires resources. Since it is more important to be able to perform the daily operations than to have analytics, it makes sense to contain the risky moment to a batch “windows”. In other case, data is ingested in real-time.

Still, in the context of this class, and since we have not a full-blown transactional system at our disposal, we have already seen the toolbox for the ETL part: We extract data with either webscraping or API. You already know about those so I won’t spend any more time on that.

Semantic Layer

The goal of an analytics system is to answer business questions. In order to do so, your data must have meaning. Meaning is usually built by providing context to datapoints. Context may be thought about dimensions In the mathematical sense of the term. on which we position the datapoint.

A good part of the ETL process consists in reconciling dimensions to make them consistent. Usually, the most important dimension is time: Assume we want to check if your sales increase when the buzz around our sales on social media increases. To know that, we need to align our records of sales coming from our CRM with the ones coming directly from the social media. Fortunately, time is pretty much the same everywhere and for everyone (timezones notwithstanding). But for other kind of data, it is trickier: In our CRM, we have the sales at the level of the municipality or even the address (because we know where we ship), while it is hard, or sometimes impossible or illegal, to get data with such granularity about revenues, for example. In this case, we need to decide how to reconcile them. Do we really need information at the address level and if so, how are we going to impute values? Are we OK with aggregating sales at the municipality level?

Once we have that, we have a semantic layer: a part of database where data is embedded in context. In this case, data is usually organised in a dimensional model. In such a model, numerical values are gathered in tables called fact tables. Besides these values, fact tables only contain foreign keys corresponding to points in dimensions.

Example of dimensional model The Sales table is the fact table and all the rounds are dimensions (in several levels) - schema by Matteo Golfarelli - License: CC BY-SA

On this schema, books sales are gathered in a fact table and contextualised along several dimensions: On the left, dimensions related to the product - the exact book, that are then organised by authors or by genre - and on the right the time dimension - each sale was done at a certain date from which we can retrieve the week or the month and then the year. This specific organisation, where the hierarchies inside dimensions is called a Snowflake schema (because it kind of look like a snowflake with little arms going from a center - the fact table). In traditional dimensional modeling, the facts must be additive. This means that it must make business sense to add two rows of the fact table. This is the case here: adding the revenue of all the sales of a same book gives you the revenue this book generate. Similarly, adding all sales from a same date gives you the total revenue generated on that date. It is less clear for the discount: if the discount is express as a percentage of the revenue (for example a discount of \(0.05\) or \(5\%\)), then it is not additive: Summing all the discount rates of sales of a sale book does not give you the global discount rate. However, in this case, the discount is likely expressed in dollars meaning that, on a row, if the discount field contains \(1\), for instance, the database analyst will understand that a discount of \(1\) euro was granted on the sale of that book. Now, if you add all the discounts along the book dimension, you do retrieve the total discounts granted on those books (and dividing by the sum of the revenues plus discounts gives you the discount rate for a book, or a date, depending on which dimension you aggregate).

At this point, you’re probably wondering what is the point of all this. In fact, organising your database in such a way makes it super easy to answer most business answers that typical business analysts asks. Let’s take it for a little spin. Unfortunately, publicly available snowflake or star schemas (dimensional model where the hierarchies are “flattened” into a single table per dimension), are relatively hard to come-by and we have no time to build it from a transactional system Those interested in understanding how to go from a transactional system to a dimensional one are invited to follow or get the material for the class Business Intelligence and Data Science during the second semester. . We are going to use a synthetic database created using this tool.

Let’s start by downloading the base that I generated (this code might run for a long time for the first time, the size of the database is 100MB):

import pandas as pd
import sqlite3

# This cell retrieves the database, downloads it to the folder where you have your notebook
# It then opens a connection with the database engine (it is like a tunnel through which you will
# pass the query from python to sqlite and the results back)

# You need to have the pandas library installed. If you don't have it and do not know
# how to install it, have a look at the notebook about modularity and open-source

import os.path
if not os.path.isfile("SSB.db"): # If you don't have the file yet...
    import urllib.request 
    urllib.request.urlretrieve("http://homepages.ulb.ac.be/~choffreu/resources/SSB.db", "SSB.db") # download it

con = sqlite3.connect("SSB.db") # create a connection with the database

Here’s what this database looks like:

The star schema

It seems simple enough: a fact tables, 4 dimensions. Remark that it differs from the schema above, that we called snowflake schema in that each dimension is only comprised of 1 table. Look, for example the time dimension. Sure it is at the level of the day but you see that the information of the year has been “rolled up” in that table so that the hierarchical structure of the dimension is represented by several columns in the table rather than several table. This is a design decision that eases further writing the queries.

Now, talking about queries, let’s see how easy it is to answer business questions when we have a start schema! It all boils down to the fact that the joins here are trivial. Long gone are the days where we had to do crazy joins across distant tables. In a start schema, all you need is one table away. Let’s see a series of questions and how they translate into code. First, we need to realize the the base is relatively heavy, it contains a little more than \(1,000,000\) rows of facts, so we are above Excel scale here. So be a bit patient if the answer take a bit to arrive.

Let’s start easy: How much did we sell in 1995 to customers in Europe?

query = """ SELECT SUM(lo_ordtotalprice)
            FROM lineorder lo
                JOIN date d ON lo.lo_orderdate = d.d_datekey
                JOIN customer c ON lo.lo_custkey = c.c_custkey
            WHERE d.d_year = 1995 
                AND c.c_region = "EUROPE" 
        """

pd.read_sql_query(query, con)

SUM(lo_ordtotalprice)
0 582823124941

Boom! Blazing fast, crazy simple! But before I drop the mike and we go on to the next topic, I can sense some of you are still doubtful. Why am I hyping dimensional models so much? Well, it is because the very simple query we wrote above is about as hard a query you have to write in a dimensional model. The structure of a query in a dimensional model will always be the same:

  • The SELECT part will always be a mix of unaggregated dimensions fields and aggregated measures
  • The FROM part will always be between the fact table and dimensions table and sometimes (in the snowflake) between dimensions table in the same hierarchy.
  • The WHERE is a filter on one or more dimensions
  • The GROUP BY will always be along the dimensions of the unaggregated fields in the SELECT

If you’re one of those Excel wizards, and by that time you probably are, you’re probably thinking “pivot table” right now. You are right! We will come back to that. But for now, let me just show you a couple more queries and ask you to write a few as well. Here’s how we can have the cost of goods sold paid to suppliers from middle eastern countries from 1990 to 1995 by country of the supplier.

query = """ SELECT s.s_nation, SUM(lo_supplycost)
            FROM lineorder lo
                JOIN date d ON lo.lo_orderdate = d.d_datekey
                JOIN supplier s ON lo.lo_suppkey = s.s_suppkey
            WHERE d.d_year BETWEEN 1990 AND 1995 
                AND s.s_region = "MIDDLE EAST"
            GROUP BY s.s_nation
        """

pd.read_sql_query(query, con)

s_nation SUM(lo_supplycost)
0 EGYPT 2336691909
1 IRAN 2280002613
2 IRAQ 2288083426
3 JORDAN 1814785982
4 SAUDI ARABIA 2509290291

The query is remarkably similar to the previous one in its construction. Let’s see if we can see from the sales to which country the supply cost going to the Middle East come.

query = """ SELECT c.c_nation, s.s_nation, SUM(lo_supplycost)
            FROM lineorder lo
                JOIN date d ON lo.lo_orderdate = d.d_datekey
                JOIN customer c on lo.lo_custkey = c.c_custkey
                JOIN supplier s ON lo.lo_suppkey = s.s_suppkey
            WHERE d.d_year BETWEEN 1990 AND 1995 
                AND s.s_region = "MIDDLE EAST"
            GROUP BY c.c_nation, s.s_nation
        """

pd.read_sql_query(query, con)

c_nation s_nation SUM(lo_supplycost)
0 ALGERIA EGYPT 91510962
1 ALGERIA IRAN 86235221
2 ALGERIA IRAQ 87029185
3 ALGERIA JORDAN 70257007
4 ALGERIA SAUDI ARABIA 94949454
... ... ... ...
120 VIETNAM EGYPT 93074372
121 VIETNAM IRAN 87627326
122 VIETNAM IRAQ 88861180
123 VIETNAM JORDAN 74356296
124 VIETNAM SAUDI ARABIA 103470364

125 rows × 3 columns

Again, this is just a minor variation on the original query. Here are a few queries that you might try as exercises.

Exercise 1 - The Color of Money

For each year, compute the total revenues generated per color of parts. Classify them chronologically and by decreasing order of revenue.

See solution
query = """ SELECT d.d_year, p.p_color, SUM(lo_revenue) as revenues
            FROM lineorder lo
                JOIN date d ON lo.lo_orderdate = d.d_datekey
                JOIN part p on lo.lo_partkey = p.p_partkey
            GROUP BY d.d_year, p.p_color
            ORDER BY d.d_year ASC, revenues DESC 
        """

pd.read_sql_query(query, con)

d_year p_color revenues
0 1992 blush 8013806433
1 1992 beige 8003628753
2 1992 blue 7954274115
3 1992 antique 7909984436
4 1992 aquamarine 7854998354
... ... ... ...
639 1998 spring 2674484870
640 1998 yellow 2596359593
641 1998 wheat 2572572362
642 1998 white 2545421666
643 1998 tomato 2522630675

644 rows × 3 columns

Exercise 2 - Multinational, Multimodal

For each region and for the whole dataset, what is the total quantity of products of color “chartreuse” that have been shipped with each shipment method?

See solution
query = """ SELECT c.c_region, lo.lo_shipmode, SUM(lo_quantity) as quantity
            FROM lineorder lo
                JOIN part p ON lo.lo_partkey = p.p_partkey
                JOIN customer c ON lo.lo_custkey = c.c_custkey
            WHERE p.p_color = "chartreuse"
            GROUP BY c.c_region, lo.lo_shipmode
        """

pd.read_sql_query(query, con)

c_region lo_shipmode quantity
0 AFRICA AIR 10476
1 AFRICA FOB 9541
2 AFRICA MAIL 9197
3 AFRICA RAIL 10151
4 AFRICA REG AIR 9677
5 AFRICA SHIP 10400
6 AFRICA TRUCK 11537
7 AMERICA AIR 10802
8 AMERICA FOB 10303
9 AMERICA MAIL 10238
10 AMERICA RAIL 10313
11 AMERICA REG AIR 9810
12 AMERICA SHIP 10667
13 AMERICA TRUCK 9608
14 ASIA AIR 9091
15 ASIA FOB 11126
16 ASIA MAIL 10240
17 ASIA RAIL 9789
18 ASIA REG AIR 10475
19 ASIA SHIP 9901
20 ASIA TRUCK 9547
21 EUROPE AIR 9502
22 EUROPE FOB 8906
23 EUROPE MAIL 10244
24 EUROPE RAIL 9520
25 EUROPE REG AIR 10384
26 EUROPE SHIP 10926
27 EUROPE TRUCK 10372
28 MIDDLE EAST AIR 10023
29 MIDDLE EAST FOB 11655
30 MIDDLE EAST MAIL 11009
31 MIDDLE EAST RAIL 9883
32 MIDDLE EAST REG AIR 9009
33 MIDDLE EAST SHIP 9865
34 MIDDLE EAST TRUCK 10256

On Holiday

On average, do we sell more expensive items (in terms of average order price per unit sold) during regular days or holidays?

See solution
query = """ SELECT holiday, tot_sales / quant
            FROM
                (SELECT d.d_holidayfl as holiday, SUM(lo_ordtotalprice) tot_sales, SUM(lo_quantity) as quant
                FROM lineorder lo
                    JOIN date d ON lo.lo_orderdate = d.d_datekey
                GROUP BY holiday)
        """

pd.read_sql_query(query, con)

holiday tot_sales / quant
0 0 741151
1 1 738927

You get the gist. Writing query in a star schema (and to a lesser extent in a snowflake schema) is relatively easy. So easy in fact that the average business analyst, even if she is not versed in the subtle art of SQL querying is able to manipulate. But anything that looks like code inspires fear in the average person. What people like are spreadsheets. Although you and I know that spreadsheets are dry and uninspiring, they are intuitive and integrate with the workflow in most companies. The pivot table was born. They are just a nice user interface on top of a dimensional model.

Remain an issue: Excel is slow and greedy in resources. It brings the computer to a screeching halt whenever you ask it to crunch a few millions rows… That’s quite the conundrum… what to do?

Presentation Layer

Data workers have added a last layer, the presentation layer. It can take many forms: those shiny dashboards in Tableau, SQL terminals for geeks or, and that what we will see here, OLAP cube. OLAP, for On-Line Analytical Processing, is in fact just a set of pre-computed dimensional queries, aggregated at each level of the hierarchy and served to the spreadsheets of the analysts. That way, they do not need to work with millions of rows but with a few thousands and the computationally expensive steps is done on big servers rather than on the small laptop or desktops of the people consuming the data.

An OLAP cube is just that: a set of pre-computed aggregations on a set of dimensions. It is in fact only a cube when there are three dimensions with the same number of components but the term stuck. One have to admit that talking about “releasing the cube” or “querying the cube” adds a touch of mystery to the description of a desk job. And isn’t that what matters in the end?

Collective Intelligence

Internet has been weird lately. Everywhere you go, your every moves are tracked and data about it are stored somewhere for people and corporations to use. While I personally believe this goes too far Although you can take some steps to mitigate the effects. , this is not unequivocably bad. When someone has information about a lot of people (and most big corporation do have that), they can, sometimes, do useful things with it for the same people about whom they have info. This is the mass-customization people talk about. The technical side of it has been referred to as collective intelligence.

Collective intelligence comes in many forms and shapes. Here, we are going to do some light application on a classical problem: which films to recommend to somebody. We start by downloading a classical dataset containing movie reviews by a bunch of people (6000 of them). For copyright reasons, I understand I cannot provide it to you directly. But I also assume I can provide you with code that will download, extract and strip the set of unnecessary data… If you’re a lawyer and you’re reading this, feel free to correct me and I’ll find some other sophistry to make it easy to get that data without any manual labour. Nice place to remember Aaron Swartz. Here goes the code:

import os.path, shutil, os
from zipfile import ZipFile

if not os.path.isfile("ratings.dat"): # If you don't have the file yet...
    import urllib.request 
    urllib.request.urlretrieve("http://files.grouplens.org/datasets/movielens/ml-1m.zip",
                               "ml-1m.zip")
    with ZipFile("ml-1m.zip", 'r') as zip_ref:
        zip_ref.extract("ml-1m/ratings.dat", path = ".")
        zip_ref.extract("ml-1m/movies.dat")
        zip_ref.extract("ml-1m/users.dat")
    shutil.move("./ml-1m/ratings.dat", ".")
    shutil.move("./ml-1m/movies.dat", ".")
    shutil.move("./ml-1m/users.dat", ".")
    shutil.rmtree("./ml-1m")
    os.remove("./ml-1m.zip")

Mouahahaha, is there anything we cannot do with code? Anyways. Open the files you just downloaded using a text editor and take a look at the formating of the file: they are a series of lines where the “fields” are separated by two colons: ::. We can load them using pandas into dataframes that will be easier to manipulate.

import pandas as pd

movies = pd.read_csv("movies.dat", sep="::",
                     names = ["ID", "Name", "Type"])

users = pd.read_csv("users.dat", sep="::",
                    names = ["ID", "Gender", 
                            "Age", "Occupation",
                            "ZIP"])

ratings = pd.read_csv("ratings.dat", sep="::",
                     names = ["User", "Movie", 
                              "Rating", "Timestamp"])

You might get a warning that you can happily ignore It tells you that, since you are using Jupyter, there is a difficulty because the separator is composed of two characters, but then it hands the problem over to regular Python which know how to handle such cases. If this bothers you, just add the parameter engine = "python" to the read_csv calls . We now have two dataframes: One containing informations about the movies that we will in fact not use that much. The other contains the rating every user gave to the movies.

We are going to use a very mathematical but nonetheless relatively effective method to tackle the collaborative filtering problem called Singular Value Decomposition See here for intuition about what the Singular Values are. . It supposes that we have a matrix containing the ratings with every row representing an individual and every column representing a movie.

It’s big brains time! To transform the we could just initiate a matrix full of zeros, insert the rating and be done with it… it is a bad idea. Most people haven’t evaluated most of the movies so the rows would be full of zeros and some rating. But in the end, this would give the impression that people who haven’t evaluated the movie actually hated it…. Surely, we can do better. So since here we are not really interested in keeping the information about which films are absolutely good or bad, we will standardize every row so that it’s mean is 0 and the modified rating of people having evaluated it represents in fact the variation around the mean of the other ratings. Scary, hu? Not quite, let’s do this step by step and using as little “magic” as possible. Let’s start by computing the average and standard deviation of the ratings for each of the movies. For some reason, the IDs of users and films start at 1. This is not pythonic so I’ll convert those so those indices are based on 0. This will come handy later when we build the actual matrix.

import numpy as np

users = users.reset_index()
movies = movies.reset_index()

users = users.set_index("ID")
movies = movies.set_index("ID")

rebase_users = lambda x : int(users.loc[x, "index"])
rebase_movies = lambda x : int(movies.loc[x, "index"])

ratings["User"] = list(map(rebase_users, ratings.User))
ratings["Movie"] = list(map(rebase_movies, ratings.Movie))

users = users.set_index("index")
movies = movies.set_index("index")

movies_spec = ratings.groupby("Movie").agg([np.mean, np.std])["Rating"]

movies_spec[:5]

mean std
Movie
0 4.146846 0.852349
1 3.201141 0.983172
2 3.016736 1.071712
3 2.729412 1.013381
4 3.006757 1.025086

Let’s check, what is movie ID 1 in the original dataframe and which got a high rating? Toys Story… OK, quite good. And the fourth, which got a very low average rating? Waiting to Exhale. Never heard of it. Apparently, it got \(56\%\) on Rotten Tomatoes, no it is consistent with our score here. Cool, everything seems to be working. Moving on.

We are now going to going to:

  1. Initiate a matrix with a bunch of 0
  2. Iterate the rating dataframe
  • For each row, we produce the standardized rating
  • We insert this into the matrix

Let’s go

shape = (int(np.max(ratings.User)) + 1,
        int(np.max(ratings.Movie)) + 1)

design_matrix = np.full(shape, 0.0)

for _idx, row in ratings.iterrows():
    mean_movie = movies_spec.loc[row["Movie"]]["mean"]
    std_movie = movies_spec.loc[row["Movie"]]["std"]
    
    standardized_rating = (row["Rating"] - mean_movie) / std_movie
    
    design_matrix[row["User"], row["Movie"]] = standardized_rating
    
design_matrix = np.nan_to_num(design_matrix)

Let’s take a look at the beginning of the matrix.

design_matrix[:3, :3]
array([[1.00094387, 0.        , 0.        ],
       [0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        ]])

And verify that the first three columns are centered:

np.sum(design_matrix[:,:3], 0)
array([4.65849581e-13, 4.44089210e-14, 1.37223566e-13])

Those look like numerical errors If computers were able to represent a number perfectly, you would expect those to be exactly 0. However, having such a small remainder after adding 6.000 values is OK … Close enough. We can now continue. Now for the big leap: we are going to transform the matrix into its singular vectors. Ready? Here we go.

design_matrix.shape
(6040, 3883)
from scipy.sparse.linalg import svds
U, S, M_filtered = svds(design_matrix, k = 100)

That’s all… we now have to analyse the data. We produce those matrices with a twist: We reduce the number of dimensions on which the the information is represented. Concretely, we do everything as if there were only 100 factors that determined the grade each viewer would assign to each film. It is a monumental trick: we determine, simultaneously, the similarity of people and of movie. Each person is determined in a space at 11 dimensions and, movies are also represented in a space of 100 dimensions.

Your head is spinning? Understandable. But now, we retrieve a result that makes business sense: if we multiply the three matrices, we have a set of recommendations that “could be observed” if the film choice was really directed by 100 abstract factors this number is a wild guess, use whatever value you’re comfortable with. Higher values allow for more variation, which is both a blessing, because you allow for greater customization, and a curse, because your recommendations become more noisy. On the topic, you can read about the curse of dimensionality and the excellent bias-variance trade-off article on Wikipedia. . Technically, we can retrieve the “strenght” of the signal on each dimension by representing the absolute value of the singular values, but we will leave it for a finer analysis, here we just decide to keep the 100 first vectors.

new_base = np.dot(np.dot(U, np.diag(S)), M_filtered)
new_base[:3,:3]
array([[ 0.59018665, -0.04393189,  0.02897357],
       [ 0.16460058, -0.05389524,  0.08092475],
       [-0.28872293,  0.04940678,  0.03531134]])

Let’s identify a random user, and check if the films she actually likes are indeed similar to the films our system will recommend her. Let’s start by inverting the transformation we made earlier. We will work with the columns vectors to save some time In most programming languages, vectorized operations are faster than loops for a series of reason. This is certainly the case for R, Matlab and most scientific programming languages, with the notable exception of Julia (for most cases). For Python, it depends on the packages you use. If performance is a concern for you, read up on the topic and make some experiments. . We then convert the matrix back to a “tall” dataframe with the columns of the viewer ID and the Movie ID in the 1-based index (to ensure consistency with the movie dataframe).

recommendations_mat = new_base.copy()

for _idx, row in ratings.iterrows():
    movie = row["Movie"]
    mean_movie = movies_spec.loc[movie]["mean"]
    std_movie = movies_spec.loc[movie]["std"]
    recommendations_mat[:, movie] = new_base[:, movie] * std_movie + mean_movie
    
recommendations = pd.DataFrame(recommendations_mat).reset_index()

recommendations_melted = recommendations.melt(id_vars = "index", var_name = "Movie",
                                      value_name = "Simulated Score")

recommendations_melted.columns = ["ID", "Movie", "Simulated Score"]

We now draw a viewer randomly and check the type of movies she evaluated the best.

viewer = np.random.choice(recommendations["index"])
viewer
5825

And we compare the list of the highest rated movies for this viewer and the top recommendations.

movies.loc[ratings.loc[ratings.User == viewer].sort_values("Rating", ascending = False)["Movie"]][:5]

Name Type
index
1216 Harold and Maude (1971) Comedy
1899 Breakfast Club, The (1985) Comedy|Drama
293 Pulp Fiction (1994) Crime|Drama
2928 Being John Malkovich (1999) Comedy
2789 American Beauty (1999) Comedy|Drama

Let’s now build our… TOP5!

Oh Yes

movies.loc[recommendations_melted.loc[recommendations_melted.ID == viewer].sort_values("Simulated Score", ascending = False)["Movie"].values][:5]

Name Type
index
589 Silence of the Lambs, The (1991) Drama|Thriller
2928 Being John Malkovich (1999) Comedy
293 Pulp Fiction (1994) Crime|Drama
2789 American Beauty (1999) Comedy|Drama
49 Usual Suspects, The (1995) Crime|Thriller

Here we see that the recommender system actually retrieves most of the movies the person rated the highest. It seems to indicate that the system got at least a reasonable idea of what that person likes. It will not work everytime, of course, because some people have too idiosyncratic a taste to lend itself to such primitive modelling but it is interesting to know that, by just having information about what other people like and not even looking at the individual characteristics of the person you can already build a recommender system. I won’t put an exercise here but try to get the broad line of how the code above works.

This is cool, but it is sooooooo 2010…

2010... a lifetime ago. It was a weird time. Wakawaka... eheh.. Wakawaka... ehoh

Since then, analytics has become much more advanced… much darker also. Eigen- or singular vectors-based techniques are relatively mecanic in the way they work (they use properties of the space and then distort it to get a notion of proximity), and they are not really data-hungry: They take observations about something in a sparse matrix and then try to extrapolate values to empty cells. They do not accomodate easily predictors: Indeed, imagine you added the age or whatever of the viewer in the analysis above, when rebasing the matrix, you would end up “predicting” the age of the viewer (which you already know) in the process and the whole analysis becomes super messy. Those techniques are also difficult to tune to one’s desire and they tend to only convince mathematicians and mathematically-minded people. Finally, they tend to bring people to a baseline and then segregate them around this baseline If you followed closely what we did, since we added the average score of each movie and only make it vary by the deviation from ones own supposed assessment. .

Enter two Knights of the Privacy Apocalypse: Statistics and Computer Science. Begins the Era of…

Modern Analytics - The Next Frontier

The Mechanical Turk Supposed for a long time to be one of the first automaton that played chess. Turns out somebody was inside the machine making all the decisions. lol. Like all cool apocalypse story, this one has been long in the making. The concept of “Artificial Intelligence”, which probably goes back to Babbage and Ada Lovelace and gained traction with Alan Turing with his Chinese Room problem made some cool advances in the 70’… but, even at the time, it was closer to the Mechanical Turk than to Aasimov’s Robot. Indeed, the approach to artificial intelligence was based on the manipulation of symbols or concepts. Researchers at the time were mostly trying to make machine “learn” like one learns mathematics: Start with axioms, build some theorem, use the result of this theorem to build other theorems, rince and repeat.

As it turns out, this did not work so great. Despite early successes, there was a long time that artificial intelligence was considered old news (the so-called AI Winter) and interest faded. But, much like the One Ring, it was found again and came back in possession of a Hobb… I mean, under the form of statistical machine learning.

Deagol finding the Ring I can't believe I waited this long to put a Lord of the Ring gif...

Indeed, while statistics had existed for a couple of century, the specific type of statistics useful for the tasks AI was asked to solve and the some of the required technique was out of the reach of the mathematician. Indeed, they required fast computers and, in many case, the ability to perform matrix computation very fast. Bizarrely, one of the reason we now have those tools is because of the gaming community. Since one of the most important requirement to have good 3D graphics is also matrix computation And some ill-intentioned people would say that gamers had money to spare because of their lack of social life. , manufacturer started to build increasingly faster Graphical Processing Units for prices that were accessible even for university researchers, who in turn used them to implement algorithms and concepts that had existed for a long time but were not really used because they were impractical. This is the story of AI in a nutshell and how gamers saved the world… until they destroy it…

Indeed, while traditional Machine Learning techniques based on symbolic computation do not require data, those new algorithms needed much data to “learn”. How much data? Much, much data. But remember the start of our journey. A computer is just a big calculator, so what does “learning” mean in such context? What do machine learn? Well, they learn mathematical relations. And of course, in order to have such relation, you need models, which we already talked about. So, in some sense, we’ve gone full circle: Much like modern programming languages, such as Python, are built on the older, lower levels languages such as Assembler, but work at a higher level of abstraction, modern analytics algorithms are able to manipulate concepts and represent relations that you will typically find in our today’s mental models. However, they depend in large part on the concepts created in the last iteration of the economy of information: If you do not have an ERP or some Management Information System, such as a CRM, you won’t have the data required to figure out which product you can sell to a specific consumer to maximize your profit. If you are not able to retrieve, join and agregate data in a large, structured database, having killer algorithms won’t do you much good. If your business processes are not somehow modelled and abstracted, you won’t be able to improve them and expose them to your business partners and customers to make them nimble and adaptive. And, of course, if you’re not on the internet today, you’re not in business at all.

In the next video, I propose a small teaser and a very short intro about how you need to change your mindset to grasp modern analytics. Going into much more detail isn’t in the scope of this class but I hope that, with these (admittedly very verbose) exercises sessions, you are now confident that you can use a computer to help you perform a broad arrays of tasks that would be hard or impossible without them. And you do not need to be a computer genius for that. Basic skills, a hefty dose of search and the disposition to learn are what you need. Without further ado, here’s the very very shallow overview of modern data science (or analytics).