1.2 - SQL

The Digital Firm

Estimated Reading Time: 51 minutes

Contents

L'Université Libre de Bruxelles fonde l'enseignement et la recherche sur le principe du libre examen. Celui-ci postule, en toute matière, le rejet de l'argument d'autorité et l'indépendance de jugement.

Article premier des statuts organiques de l'Université libre de Bruxelles

The Free University of Brussels bases its teachings and research on the principle of free inquiry. This principle postulates, in all matters, the rejection of the argument of authority and independence of thought.

First article of the Statutes of the Free University of Brussels (free translation)

Introduction

SQL, or Structured Query Language, is not like Python. It is indeed not a programming language but a query language. This means that most of what we have seen in the Python refresher do not apply when writing SQL. Python and SQL are therefore complements rather than substitutes.

You will use Python to manipulate data, respond to new events by the user and implement advanced conditional logic. And you will use SQL (or one of its descendents, the so-called NoSQL languages) to retrieve information that is stored in some structured fashion. SQL is particularly at its ease with relational databases (that you probably have seen or will soon see during the theory lectures).

In those databases the data is stored in tables that are linked together. The easiest way for business/economics students to conceptualize those tables are as tabs on an spreadsheet that has named columns. Each record on those tables has a field, or list of fields that identifies it uniquely. For example, you probably have a student number, which is the field the University uses to identify you uniquely, even if another students with the same name and date of birth is enrolled at the same university. Speaking about students, here is what a table containing students enrolled at the university might look like.

Students

student_id name date_of_birth city_of_residence
1 Aloysius 1999-07-05 Brussels
2 Clothilde 1998-12-14 Brussels
3 Damian 1999-02-28 Dilbeek
4 Clothilde 1999-06-06 Anderlecht
5 Eya 1998-10-15 Brussels

Now, remark that we have two students named Clothilde who are obviously two different people as they do not share the same birthday nor the same city. This is not a problem at all, and even if they shared those information by an enormous coincidence, they would still be differentiated in the database because they have different student_id. In the lingo, those attributes I’ll use attribute or fields interchangeably that enable differentiating two records, even if all the other fields are rigorously identical are called primary keys. This is a very important concept and all the use of relational database and a good part of our modern way of life hinges on it.

Now, how does this differs from a spreadsheet, would you ask. Well, the presence of a primary key enable the creation of links between tables while most spreadsheet systems enable those as well, it is often super painful to maintain and manage . Let’s see some example. If we have students, it would be reasonable to assume they follow courses. Here is a small table with some courses.

Courses

course_id name
GEST-S482 The Digital Firm
GEST-S428 Some less cool course

It would also be expected that those courses are given by professors. Let’s create a table for them too…

Professors

employee_id name
1 Niccolo di Pantaloni de Mare
2 Tom Marvolo Riddle
3 Frank Randall

Nice! Now, to express the fact that Niccolo give the class “The Digital Firm”, we are actually going to modify the table containing the courses to add a field. This field will be peculiar: It will only contain copies of primary keys from the table Professors. The semantics of this field will be that, every time a record has a number in this field, the professor with this primary key gives this class. As such, the table Courses becomes:

Courses

course_id name professor_id
GEST-S482 The Digital Firm 1
GEST-S428 Some less cool course 3

Here we see that the professor Niccolo gives the class “The Digital Firm” and the professor Frank gives “Some less cool course”. Technically, the field that contains copies of the primary key of another table is called a foreign key. This is the mechanism that underpins links creation in relational databases. In fact, if a professors gives several classes, we can just copy her primary key in several rows of the Courses table and be done with it. This is why this type of link is called One-to-Many relation (because one professor can give several classes but a class can only be given by one professor).

But you might ask: “Why haven’t we linked students with courses? Surely we can use foreign keys to do so!” and surely, we can. However, it is not so easy. Take a minute to think how you would do it and draw it on a piece of paper in the example, I’ll wait.

Done? Marvelous! As you could see, there is an issue. A student can be enrolled in several classes but, at the same time, a class can be given to several students, fortunately for our universities budgets. Copying the primary key of both sides of the relation and put it on the other table would not work In fact, it would make the matter worst: It would force a student to have at most one class and a class to have at most a student, try it for yourself! . We therefore have to resort to a different construct: The junction table. This is a table that is going to be created between the Students and the Courses. Every time a student registers to a class, a new record is created in this table containing the primary key of the student and the primary key of the course as foreign keys. What’s even cooler is that, now, students can register several times to the same class, for example if they have to retake the class What’s cool here is that we are able to do it, not that students have to retake the class. Please try to pass at your first trial… . You just need to add a field with the year during which this student was register. You can also, for example, log the grade of the student in this class and this kind of information. This table, that we can call, for example Enrollments might look like:

Enrollments

student_id course_id year grade
2 GEST-S482 2020 12
1 GEST-S482 2019 8
2 GEST-S428 2020 14
1 GEST-S482 2020 16

As one can see, we have student 2 (one of the Clothildes) that was registered to both courses (GEST-S482 on the first row and then GEST-S428 on the third row, both in 2020) and both students (Clothilde and Aloysius) were registered to course GEST-S482 (in fact, Aloysius had to retake the course for 2 years). Since a course can have several students and a students can have several courses, we call this type of relation a Many-to-Many (or sometime a M-N) relation.

Go Further

There isn’t much more to learn about the basic structure of relational databases. In this video, I talk about a more arcane topic, the self-join to represent hierarchies (or network) of entities. You can safely skip this if you are not interested.

Rather than detailing the content of the table each time, people working with databases usually resort to representing the tables and relations in what is called an Entities-Relations Diagram (often abbreviated as ERD). Those schemas are composed by rectangles representing the tables with their name and each field along with its type.

These tables are connected by graphical representations of the relations between them. There are two types of relations. The straight ones represent one-to-one relations, that arise when one record in one of the tables can be connected to at most one record on the other This is the type of relation that might represent the connection between a person and her Belgian ID card. A person can have at most one valid ID card at a time and an ID card can only belong to one person. . The second type of relation is the one represented by a forked line. Those are meant to be a graphical instanciation of a one-to-many relation. In this case, the straight end of the line represent the “one” side of the relation while the forked end is supposed to represent the “many” side.

Below is a ERD representing the above scenario with students, professors, courses and enrollments.

The ERD corresponding to the scenario above - Made with DrawSQL

As you can see, every piece of information of the scenario are represented in this compact visualization (except for the values contained in the table but those are not really needed to work out the queries, as you will see afterwards). For example, the primary keys are explicitely represented. You might have remarked that the enrollments table has several keys symbols. This does not mean that there are several primary keys on this table but rather that the primary key is constituted of several fields. The combination of values in those fields must be unique, not the values in each of those fields. Those are the fundamentals of the design of relational databases. It seems basic but most of today’s data architecture enabling the world to work the way it does is based on that. You can represent most scenario you can think of with relational databases (sometimes going through some crazy hoops, but still). The thing is that you have to impose a rigid structure. This is why, today, some databases system have broken free of or taken liberty with the entity-relation paradigm to allow for more flexible designs, but those are outside the scope of this class.

Queries

In you job as a business engineer or economist, you are mostly going to have to retrieve data from databases. In most relational databases, this is done by using a somewhat universal querying language called SQL. It is not only a fairly sought after skill to have to land a data analyst job, it is also a good way to woo your boss. Extract of the Amazon’s show Jack Ryan - S01E01

In this context, the most useful (and arguably the most complex) statement (or instruction) we will use is the SELECT statement. This will be the topic of the remainder of the session.

The SELECT statement

What the heck are we doing?

Before going on with the rest of the session, it is important to understand why we talk about SQL but wrap it in a bunch of python code. It is because, most of the time, you want to do something with the data you’ll be retrieving. In this class, we’ll use python to do so (since we already uses this language for other topics).

However, make no mistake, we are using three types of “programs” under the hood. Our “view” or graphical interface is the jupyter notebook (even if you read this in web page, this is just a notebook of which the result was “exported” to a web page), which renders the result of whatever code we run in the code cells. We have python that is going to manage all the logic once the data has been retrieve from the disc. As such, it might be compared with a “controller” layer This is a broad approximation, go along with this for now, we’ll go in more details in those concepts in the theory sessions. . Finally, we have the SQL server - a program that lurks in the dark (or runs as a process on your computer), waiting for another program to send it a query Again, an approximation in the case of SQLite, the engine we are using, but most of the SQL systems work with server programs. The SQLite engine is actually started at the moment you submit it a query. . Once it receives a valid query, it goes on the disk to retrieve the required data and send the result of the query back to the program that requested it. All this happens in a split second and you have the impression that everything happens in one place but there is in fact a whole message flow going on.

Representation of the process to get a query executed
import pandas as pd
import sqlite3

# This cell retrieves the database, downloads it to the folder where you have your notebook
# It then opens a connection with the database engine (it is like a tunnel through which you will
# pass the query from python to sqlite and the results back)

# You need to have the pandas library installed. If you don't have it and do not know
# how to install it, have a look at the notebook about modularity and open-source

import os.path
if not os.path.isfile("finnhub.db"): # If you don't have the file yet...
    import urllib.request 
    urllib.request.urlretrieve("http://homepages.ulb.ac.be/~choffreu/resources/finnhub.db", "finnhub.db") # download it

con = sqlite3.connect("finnhub.db") # create a connection with the database

query = "SELECT * FROM sqlite_master" # This is the SQL code that we want to execute

pd.read_sql_query(query, con) # We use the pandas library to send the query through the connection
# retrieve the data and format it into a table

type name tbl_name rootpage sql
0 table Symbols Symbols 2 CREATE TABLE "Symbols" (\n"description" TEXT,\...
1 table Companies Companies 3 CREATE TABLE "Companies" (\n"address" TEXT,\n ...
2 table Peers Peers 32 CREATE TABLE "Peers" (\n"symbol" TEXT,\n "pee...
3 table Executives Executives 49 CREATE TABLE "Executives" (\n"age" TEXT,\n "n...
4 table Candlestick Candlestick 74 CREATE TABLE "Candlestick" (\n"close_price" RE...

The second point I need to discuss before starting the explanation is the base that we are going to use in the examples. It is a base containing informations about listed companies and their top executives (their board of directors). It was collected by Benoit Decorte using the API from Finnhub (we will see how to retrieve this kind of data in an automated fashion during a future session). I hosted the database on this website and the boiler plate code downloads the database to the same folder as your notebook if it is not already present. The ERD for this base is the following.

A representation of the database used here adapted from the original by Benoit Decorte, using DrawSQL

As you can see, we have a record of companies along with their executives, some group of other companies considered as peers, their financial results (summarized in candlestick data), some evaluations by financial analysts and the headlines of some news involving the company. What links all information about a company is a specific field, called a ticker or symbol.

Remark that, in this schema, the table Symbols is easily bypassed: since the ticker is the foreign key in all other tables (and, in fact, even primary key in the table Companies, we can join the tables using the key directly. This will become clearer as we go.

After this super long introduction, we can go ahead and learn the SELECT statement. The syntax is plain and we will go through all the keywords successively. The statement follows this pattern:

SELECT [fields or agregates]

FROM [tables or other selects, that might be joined together]

WHERE [conditions on the lines]

GROUP BY [agregation criteria]

HAVING [conditions on the agregates]

ORDER BY [fields on which to order and direction]

LIMIT [number of final lines to show]

Don’t let the simple syntax fool you. With this, you can answer as many questions as your heart desires (if you have a relational database with the data, of course). Let’s start!

SELECT … FROM …

The SELECT statement start with the… SELECT clause. A statement is like a function while a clause is a part of this statement In this clause, you list the fields you would like to retrieve in the result. Together with the FROM clause, in which you name the tables from which you want to retrieve the fields, they are the only mandatory part of a query.

Assume you want to have the name of all the companies in the database. You could write the query in the following way: The three double-quotes in python enable the creation of a multi-line string. SQL disregards carriage-returns (going to a new line) but, for you sanity, try to write you queries on several lines to make them more readable. Try to keep an uniform style as much as possible.

query = """ SELECT name
            FROM Companies
"""

pd.read_sql_query(query, con)

name
0 AGILENT TECHNOLOGIES INC
1 ALCOA CORP
2 PERTH MINT PHYSICAL GOLD ETF
3 None
4 DORSEY WRIGHT ADR ETF
... ...
216 WIDEPOINT CORP
217 XEROX CORP
218 YELP INC
219 ZILLOW GROUP INC
220 ZEDGE INC

221 rows × 1 columns

We see that here, we have 220 companies and their name. Note that we have some companies for which the name is empty. We’ll fix it later (in fact, those are tickers that correspond to funds or something and not to a company).

This is the simplest example we can think of. With just this, we can only show all the rows of a table. The most sophisticated operation we can perform are basic math operations on the same row. For example, see that in the table Executive, you have the age of the person and the year she became a board member? Using simply SELECT and FROM, we can find out the age she was when she became a board member (plus or minus 1 year since the granularity of the data is at the level of the year here).

query = """ SELECT name, age - 2020 + since
            FROM Executives
"""

pd.read_sql_query(query, con)

name age - 2020 + since
0 Jeffery H. Boyd 60
1 Glenn D. Fogel 56
2 David I. Goulden 57
3 Gillian Tans 47
4 Peter J. Millones 48
... ... ...
1389 Gerald D. Held 60
1390 Kathryn M. Hill 55
1391 Deborah Linda Kerr 44
1392 Scott F. Schenkel 48
1393 George T. Shaheen 59

1394 rows × 2 columns

In many relational database systems, we strive to attain what is called normalization. It means that we seek to reduce data redundancy An explanation is available in the Youtube video here . Since each piece of data is in its own table, it makes sense to want to merge information coming from several tables. For example, imagine that you want to have the name of all the board members along with the name of the company in which they serve. This information is in two different tables: The table Companies contains the name of the company (under the field name) while the table Executives contains the name of the board member under the field name. How can we make both pieces of information appear in the same result, we need to use what is called a join.

You can declare a JOIN in the FROM clause. The concept is the following: You give the name of the first table, then JOIN, then the second table and then, you use the keyword ON to specify the fields that should be common between the two tables. Seems abstract? Here’s an example. We know that the symbol in the table Executives is a foreign key (this is why there is a little fork arriving to it) from the table Symbols and we know that the table Symbol has a one-to-one relation with the table company (meaning that each symbol corresponds to at most one company and that each company has at most one symbol, or ticker). Well then, it means that the value in the field symbol of the table Executives is a copy of a value in the field ticker in the table Companies. If we were to produce all possible combinations of rows from Executives with the rows from Companies and keep only the ones for which the field symbol in Executives is equal to the field ticker in Companies, we would indeed have all the informations of the executives alongside all the information about the company in which they serve. That’s precisely what a join does.

Watch this unfold in a very small example:

Converting this into code, we have:

query = """SELECT Executives.Name, Companies.name
            FROM Executives JOIN Companies
                ON (Executives.symbol = Companies.ticker)
"""

pd.read_sql_query(query, con)

name name
0 Jeffery H. Boyd BOOKING HOLDINGS INC
1 Glenn D. Fogel BOOKING HOLDINGS INC
2 David I. Goulden BOOKING HOLDINGS INC
3 Gillian Tans BOOKING HOLDINGS INC
4 Peter J. Millones BOOKING HOLDINGS INC
... ... ...
1389 Gerald D. Held NETAPP INC
1390 Kathryn M. Hill NETAPP INC
1391 Deborah Linda Kerr NETAPP INC
1392 Scott F. Schenkel NETAPP INC
1393 George T. Shaheen NETAPP INC

1394 rows × 2 columns

Now, this is type of JOIN is the most common. It is called an INNER join. In it, only the rows that have a correspondence in the other table will show up in the results. For instance, imagine we are joining a table containing courses with a table containing professors. If we use an inner join, a course that has no professor assigned will not appear in the result and a professor who isn’t connected to any course will not appear either. Now imagine you want to make sure that all courses appear once and, if there is a professor responsible for it, you want to see the name of the professor. This is a job for a LEFT join. A left join will necessarily show all the record of the table that comes before the LEFT JOIN keyword and only show the information of the rows of the second table if they match with a row from the first table.

Most of the time, you’ll end up doing one of either inner join (very frequently) or left join (seldom). Besides those, there are also the right and the outer joins, that are outside of the scope of this class.

WHERE…

What we have so far is a way to retrieve all the rows from a table or from the result of junction of tables. That’s nice but not very useful for many scenarii. The first refinement is that we should be able to filter out some of the rows from the tables. This is what the WHERE clause does. It checks a condition, or series of conditions, for each rows. If the conditions are verified for this row, the row will show up in the results, otherwise it is discarded from the query.

The conditions are fairly classical. You can test an equality with = Do not mix it up with python where the equality is checked with == , “greater than” >, “smaller than” <, “greater or equal than” >=, “smaller or equal than” <= and “different from” <> Which is also different from python where this is evaluated with != . Besides that, for text and chains of characters, you can check patterns by using the keyword LIKE and using percentage signs % as jokers representing one or more arbitrary characters. For example, imagine you just want to see the executives that are younger than 40 years. You can do so with:

query = """SELECT name, age
            FROM Executives
            WHERE age < 40
"""

pd.read_sql_query(query, con)

name age
0 Eyal Lalo
1 Michael Porter 38
2 Mark Locks
3 Jon Sanders
4 Michael Friedman
... ... ...
190 Jared Smith
191 Matt Frey
192 William C. Martin 39
193 Spencer Scott
194 Atish Gude

195 rows × 2 columns

That’s a good start but, as you can see, we also have a bunch of executives for which we do not know the age. For those, the value in the database was coded as an empty string ("") and it is evaluated as smaller than any number. We need to get rid of those. The trick here is to compose several conditions using the keyword AND. We can just add an additional condition that filters out such values. As such:

query = """SELECT name, age
            FROM Executives
            WHERE age < 40 AND age <> ""
"""

pd.read_sql_query(query, con)

name age
0 Michael Porter 38
1 Alexander B. Spiro 35
2 Benjamin Silbermann 36
3 Evan Sharp 36
4 Anneka R. Gupta 31
... ... ...
60 Melissa Thomas 39
61 Joseph Levin 39
62 Julia Hartz 39
63 Joseph Levin 39
64 William C. Martin 39

65 rows × 2 columns

If we want to see all the executives that are younger than 40 years and whose first name starts with the letter J, we could do:

query = """SELECT name, age
            FROM Executives
            WHERE age < 40 
                AND age <> ""
                AND name LIKE "J%"
"""

pd.read_sql_query(query, con)

name age
0 Joseph Levin 39
1 John Paul Knab 38
2 Jesse A. Cohn 38
3 Jose J. Vargas 39
4 Jayme Mendal 33
5 Joseph Levin 39
6 Jamie Cohen 32
7 Jantoon Reigersman 37
8 Joseph Bzdyl 36
9 Joseph Levin 39
10 Julia Hartz 39
11 Joseph Levin 39

Here, remark that the LIKE "J%" matches any string starting with a J, whatever follows it. You can always invert a condition using the keyword NOT:

query = """SELECT name, age
            FROM Executives
            WHERE age < 40 
                AND age <> ""
                AND NOT name LIKE "J%"
"""

pd.read_sql_query(query, con)

name age
0 Michael Porter 38
1 Alexander B. Spiro 35
2 Benjamin Silbermann 36
3 Evan Sharp 36
4 Anneka R. Gupta 31
5 Rachel Barnett 37
6 Christina Sindoni Ciocca 31
7 Liqun Liu 38
8 Chelsea Clinton 39
9 Sean K. Middleton 37
10 Tiffany Kuo 27
11 Michael Ross Pope 35
12 Gregory R. Blank 38
13 Zach Thomann 35
14 Benjamin L. Rosenzweig 33
15 Lev Peker 36
16 David Meniane 35
17 Ryan S. Schram 39
18 Garrard Brown 36
19 Katrina Lake 36
20 Chad Summe 39
21 Raymond T. Walsh 33
22 David B. Peters 39
23 Vikram R. Raghavan 36
24 Nathan Forrest Auwerda 33
25 Amit Goyal 39
26 Kamelia Aryafar 33
27 Sumit Goyal 34
28 Seth A. Moore 36
29 Logan D. Green 35
30 Pouyan Salehi 39
31 William Gray Stream 39
32 Maureen Mullen 38
33 Matias de Tezanos 38
34 Michael Montano 33
35 Bret Taylor 39
36 Darryl Auguste 30
37 Shimrit Markette 35
38 Mira Wilczek 37
39 Katrina Lake 36
40 Samuel H Altman 30
41 Chelsea Clinton 39
42 Ismail Azeri 39
43 Mark Zuckerberg 34
44 Evan Spiegel 28
45 Robert C. Murphy 30
46 Poppy Thorpe 34
47 Erin N. Lantz 38
48 Matthew B. Salzberg 35
49 Ilia M. Papas 37
50 Tracy Britt Cool 35
51 Melissa Thomas 39
52 William C. Martin 39

GROUP BY…

This is the last difficult part in the SELECT statement. The GROUP BY clause enables performing agregation. Remember that the SELECT clause enable doing simple operations on the row such as adding or subtracting two fields? Well, the GROUP BY clause makes it possible to do some operations across rows, on a certain field. For example, you can find the average age of executives using the following syntax:

query = """SELECT AVG(age)
            FROM Executives
"""

pd.read_sql_query(query, con)

AVG(age)
0 50.672166

Indeed, you “squeezed” all the rows into one, summarizing the values in the age field to take their average. This is neat, but what does the GROUP BY have to do with it? Well, by default, if you don’t specify it otherwise, agregation operations in the SELECT clause (such as AVG, MAX, MIN, SUM and COUNT) will squeeze the table all the day down to 1 row. Most of the time, the question you have only requires squeezing the table up to a certain level. For instance, knowing that the average age of directors is 50.67 years is nice, but it is of no use to know which companies employ the youngest directors. This is where the GROUP BY shines. By specifying a field or set of fields in this clause, you force the database engine to agregate inside each value of this field (or unique combination of value in the set of fields). As such, if you wonder what is the average age of the directors in each company, you can do The WHERE age <> "" is there to avoid counting people without age recorded that would be defaulted to 0 if this wasn’t there. :

query = """SELECT symbol, AVG(age)
            FROM Executives
            WHERE age <> ""
            GROUP BY symbol 
"""

pd.read_sql_query(query, con)

symbol AVG(age)
0 AAPL 58.933333
1 ALOT 62.363636
2 AMRH 53.800000
3 AMZN 58.812500
4 ANGI 46.266667
... ... ...
97 WDC 58.285714
98 WTRH 48.400000
99 WYY 54.400000
100 XRX 53.454545
101 YELP 50.916667

102 rows × 2 columns

In this case, you end up with one line per symbol and the average age of the directors that have this symbol as foreign key.

While the average (AVG), sum (SUM), maximum (MAX) and minimum (MIN) all behave as you would expect them to behave, the COUNT is a bit special: It counts the number of rows for which the combination of fields that you pass between parenthesis is not entirely NULL. Most of the time, using COUNT, you really want to count the number of underlying lines in your query, this is why COUNT is often use with a star * (meaning “all the fields available”) between the parentheses. For example, in order to count the number of directors in each company, you can do:

query = """SELECT symbol, COUNT(*)
            FROM Executives
            WHERE age <> ""
            GROUP BY symbol
"""

pd.read_sql_query(query, con)

symbol COUNT(*)
0 AAPL 15
1 ALOT 11
2 AMRH 5
3 AMZN 16
4 ANGI 15
... ... ...
97 WDC 14
98 WTRH 10
99 WYY 5
100 XRX 11
101 YELP 12

102 rows × 2 columns

In this case, the COUNT(*) count all the rows from the table that are not completely empty.

HAVING…

Sometimes, it makes sense to want to filter some lines based on the value of the result of an agregation. The issue is that clauses are executed sequentially and that, when the agregation is performed, the WHERE clause has already been executed (because otherwise, you would have wrong values in your agregated fields). This is the reason why the HAVING clause exists. For example, the following code will produce an error: You can use the keyword AS to give alias to your field. This is especially useful when you create new agregated fields.

query = """SELECT symbol, AVG(age) AS average_age
            FROM Executives
            WHERE average_age < 40 AND age <> ""
            GROUP BY symbol
"""

pd.read_sql_query(query, con)
---------------------------------------------------------------------------

OperationalError                          Traceback (most recent call last)

~/.pyenv/versions/3.7.6/envs/digital_business/lib/python3.7/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
   1680         try:
-> 1681             cur.execute(*args, **kwargs)
   1682             return cur


OperationalError: misuse of aggregate: AVG()


The above exception was the direct cause of the following exception:


DatabaseError                             Traceback (most recent call last)

<ipython-input-12-6a25273a221f> in <module>
      5 """
      6 
----> 7 pd.read_sql_query(query, con)


~/.pyenv/versions/3.7.6/envs/digital_business/lib/python3.7/site-packages/pandas/io/sql.py in read_sql_query(sql, con, index_col, coerce_float, params, parse_dates, chunksize)
    381         coerce_float=coerce_float,
    382         parse_dates=parse_dates,
--> 383         chunksize=chunksize,
    384     )
    385 


~/.pyenv/versions/3.7.6/envs/digital_business/lib/python3.7/site-packages/pandas/io/sql.py in read_query(self, sql, index_col, coerce_float, params, parse_dates, chunksize)
   1725 
   1726         args = _convert_params(sql, params)
-> 1727         cursor = self.execute(*args)
   1728         columns = [col_desc[0] for col_desc in cursor.description]
   1729 


~/.pyenv/versions/3.7.6/envs/digital_business/lib/python3.7/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
   1691 
   1692             ex = DatabaseError(f"Execution failed on sql '{args[0]}': {exc}")
-> 1693             raise ex from exc
   1694 
   1695     @staticmethod


DatabaseError: Execution failed on sql 'SELECT symbol, AVG(age) AS average_age
            FROM Executives
            WHERE average_age < 40 AND age <> ""
            GROUP BY symbol
': misuse of aggregate: AVG()
jack_ryan_pissed

This huge error message basically tells us that we cannot do that for a good reason: We try to use the result of the AVG agregation in the WHERE clause but, at the moment of the evaluation of this clause, the agregation is not yet calculated. To fix this and retrieve the companies where directors have on average less than 40 years, we need to use the HAVING clause:

query = """SELECT symbol, AVG(age) AS average_age
            FROM Executives
            WHERE age <> ""
            GROUP BY symbol
            HAVING average_age < 40
"""

pd.read_sql_query(query, con)

symbol average_age

It turns out, none of the companies in our base have a your board of directors. It’s part of the reason we’re in this mess, if you ask me but anyways. For the rest, the HAVING clause can perform most the same checks as the WHERE clause.

ORDER BY…

Perhaps unsurprisingly, the ORDER BY clause enables to sort the results of the query. It is usually a fairly unsophisticated clause and the orders that are supported are the numerical order (1,2,3,7 or 1.1, 1.2, 1.7, 6.5) and the alphabetical order. You can also specify if you want it from small to large (using the ASC, for ascending, keyword, which is the default) or from large to small (using the DESC, for descending, keyword). You can also specify several fields: It will start sorting on the first and, if there are draws, those will be resolved by resorting to the second fields, etc.).

In our example, assume we want to show all directors by classifying them by the alphabetical order of the name of their company and, for people in the same company, by decreasing order of age (from oldest to youngest) and, if two directors from the same company have the same age, we want the alphebetical order of name (we do not expect two directors to have exactly the same name, so we stop there). We can do so as such: You can alias the table name by writing the alias just after the table name. For example, in this query, we alias the table Companies by c by writing FROM Companies c. In the rest of the query, everytime we type c, the engine understands we refer to the table Companies.

query = """SELECT c.name, e.name, e.age
            FROM Companies c
                JOIN Executives e
                ON (c.ticker = e.symbol)
            ORDER BY c.name ASC, 
                    e.age DESC, 
                    e.name
"""

pd.read_sql_query(query, con)

name name age
0 1-800-FLOWERS.COM James A. Cannavino 75
1 1-800-FLOWERS.COM Eugene F. DeMark 72
2 1-800-FLOWERS.COM Adam Hanft 69
3 1-800-FLOWERS.COM James Francis McCann 68
4 1-800-FLOWERS.COM Leonard J. Elmore 67
... ... ... ...
1389 YELP INC Laurence Wilson 46
1390 YELP INC James Miln 45
1391 YELP INC George Hu 44
1392 YELP INC Jeremy Stoppelman 41
1393 YELP INC Vivek Patel 40

1394 rows × 3 columns

Here, remark that we have asked to sort first by name of the company in alphabetical order, then the age of the executive by descending order, then the name of the executive by alphabetical order but we ommitted the ASC, it is no big deal since it is the default value. We could also have dropped it after the c.name.

LIMIT

The last and maybe easiest clause is the LIMIT. It is executed at the end of the query and cuts the results. If it is followed by one positive number x, it will show the x first results. If the numbers is negative, it will show the last x results. Finally, if you give it two number, x and y, it will show the y rows that follow row x.

query = """SELECT symbol, AVG(age) as average_age
            FROM Executives
            GROUP BY symbol
            ORDER BY average_age
            LIMIT 3
"""

pd.read_sql_query(query, con)

symbol average_age
0 SONM 17.166667
1 DXC 31.064516
2 ANGI 31.545455

Gives you the 3 companies where the directors are the youngest on average.

query = """SELECT symbol, AVG(age) as average_age
            FROM Executives
            GROUP BY symbol
            ORDER BY average_age
            LIMIT 2, 5
"""

pd.read_sql_query(query, con)

symbol average_age
0 ANGI 31.545455
1 BOXL 32.555556
2 IMMR 32.666667
3 APRN 37.307692
4 MEET 37.583333

This query gives you the 5 companies that follow the company at index 2 of the result (the 3rd company, ANGI in this case, as the index starts at 0).

Tying everything together

You now know everything you need to perform arbitrarily complex queries in relational databases.

They Know

Let’s try to do some (very) complex queries. In the first query, I will figure out the average age of the executives in the companies which registered the highest relative stock price increase during the month of January 2019. In order for you to realize all the challenges and understand the way of thinking, I’ll go step-by-step without preparation. Please be patient as I make all the mistakes I warned you about above…

query = """
SELECT AVG(age)
FROM 
    Executives
    JOIN
    (SELECT last_quotation.symbol, close_price/open_price AS evolution
    FROM
        (SELECT symbol, close_price
        FROM Candlestick
        WHERE timestamp = "2019-01-31") last_quotation
        JOIN
        (SELECT symbol, open_price
        FROM Candlestick
        WHERE timestamp = "2019-01-02") first_quotation
        ON last_quotation.symbol = first_quotation.symbol
    ORDER BY evolution DESC
    LIMIT 1) max_evol
    ON Executives.symbol = max_evol.symbol
"""

pd.read_sql_query(query, con)

AVG(age)
0 32.555556
Go Further

Look at how the Peers table is constructed. It has two foreign keys corresponding both to companies. Concretely, this allows representing a directed graph in a relational database form. Querying those are not hard once you understand the difference between the table on disk and the invocation of this table. In this video, we retrieve the name of those companies using an SQL query.

query = """
SELECT origin.name, destination.name 
FROM 
    Peers
    JOIN Companies origin
        ON Peers.symbol = origin.ticker
    JOIN Companies destination
        ON Peers.peers_symbol = destination.ticker
"""

pd.read_sql_query(query, con)

name name
0 BOOKING HOLDINGS INC SHUTTERSTOCK INC
1 BOOKING HOLDINGS INC LANDS' END INC
2 BOOKING HOLDINGS INC QUOTIENT TECHNOLOGY INC
3 BOOKING HOLDINGS INC RUBICON PROJECT INC
4 BOOKING HOLDINGS INC GRUBHUB INC
... ... ...
2811 NETAPP INC SOCKET MOBILE INC
2812 NETAPP INC INTEVAC INC
2813 NETAPP INC TRANSACT TECHNOLOGIES INC
2814 NETAPP INC IMMERSION CORP
2815 NETAPP INC QUALSTAR CORP

2816 rows × 2 columns

Check your understanding

Once again, this was a long session but it is fundamental for what’s to come (and for your life in general). Here are a few questions to check if you remember what’s in this class. This does not count for grade and the answers will be anonymous. Try is as much as you want.

Exercises

ERD I - Fakebook

You want to model a simple competitor to Facebook. You have to have users, who may become friends with each other, can post posts, to which their friends may then react or comments and some of the posts may contain one or more pictures. Design the ER diagram representing this app.

See solution

We model just enough of Facebook to turn a profit. It will always be time to become multi-billionnaires later.

Here’s the result of our efforts. Let the VC know, there’s a new kid in town!

Note, however, that if you do not have the same solution, it does not mean that yours is wrong. The thing with ERD is that there are more than a way to skin a cat even because everyone has a different understanding of any problem. The important is to be able to justify your choices and to understand their consequences. Moreover, there are designs that are disfunctional in the sense that they would not be able to accomodate good sense situations (think of a one-to-one relation between Users and Posts that would mean that users would be only able to post something once and then bail. The name of the game is to avoid those nonesensical pieces of drawing. Everything else is fair game if what you draw represent what you had in mind.

<figure >

A solution for a social network model

I guess we are ready to be rich now.

ERD II - The CoViD app

Design an ER diagram for an app managing the tracking of CoViD. You want to be able to figure out who was or may have been in contact with people affected by the Corona Virus Disease. As a rule, you do not need to track the people on their ways to and from places, only in closed places. You can use any tool to model the ERD, but if you have no preference, you can use DrawSQL with a trial account.

See solution

The solution is actually less complex than one could expect. Of course, in reality, most of the challenge consists in keeping private data safe and optimizing search so that you don’t have to wait for days for each query to execute in the gigantic database.

And here is the result of this exercise. Again this is not the unique solution. You might have thought of additional entities that could be represented and used in the disease tracking workflow.

<figure >

A solution for the exercise with the CoViD app

ERD III - The Bill of Material

You want to represent the “recipe book” for your product: an electric bike. Since you manufacture all parts of the bike, you start and assemble everything from raw material. So you have 3 types of materials: finished products, intermediary products and raw materials. Only raw materials have direct computable costs.

You use raw materials and intermediary products to produce the finished products as well as intermediary products. Since you want to be able to compute the total cost of each product, you want to be able to trace all the products that enters the production of any specific product at any stage (be it to produce the final product or any of it’s intermediary component).

In technical terms, this is known as a Bill of Materials (or BOM). Since we are only interested in the cost of materials, you do not need to include costs of the process. Design the ERD that represent this problem.

See solution

This exercise is a little more complex in its solution as the situation is a little less familiar to the average database designer. The point is that you do not need to constraint the form of the network representing the BOM, a simple table with two connections to the Products (or Materials, or whatever) table is all you need.

Note that you might have wanted to separate raw inputs, semi-finished goods and finished products in separate tables. If you go this way, I’d still recommend creating a unique abstract table where all those get an ID. Indeed, having an ID for the concept of material or product is important to avoid having to create 3 fields (raw_material_id, semi_finished_good_id and finished_good_id) even more so because that way, you would also have to have several fields for the product being constructed. Similarly, creating a BOM table for each type of material is a bad idea since most of the fields will be the same and there is nothing special in a recipe for semi-finished good that is in the finished good. The thing with the UnitsOfMeasure is more of a detail and it is totally understandable if you don’t have it in your design.

<figure >

A solution for the BOM exercise

SQL - Setup

Let’s play with the stock price database that we have. Let’s start by making sure that we have it. Download the full database. Beware that it is quite heavy (250Mb or something) so executing this cell might take quite a while. Fortunately, you only need to execute it once as it will find the version previously downloaded the next times.

import pandas as pd
import sqlite3

# This cell retrieves the database, downloads it to the folder where you have your notebook
# It then opens a connection with the database engine (it is like a tunnel through which you will
# pass the query from python to sqlite and the results back)

# You need to have the pandas library installed. If you don't have it and do not know
# how to install it, have a look at the notebook about modularity and open-source

import os.path
if not os.path.isfile("stocks_full.db"): # If you don't have the file yet...
    import urllib.request 
    urllib.request.urlretrieve("http://homepages.ulb.ac.be/~choffreu/resources/stocks_full.db", 
                               "stocks_full.db") # download it

con = sqlite3.connect("stocks_full.db") # create a connection with the database

query = "SELECT * FROM sqlite_master" # This is the SQL code that we want to execute

pd.read_sql_query(query, con) # We use the pandas library to send the query through the connection
# retrieve the data and format it into a table

type name tbl_name rootpage sql
0 table Symbols Symbols 2 CREATE TABLE "Symbols" (\n"symbol" TEXT,\n "d...
1 table Companies Companies 26 CREATE TABLE "Companies" (\n"ticker" TEXT,\n ...
2 table Peers Peers 276 CREATE TABLE "Peers" (\n"symbol" TEXT,\n "pee...
3 table Executives Executives 20214 CREATE TABLE "Executives" (\n"id" INTEGER,\n ...
4 table Candlesticks Candlesticks 20338 CREATE TABLE "Candlesticks" (\n"id" INTEGER,\n...
5 table Recommendations Recommendations 57926 CREATE TABLE "Recommendations" (\n"id" INTEGER...
6 table Evaluations Evaluations 57942 CREATE TABLE "Evaluations" (\n"id" INTEGER,\n ...
7 table MajorDevelopments MajorDevelopments 57985 CREATE TABLE "MajorDevelopments" (\n"id" INTEG...

SQL I - Thinking like a finance maverick

How you should feel regarding the financial validity of those exercises First let me preface these questions with the disclaimer. This is not a finance course. So what we are going to do during this session probably makes little to no financial sense (please don’t show this notebook to any finance professor).

That being said, let’s start. We’re first going to make our very own index to assess the attractiveness of sectors! That’s right, Goldman Sach’s. Beware!

In order to do so, produce the total number of recommendations for each industry in the dataset (the gind field in the Companies table) in 2019. Technically, we want the sum of the strongBuy, buy, hold, sell and strongSell emitted by experts in 2019. Note that you do not need to compute them at the level of the evaluation, since this was already done for you in the Recommendations table.

In order to do so, think of both agregation and joins across the tables to retrieve the information you want.

See solution

This exercise illustrate how joins and aggregations together may be combined to produce views on your data at different levels of a hierarchy. In this case, companies are grouped along one of their dimensions, the industry they are a part of, to produce statistics on such aggregates. (Note that there is an edit in this video around the end, so the video seems to jump, this is normal, I still need to hone my mad video editing skillz)

query = """
        SELECT gind as Industry, SUM(strongBuy), 
            SUM(buy), SUM(hold), SUM(sell), SUM(strongSell)
        FROM Companies c
            JOIN Recommendations r
            ON (c.ticker = r.symbol)
        WHERE r.period LIKE "2019%" AND c.gind <> "N/A"
        GROUP BY gind
"""

pd.read_sql_query(query, con)

Industry SUM(strongBuy) SUM(buy) SUM(hold) SUM(sell) SUM(strongSell)
0 IT Services 567 1125 889 110 22
1 Interactive Media & Services 759 1309 1445 377 45
2 Internet & Direct Marketing Retail 746 1344 1504 161 14
3 Technology Hardware, Storage & Peripherals 648 1001 1323 124 38
4 Trading Companies & Distributors 34 22 142 1 1

SQL II - If advices were good, they wouldn’t be free

Financial analysts usually give advices about whether they think you should keep, buy or sell some stocks (in this case, they are called Evaluations). Thing is, sometimes, they are wrong. So what we are going to do is that we are going to take all the evaluations given by the analysts during the month of October 2019 and check whether following the recommendation of the analyst during this period would have resulted in a profit or a loss in the timespan of a month. If the last recommendation of an analyst during the month was to “Buy”, “Strong Buy”, “Outperform”, “Overweight” and the action went up in the month of November, the analyst get a point, otherwise, they lose one. Same if the last recommendation of the analyst was to “Sell” or “Underweight”, “Underperform” and the action lost value.

Break this query in smaller, simpler queries and then just assemble them like legos.

To finish, establish a ranking of the analysts by decreasing order of points they got.

See solution

Step 1 - This query is not complicated from a syntax standpoint (there is very few subtlety in with SQL language) but is hard to organise. This is why we spend the first video thinking about how to break down the problem to a set of small components we can then put together. Try doing it by yourself before watching the video as this is the single most important step to solve this exercise. The rest is just data plumbing. If you do not break the query the same way I do, try implementing your solution in code and let me know if you manage or hit a road block. There are usually more than a valid solution just be sure you understand precisely what you do and retrieve as data. For example using the MAX(gradeTime) directly to retrieve the last evaluation looks correct but might in fact produce wrong answers (in this particular case, it’s a pity SQLite is so tolerant with imprecise instructions).

and here is the map and the corresponding videos:

<figure >

Proposed breakdown of problem 2 with corresponding videos

Step 2 - We start by generating the performance of the stocks during the month of November 2019.

Note that hardcoding the values for the first and last valuations during the month is a bad idea. However, since I’m feeling quite lazy and hypocrit today, let’s say fixing that is an exercise left for you to test your understanding.

query_perf = """
            SELECT close.symbol, (close.close_price/open.open_price) - 1 AS var
            FROM
                (SELECT ca.symbol, ca.open_price as open_price
                FROM Candlesticks ca
                WHERE ca.timestamp = "2019-11-01") open
                JOIN
                (SELECT ca.symbol, ca.close_price as close_price
                FROM Candlesticks ca
                WHERE ca.timestamp = "2019-11-29") close
                ON close.symbol = open.symbol
"""

pd.read_sql_query(query_perf, con)

symbol var
0 FINX 0.060153
1 BTA -0.015711
2 NORW -0.014141
3 NIM -0.024621
4 VONG 0.038809
... ... ...
2157 IBMI 0.001394
2158 MFV -0.015152
2159 MNCL -0.001986
2160 VTWO 0.034113
2161 VTI 0.032387

2162 rows × 2 columns

Step 3 - Retrieving the last Evaluation by analysts for every company in the month of October 2019. Beware, this step is the trickiest of the whole exercise.

query_last_eval = """
        SELECT e.symbol, e.company, e.toGrade
        FROM Evaluations e
            JOIN
            (SELECT symbol, company, MAX(gradeTime) as last_eval
            FROM Evaluations
            WHERE gradeTime LIKE "2019-10%"
            GROUP BY symbol, company) last
            ON e.symbol = last.symbol AND e.company = last.company 
                                    AND e.gradeTime = last.last_eval
        GROUP BY e.symbol, e.company
"""

pd.read_sql_query(query_last_eval, con)

symbol company toGrade
0 AAPL Bank of America Buy
1 AAPL Barclays Equal-Weight
2 AAPL Canaccord Genuity Buy
3 AAPL Cascend Buy
4 AAPL Cowen & Co. Outperform
... ... ... ...
150 W Morgan Stanley Equal-Weight
151 W Wells Fargo Market Perform
152 WDC Loop Capital Buy
153 YELP Morgan Stanley Underweight
154 Z Morgan Stanley Equal-Weight

155 rows × 3 columns

Step 4 - Using the two components built previously, we can now compute the hits and misses. Please note that in a business setting, we would probably not build all this query in one go but we would rather encapsulate some of the steps into views or stored procedures. We have not seen those here so we’ll do without.

query_pos = """
        SELECT company, COUNT(*)
        FROM
            (SELECT close.symbol, (close.close_price/open.open_price) - 1 AS var
            FROM
                (SELECT ca.symbol, ca.open_price as open_price
                FROM Candlesticks ca
                WHERE ca.timestamp = "2019-11-01") open
                JOIN
                (SELECT ca.symbol, ca.close_price as close_price
                FROM Candlesticks ca
                WHERE ca.timestamp = "2019-11-29") close
                ON close.symbol = open.symbol) perf
            JOIN
            (SELECT e.symbol, e.company, e.toGrade
            FROM Evaluations e
                JOIN
                (SELECT symbol, company, MAX(gradeTime) as last_eval
                FROM Evaluations
                WHERE gradeTime LIKE "2019-10%"
                GROUP BY symbol, company) last
                ON e.symbol = last.symbol AND e.company = last.company 
                                        AND e.gradeTime = last.last_eval
            GROUP BY e.symbol, e.company) eval
            ON perf.symbol = eval.symbol
        WHERE
            (var > 0 AND toGrade IN ("Buy", "Strong Buy", 
                                    "Outperform", "Overweight"))
            OR
            (var < 0 AND toGrade IN ("Sell", "Underweight", "Underperform"))
        GROUP BY company
"""

query_neg = """
        SELECT company, COUNT(*)
        FROM
            (SELECT close.symbol, (close.close_price/open.open_price) - 1 AS var
            FROM
                (SELECT ca.symbol, ca.open_price as open_price
                FROM Candlesticks ca
                WHERE ca.timestamp = "2019-11-01") open
                JOIN
                (SELECT ca.symbol, ca.close_price as close_price
                FROM Candlesticks ca
                WHERE ca.timestamp = "2019-11-29") close
                ON close.symbol = open.symbol) perf
            JOIN
            (SELECT e.symbol, e.company, e.toGrade
            FROM Evaluations e
                JOIN
                (SELECT symbol, company, MAX(gradeTime) as last_eval
                FROM Evaluations
                WHERE gradeTime LIKE "2019-10%"
                GROUP BY symbol, company) last
                ON e.symbol = last.symbol AND e.company = last.company 
                                        AND e.gradeTime = last.last_eval
            GROUP BY e.symbol, e.company) eval
            ON perf.symbol = eval.symbol
        WHERE
            (var < 0 AND toGrade IN ("Buy", "Strong Buy", 
                                    "Outperform", "Overweight"))
            OR
            (var > 0 AND toGrade IN ("Sell", "Underweight", "Underperform"))
        GROUP BY company
"""

pd.read_sql_query(query_pos, con)

company COUNT(*)
0 Arete 1
1 B. Riley 1
2 BMO Capital 1
3 Baird 1
4 Bank of America 3
5 Barclays 5
6 Benchmark 2
7 Canaccord Genuity 2
8 Cascend 2
9 Citigroup 2
10 Cowen & Co. 1
11 Craig-Hallum 1
12 Credit Suisse 6
13 DA Davidson 1
14 Deutsche Bank 1
15 JP Morgan 4
16 Jefferies 1
17 KeyBanc 3
18 LightShed Partners 1
19 Longbow Research 1
20 MKM Partners 1
21 Mizuho 1
22 Morgan Stanley 5
23 National Securities 1
24 Nomura 1
25 Oppenheimer 1
26 Piper Jaffray 2
27 RBC Capital 1
28 Raymond James 3
29 SunTrust Robinson Humphrey 3
30 Telsey Advisory Group 1
31 UBS 4
32 Wedbush 3
33 Wells Fargo 1

Step 5 - We can now put everything together!

The final query looks gruesome but it’s mostly a matter of copy/pasting the components above. Also, we used the UNION ALL statement to concatenate the results of two queries to overcome a situation that might arise if you just join the positive and negative queries. Kudos if you spotted the potential issue and looked and found a solution! This is what programming is all about. If not, do not feel bad. It was pretty vicious.

Also, the code here is a little different from the one in the video: I aliased the SUM(score) as final_score and ordered the result by descending order of it.

query = """
SELECT company, SUM(score) as final_score
FROM
    (SELECT company, COUNT(*) as score
        FROM
            (SELECT close.symbol, (close.close_price/open.open_price) - 1 AS var
            FROM
                (SELECT ca.symbol, ca.open_price as open_price
                FROM Candlesticks ca
                WHERE ca.timestamp = "2019-11-01") open
                JOIN
                (SELECT ca.symbol, ca.close_price as close_price
                FROM Candlesticks ca
                WHERE ca.timestamp = "2019-11-29") close
                ON close.symbol = open.symbol) perf
            JOIN
            (SELECT e.symbol, e.company, e.toGrade
            FROM Evaluations e
                JOIN
                (SELECT symbol, company, MAX(gradeTime) as last_eval
                FROM Evaluations
                WHERE gradeTime LIKE "2019-10%"
                GROUP BY symbol, company) last
                ON e.symbol = last.symbol AND e.company = last.company 
                                        AND e.gradeTime = last.last_eval
            GROUP BY e.symbol, e.company) eval
            ON perf.symbol = eval.symbol
        WHERE
            (var > 0 AND toGrade IN ("Buy", "Strong Buy", 
                                    "Outperform", "Overweight"))
            OR
            (var < 0 AND toGrade IN ("Sell", "Underweight", "Underperform"))
        GROUP BY company
    UNION ALL
    SELECT company,  - COUNT(*)
    FROM
        (SELECT close.symbol, (close.close_price/open.open_price) - 1 AS var
        FROM
            (SELECT ca.symbol, ca.open_price as open_price
            FROM Candlesticks ca
            WHERE ca.timestamp = "2019-11-01") open
            JOIN
            (SELECT ca.symbol, ca.close_price as close_price
            FROM Candlesticks ca
            WHERE ca.timestamp = "2019-11-29") close
            ON close.symbol = open.symbol) perf
        JOIN
        (SELECT e.symbol, e.company, e.toGrade
        FROM Evaluations e
            JOIN
            (SELECT symbol, company, MAX(gradeTime) as last_eval
            FROM Evaluations
            WHERE gradeTime LIKE "2019-10%"
            GROUP BY symbol, company) last
            ON e.symbol = last.symbol AND e.company = last.company 
                                    AND e.gradeTime = last.last_eval
        GROUP BY e.symbol, e.company) eval
        ON perf.symbol = eval.symbol
    WHERE
        (var < 0 AND toGrade IN ("Buy", "Strong Buy", 
                                "Outperform", "Overweight"))
        OR
        (var > 0 AND toGrade IN ("Sell", "Underweight", "Underperform"))
    GROUP BY company)
GROUP BY company
ORDER BY final_score DESC
"""

pd.read_sql_query(query, con)

company final_score
0 Credit Suisse 6
1 Barclays 5
2 Morgan Stanley 4
3 JP Morgan 4
4 Wedbush 3
5 UBS 3
6 SunTrust Robinson Humphrey 3
7 Raymond James 2
8 Piper Jaffray 2
9 KeyBanc 2
10 Citigroup 2
11 Cascend 2
12 Canaccord Genuity 2
13 Benchmark 2
14 Wells Fargo 1
15 Telsey Advisory Group 1
16 Mizuho 1
17 MKM Partners 1
18 Longbow Research 1
19 LightShed Partners 1
20 Jefferies 1
21 DA Davidson 1
22 Craig-Hallum 1
23 Cowen & Co. 1
24 Baird 1
25 BMO Capital 1
26 B. Riley 1
27 Arete 1
28 RBC Capital 0
29 Oppenheimer 0
30 National Securities 0
31 Bank of America 0
32 Loop Capital -1
33 Deutsche Bank -1
34 Nomura -2
35 Goldman Sachs -3

It would probably be a good idea to provide the result as a percentage of advices given but this too is left as an additional exercise if you’re interested.

SQL III - A horse race

If capitalism is a race, companies are horses. But some of them are more like donkeys. Compute the growth rate of each company’s stock price in January of 2020 (which is the ratio between the latest closing price on the earliest opening price). Compare them with the one of their peers and give the rank of the company in the list of its peers.

You might want to think about what this rank represents to solve this one.

See solution

Step 1 - Much like for exercise 2, we start by breaking down the query in smaller work packages. Fortunately, this is significantly easier in this case.

And here is the plan:

<figure >

Proposed breakdown of problem 2 with corresponding videos

Step 2 - We start, much like for Question 2 at step, with figuring out the performance of the stock during the month of January. Same caveats and hypocrisy regarding the answer ensue.

And here is the code:

query_perf = """
    SELECT open.symbol, (close_price/open_price) - 1 as perf
    FROM
       (SELECT ca.symbol, ca.open_price
       FROM Candlesticks ca
       WHERE ca.timestamp = "2020-01-02") open
       JOIN
       (SELECT ca.symbol, ca.close_price
       FROM Candlesticks ca
       WHERE ca.timestamp = "2020-01-31") close
       ON open.symbol = close.symbol
"""

pd.read_sql_query(query_perf, con)

symbol perf
0 FINX 0.034916
1 BTA 0.008567
2 NORW -0.081423
3 NIM 0.014272
4 VONG 0.014878
... ... ...
2168 MFV 0.077519
2169 MNCL -0.000978
2170 VTWO -0.037350
2171 PAAC 0.010499
2172 VTI -0.007044

2173 rows × 2 columns

Step 3 - We can now count the number of peers that have a better performance than each companies. Please take very good care to understand what we are doing at this step as it is the very heart of this solution. The distinction between a table on the disk and a table (or the result of a query) in memory is paramount for understanding how queries are parsed and understood by a database engine!

And here is the query:

query_numb_peers = """
    SELECT p.symbol, COUNT(*)
    FROM Peers p
        JOIN
        (SELECT open.symbol, (close_price/open_price) - 1 as perf
        FROM
           (SELECT ca.symbol, ca.open_price
           FROM Candlesticks ca
           WHERE ca.timestamp = "2020-01-02") open
           JOIN
           (SELECT ca.symbol, ca.close_price
           FROM Candlesticks ca
           WHERE ca.timestamp = "2020-01-31") close
           ON open.symbol = close.symbol) perf_company
        ON p.symbol = perf_company.symbol
        JOIN
        (SELECT open.symbol, (close_price/open_price) - 1 as perf
        FROM
           (SELECT ca.symbol, ca.open_price
           FROM Candlesticks ca
           WHERE ca.timestamp = "2020-01-02") open
           JOIN
           (SELECT ca.symbol, ca.close_price
           FROM Candlesticks ca
           WHERE ca.timestamp = "2020-01-31") close
           ON open.symbol = close.symbol) perf_peer
        ON p.peers_symbol = perf_peer.symbol
    WHERE perf_peer.perf > perf_company.perf
    GROUP BY p.symbol
"""

pd.read_sql_query(query_numb_peers, con)

symbol COUNT(*)
0 AAAU 132
1 AADR 1108
2 AAPL 3
3 AAXJ 1866
4 ACAM 418
... ... ...
2162 ZJPN 1446
2163 ZMLP 1828
2164 ZROZ 39
2165 ZSL 978
2166 ZTR 648

2167 rows × 2 columns

Step 4 - Here’s the final step. The logic is similar to the one on exercise 2 with a twist: we start by creating a query that returns all the companies that have at least 1 peer and give it an offset of 1 in the rank (the why is explained in the video).

And here is the final query:

query = """
    SELECT symbol, SUM(position) as rank
    FROM
        (SELECT DISTINCT symbol, 1 as position
        FROM Peers
        UNION ALL
        SELECT p.symbol, COUNT(*)
        FROM Peers p
            JOIN
            (SELECT open.symbol, (close_price/open_price) - 1 as perf
            FROM
               (SELECT ca.symbol, ca.open_price
               FROM Candlesticks ca
               WHERE ca.timestamp = "2020-01-02") open
               JOIN
               (SELECT ca.symbol, ca.close_price
               FROM Candlesticks ca
               WHERE ca.timestamp = "2020-01-31") close
               ON open.symbol = close.symbol) perf_company
            ON p.symbol = perf_company.symbol
            JOIN
            (SELECT open.symbol, (close_price/open_price) - 1 as perf
            FROM
               (SELECT ca.symbol, ca.open_price
               FROM Candlesticks ca
               WHERE ca.timestamp = "2020-01-02") open
               JOIN
               (SELECT ca.symbol, ca.close_price
               FROM Candlesticks ca
               WHERE ca.timestamp = "2020-01-31") close
               ON open.symbol = close.symbol) perf_peer
            ON p.peers_symbol = perf_peer.symbol
        WHERE perf_peer.perf > perf_company.perf
        GROUP BY p.symbol)
    GROUP BY symbol
"""

pd.read_sql_query(query, con)

symbol rank
0 AAAU 133
1 AADR 1109
2 AAPL 4
3 AAXJ 1867
4 ACAM 419
... ... ...
2200 ZJPN 1447
2201 ZMLP 1829
2202 ZROZ 40
2203 ZSL 979
2204 ZTR 649

2205 rows × 2 columns

Aaaaaaaand, we are now in business! Time to relax and then go on with understand how we can model business processes to improve and automate them.

Where to go next?

Do the exercises if they are already available on the UV (the corrections will be posted on this page pretty soon). Otherwise or if you want to know more about business process and how to represent, digitize and, ultimately, automate them, head to the BPMN page!