1.2 - SQL
The Digital Firm
Estimated Reading Time: 51 minutes
Contents
L'Université Libre de Bruxelles fonde l'enseignement et la recherche sur le principe du libre examen. Celui-ci postule, en toute matière, le rejet de l'argument d'autorité et l'indépendance de jugement.
The Free University of Brussels bases its teachings and research on the principle of free inquiry. This principle postulates, in all matters, the rejection of the argument of authority and independence of thought.
Introduction
SQL, or Structured Query Language, is not like Python. It is indeed not a programming language but a query language. This means that most of what we have seen in the Python refresher do not apply when writing SQL. Python and SQL are therefore complements rather than substitutes.
You will use Python to manipulate data, respond to new events by the user and implement advanced conditional logic. And you will use SQL (or one of its descendents, the so-called NoSQL languages) to retrieve information that is stored in some structured fashion. SQL is particularly at its ease with relational databases (that you probably have seen or will soon see during the theory lectures).
In those databases the data is stored in tables that are linked together. The easiest way for business/economics students to conceptualize those tables are as tabs on an spreadsheet that has named columns. Each record on those tables has a field, or list of fields that identifies it uniquely. For example, you probably have a student number, which is the field the University uses to identify you uniquely, even if another students with the same name and date of birth is enrolled at the same university. Speaking about students, here is what a table containing students enrolled at the university might look like.
Students
student_id | name | date_of_birth | city_of_residence |
---|---|---|---|
1 | Aloysius | 1999-07-05 | Brussels |
2 | Clothilde | 1998-12-14 | Brussels |
3 | Damian | 1999-02-28 | Dilbeek |
4 | Clothilde | 1999-06-06 | Anderlecht |
5 | Eya | 1998-10-15 | Brussels |
Now, remark that we have two students named Clothilde who are obviously two different people as they do not share the same birthday nor the same city. This is not a problem at all, and even if they shared those information by an enormous coincidence, they would still be differentiated in the database because they have different student_id. In the lingo, those attributes I’ll use attribute or fields interchangeably that enable differentiating two records, even if all the other fields are rigorously identical are called primary keys. This is a very important concept and all the use of relational database and a good part of our modern way of life hinges on it.
Now, how does this differs from a spreadsheet, would you ask. Well, the presence of a primary key enable the creation of links between tables while most spreadsheet systems enable those as well, it is often super painful to maintain and manage . Let’s see some example. If we have students, it would be reasonable to assume they follow courses. Here is a small table with some courses.
Courses
course_id | name |
---|---|
GEST-S482 | The Digital Firm |
GEST-S428 | Some less cool course |
It would also be expected that those courses are given by professors. Let’s create a table for them too…
Professors
employee_id | name |
---|---|
1 | Niccolo di Pantaloni de Mare |
2 | Tom Marvolo Riddle |
3 | Frank Randall |
Nice! Now, to express the fact that Niccolo give the class “The Digital Firm”, we are actually going to modify the table containing the courses to add a field. This field will be peculiar: It will only contain copies of primary keys from the table Professors. The semantics of this field will be that, every time a record has a number in this field, the professor with this primary key gives this class. As such, the table Courses becomes:
Courses
course_id | name | professor_id |
---|---|---|
GEST-S482 | The Digital Firm | 1 |
GEST-S428 | Some less cool course | 3 |
Here we see that the professor Niccolo gives the class “The Digital Firm” and the professor Frank gives “Some less cool course”. Technically, the field that contains copies of the primary key of another table is called a foreign key. This is the mechanism that underpins links creation in relational databases. In fact, if a professors gives several classes, we can just copy her primary key in several rows of the Courses table and be done with it. This is why this type of link is called One-to-Many relation (because one professor can give several classes but a class can only be given by one professor).
But you might ask: “Why haven’t we linked students with courses? Surely we can use foreign keys to do so!” and surely, we can. However, it is not so easy. Take a minute to think how you would do it and draw it on a piece of paper in the example, I’ll wait.
…
…
…
Done? Marvelous! As you could see, there is an issue. A student can be enrolled in several classes but, at the same time, a class can be given to several students, fortunately for our universities budgets. Copying the primary key of both sides of the relation and put it on the other table would not work In fact, it would make the matter worst: It would force a student to have at most one class and a class to have at most a student, try it for yourself! . We therefore have to resort to a different construct: The junction table. This is a table that is going to be created between the Students and the Courses. Every time a student registers to a class, a new record is created in this table containing the primary key of the student and the primary key of the course as foreign keys. What’s even cooler is that, now, students can register several times to the same class, for example if they have to retake the class What’s cool here is that we are able to do it, not that students have to retake the class. Please try to pass at your first trial… . You just need to add a field with the year during which this student was register. You can also, for example, log the grade of the student in this class and this kind of information. This table, that we can call, for example Enrollments might look like:
Enrollments
student_id | course_id | year | grade |
---|---|---|---|
2 | GEST-S482 | 2020 | 12 |
1 | GEST-S482 | 2019 | 8 |
2 | GEST-S428 | 2020 | 14 |
1 | GEST-S482 | 2020 | 16 |
As one can see, we have student 2 (one of the Clothildes) that was registered to both courses (GEST-S482 on the first row and then GEST-S428 on the third row, both in 2020) and both students (Clothilde and Aloysius) were registered to course GEST-S482 (in fact, Aloysius had to retake the course for 2 years). Since a course can have several students and a students can have several courses, we call this type of relation a Many-to-Many (or sometime a M-N) relation.
Go Further
There isn’t much more to learn about the basic structure of relational databases. In this video, I talk about a more arcane topic, the self-join to represent hierarchies (or network) of entities. You can safely skip this if you are not interested.
Rather than detailing the content of the table each time, people working with databases usually resort to representing the tables and relations in what is called an Entities-Relations Diagram (often abbreviated as ERD). Those schemas are composed by rectangles representing the tables with their name and each field along with its type.
These tables are connected by graphical representations of the relations between them. There are two types of relations. The straight ones represent one-to-one relations, that arise when one record in one of the tables can be connected to at most one record on the other This is the type of relation that might represent the connection between a person and her Belgian ID card. A person can have at most one valid ID card at a time and an ID card can only belong to one person. . The second type of relation is the one represented by a forked line. Those are meant to be a graphical instanciation of a one-to-many relation. In this case, the straight end of the line represent the “one” side of the relation while the forked end is supposed to represent the “many” side.
Below is a ERD representing the above scenario with students, professors, courses and enrollments.
As you can see, every piece of information of the scenario are represented in this compact visualization (except for the values contained in the table but those are not really needed to work out the queries, as you will see afterwards). For example, the primary keys are explicitely represented. You might have remarked that the enrollments table has several keys symbols. This does not mean that there are several primary keys on this table but rather that the primary key is constituted of several fields. The combination of values in those fields must be unique, not the values in each of those fields. Those are the fundamentals of the design of relational databases. It seems basic but most of today’s data architecture enabling the world to work the way it does is based on that. You can represent most scenario you can think of with relational databases (sometimes going through some crazy hoops, but still). The thing is that you have to impose a rigid structure. This is why, today, some databases system have broken free of or taken liberty with the entity-relation paradigm to allow for more flexible designs, but those are outside the scope of this class.
Queries
In you job as a business engineer or economist, you are mostly going to have to retrieve data from databases. In most relational databases, this is done by using a somewhat universal querying language called SQL. It is not only a fairly sought after skill to have to land a data analyst job, it is also a good way to woo your boss. Extract of the Amazon’s show Jack Ryan - S01E01
In this context, the most useful (and arguably the most complex) statement (or instruction) we will use is the SELECT
statement. This will be the topic of the remainder of the session.
The SELECT statement
What the heck are we doing?
Before going on with the rest of the session, it is important to understand why we talk about SQL but wrap it in a bunch of python code. It is because, most of the time, you want to do something with the data you’ll be retrieving. In this class, we’ll use python to do so (since we already uses this language for other topics).
However, make no mistake, we are using three types of “programs” under the hood. Our “view” or graphical interface is the jupyter notebook (even if you read this in web page, this is just a notebook of which the result was “exported” to a web page), which renders the result of whatever code we run in the code cells. We have python that is going to manage all the logic once the data has been retrieve from the disc. As such, it might be compared with a “controller” layer This is a broad approximation, go along with this for now, we’ll go in more details in those concepts in the theory sessions. . Finally, we have the SQL server - a program that lurks in the dark (or runs as a process on your computer), waiting for another program to send it a query Again, an approximation in the case of SQLite, the engine we are using, but most of the SQL systems work with server programs. The SQLite engine is actually started at the moment you submit it a query. . Once it receives a valid query, it goes on the disk to retrieve the required data and send the result of the query back to the program that requested it. All this happens in a split second and you have the impression that everything happens in one place but there is in fact a whole message flow going on.
import pandas as pd
import sqlite3
# This cell retrieves the database, downloads it to the folder where you have your notebook
# It then opens a connection with the database engine (it is like a tunnel through which you will
# pass the query from python to sqlite and the results back)
# You need to have the pandas library installed. If you don't have it and do not know
# how to install it, have a look at the notebook about modularity and open-source
import os.path
if not os.path.isfile("finnhub.db"): # If you don't have the file yet...
import urllib.request
urllib.request.urlretrieve("http://homepages.ulb.ac.be/~choffreu/resources/finnhub.db", "finnhub.db") # download it
con = sqlite3.connect("finnhub.db") # create a connection with the database
query = "SELECT * FROM sqlite_master" # This is the SQL code that we want to execute
pd.read_sql_query(query, con) # We use the pandas library to send the query through the connection
# retrieve the data and format it into a table
type | name | tbl_name | rootpage | sql | |
---|---|---|---|---|---|
0 | table | Symbols | Symbols | 2 | CREATE TABLE "Symbols" (\n"description" TEXT,\... |
1 | table | Companies | Companies | 3 | CREATE TABLE "Companies" (\n"address" TEXT,\n ... |
2 | table | Peers | Peers | 32 | CREATE TABLE "Peers" (\n"symbol" TEXT,\n "pee... |
3 | table | Executives | Executives | 49 | CREATE TABLE "Executives" (\n"age" TEXT,\n "n... |
4 | table | Candlestick | Candlestick | 74 | CREATE TABLE "Candlestick" (\n"close_price" RE... |
The second point I need to discuss before starting the explanation is the base that we are going to use in the examples. It is a base containing informations about listed companies and their top executives (their board of directors). It was collected by Benoit Decorte using the API from Finnhub (we will see how to retrieve this kind of data in an automated fashion during a future session). I hosted the database on this website and the boiler plate code downloads the database to the same folder as your notebook if it is not already present. The ERD for this base is the following.
As you can see, we have a record of companies along with their executives, some group of other companies considered as peers, their financial results (summarized in candlestick data), some evaluations by financial analysts and the headlines of some news involving the company. What links all information about a company is a specific field, called a ticker or symbol.
Remark that, in this schema, the table Symbols
is easily bypassed: since the ticker is the foreign key in all other tables (and, in fact, even primary key in the table Companies
, we can join the tables using the key directly. This will become clearer as we go.
After this super long introduction, we can go ahead and learn the SELECT
statement. The syntax is plain and we will go through all the keywords successively. The statement follows this pattern:
SELECT [fields or agregates]
FROM [tables or other selects, that might be joined together]
WHERE [conditions on the lines]
GROUP BY [agregation criteria]
HAVING [conditions on the agregates]
ORDER BY [fields on which to order and direction]
LIMIT [number of final lines to show]
Don’t let the simple syntax fool you. With this, you can answer as many questions as your heart desires (if you have a relational database with the data, of course). Let’s start!
SELECT … FROM …
The SELECT
statement start with the… SELECT
clause.
A statement is like a function while a clause is a part of this statement
In this clause, you list the fields you would like to retrieve in the result. Together with the FROM
clause, in which you name the tables from which you want to retrieve the fields, they are the only mandatory part of a query.
Assume you want to have the name of all the companies in the database. You could write the query in the following way: The three double-quotes in python enable the creation of a multi-line string. SQL disregards carriage-returns (going to a new line) but, for you sanity, try to write you queries on several lines to make them more readable. Try to keep an uniform style as much as possible.
query = """ SELECT name
FROM Companies
"""
pd.read_sql_query(query, con)
name | |
---|---|
0 | AGILENT TECHNOLOGIES INC |
1 | ALCOA CORP |
2 | PERTH MINT PHYSICAL GOLD ETF |
3 | None |
4 | DORSEY WRIGHT ADR ETF |
... | ... |
216 | WIDEPOINT CORP |
217 | XEROX CORP |
218 | YELP INC |
219 | ZILLOW GROUP INC |
220 | ZEDGE INC |
221 rows × 1 columns
We see that here, we have 220 companies and their name. Note that we have some companies for which the name is empty. We’ll fix it later (in fact, those are tickers that correspond to funds or something and not to a company).
This is the simplest example we can think of. With just this, we can only show all the rows of a table. The most sophisticated operation we can perform are basic math operations on the same row. For example, see that in the table Executive, you have the age of the person and the year she became a board member? Using simply SELECT
and FROM
, we can find out the age she was when she became a board member (plus or minus 1 year since the granularity of the data is at the level of the year here).
query = """ SELECT name, age - 2020 + since
FROM Executives
"""
pd.read_sql_query(query, con)
name | age - 2020 + since | |
---|---|---|
0 | Jeffery H. Boyd | 60 |
1 | Glenn D. Fogel | 56 |
2 | David I. Goulden | 57 |
3 | Gillian Tans | 47 |
4 | Peter J. Millones | 48 |
... | ... | ... |
1389 | Gerald D. Held | 60 |
1390 | Kathryn M. Hill | 55 |
1391 | Deborah Linda Kerr | 44 |
1392 | Scott F. Schenkel | 48 |
1393 | George T. Shaheen | 59 |
1394 rows × 2 columns
In many relational database systems, we strive to attain what is called normalization. It means that we seek to reduce data redundancy
An explanation is available in the Youtube video here
. Since each piece of data is in its own table, it makes sense to want to merge information coming from several tables. For example, imagine that you want to have the name of all the board members along with the name of the company in which they serve. This information is in two different tables: The table Companies
contains the name of the company (under the field name
) while the table Executives
contains the name of the board member under the field name
. How can we make both pieces of information appear in the same result, we need to use what is called a join.
You can declare a JOIN
in the FROM
clause. The concept is the following: You give the name of the first table, then JOIN
, then the second table and then, you use the keyword ON
to specify the fields that should be common between the two tables. Seems abstract? Here’s an example. We know that the symbol
in the table Executives
is a foreign key (this is why there is a little fork arriving to it) from the table Symbols
and we know that the table Symbol
has a one-to-one relation with the table company (meaning that each symbol corresponds to at most one company and that each company has at most one symbol, or ticker). Well then, it means that the value in the field symbol
of the table Executives
is a copy of a value in the field ticker
in the table Companies
. If we were to produce all possible combinations of rows from Executives
with the rows from Companies
and keep only the ones for which the field symbol
in Executives
is equal to the field ticker
in Companies
, we would indeed have all the informations of the executives alongside all the information about the company in which they serve. That’s precisely what a join does.
Watch this unfold in a very small example:
Converting this into code, we have:
query = """SELECT Executives.Name, Companies.name
FROM Executives JOIN Companies
ON (Executives.symbol = Companies.ticker)
"""
pd.read_sql_query(query, con)
name | name | |
---|---|---|
0 | Jeffery H. Boyd | BOOKING HOLDINGS INC |
1 | Glenn D. Fogel | BOOKING HOLDINGS INC |
2 | David I. Goulden | BOOKING HOLDINGS INC |
3 | Gillian Tans | BOOKING HOLDINGS INC |
4 | Peter J. Millones | BOOKING HOLDINGS INC |
... | ... | ... |
1389 | Gerald D. Held | NETAPP INC |
1390 | Kathryn M. Hill | NETAPP INC |
1391 | Deborah Linda Kerr | NETAPP INC |
1392 | Scott F. Schenkel | NETAPP INC |
1393 | George T. Shaheen | NETAPP INC |
1394 rows × 2 columns
Now, this is type of JOIN is the most common. It is called an INNER join. In it, only the rows that have a correspondence in the other table will show up in the results. For instance, imagine we are joining a table containing courses with a table containing professors. If we use an inner join, a course that has no professor assigned will not appear in the result and a professor who isn’t connected to any course will not appear either. Now imagine you want to make sure that all courses appear once and, if there is a professor responsible for it, you want to see the name of the professor. This is a job for a LEFT join. A left join will necessarily show all the record of the table that comes before the LEFT JOIN keyword and only show the information of the rows of the second table if they match with a row from the first table.
Most of the time, you’ll end up doing one of either inner join (very frequently) or left join (seldom). Besides those, there are also the right and the outer joins, that are outside of the scope of this class.
WHERE…
What we have so far is a way to retrieve all the rows from a table or from the result of junction of tables. That’s nice but not very useful for many scenarii. The first refinement is that we should be able to filter out some of the rows from the tables. This is what the WHERE
clause does. It checks a condition, or series of conditions, for each rows. If the conditions are verified for this row, the row will show up in the results, otherwise it is discarded from the query.
The conditions are fairly classical. You can test an equality with =
Do not mix it up with python where the equality is checked with ==
, “greater than” >
, “smaller than” <
, “greater or equal than” >=
, “smaller or equal than” <=
and “different from” <>
Which is also different from python where this is evaluated with !=
. Besides that, for text and chains of characters, you can check patterns by using the keyword LIKE
and using percentage signs %
as jokers representing one or more arbitrary characters. For example, imagine you just want to see the executives that are younger than 40 years. You can do so with:
query = """SELECT name, age
FROM Executives
WHERE age < 40
"""
pd.read_sql_query(query, con)
name | age | |
---|---|---|
0 | Eyal Lalo | |
1 | Michael Porter | 38 |
2 | Mark Locks | |
3 | Jon Sanders | |
4 | Michael Friedman | |
... | ... | ... |
190 | Jared Smith | |
191 | Matt Frey | |
192 | William C. Martin | 39 |
193 | Spencer Scott | |
194 | Atish Gude |
195 rows × 2 columns
That’s a good start but, as you can see, we also have a bunch of executives for which we do not know the age. For those, the value in the database was coded as an empty string (""
) and it is evaluated as smaller than any number. We need to get rid of those. The trick here is to compose several conditions using the keyword AND
. We can just add an additional condition that filters out such values. As such:
query = """SELECT name, age
FROM Executives
WHERE age < 40 AND age <> ""
"""
pd.read_sql_query(query, con)
name | age | |
---|---|---|
0 | Michael Porter | 38 |
1 | Alexander B. Spiro | 35 |
2 | Benjamin Silbermann | 36 |
3 | Evan Sharp | 36 |
4 | Anneka R. Gupta | 31 |
... | ... | ... |
60 | Melissa Thomas | 39 |
61 | Joseph Levin | 39 |
62 | Julia Hartz | 39 |
63 | Joseph Levin | 39 |
64 | William C. Martin | 39 |
65 rows × 2 columns
If we want to see all the executives that are younger than 40 years and whose first name starts with the letter J, we could do:
query = """SELECT name, age
FROM Executives
WHERE age < 40
AND age <> ""
AND name LIKE "J%"
"""
pd.read_sql_query(query, con)
name | age | |
---|---|---|
0 | Joseph Levin | 39 |
1 | John Paul Knab | 38 |
2 | Jesse A. Cohn | 38 |
3 | Jose J. Vargas | 39 |
4 | Jayme Mendal | 33 |
5 | Joseph Levin | 39 |
6 | Jamie Cohen | 32 |
7 | Jantoon Reigersman | 37 |
8 | Joseph Bzdyl | 36 |
9 | Joseph Levin | 39 |
10 | Julia Hartz | 39 |
11 | Joseph Levin | 39 |
Here, remark that the LIKE "J%"
matches any string starting with a J, whatever follows it. You can always invert a condition using the keyword NOT
:
query = """SELECT name, age
FROM Executives
WHERE age < 40
AND age <> ""
AND NOT name LIKE "J%"
"""
pd.read_sql_query(query, con)
name | age | |
---|---|---|
0 | Michael Porter | 38 |
1 | Alexander B. Spiro | 35 |
2 | Benjamin Silbermann | 36 |
3 | Evan Sharp | 36 |
4 | Anneka R. Gupta | 31 |
5 | Rachel Barnett | 37 |
6 | Christina Sindoni Ciocca | 31 |
7 | Liqun Liu | 38 |
8 | Chelsea Clinton | 39 |
9 | Sean K. Middleton | 37 |
10 | Tiffany Kuo | 27 |
11 | Michael Ross Pope | 35 |
12 | Gregory R. Blank | 38 |
13 | Zach Thomann | 35 |
14 | Benjamin L. Rosenzweig | 33 |
15 | Lev Peker | 36 |
16 | David Meniane | 35 |
17 | Ryan S. Schram | 39 |
18 | Garrard Brown | 36 |
19 | Katrina Lake | 36 |
20 | Chad Summe | 39 |
21 | Raymond T. Walsh | 33 |
22 | David B. Peters | 39 |
23 | Vikram R. Raghavan | 36 |
24 | Nathan Forrest Auwerda | 33 |
25 | Amit Goyal | 39 |
26 | Kamelia Aryafar | 33 |
27 | Sumit Goyal | 34 |
28 | Seth A. Moore | 36 |
29 | Logan D. Green | 35 |
30 | Pouyan Salehi | 39 |
31 | William Gray Stream | 39 |
32 | Maureen Mullen | 38 |
33 | Matias de Tezanos | 38 |
34 | Michael Montano | 33 |
35 | Bret Taylor | 39 |
36 | Darryl Auguste | 30 |
37 | Shimrit Markette | 35 |
38 | Mira Wilczek | 37 |
39 | Katrina Lake | 36 |
40 | Samuel H Altman | 30 |
41 | Chelsea Clinton | 39 |
42 | Ismail Azeri | 39 |
43 | Mark Zuckerberg | 34 |
44 | Evan Spiegel | 28 |
45 | Robert C. Murphy | 30 |
46 | Poppy Thorpe | 34 |
47 | Erin N. Lantz | 38 |
48 | Matthew B. Salzberg | 35 |
49 | Ilia M. Papas | 37 |
50 | Tracy Britt Cool | 35 |
51 | Melissa Thomas | 39 |
52 | William C. Martin | 39 |
GROUP BY…
This is the last difficult part in the SELECT
statement. The GROUP BY
clause enables performing agregation. Remember that the SELECT
clause enable doing simple operations on the row such as adding or subtracting two fields? Well, the GROUP BY
clause makes it possible to do some operations across rows, on a certain field. For example, you can find the average age of executives using the following syntax:
query = """SELECT AVG(age)
FROM Executives
"""
pd.read_sql_query(query, con)
AVG(age) | |
---|---|
0 | 50.672166 |
Indeed, you “squeezed” all the rows into one, summarizing the values in the age
field to take their average. This is neat, but what
does the GROUP BY
have to do with it? Well, by default, if you don’t specify it otherwise, agregation operations in the SELECT
clause (such as AVG
, MAX
, MIN
, SUM
and COUNT
) will squeeze the table all the day down to 1 row. Most of the time, the question you have only requires squeezing the table up to a certain level. For instance, knowing that the average age of directors is 50.67 years is nice, but it is of no use to know which companies employ the youngest directors. This is where the GROUP BY
shines. By specifying a field or set of fields in this clause, you force the database engine to agregate inside each value of this field (or unique combination of value in the set of fields). As such, if you wonder what is the average age of the directors in each company, you can do
The WHERE age <> ""
is there to avoid counting people without age recorded that would be defaulted to 0 if this wasn’t there.
:
query = """SELECT symbol, AVG(age)
FROM Executives
WHERE age <> ""
GROUP BY symbol
"""
pd.read_sql_query(query, con)
symbol | AVG(age) | |
---|---|---|
0 | AAPL | 58.933333 |
1 | ALOT | 62.363636 |
2 | AMRH | 53.800000 |
3 | AMZN | 58.812500 |
4 | ANGI | 46.266667 |
... | ... | ... |
97 | WDC | 58.285714 |
98 | WTRH | 48.400000 |
99 | WYY | 54.400000 |
100 | XRX | 53.454545 |
101 | YELP | 50.916667 |
102 rows × 2 columns
In this case, you end up with one line per symbol and the average age of the directors that have this symbol as foreign key.
While the average (AVG
), sum (SUM
), maximum (MAX
) and minimum (MIN
) all behave as you would expect them to behave, the COUNT
is a bit special: It counts the number of rows for which the combination of fields that you pass between parenthesis is not entirely NULL
. Most of the time, using COUNT
, you really want to count the number of underlying lines in your query, this is why COUNT
is often use with a star *
(meaning “all the fields available”) between the parentheses. For example, in order to count the number of directors in each company, you can do:
query = """SELECT symbol, COUNT(*)
FROM Executives
WHERE age <> ""
GROUP BY symbol
"""
pd.read_sql_query(query, con)
symbol | COUNT(*) | |
---|---|---|
0 | AAPL | 15 |
1 | ALOT | 11 |
2 | AMRH | 5 |
3 | AMZN | 16 |
4 | ANGI | 15 |
... | ... | ... |
97 | WDC | 14 |
98 | WTRH | 10 |
99 | WYY | 5 |
100 | XRX | 11 |
101 | YELP | 12 |
102 rows × 2 columns
In this case, the COUNT(*)
count all the rows from the table that are not completely empty.
HAVING…
Sometimes, it makes sense to want to filter some lines based on the value of the result of an agregation. The issue is that clauses are executed sequentially and that, when the agregation is performed, the WHERE
clause has already been executed (because otherwise, you would have wrong values in your agregated fields). This is the reason why the HAVING
clause exists. For example, the following code will produce an error:
You can use the keyword AS
to give alias to your field. This is especially useful when you create new agregated fields.
query = """SELECT symbol, AVG(age) AS average_age
FROM Executives
WHERE average_age < 40 AND age <> ""
GROUP BY symbol
"""
pd.read_sql_query(query, con)
---------------------------------------------------------------------------
OperationalError Traceback (most recent call last)
~/.pyenv/versions/3.7.6/envs/digital_business/lib/python3.7/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
1680 try:
-> 1681 cur.execute(*args, **kwargs)
1682 return cur
OperationalError: misuse of aggregate: AVG()
The above exception was the direct cause of the following exception:
DatabaseError Traceback (most recent call last)
<ipython-input-12-6a25273a221f> in <module>
5 """
6
----> 7 pd.read_sql_query(query, con)
~/.pyenv/versions/3.7.6/envs/digital_business/lib/python3.7/site-packages/pandas/io/sql.py in read_sql_query(sql, con, index_col, coerce_float, params, parse_dates, chunksize)
381 coerce_float=coerce_float,
382 parse_dates=parse_dates,
--> 383 chunksize=chunksize,
384 )
385
~/.pyenv/versions/3.7.6/envs/digital_business/lib/python3.7/site-packages/pandas/io/sql.py in read_query(self, sql, index_col, coerce_float, params, parse_dates, chunksize)
1725
1726 args = _convert_params(sql, params)
-> 1727 cursor = self.execute(*args)
1728 columns = [col_desc[0] for col_desc in cursor.description]
1729
~/.pyenv/versions/3.7.6/envs/digital_business/lib/python3.7/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
1691
1692 ex = DatabaseError(f"Execution failed on sql '{args[0]}': {exc}")
-> 1693 raise ex from exc
1694
1695 @staticmethod
DatabaseError: Execution failed on sql 'SELECT symbol, AVG(age) AS average_age
FROM Executives
WHERE average_age < 40 AND age <> ""
GROUP BY symbol
': misuse of aggregate: AVG()
This huge error message basically tells us that we cannot do that for a good reason: We try to use the result of the AVG
agregation in the WHERE
clause but, at the moment of the evaluation of this clause, the agregation is not yet calculated. To fix this and retrieve the companies where directors have on average less than 40 years, we need to use the HAVING
clause:
query = """SELECT symbol, AVG(age) AS average_age
FROM Executives
WHERE age <> ""
GROUP BY symbol
HAVING average_age < 40
"""
pd.read_sql_query(query, con)
symbol | average_age |
---|
It turns out, none of the companies in our base have a your board of directors. It’s part of the reason we’re in this mess, if you ask me but anyways. For the rest, the HAVING
clause can perform most the same checks as the WHERE
clause.
ORDER BY…
Perhaps unsurprisingly, the ORDER BY
clause enables to sort the results of the query. It is usually a fairly unsophisticated clause and the orders that are supported are the numerical order (1,2,3,7 or 1.1, 1.2, 1.7, 6.5) and the alphabetical order. You can also specify if you want it from small to large (using the ASC
, for ascending, keyword, which is the default) or from large to small (using the DESC
, for descending, keyword). You can also specify several fields: It will start sorting on the first and, if there are draws, those will be resolved by resorting to the second fields, etc.).
In our example, assume we want to show all directors by classifying them by the alphabetical order of the name of their company and, for people in the same company, by decreasing order of age (from oldest to youngest) and, if two directors from the same company have the same age, we want the alphebetical order of name (we do not expect two directors to have exactly the same name, so we stop there). We can do so as such:
You can alias the table name by writing the alias just after the table name. For example, in this query, we alias the table Companies
by c
by writing FROM Companies c
. In the rest of the query, everytime we type c
, the engine understands we refer to the table Companies
.
query = """SELECT c.name, e.name, e.age
FROM Companies c
JOIN Executives e
ON (c.ticker = e.symbol)
ORDER BY c.name ASC,
e.age DESC,
e.name
"""
pd.read_sql_query(query, con)
name | name | age | |
---|---|---|---|
0 | 1-800-FLOWERS.COM | James A. Cannavino | 75 |
1 | 1-800-FLOWERS.COM | Eugene F. DeMark | 72 |
2 | 1-800-FLOWERS.COM | Adam Hanft | 69 |
3 | 1-800-FLOWERS.COM | James Francis McCann | 68 |
4 | 1-800-FLOWERS.COM | Leonard J. Elmore | 67 |
... | ... | ... | ... |
1389 | YELP INC | Laurence Wilson | 46 |
1390 | YELP INC | James Miln | 45 |
1391 | YELP INC | George Hu | 44 |
1392 | YELP INC | Jeremy Stoppelman | 41 |
1393 | YELP INC | Vivek Patel | 40 |
1394 rows × 3 columns
Here, remark that we have asked to sort first by name of the company in alphabetical order, then the age of the executive by descending order, then the name of the executive by alphabetical order but we ommitted the ASC
, it is no big deal since it is the default value. We could also have dropped it after the c.name
.
LIMIT
The last and maybe easiest clause is the LIMIT
. It is executed at the end of the query and cuts the results. If it is followed by one positive number x, it will show the x first results. If the numbers is negative, it will show the last x results. Finally, if you give it two number, x and y, it will show the y rows that follow row x.
query = """SELECT symbol, AVG(age) as average_age
FROM Executives
GROUP BY symbol
ORDER BY average_age
LIMIT 3
"""
pd.read_sql_query(query, con)
symbol | average_age | |
---|---|---|
0 | SONM | 17.166667 |
1 | DXC | 31.064516 |
2 | ANGI | 31.545455 |
Gives you the 3 companies where the directors are the youngest on average.
query = """SELECT symbol, AVG(age) as average_age
FROM Executives
GROUP BY symbol
ORDER BY average_age
LIMIT 2, 5
"""
pd.read_sql_query(query, con)
symbol | average_age | |
---|---|---|
0 | ANGI | 31.545455 |
1 | BOXL | 32.555556 |
2 | IMMR | 32.666667 |
3 | APRN | 37.307692 |
4 | MEET | 37.583333 |
This query gives you the 5 companies that follow the company at index 2 of the result (the 3rd company, ANGI in this case, as the index starts at 0).
Tying everything together
You now know everything you need to perform arbitrarily complex queries in relational databases.
Let’s try to do some (very) complex queries. In the first query, I will figure out the average age of the executives in the companies which registered the highest relative stock price increase during the month of January 2019. In order for you to realize all the challenges and understand the way of thinking, I’ll go step-by-step without preparation. Please be patient as I make all the mistakes I warned you about above…
query = """
SELECT AVG(age)
FROM
Executives
JOIN
(SELECT last_quotation.symbol, close_price/open_price AS evolution
FROM
(SELECT symbol, close_price
FROM Candlestick
WHERE timestamp = "2019-01-31") last_quotation
JOIN
(SELECT symbol, open_price
FROM Candlestick
WHERE timestamp = "2019-01-02") first_quotation
ON last_quotation.symbol = first_quotation.symbol
ORDER BY evolution DESC
LIMIT 1) max_evol
ON Executives.symbol = max_evol.symbol
"""
pd.read_sql_query(query, con)
AVG(age) | |
---|---|
0 | 32.555556 |
Go Further
Look at how the Peers table is constructed. It has two foreign keys corresponding both to companies. Concretely, this allows representing a directed graph in a relational database form. Querying those are not hard once you understand the difference between the table on disk and the invocation of this table. In this video, we retrieve the name of those companies using an SQL query.
query = """
SELECT origin.name, destination.name
FROM
Peers
JOIN Companies origin
ON Peers.symbol = origin.ticker
JOIN Companies destination
ON Peers.peers_symbol = destination.ticker
"""
pd.read_sql_query(query, con)
name | name | |
---|---|---|
0 | BOOKING HOLDINGS INC | SHUTTERSTOCK INC |
1 | BOOKING HOLDINGS INC | LANDS' END INC |
2 | BOOKING HOLDINGS INC | QUOTIENT TECHNOLOGY INC |
3 | BOOKING HOLDINGS INC | RUBICON PROJECT INC |
4 | BOOKING HOLDINGS INC | GRUBHUB INC |
... | ... | ... |
2811 | NETAPP INC | SOCKET MOBILE INC |
2812 | NETAPP INC | INTEVAC INC |
2813 | NETAPP INC | TRANSACT TECHNOLOGIES INC |
2814 | NETAPP INC | IMMERSION CORP |
2815 | NETAPP INC | QUALSTAR CORP |
2816 rows × 2 columns
Check your understanding
Once again, this was a long session but it is fundamental for what’s to come (and for your life in general). Here are a few questions to check if you remember what’s in this class. This does not count for grade and the answers will be anonymous. Try is as much as you want.
Exercises
ERD I - Fakebook
You want to model a simple competitor to Facebook. You have to have users, who may become friends with each other, can post posts, to which their friends may then react or comments and some of the posts may contain one or more pictures. Design the ER diagram representing this app.
See solution
We model just enough of Facebook to turn a profit. It will always be time to become multi-billionnaires later.
Here’s the result of our efforts. Let the VC know, there’s a new kid in town!
Note, however, that if you do not have the same solution, it does not mean that yours is wrong. The thing with ERD is that there are more than a way to skin a cat even because everyone has a different understanding of any problem. The important is to be able to justify your choices and to understand their consequences. Moreover, there are designs that are disfunctional in the sense that they would not be able to accomodate good sense situations (think of a one-to-one relation between Users and Posts that would mean that users would be only able to post something once and then bail. The name of the game is to avoid those nonesensical pieces of drawing. Everything else is fair game if what you draw represent what you had in mind.
<figure >
I guess we are ready to be rich now.
ERD II - The CoViD app
Design an ER diagram for an app managing the tracking of CoViD. You want to be able to figure out who was or may have been in contact with people affected by the Corona Virus Disease. As a rule, you do not need to track the people on their ways to and from places, only in closed places. You can use any tool to model the ERD, but if you have no preference, you can use DrawSQL with a trial account.
See solution
The solution is actually less complex than one could expect. Of course, in reality, most of the challenge consists in keeping private data safe and optimizing search so that you don’t have to wait for days for each query to execute in the gigantic database.
And here is the result of this exercise. Again this is not the unique solution. You might have thought of additional entities that could be represented and used in the disease tracking workflow.
<figure >
ERD III - The Bill of Material
You want to represent the “recipe book” for your product: an electric bike. Since you manufacture all parts of the bike, you start and assemble everything from raw material. So you have 3 types of materials: finished products, intermediary products and raw materials. Only raw materials have direct computable costs.
You use raw materials and intermediary products to produce the finished products as well as intermediary products. Since you want to be able to compute the total cost of each product, you want to be able to trace all the products that enters the production of any specific product at any stage (be it to produce the final product or any of it’s intermediary component).
In technical terms, this is known as a Bill of Materials (or BOM). Since we are only interested in the cost of materials, you do not need to include costs of the process. Design the ERD that represent this problem.
See solution
This exercise is a little more complex in its solution as the situation is a little less familiar to the average database designer. The point is that you do not need to constraint the form of the network representing the BOM, a simple table with two connections to the Products (or Materials, or whatever) table is all you need.
Note that you might have wanted to separate raw inputs, semi-finished goods and finished products in separate tables. If you go this way, I’d still recommend creating a unique abstract table where all those get an ID. Indeed, having an ID for the concept of material or product is important to avoid having to create 3 fields (raw_material_id
, semi_finished_good_id
and finished_good_id
) even more so because that way, you would also have to have several fields for the product being constructed. Similarly, creating a BOM table for each type of material is a bad idea since most of the fields will be the same and there is nothing special in a recipe for semi-finished good that is in the finished good. The thing with the UnitsOfMeasure is more of a detail and it is totally understandable if you don’t have it in your design.
<figure >
SQL - Setup
Let’s play with the stock price database that we have. Let’s start by making sure that we have it. Download the full database. Beware that it is quite heavy (250Mb or something) so executing this cell might take quite a while. Fortunately, you only need to execute it once as it will find the version previously downloaded the next times.
import pandas as pd
import sqlite3
# This cell retrieves the database, downloads it to the folder where you have your notebook
# It then opens a connection with the database engine (it is like a tunnel through which you will
# pass the query from python to sqlite and the results back)
# You need to have the pandas library installed. If you don't have it and do not know
# how to install it, have a look at the notebook about modularity and open-source
import os.path
if not os.path.isfile("stocks_full.db"): # If you don't have the file yet...
import urllib.request
urllib.request.urlretrieve("http://homepages.ulb.ac.be/~choffreu/resources/stocks_full.db",
"stocks_full.db") # download it
con = sqlite3.connect("stocks_full.db") # create a connection with the database
query = "SELECT * FROM sqlite_master" # This is the SQL code that we want to execute
pd.read_sql_query(query, con) # We use the pandas library to send the query through the connection
# retrieve the data and format it into a table
type | name | tbl_name | rootpage | sql | |
---|---|---|---|---|---|
0 | table | Symbols | Symbols | 2 | CREATE TABLE "Symbols" (\n"symbol" TEXT,\n "d... |
1 | table | Companies | Companies | 26 | CREATE TABLE "Companies" (\n"ticker" TEXT,\n ... |
2 | table | Peers | Peers | 276 | CREATE TABLE "Peers" (\n"symbol" TEXT,\n "pee... |
3 | table | Executives | Executives | 20214 | CREATE TABLE "Executives" (\n"id" INTEGER,\n ... |
4 | table | Candlesticks | Candlesticks | 20338 | CREATE TABLE "Candlesticks" (\n"id" INTEGER,\n... |
5 | table | Recommendations | Recommendations | 57926 | CREATE TABLE "Recommendations" (\n"id" INTEGER... |
6 | table | Evaluations | Evaluations | 57942 | CREATE TABLE "Evaluations" (\n"id" INTEGER,\n ... |
7 | table | MajorDevelopments | MajorDevelopments | 57985 | CREATE TABLE "MajorDevelopments" (\n"id" INTEG... |
SQL I - Thinking like a finance maverick
First let me preface these questions with the disclaimer. This is not a finance course. So what we are going to do during this session probably makes little to no financial sense (please don’t show this notebook to any finance professor).
That being said, let’s start. We’re first going to make our very own index to assess the attractiveness of sectors! That’s right, Goldman Sach’s. Beware!
In order to do so, produce the total number of recommendations for each industry in the dataset (the gind
field in the Companies table) in 2019. Technically, we want the sum of the strongBuy, buy, hold, sell and strongSell emitted by experts in 2019. Note that you do not need to compute them at the level of the evaluation, since this was already done for you in the Recommendations table.
In order to do so, think of both agregation and joins across the tables to retrieve the information you want.
See solution
This exercise illustrate how joins and aggregations together may be combined to produce views on your data at different levels of a hierarchy. In this case, companies are grouped along one of their dimensions, the industry they are a part of, to produce statistics on such aggregates. (Note that there is an edit in this video around the end, so the video seems to jump, this is normal, I still need to hone my mad video editing skillz)
query = """
SELECT gind as Industry, SUM(strongBuy),
SUM(buy), SUM(hold), SUM(sell), SUM(strongSell)
FROM Companies c
JOIN Recommendations r
ON (c.ticker = r.symbol)
WHERE r.period LIKE "2019%" AND c.gind <> "N/A"
GROUP BY gind
"""
pd.read_sql_query(query, con)
Industry | SUM(strongBuy) | SUM(buy) | SUM(hold) | SUM(sell) | SUM(strongSell) | |
---|---|---|---|---|---|---|
0 | IT Services | 567 | 1125 | 889 | 110 | 22 |
1 | Interactive Media & Services | 759 | 1309 | 1445 | 377 | 45 |
2 | Internet & Direct Marketing Retail | 746 | 1344 | 1504 | 161 | 14 |
3 | Technology Hardware, Storage & Peripherals | 648 | 1001 | 1323 | 124 | 38 |
4 | Trading Companies & Distributors | 34 | 22 | 142 | 1 | 1 |
SQL II - If advices were good, they wouldn’t be free
Financial analysts usually give advices about whether they think you should keep, buy or sell some stocks (in this case, they are called Evaluations). Thing is, sometimes, they are wrong. So what we are going to do is that we are going to take all the evaluations given by the analysts during the month of October 2019 and check whether following the recommendation of the analyst during this period would have resulted in a profit or a loss in the timespan of a month. If the last recommendation of an analyst during the month was to “Buy”, “Strong Buy”, “Outperform”, “Overweight” and the action went up in the month of November, the analyst get a point, otherwise, they lose one. Same if the last recommendation of the analyst was to “Sell” or “Underweight”, “Underperform” and the action lost value.
Break this query in smaller, simpler queries and then just assemble them like legos.
To finish, establish a ranking of the analysts by decreasing order of points they got.
See solution
Step 1 - This query is not complicated from a syntax standpoint (there is very few subtlety in with SQL language) but is hard to organise. This is why we spend the first video thinking about how to break down the problem to a set of small components we can then put together. Try doing it by yourself before watching the video as this is the single most important step to solve this exercise. The rest is just data plumbing. If you do not break the query the same way I do, try implementing your solution in code and let me know if you manage or hit a road block. There are usually more than a valid solution just be sure you understand precisely what you do and retrieve as data. For example using the MAX(gradeTime)
directly to retrieve the last evaluation looks correct but might in fact produce wrong answers (in this particular case, it’s a pity SQLite is so tolerant with imprecise instructions).
and here is the map and the corresponding videos:
<figure >
Step 2 - We start by generating the performance of the stocks during the month of November 2019.
Note that hardcoding the values for the first and last valuations during the month is a bad idea. However, since I’m feeling quite lazy and hypocrit today, let’s say fixing that is an exercise left for you to test your understanding.
query_perf = """
SELECT close.symbol, (close.close_price/open.open_price) - 1 AS var
FROM
(SELECT ca.symbol, ca.open_price as open_price
FROM Candlesticks ca
WHERE ca.timestamp = "2019-11-01") open
JOIN
(SELECT ca.symbol, ca.close_price as close_price
FROM Candlesticks ca
WHERE ca.timestamp = "2019-11-29") close
ON close.symbol = open.symbol
"""
pd.read_sql_query(query_perf, con)
symbol | var | |
---|---|---|
0 | FINX | 0.060153 |
1 | BTA | -0.015711 |
2 | NORW | -0.014141 |
3 | NIM | -0.024621 |
4 | VONG | 0.038809 |
... | ... | ... |
2157 | IBMI | 0.001394 |
2158 | MFV | -0.015152 |
2159 | MNCL | -0.001986 |
2160 | VTWO | 0.034113 |
2161 | VTI | 0.032387 |
2162 rows × 2 columns
Step 3 - Retrieving the last Evaluation by analysts for every company in the month of October 2019. Beware, this step is the trickiest of the whole exercise.
query_last_eval = """
SELECT e.symbol, e.company, e.toGrade
FROM Evaluations e
JOIN
(SELECT symbol, company, MAX(gradeTime) as last_eval
FROM Evaluations
WHERE gradeTime LIKE "2019-10%"
GROUP BY symbol, company) last
ON e.symbol = last.symbol AND e.company = last.company
AND e.gradeTime = last.last_eval
GROUP BY e.symbol, e.company
"""
pd.read_sql_query(query_last_eval, con)
symbol | company | toGrade | |
---|---|---|---|
0 | AAPL | Bank of America | Buy |
1 | AAPL | Barclays | Equal-Weight |
2 | AAPL | Canaccord Genuity | Buy |
3 | AAPL | Cascend | Buy |
4 | AAPL | Cowen & Co. | Outperform |
... | ... | ... | ... |
150 | W | Morgan Stanley | Equal-Weight |
151 | W | Wells Fargo | Market Perform |
152 | WDC | Loop Capital | Buy |
153 | YELP | Morgan Stanley | Underweight |
154 | Z | Morgan Stanley | Equal-Weight |
155 rows × 3 columns
Step 4 - Using the two components built previously, we can now compute the hits and misses. Please note that in a business setting, we would probably not build all this query in one go but we would rather encapsulate some of the steps into views or stored procedures. We have not seen those here so we’ll do without.
query_pos = """
SELECT company, COUNT(*)
FROM
(SELECT close.symbol, (close.close_price/open.open_price) - 1 AS var
FROM
(SELECT ca.symbol, ca.open_price as open_price
FROM Candlesticks ca
WHERE ca.timestamp = "2019-11-01") open
JOIN
(SELECT ca.symbol, ca.close_price as close_price
FROM Candlesticks ca
WHERE ca.timestamp = "2019-11-29") close
ON close.symbol = open.symbol) perf
JOIN
(SELECT e.symbol, e.company, e.toGrade
FROM Evaluations e
JOIN
(SELECT symbol, company, MAX(gradeTime) as last_eval
FROM Evaluations
WHERE gradeTime LIKE "2019-10%"
GROUP BY symbol, company) last
ON e.symbol = last.symbol AND e.company = last.company
AND e.gradeTime = last.last_eval
GROUP BY e.symbol, e.company) eval
ON perf.symbol = eval.symbol
WHERE
(var > 0 AND toGrade IN ("Buy", "Strong Buy",
"Outperform", "Overweight"))
OR
(var < 0 AND toGrade IN ("Sell", "Underweight", "Underperform"))
GROUP BY company
"""
query_neg = """
SELECT company, COUNT(*)
FROM
(SELECT close.symbol, (close.close_price/open.open_price) - 1 AS var
FROM
(SELECT ca.symbol, ca.open_price as open_price
FROM Candlesticks ca
WHERE ca.timestamp = "2019-11-01") open
JOIN
(SELECT ca.symbol, ca.close_price as close_price
FROM Candlesticks ca
WHERE ca.timestamp = "2019-11-29") close
ON close.symbol = open.symbol) perf
JOIN
(SELECT e.symbol, e.company, e.toGrade
FROM Evaluations e
JOIN
(SELECT symbol, company, MAX(gradeTime) as last_eval
FROM Evaluations
WHERE gradeTime LIKE "2019-10%"
GROUP BY symbol, company) last
ON e.symbol = last.symbol AND e.company = last.company
AND e.gradeTime = last.last_eval
GROUP BY e.symbol, e.company) eval
ON perf.symbol = eval.symbol
WHERE
(var < 0 AND toGrade IN ("Buy", "Strong Buy",
"Outperform", "Overweight"))
OR
(var > 0 AND toGrade IN ("Sell", "Underweight", "Underperform"))
GROUP BY company
"""
pd.read_sql_query(query_pos, con)
company | COUNT(*) | |
---|---|---|
0 | Arete | 1 |
1 | B. Riley | 1 |
2 | BMO Capital | 1 |
3 | Baird | 1 |
4 | Bank of America | 3 |
5 | Barclays | 5 |
6 | Benchmark | 2 |
7 | Canaccord Genuity | 2 |
8 | Cascend | 2 |
9 | Citigroup | 2 |
10 | Cowen & Co. | 1 |
11 | Craig-Hallum | 1 |
12 | Credit Suisse | 6 |
13 | DA Davidson | 1 |
14 | Deutsche Bank | 1 |
15 | JP Morgan | 4 |
16 | Jefferies | 1 |
17 | KeyBanc | 3 |
18 | LightShed Partners | 1 |
19 | Longbow Research | 1 |
20 | MKM Partners | 1 |
21 | Mizuho | 1 |
22 | Morgan Stanley | 5 |
23 | National Securities | 1 |
24 | Nomura | 1 |
25 | Oppenheimer | 1 |
26 | Piper Jaffray | 2 |
27 | RBC Capital | 1 |
28 | Raymond James | 3 |
29 | SunTrust Robinson Humphrey | 3 |
30 | Telsey Advisory Group | 1 |
31 | UBS | 4 |
32 | Wedbush | 3 |
33 | Wells Fargo | 1 |
Step 5 - We can now put everything together!
The final query looks gruesome but it’s mostly a matter of copy/pasting the components above. Also, we used the UNION ALL
statement to concatenate the results of two queries to overcome a situation that might arise if you just join the positive and negative queries. Kudos if you spotted the potential issue and looked and found a solution! This is what programming is all about. If not, do not feel bad. It was pretty vicious.
Also, the code here is a little different from the one in the video: I aliased the SUM(score)
as final_score
and ordered the result by descending order of it.
query = """
SELECT company, SUM(score) as final_score
FROM
(SELECT company, COUNT(*) as score
FROM
(SELECT close.symbol, (close.close_price/open.open_price) - 1 AS var
FROM
(SELECT ca.symbol, ca.open_price as open_price
FROM Candlesticks ca
WHERE ca.timestamp = "2019-11-01") open
JOIN
(SELECT ca.symbol, ca.close_price as close_price
FROM Candlesticks ca
WHERE ca.timestamp = "2019-11-29") close
ON close.symbol = open.symbol) perf
JOIN
(SELECT e.symbol, e.company, e.toGrade
FROM Evaluations e
JOIN
(SELECT symbol, company, MAX(gradeTime) as last_eval
FROM Evaluations
WHERE gradeTime LIKE "2019-10%"
GROUP BY symbol, company) last
ON e.symbol = last.symbol AND e.company = last.company
AND e.gradeTime = last.last_eval
GROUP BY e.symbol, e.company) eval
ON perf.symbol = eval.symbol
WHERE
(var > 0 AND toGrade IN ("Buy", "Strong Buy",
"Outperform", "Overweight"))
OR
(var < 0 AND toGrade IN ("Sell", "Underweight", "Underperform"))
GROUP BY company
UNION ALL
SELECT company, - COUNT(*)
FROM
(SELECT close.symbol, (close.close_price/open.open_price) - 1 AS var
FROM
(SELECT ca.symbol, ca.open_price as open_price
FROM Candlesticks ca
WHERE ca.timestamp = "2019-11-01") open
JOIN
(SELECT ca.symbol, ca.close_price as close_price
FROM Candlesticks ca
WHERE ca.timestamp = "2019-11-29") close
ON close.symbol = open.symbol) perf
JOIN
(SELECT e.symbol, e.company, e.toGrade
FROM Evaluations e
JOIN
(SELECT symbol, company, MAX(gradeTime) as last_eval
FROM Evaluations
WHERE gradeTime LIKE "2019-10%"
GROUP BY symbol, company) last
ON e.symbol = last.symbol AND e.company = last.company
AND e.gradeTime = last.last_eval
GROUP BY e.symbol, e.company) eval
ON perf.symbol = eval.symbol
WHERE
(var < 0 AND toGrade IN ("Buy", "Strong Buy",
"Outperform", "Overweight"))
OR
(var > 0 AND toGrade IN ("Sell", "Underweight", "Underperform"))
GROUP BY company)
GROUP BY company
ORDER BY final_score DESC
"""
pd.read_sql_query(query, con)
company | final_score | |
---|---|---|
0 | Credit Suisse | 6 |
1 | Barclays | 5 |
2 | Morgan Stanley | 4 |
3 | JP Morgan | 4 |
4 | Wedbush | 3 |
5 | UBS | 3 |
6 | SunTrust Robinson Humphrey | 3 |
7 | Raymond James | 2 |
8 | Piper Jaffray | 2 |
9 | KeyBanc | 2 |
10 | Citigroup | 2 |
11 | Cascend | 2 |
12 | Canaccord Genuity | 2 |
13 | Benchmark | 2 |
14 | Wells Fargo | 1 |
15 | Telsey Advisory Group | 1 |
16 | Mizuho | 1 |
17 | MKM Partners | 1 |
18 | Longbow Research | 1 |
19 | LightShed Partners | 1 |
20 | Jefferies | 1 |
21 | DA Davidson | 1 |
22 | Craig-Hallum | 1 |
23 | Cowen & Co. | 1 |
24 | Baird | 1 |
25 | BMO Capital | 1 |
26 | B. Riley | 1 |
27 | Arete | 1 |
28 | RBC Capital | 0 |
29 | Oppenheimer | 0 |
30 | National Securities | 0 |
31 | Bank of America | 0 |
32 | Loop Capital | -1 |
33 | Deutsche Bank | -1 |
34 | Nomura | -2 |
35 | Goldman Sachs | -3 |
It would probably be a good idea to provide the result as a percentage of advices given but this too is left as an additional exercise if you’re interested.
SQL III - A horse race
If capitalism is a race, companies are horses. But some of them are more like donkeys. Compute the growth rate of each company’s stock price in January of 2020 (which is the ratio between the latest closing price on the earliest opening price). Compare them with the one of their peers and give the rank of the company in the list of its peers.
You might want to think about what this rank represents to solve this one.
See solution
Step 1 - Much like for exercise 2, we start by breaking down the query in smaller work packages. Fortunately, this is significantly easier in this case.
And here is the plan:
<figure >
Step 2 - We start, much like for Question 2 at step, with figuring out the performance of the stock during the month of January. Same caveats and hypocrisy regarding the answer ensue.
And here is the code:
query_perf = """
SELECT open.symbol, (close_price/open_price) - 1 as perf
FROM
(SELECT ca.symbol, ca.open_price
FROM Candlesticks ca
WHERE ca.timestamp = "2020-01-02") open
JOIN
(SELECT ca.symbol, ca.close_price
FROM Candlesticks ca
WHERE ca.timestamp = "2020-01-31") close
ON open.symbol = close.symbol
"""
pd.read_sql_query(query_perf, con)
symbol | perf | |
---|---|---|
0 | FINX | 0.034916 |
1 | BTA | 0.008567 |
2 | NORW | -0.081423 |
3 | NIM | 0.014272 |
4 | VONG | 0.014878 |
... | ... | ... |
2168 | MFV | 0.077519 |
2169 | MNCL | -0.000978 |
2170 | VTWO | -0.037350 |
2171 | PAAC | 0.010499 |
2172 | VTI | -0.007044 |
2173 rows × 2 columns
Step 3 - We can now count the number of peers that have a better performance than each companies. Please take very good care to understand what we are doing at this step as it is the very heart of this solution. The distinction between a table on the disk and a table (or the result of a query) in memory is paramount for understanding how queries are parsed and understood by a database engine!
And here is the query:
query_numb_peers = """
SELECT p.symbol, COUNT(*)
FROM Peers p
JOIN
(SELECT open.symbol, (close_price/open_price) - 1 as perf
FROM
(SELECT ca.symbol, ca.open_price
FROM Candlesticks ca
WHERE ca.timestamp = "2020-01-02") open
JOIN
(SELECT ca.symbol, ca.close_price
FROM Candlesticks ca
WHERE ca.timestamp = "2020-01-31") close
ON open.symbol = close.symbol) perf_company
ON p.symbol = perf_company.symbol
JOIN
(SELECT open.symbol, (close_price/open_price) - 1 as perf
FROM
(SELECT ca.symbol, ca.open_price
FROM Candlesticks ca
WHERE ca.timestamp = "2020-01-02") open
JOIN
(SELECT ca.symbol, ca.close_price
FROM Candlesticks ca
WHERE ca.timestamp = "2020-01-31") close
ON open.symbol = close.symbol) perf_peer
ON p.peers_symbol = perf_peer.symbol
WHERE perf_peer.perf > perf_company.perf
GROUP BY p.symbol
"""
pd.read_sql_query(query_numb_peers, con)
symbol | COUNT(*) | |
---|---|---|
0 | AAAU | 132 |
1 | AADR | 1108 |
2 | AAPL | 3 |
3 | AAXJ | 1866 |
4 | ACAM | 418 |
... | ... | ... |
2162 | ZJPN | 1446 |
2163 | ZMLP | 1828 |
2164 | ZROZ | 39 |
2165 | ZSL | 978 |
2166 | ZTR | 648 |
2167 rows × 2 columns
Step 4 - Here’s the final step. The logic is similar to the one on exercise 2 with a twist: we start by creating a query that returns all the companies that have at least 1 peer and give it an offset of 1 in the rank (the why is explained in the video).
And here is the final query:
query = """
SELECT symbol, SUM(position) as rank
FROM
(SELECT DISTINCT symbol, 1 as position
FROM Peers
UNION ALL
SELECT p.symbol, COUNT(*)
FROM Peers p
JOIN
(SELECT open.symbol, (close_price/open_price) - 1 as perf
FROM
(SELECT ca.symbol, ca.open_price
FROM Candlesticks ca
WHERE ca.timestamp = "2020-01-02") open
JOIN
(SELECT ca.symbol, ca.close_price
FROM Candlesticks ca
WHERE ca.timestamp = "2020-01-31") close
ON open.symbol = close.symbol) perf_company
ON p.symbol = perf_company.symbol
JOIN
(SELECT open.symbol, (close_price/open_price) - 1 as perf
FROM
(SELECT ca.symbol, ca.open_price
FROM Candlesticks ca
WHERE ca.timestamp = "2020-01-02") open
JOIN
(SELECT ca.symbol, ca.close_price
FROM Candlesticks ca
WHERE ca.timestamp = "2020-01-31") close
ON open.symbol = close.symbol) perf_peer
ON p.peers_symbol = perf_peer.symbol
WHERE perf_peer.perf > perf_company.perf
GROUP BY p.symbol)
GROUP BY symbol
"""
pd.read_sql_query(query, con)
symbol | rank | |
---|---|---|
0 | AAAU | 133 |
1 | AADR | 1109 |
2 | AAPL | 4 |
3 | AAXJ | 1867 |
4 | ACAM | 419 |
... | ... | ... |
2200 | ZJPN | 1447 |
2201 | ZMLP | 1829 |
2202 | ZROZ | 40 |
2203 | ZSL | 979 |
2204 | ZTR | 649 |
2205 rows × 2 columns
Aaaaaaaand, we are now in business! Time to relax and then go on with understand how we can model business processes to improve and automate them.
Where to go next?
Do the exercises if they are already available on the UV (the corrections will be posted on this page pretty soon). Otherwise or if you want to know more about business process and how to represent, digitize and, ultimately, automate them, head to the BPMN page!