1.2 - SQL

The Digital Firm

Estimated Reading Time: 51 minutes

Contents

L'Université Libre de Bruxelles fonde l'enseignement et la recherche sur le principe du libre examen. Celui-ci postule, en toute matière, le rejet de l'argument d'autorité et l'indépendance de jugement.

Article premier des statuts organiques de l'Université libre de Bruxelles

The Free University of Brussels bases its teachings and research on the principle of free inquiry. This principle postulates, in all matters, the rejection of the argument of authority and independence of thought.

First article of the Statutes of the Free University of Brussels (free translation)

Introduction

SQL, or Structured Query Language, is not like Python. It is indeed not a programming language but a query language. This means that most of what we have seen in the Python refresher do not apply when writing SQL. Python and SQL are therefore complements rather than substitutes.

You will use Python to manipulate data, respond to new events by the user and implement advanced conditional logic. And you will use SQL (or one of its descendents, the so-called NoSQL languages) to retrieve information that is stored in some structured fashion. SQL is particularly at its ease with relational databases (that you probably have seen or will soon see during the theory lectures).

In those databases the data is stored in tables that are linked together. The easiest way for business/economics students to conceptualize those tables are as tabs on an spreadsheet that has named columns. Each record on those tables has a field, or list of fields that identifies it uniquely. For example, you probably have a student number, which is the field the University uses to identify you uniquely, even if another students with the same name and date of birth is enrolled at the same university. Speaking about students, here is what a table containing students enrolled at the university might look like.

Students

student_id	name	date_of_birth	city_of_residence
1	Aloysius	1999-07-05	Brussels
2	Clothilde	1998-12-14	Brussels
3	Damian	1999-02-28	Dilbeek
4	Clothilde	1999-06-06	Anderlecht
5	Eya	1998-10-15	Brussels

Now, remark that we have two students named Clothilde who are obviously two different people as they do not share the same birthday nor the same city. This is not a problem at all, and even if they shared those information by an enormous coincidence, they would still be differentiated in the database because they have different student_id. In the lingo, those attributes I’ll use attribute or fields interchangeably that enable differentiating two records, even if all the other fields are rigorously identical are called primary keys. This is a very important concept and all the use of relational database and a good part of our modern way of life hinges on it.

Now, how does this differs from a spreadsheet, would you ask. Well, the presence of a primary key enable the creation of links between tables⊕ while most spreadsheet systems enable those as well, it is often super painful to maintain and manage . Let’s see some example. If we have students, it would be reasonable to assume they follow courses. Here is a small table with some courses.

Courses

course_id	name
GEST-S482	The Digital Firm
GEST-S428	Some less cool course

It would also be expected that those courses are given by professors. Let’s create a table for them too…

Professors

employee_id	name
1	Niccolo di Pantaloni de Mare
2	Tom Marvolo Riddle
3	Frank Randall

Nice! Now, to express the fact that Niccolo give the class “The Digital Firm”, we are actually going to modify the table containing the courses to add a field. This field will be peculiar: It will only contain copies of primary keys from the table Professors. The semantics of this field will be that, every time a record has a number in this field, the professor with this primary key gives this class. As such, the table Courses becomes:

Courses

course_id	name	professor_id
GEST-S482	The Digital Firm	1
GEST-S428	Some less cool course	3

Here we see that the professor Niccolo gives the class “The Digital Firm” and the professor Frank gives “Some less cool course”. Technically, the field that contains copies of the primary key of another table is called a foreign key. This is the mechanism that underpins links creation in relational databases. In fact, if a professors gives several classes, we can just copy her primary key in several rows of the Courses table and be done with it. This is why this type of link is called One-to-Many relation (because one professor can give several classes but a class can only be given by one professor).

But you might ask: “Why haven’t we linked students with courses? Surely we can use foreign keys to do so!” and surely, we can. However, it is not so easy. Take a minute to think how you would do it and draw it on a piece of paper in the example, I’ll wait.

…

Done? Marvelous! As you could see, there is an issue. A student can be enrolled in several classes but, at the same time, a class can be given to several students, fortunately for our universities budgets. Copying the primary key of both sides of the relation and put it on the other table would not work In fact, it would make the matter worst: It would force a student to have at most one class and a class to have at most a student, try it for yourself! . We therefore have to resort to a different construct: The junction table. This is a table that is going to be created between the Students and the Courses. Every time a student registers to a class, a new record is created in this table containing the primary key of the student and the primary key of the course as foreign keys. What’s even cooler is that, now, students can register several times to the same class, for example if they have to retake the class⊕ What’s cool here is that we are able to do it, not that students have to retake the class. Please try to pass at your first trial… . You just need to add a field with the year during which this student was register. You can also, for example, log the grade of the student in this class and this kind of information. This table, that we can call, for example Enrollments might look like:

Enrollments

student_id	course_id	year	grade
2	GEST-S482	2020	12
1	GEST-S482	2019	8
2	GEST-S428	2020	14
1	GEST-S482	2020	16

As one can see, we have student 2 (one of the Clothildes) that was registered to both courses (GEST-S482 on the first row and then GEST-S428 on the third row, both in 2020) and both students (Clothilde and Aloysius) were registered to course GEST-S482 (in fact, Aloysius had to retake the course for 2 years). Since a course can have several students and a students can have several courses, we call this type of relation a Many-to-Many (or sometime a M-N) relation.

Go Further

There isn’t much more to learn about the basic structure of relational databases. In this video, I talk about a more arcane topic, the self-join to represent hierarchies (or network) of entities. You can safely skip this if you are not interested.

Rather than detailing the content of the table each time, people working with databases usually resort to representing the tables and relations in what is called an Entities-Relations Diagram (often abbreviated as ERD). Those schemas are composed by rectangles representing the tables with their name and each field along with its type.

These tables are connected by graphical representations of the relations between them. There are two types of relations. The straight ones represent one-to-one relations, that arise when one record in one of the tables can be connected to at most one record on the other This is the type of relation that might represent the connection between a person and her Belgian ID card. A person can have at most one valid ID card at a time and an ID card can only belong to one person. . The second type of relation is the one represented by a forked line. Those are meant to be a graphical instanciation of a one-to-many relation. In this case, the straight end of the line represent the “one” side of the relation while the forked end is supposed to represent the “many” side.

Below is a ERD representing the above scenario with students, professors, courses and enrollments.

As you can see, every piece of information of the scenario are represented in this compact visualization (except for the values contained in the table but those are not really needed to work out the queries, as you will see afterwards). For example, the primary keys are explicitely represented.⊕ You might have remarked that the enrollments table has several keys symbols. This does not mean that there are several primary keys on this table but rather that the primary key is constituted of several fields. The combination of values in those fields must be unique, not the values in each of those fields. Those are the fundamentals of the design of relational databases. It seems basic but most of today’s data architecture enabling the world to work the way it does is based on that. You can represent most scenario you can think of with relational databases (sometimes going through some crazy hoops, but still). The thing is that you have to impose a rigid structure. This is why, today, some databases system have broken free of or taken liberty with the entity-relation paradigm to allow for more flexible designs, but those are outside the scope of this class.

Queries

In you job as a business engineer or economist, you are mostly going to have to retrieve data from databases. In most relational databases, this is done by using a somewhat universal querying language called SQL. It is not only a fairly sought after skill to have to land a data analyst job, it is also a good way to woo your boss. ⊕ Extract of the Amazon’s show Jack Ryan - S01E01

In this context, the most useful (and arguably the most complex) statement (or instruction) we will use is the SELECT statement. This will be the topic of the remainder of the session.

The SELECT statement

What the heck are we doing?

Before going on with the rest of the session, it is important to understand why we talk about SQL but wrap it in a bunch of python code. It is because, most of the time, you want to do something with the data you’ll be retrieving. In this class, we’ll use python to do so (since we already uses this language for other topics).

However, make no mistake, we are using three types of “programs” under the hood. Our “view” or graphical interface is the jupyter notebook (even if you read this in web page, this is just a notebook of which the result was “exported” to a web page), which renders the result of whatever code we run in the code cells. We have python that is going to manage all the logic once the data has been retrieve from the disc. As such, it might be compared with a “controller” layer This is a broad approximation, go along with this for now, we’ll go in more details in those concepts in the theory sessions. . Finally, we have the SQL server - a program that lurks in the dark (or runs as a process on your computer), waiting for another program to send it a query Again, an approximation in the case of SQLite, the engine we are using, but most of the SQL systems work with server programs. The SQLite engine is actually started at the moment you submit it a query. . Once it receives a valid query, it goes on the disk to retrieve the required data and send the result of the query back to the program that requested it. All this happens in a split second and you have the impression that everything happens in one place but there is in fact a whole message flow going on.

import pandas as pd
import sqlite3

# This cell retrieves the database, downloads it to the folder where you have your notebook
# It then opens a connection with the database engine (it is like a tunnel through which you will
# pass the query from python to sqlite and the results back)

# You need to have the pandas library installed. If you don't have it and do not know
# how to install it, have a look at the notebook about modularity and open-source

import os.path
if not os.path.isfile("finnhub.db"): # If you don't have the file yet...
    import urllib.request 
    urllib.request.urlretrieve("http://homepages.ulb.ac.be/~choffreu/resources/finnhub.db", "finnhub.db") # download it

con = sqlite3.connect("finnhub.db") # create a connection with the database

query = "SELECT * FROM sqlite_master" # This is the SQL code that we want to execute

pd.read_sql_query(query, con) # We use the pandas library to send the query through the connection
# retrieve the data and format it into a table

	type	name	tbl_name	rootpage	sql
0	table	Symbols	Symbols	2	CREATE TABLE "Symbols" (\n"description" TEXT,\...
1	table	Companies	Companies	3	CREATE TABLE "Companies" (\n"address" TEXT,\n ...
2	table	Peers	Peers	32	CREATE TABLE "Peers" (\n"symbol" TEXT,\n "pee...
3	table	Executives	Executives	49	CREATE TABLE "Executives" (\n"age" TEXT,\n "n...
4	table	Candlestick	Candlestick	74	CREATE TABLE "Candlestick" (\n"close_price" RE...

The second point I need to discuss before starting the explanation is the base that we are going to use in the examples. It is a base containing informations about listed companies and their top executives (their board of directors). It was collected by Benoit Decorte using the API from Finnhub (we will see how to retrieve this kind of data in an automated fashion during a future session). I hosted the database on this website and the boiler plate code downloads the database to the same folder as your notebook if it is not already present. The ERD for this base is the following.

As you can see, we have a record of companies along with their executives, some group of other companies considered as peers, their financial results (summarized in candlestick data), some evaluations by financial analysts and the headlines of some news involving the company. What links all information about a company is a specific field, called a ticker or symbol.

Remark that, in this schema, the table Symbols is easily bypassed: since the ticker is the foreign key in all other tables (and, in fact, even primary key in the table Companies, we can join the tables using the key directly. This will become clearer as we go.

After this super long introduction, we can go ahead and learn the SELECT statement. The syntax is plain and we will go through all the keywords successively. The statement follows this pattern:

SELECT [fields or agregates]

FROM [tables or other selects, that might be joined together]

WHERE [conditions on the lines]

GROUP BY [agregation criteria]

HAVING [conditions on the agregates]

ORDER BY [fields on which to order and direction]

LIMIT [number of final lines to show]

Don’t let the simple syntax fool you. With this, you can answer as many questions as your heart desires (if you have a relational database with the data, of course). Let’s start!

SELECT … FROM …

The SELECT statement start with the… SELECT clause. ⊕ A statement is like a function while a clause is a part of this statement In this clause, you list the fields you would like to retrieve in the result. Together with the FROM clause, in which you name the tables from which you want to retrieve the fields, they are the only mandatory part of a query.

Assume you want to have the name of all the companies in the database. You could write the query in the following way: ⊕ The three double-quotes in python enable the creation of a multi-line string. SQL disregards carriage-returns (going to a new line) but, for you sanity, try to write you queries on several lines to make them more readable. Try to keep an uniform style as much as possible.

query = """ SELECT name
            FROM Companies
"""

pd.read_sql_query(query, con)

	name
0	AGILENT TECHNOLOGIES INC
1	ALCOA CORP
2	PERTH MINT PHYSICAL GOLD ETF
3	None
4	DORSEY WRIGHT ADR ETF
...	...
216	WIDEPOINT CORP
217	XEROX CORP
218	YELP INC
219	ZILLOW GROUP INC
220	ZEDGE INC

221 rows × 1 columns

We see that here, we have 220 companies and their name. Note that we have some companies for which the name is empty. We’ll fix it later (in fact, those are tickers that correspond to funds or something and not to a company).

This is the simplest example we can think of. With just this, we can only show all the rows of a table. The most sophisticated operation we can perform are basic math operations on the same row. For example, see that in the table Executive, you have the age of the person and the year she became a board member? Using simply SELECT and FROM, we can find out the age she was when she became a board member (plus or minus 1 year since the granularity of the data is at the level of the year here).

query = """ SELECT name, age - 2020 + since
            FROM Executives
"""

pd.read_sql_query(query, con)

	name	age - 2020 + since
0	Jeffery H. Boyd	60
1	Glenn D. Fogel	56
2	David I. Goulden	57
3	Gillian Tans	47
4	Peter J. Millones	48
...	...	...
1389	Gerald D. Held	60
1390	Kathryn M. Hill	55
1391	Deborah Linda Kerr	44
1392	Scott F. Schenkel	48
1393	George T. Shaheen	59

1394 rows × 2 columns

In many relational database systems, we strive to attain what is called normalization. It means that we seek to reduce data redundancy An explanation is available in the Youtube video here . Since each piece of data is in its own table, it makes sense to want to merge information coming from several tables. For example, imagine that you want to have the name of all the board members along with the name of the company in which they serve. This information is in two different tables: The table Companies contains the name of the company (under the field name) while the table Executives contains the name of the board member under the field name. How can we make both pieces of information appear in the same result, we need to use what is called a join.

You can declare a JOIN in the FROM clause. The concept is the following: You give the name of the first table, then JOIN, then the second table and then, you use the keyword ON to specify the fields that should be common between the two tables. Seems abstract? Here’s an example. We know that the symbol in the table Executives is a foreign key (this is why there is a little fork arriving to it) from the table Symbols and we know that the table Symbol has a one-to-one relation with the table company (meaning that each symbol corresponds to at most one company and that each company has at most one symbol, or ticker). Well then, it means that the value in the field symbol of the table Executives is a copy of a value in the field ticker in the table Companies. If we were to produce all possible combinations of rows from Executives with the rows from Companies and keep only the ones for which the field symbol in Executives is equal to the field ticker in Companies, we would indeed have all the informations of the executives alongside all the information about the company in which they serve. That’s precisely what a join does.

Watch this unfold in a very small example:

Converting this into code, we have:

query = """SELECT Executives.Name, Companies.name
            FROM Executives JOIN Companies
                ON (Executives.symbol = Companies.ticker)
"""

pd.read_sql_query(query, con)

	name	name
0	Jeffery H. Boyd	BOOKING HOLDINGS INC
1	Glenn D. Fogel	BOOKING HOLDINGS INC
2	David I. Goulden	BOOKING HOLDINGS INC
3	Gillian Tans	BOOKING HOLDINGS INC
4	Peter J. Millones	BOOKING HOLDINGS INC
...	...	...
1389	Gerald D. Held	NETAPP INC
1390	Kathryn M. Hill	NETAPP INC
1391	Deborah Linda Kerr	NETAPP INC
1392	Scott F. Schenkel	NETAPP INC
1393	George T. Shaheen	NETAPP INC

1394 rows × 2 columns

Now, this is type of JOIN is the most common. It is called an INNER join. In it, only the rows that have a correspondence in the other table will show up in the results. For instance, imagine we are joining a table containing courses with a table containing professors. If we use an inner join, a course that has no professor assigned will not appear in the result and a professor who isn’t connected to any course will not appear either. Now imagine you want to make sure that all courses appear once and, if there is a professor responsible for it, you want to see the name of the professor. This is a job for a LEFT join. A left join will necessarily show all the record of the table that comes before the LEFT JOIN keyword and only show the information of the rows of the second table if they match with a row from the first table.

Most of the time, you’ll end up doing one of either inner join (very frequently) or left join (seldom). Besides those, there are also the right and the outer joins, that are outside of the scope of this class.

WHERE…

What we have so far is a way to retrieve all the rows from a table or from the result of junction of tables. That’s nice but not very useful for many scenarii. The first refinement is that we should be able to filter out some of the rows from the tables. This is what the WHERE clause does. It checks a condition, or series of conditions, for each rows. If the conditions are verified for this row, the row will show up in the results, otherwise it is discarded from the query.

The conditions are fairly classical. You can test an equality with = Do not mix it up with python where the equality is checked with == , “greater than” >, “smaller than” <, “greater or equal than” >=, “smaller or equal than” <= and “different from” <> Which is also different from python where this is evaluated with != . Besides that, for text and chains of characters, you can check patterns by using the keyword LIKE and using percentage signs % as jokers representing one or more arbitrary characters. For example, imagine you just want to see the executives that are younger than 40 years. You can do so with:

query = """SELECT name, age
            FROM Executives
            WHERE age < 40
"""

pd.read_sql_query(query, con)

	name	age
0	Eyal Lalo
1	Michael Porter	38
2	Mark Locks
3	Jon Sanders
4	Michael Friedman
...	...	...
190	Jared Smith
191	Matt Frey
192	William C. Martin	39
193	Spencer Scott
194	Atish Gude

195 rows × 2 columns

That’s a good start but, as you can see, we also have a bunch of executives for which we do not know the age. For those, the value in the database was coded as an empty string ("") and it is evaluated as smaller than any number. We need to get rid of those. The trick here is to compose several conditions using the keyword AND. We can just add an additional condition that filters out such values. As such:

query = """SELECT name, age
            FROM Executives
            WHERE age < 40 AND age <> ""
"""

pd.read_sql_query(query, con)

	name	age
0	Michael Porter	38
1	Alexander B. Spiro	35
2	Benjamin Silbermann	36
3	Evan Sharp	36
4	Anneka R. Gupta	31
...	...	...
60	Melissa Thomas	39
61	Joseph Levin	39
62	Julia Hartz	39
63	Joseph Levin	39
64	William C. Martin	39

65 rows × 2 columns

If we want to see all the executives that are younger than 40 years and whose first name starts with the letter J, we could do:

query = """SELECT name, age
            FROM Executives
            WHERE age < 40 
                AND age <> ""
                AND name LIKE "J%"
"""

pd.read_sql_query(query, con)

	name	age
0	Joseph Levin	39
1	John Paul Knab	38
2	Jesse A. Cohn	38
3	Jose J. Vargas	39
4	Jayme Mendal	33
5	Joseph Levin	39
6	Jamie Cohen	32
7	Jantoon Reigersman	37
8	Joseph Bzdyl	36
9	Joseph Levin	39
10	Julia Hartz	39
11	Joseph Levin	39

Here, remark that the LIKE "J%" matches any string starting with a J, whatever follows it. You can always invert a condition using the keyword NOT:

query = """SELECT name, age
            FROM Executives
            WHERE age < 40 
                AND age <> ""
                AND NOT name LIKE "J%"
"""

pd.read_sql_query(query, con)

	name	age
0	Michael Porter	38
1	Alexander B. Spiro	35
2	Benjamin Silbermann	36
3	Evan Sharp	36
4	Anneka R. Gupta	31
5	Rachel Barnett	37
6	Christina Sindoni Ciocca	31
7	Liqun Liu	38
8	Chelsea Clinton	39
9	Sean K. Middleton	37
10	Tiffany Kuo	27
11	Michael Ross Pope	35
12	Gregory R. Blank	38
13	Zach Thomann	35
14	Benjamin L. Rosenzweig	33
15	Lev Peker	36
16	David Meniane	35
17	Ryan S. Schram	39
18	Garrard Brown	36
19	Katrina Lake	36
20	Chad Summe	39
21	Raymond T. Walsh	33
22	David B. Peters	39
23	Vikram R. Raghavan	36
24	Nathan Forrest Auwerda	33
25	Amit Goyal	39
26	Kamelia Aryafar	33
27	Sumit Goyal	34
28	Seth A. Moore	36
29	Logan D. Green	35
30	Pouyan Salehi	39
31	William Gray Stream	39
32	Maureen Mullen	38
33	Matias de Tezanos	38
34	Michael Montano	33
35	Bret Taylor	39
36	Darryl Auguste	30
37	Shimrit Markette	35
38	Mira Wilczek	37
39	Katrina Lake	36
40	Samuel H Altman	30
41	Chelsea Clinton	39
42	Ismail Azeri	39
43	Mark Zuckerberg	34
44	Evan Spiegel	28
45	Robert C. Murphy	30
46	Poppy Thorpe	34
47	Erin N. Lantz	38
48	Matthew B. Salzberg	35
49	Ilia M. Papas	37
50	Tracy Britt Cool	35
51	Melissa Thomas	39
52	William C. Martin	39

GROUP BY…

This is the last difficult part in the SELECT statement. The GROUP BY clause enables performing agregation. Remember that the SELECT clause enable doing simple operations on the row such as adding or subtracting two fields? Well, the GROUP BY clause makes it possible to do some operations across rows, on a certain field. For example, you can find the average age of executives using the following syntax:

query = """SELECT AVG(age)
            FROM Executives
"""

pd.read_sql_query(query, con)

	AVG(age)
0	50.672166

Indeed, you “squeezed” all the rows into one, summarizing the values in the age field to take their average. This is neat, but what does the GROUP BY have to do with it? Well, by default, if you don’t specify it otherwise, agregation operations in the SELECT clause (such as AVG, MAX, MIN, SUM and COUNT) will squeeze the table all the day down to 1 row. Most of the time, the question you have only requires squeezing the table up to a certain level. For instance, knowing that the average age of directors is 50.67 years is nice, but it is of no use to know which companies employ the youngest directors. This is where the GROUP BY shines. By specifying a field or set of fields in this clause, you force the database engine to agregate inside each value of this field (or unique combination of value in the set of fields). As such, if you wonder what is the average age of the directors in each company, you can do The WHERE age <> "" is there to avoid counting people without age recorded that would be defaulted to 0 if this wasn’t there. :

query = """SELECT symbol, AVG(age)
            FROM Executives
            WHERE age <> ""
            GROUP BY symbol 
"""

pd.read_sql_query(query, con)

	symbol	AVG(age)
0	AAPL	58.933333
1	ALOT	62.363636
2	AMRH	53.800000
3	AMZN	58.812500
4	ANGI	46.266667
...	...	...
97	WDC	58.285714
98	WTRH	48.400000
99	WYY	54.400000
100	XRX	53.454545
101	YELP	50.916667

102 rows × 2 columns

In this case, you end up with one line per symbol and the average age of the directors that have this symbol as foreign key.

While the average (AVG), sum (SUM), maximum (MAX) and minimum (MIN) all behave as you would expect them to behave, the COUNT is a bit special: It counts the number of rows for which the combination of fields that you pass between parenthesis is not entirely NULL. Most of the time, using COUNT, you really want to count the number of underlying lines in your query, this is why COUNT is often use with a star * (meaning “all the fields available”) between the parentheses. For example, in order to count the number of directors in each company, you can do:

query = """SELECT symbol, COUNT(*)
            FROM Executives
            WHERE age <> ""
            GROUP BY symbol
"""

pd.read_sql_query(query, con)

	symbol	COUNT(*)
0	AAPL	15
1	ALOT	11
2	AMRH	5
3	AMZN	16
4	ANGI	15
...	...	...
97	WDC	14
98	WTRH	10
99	WYY	5
100	XRX	11
101	YELP	12

102 rows × 2 columns

In this case, the COUNT(*) count all the rows from the table that are not completely empty.

HAVING…

Sometimes, it makes sense to want to filter some lines based on the value of the result of an agregation. The issue is that clauses are executed sequentially and that, when the agregation is performed, the WHERE clause has already been executed (because otherwise, you would have wrong values in your agregated fields). This is the reason why the HAVING clause exists. For example, the following code will produce an error:⊕ You can use the keyword AS to give alias to your field. This is especially useful when you create new agregated fields.

query = """SELECT symbol, AVG(age) AS average_age
            FROM Executives
            WHERE average_age < 40 AND age <> ""
            GROUP BY symbol
"""

pd.read_sql_query(query, con)

---------------------------------------------------------------------------

OperationalError                          Traceback (most recent call last)

~/.pyenv/versions/3.7.6/envs/digital_business/lib/python3.7/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
   1680         try:
-> 1681             cur.execute(*args, **kwargs)
   1682             return cur


OperationalError: misuse of aggregate: AVG()


The above exception was the direct cause of the following exception:


DatabaseError                             Traceback (most recent call last)

<ipython-input-12-6a25273a221f> in <module>
      5 """
      6 
----> 7 pd.read_sql_query(query, con)


~/.pyenv/versions/3.7.6/envs/digital_business/lib/python3.7/site-packages/pandas/io/sql.py in read_sql_query(sql, con, index_col, coerce_float, params, parse_dates, chunksize)
    381         coerce_float=coerce_float,
    382         parse_dates=parse_dates,
--> 383         chunksize=chunksize,
    384     )
    385 


~/.pyenv/versions/3.7.6/envs/digital_business/lib/python3.7/site-packages/pandas/io/sql.py in read_query(self, sql, index_col, coerce_float, params, parse_dates, chunksize)
   1725 
   1726         args = _convert_params(sql, params)
-> 1727         cursor = self.execute(*args)
   1728         columns = [col_desc[0] for col_desc in cursor.description]
   1729 


~/.pyenv/versions/3.7.6/envs/digital_business/lib/python3.7/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
   1691 
   1692             ex = DatabaseError(f"Execution failed on sql '{args[0]}': {exc}")
-> 1693             raise ex from exc
   1694 
   1695     @staticmethod


DatabaseError: Execution failed on sql 'SELECT symbol, AVG(age) AS average_age
            FROM Executives
            WHERE average_age < 40 AND age <> ""
            GROUP BY symbol
': misuse of aggregate: AVG()

This huge error message basically tells us that we cannot do that for a good reason: We try to use the result of the AVG agregation in the WHERE clause but, at the moment of the evaluation of this clause, the agregation is not yet calculated. To fix this and retrieve the companies where directors have on average less than 40 years, we need to use the HAVING clause:

query = """SELECT symbol, AVG(age) AS average_age
            FROM Executives
            WHERE age <> ""
            GROUP BY symbol
            HAVING average_age < 40
"""

pd.read_sql_query(query, con)

	symbol	average_age

It turns out, none of the companies in our base have a your board of directors. It’s part of the reason we’re in this mess, if you ask me but anyways. For the rest, the HAVING clause can perform most the same checks as the WHERE clause.

ORDER BY…

Perhaps unsurprisingly, the ORDER BY clause enables to sort the results of the query. It is usually a fairly unsophisticated clause and the orders that are supported are the numerical order (1,2,3,7 or 1.1, 1.2, 1.7, 6.5) and the alphabetical order. You can also specify if you want it from small to large (using the ASC, for ascending, keyword, which is the default) or from large to small (using the DESC, for descending, keyword). You can also specify several fields: It will start sorting on the first and, if there are draws, those will be resolved by resorting to the second fields, etc.).

In our example, assume we want to show all directors by classifying them by the alphabetical order of the name of their company and, for people in the same company, by decreasing order of age (from oldest to youngest) and, if two directors from the same company have the same age, we want the alphebetical order of name (we do not expect two directors to have exactly the same name, so we stop there). We can do so as such:⊕ You can alias the table name by writing the alias just after the table name. For example, in this query, we alias the table Companies by c by writing FROM Companies c. In the rest of the query, everytime we type c, the engine understands we refer to the table Companies.

query = """SELECT c.name, e.name, e.age
            FROM Companies c
                JOIN Executives e
                ON (c.ticker = e.symbol)
            ORDER BY c.name ASC, 
                    e.age DESC, 
                    e.name
"""

pd.read_sql_query(query, con)

	name	name	age
0	1-800-FLOWERS.COM	James A. Cannavino	75
1	1-800-FLOWERS.COM	Eugene F. DeMark	72
2	1-800-FLOWERS.COM	Adam Hanft	69
3	1-800-FLOWERS.COM	James Francis McCann	68
4	1-800-FLOWERS.COM	Leonard J. Elmore	67
...	...	...	...
1389	YELP INC	Laurence Wilson	46
1390	YELP INC	James Miln	45
1391	YELP INC	George Hu	44
1392	YELP INC	Jeremy Stoppelman	41
1393	YELP INC	Vivek Patel	40

1394 rows × 3 columns

Here, remark that we have asked to sort first by name of the company in alphabetical order, then the age of the executive by descending order, then the name of the executive by alphabetical order but we ommitted the ASC, it is no big deal since it is the default value. We could also have dropped it after the c.name.

LIMIT

The last and maybe easiest clause is the LIMIT. It is executed at the end of the query and cuts the results. If it is followed by one positive number x, it will show the x first results. If the numbers is negative, it will show the last x results. Finally, if you give it two number, x and y, it will show the y rows that follow row x.

query = """SELECT symbol, AVG(age) as average_age
            FROM Executives
            GROUP BY symbol
            ORDER BY average_age
            LIMIT 3
"""

pd.read_sql_query(query, con)

	symbol	average_age
0	SONM	17.166667
1	DXC	31.064516
2	ANGI	31.545455

Gives you the 3 companies where the directors are the youngest on average.

query = """SELECT symbol, AVG(age) as average_age
            FROM Executives
            GROUP BY symbol
            ORDER BY average_age
            LIMIT 2, 5
"""

pd.read_sql_query(query, con)

	symbol	average_age
0	ANGI	31.545455
1	BOXL	32.555556
2	IMMR	32.666667
3	APRN	37.307692
4	MEET	37.583333

This query gives you the 5 companies that follow the company at index 2 of the result (the 3rd company, ANGI in this case, as the index starts at 0).

Tying everything together

You now know everything you need to perform arbitrarily complex queries in relational databases.

Let’s try to do some (very) complex queries. In the first query, I will figure out the average age of the executives in the companies which registered the highest relative stock price increase during the month of January 2019. In order for you to realize all the challenges and understand the way of thinking, I’ll go step-by-step without preparation. Please be patient as I make all the mistakes I warned you about above…

query = """
SELECT AVG(age)
FROM 
    Executives
    JOIN
    (SELECT last_quotation.symbol, close_price/open_price AS evolution
    FROM
        (SELECT symbol, close_price
        FROM Candlestick
        WHERE timestamp = "2019-01-31") last_quotation
        JOIN
        (SELECT symbol, open_price
        FROM Candlestick
        WHERE timestamp = "2019-01-02") first_quotation
        ON last_quotation.symbol = first_quotation.symbol
    ORDER BY evolution DESC
    LIMIT 1) max_evol
    ON Executives.symbol = max_evol.symbol
"""

pd.read_sql_query(query, con)

	AVG(age)
0	32.555556

Go Further

Look at how the Peers table is constructed. It has two foreign keys corresponding both to companies. Concretely, this allows representing a directed graph in a relational database form. Querying those are not hard once you understand the difference between the table on disk and the invocation of this table. In this video, we retrieve the name of those companies using an SQL query.

query = """
SELECT origin.name, destination.name 
FROM 
    Peers
    JOIN Companies origin
        ON Peers.symbol = origin.ticker
    JOIN Companies destination
        ON Peers.peers_symbol = destination.ticker
"""

pd.read_sql_query(query, con)

	name	name
0	BOOKING HOLDINGS INC	SHUTTERSTOCK INC
1	BOOKING HOLDINGS INC	LANDS' END INC
2	BOOKING HOLDINGS INC	QUOTIENT TECHNOLOGY INC
3	BOOKING HOLDINGS INC	RUBICON PROJECT INC
4	BOOKING HOLDINGS INC	GRUBHUB INC
...	...	...
2811	NETAPP INC	SOCKET MOBILE INC
2812	NETAPP INC	INTEVAC INC
2813	NETAPP INC	TRANSACT TECHNOLOGIES INC
2814	NETAPP INC	IMMERSION CORP
2815	NETAPP INC	QUALSTAR CORP

2816 rows × 2 columns

Check your understanding

Once again, this was a long session but it is fundamental for what’s to come (and for your life in general). Here are a few questions to check if you remember what’s in this class. This does not count for grade and the answers will be anonymous. Try is as much as you want.

Exercises

ERD I - Fakebook

You want to model a simple competitor to Facebook. You have to have users, who may become friends with each other, can post posts, to which their friends may then react or comments and some of the posts may contain one or more pictures. Design the ER diagram representing this app.

See solution

We model just enough of Facebook to turn a profit. It will always be time to become multi-billionnaires later.

Here’s the result of our efforts. Let the VC know, there’s a new kid in town!

Note, however, that if you do not have the same solution, it does not mean that yours is wrong. The thing with ERD is that there are more than a way to skin a cat even because everyone has a different understanding of any problem. The important is to be able to justify your choices and to understand their consequences. Moreover, there are designs that are disfunctional in the sense that they would not be able to accomodate good sense situations (think of a one-to-one relation between Users and Posts that would mean that users would be only able to post something once and then bail. The name of the game is to avoid those nonesensical pieces of drawing. Everything else is fair game if what you draw represent what you had in mind.

<figure >

⊕

A solution for a social network model

I guess we are ready to be rich now.

ERD II - The CoViD app

Design an ER diagram for an app managing the tracking of CoViD. You want to be able to figure out who was or may have been in contact with people affected by the Corona Virus Disease. As a rule, you do not need to track the people on their ways to and from places, only in closed places. You can use any tool to model the ERD, but if you have no preference, you can use DrawSQL with a trial account.

See solution

The solution is actually less complex than one could expect. Of course, in reality, most of the challenge consists in keeping private data safe and optimizing search so that you don’t have to wait for days for each query to execute in the gigantic database.

And here is the result of this exercise. Again this is not the unique solution. You might have thought of additional entities that could be represented and used in the disease tracking workflow.

<figure >

⊕

A solution for the exercise with the CoViD app

ERD III - The Bill of Material

You want to represent the “recipe book” for your product: an electric bike. Since you manufacture all parts of the bike, you start and assemble everything from raw material. So you have 3 types of materials: finished products, intermediary products and raw materials. Only raw materials have direct computable costs.

You use raw materials and intermediary products to produce the finished products as well as intermediary products. Since you want to be able to compute the total cost of each product, you want to be able to trace all the products that enters the production of any specific product at any stage (be it to produce the final product or any of it’s intermediary component).

In technical terms, this is known as a Bill of Materials (or BOM). Since we are only interested in the cost of materials, you do not need to include costs of the process. Design the ERD that represent this problem.

See solution

This exercise is a little more complex in its solution as the situation is a little less familiar to the average database designer. The point is that you do not need to constraint the form of the network representing the BOM, a simple table with two connections to the Products (or Materials, or whatever) table is all you need.

Note that you might have wanted to separate raw inputs, semi-finished goods and finished products in separate tables. If you go this way, I’d still recommend creating a unique abstract table where all those get an ID. Indeed, having an ID for the concept of material or product is important to avoid having to create 3 fields (raw_material_id, semi_finished_good_id and finished_good_id) even more so because that way, you would also have to have several fields for the product being constructed. Similarly, creating a BOM table for each type of material is a bad idea since most of the fields will be the same and there is nothing special in a recipe for semi-finished good that is in the finished good. The thing with the UnitsOfMeasure is more of a detail and it is totally understandable if you don’t have it in your design.

<figure >

⊕

A solution for the BOM exercise

SQL - Setup

Let’s play with the stock price database that we have. Let’s start by making sure that we have it. Download the full database. Beware that it is quite heavy (250Mb or something) so executing this cell might take quite a while. Fortunately, you only need to execute it once as it will find the version previously downloaded the next times.

import pandas as pd
import sqlite3

# This cell retrieves the database, downloads it to the folder where you have your notebook
# It then opens a connection with the database engine (it is like a tunnel through which you will
# pass the query from python to sqlite and the results back)

# You need to have the pandas library installed. If you don't have it and do not know
# how to install it, have a look at the notebook about modularity and open-source

import os.path
if not os.path.isfile("stocks_full.db"): # If you don't have the file yet...
    import urllib.request 
    urllib.request.urlretrieve("http://homepages.ulb.ac.be/~choffreu/resources/stocks_full.db", 
                               "stocks_full.db") # download it

con = sqlite3.connect("stocks_full.db") # create a connection with the database

query = "SELECT * FROM sqlite_master" # This is the SQL code that we want to execute

pd.read_sql_query(query, con) # We use the pandas library to send the query through the connection
# retrieve the data and format it into a table

	type	name	tbl_name	rootpage	sql
0	table	Symbols	Symbols	2	CREATE TABLE "Symbols" (\n"symbol" TEXT,\n "d...
1	table	Companies	Companies	26	CREATE TABLE "Companies" (\n"ticker" TEXT,\n ...
2	table	Peers	Peers	276	CREATE TABLE "Peers" (\n"symbol" TEXT,\n "pee...
3	table	Executives	Executives	20214	CREATE TABLE "Executives" (\n"id" INTEGER,\n ...
4	table	Candlesticks	Candlesticks	20338	CREATE TABLE "Candlesticks" (\n"id" INTEGER,\n...
5	table	Recommendations	Recommendations	57926	CREATE TABLE "Recommendations" (\n"id" INTEGER...
6	table	Evaluations	Evaluations	57942	CREATE TABLE "Evaluations" (\n"id" INTEGER,\n ...
7	table	MajorDevelopments	MajorDevelopments	57985	CREATE TABLE "MajorDevelopments" (\n"id" INTEG...

SQL I - Thinking like a finance maverick

⊕ How you should feel regarding the financial validity of those exercises First let me preface these questions with the disclaimer. This is not a finance course. So what we are going to do during this session probably makes little to no financial sense (please don’t show this notebook to any finance professor).

That being said, let’s start. We’re first going to make our very own index to assess the attractiveness of sectors! That’s right, Goldman Sach’s. Beware!

In order to do so, produce the total number of recommendations for each industry in the dataset (the gind field in the Companies table) in 2019. Technically, we want the sum of the strongBuy, buy, hold, sell and strongSell emitted by experts in 2019. Note that you do not need to compute them at the level of the evaluation, since this was already done for you in the Recommendations table.

In order to do so, think of both agregation and joins across the tables to retrieve the information you want.

See solution

This exercise illustrate how joins and aggregations together may be combined to produce views on your data at different levels of a hierarchy. In this case, companies are grouped along one of their dimensions, the industry they are a part of, to produce statistics on such aggregates. (Note that there is an edit in this video around the end, so the video seems to jump, this is normal, I still need to hone my mad video editing skillz)

query = """
        SELECT gind as Industry, SUM(strongBuy), 
            SUM(buy), SUM(hold), SUM(sell), SUM(strongSell)
        FROM Companies c
            JOIN Recommendations r
            ON (c.ticker = r.symbol)
        WHERE r.period LIKE "2019%" AND c.gind <> "N/A"
        GROUP BY gind
"""

pd.read_sql_query(query, con)

	Industry	SUM(strongBuy)	SUM(buy)	SUM(hold)	SUM(sell)	SUM(strongSell)
0	IT Services	567	1125	889	110	22
1	Interactive Media & Services	759	1309	1445	377	45
2	Internet & Direct Marketing Retail	746	1344	1504	161	14
3	Technology Hardware, Storage & Peripherals	648	1001	1323	124	38
4	Trading Companies & Distributors	34	22	142	1	1

SQL II - If advices were good, they wouldn’t be free

Financial analysts usually give advices about whether they think you should keep, buy or sell some stocks (in this case, they are called Evaluations). Thing is, sometimes, they are wrong. So what we are going to do is that we are going to take all the evaluations given by the analysts during the month of October 2019 and check whether following the recommendation of the analyst during this period would have resulted in a profit or a loss in the timespan of a month. If the last recommendation of an analyst during the month was to “Buy”, “Strong Buy”, “Outperform”, “Overweight” and the action went up in the month of November, the analyst get a point, otherwise, they lose one. Same if the last recommendation of the analyst was to “Sell” or “Underweight”, “Underperform” and the action lost value.

Break this query in smaller, simpler queries and then just assemble them like legos.

To finish, establish a ranking of the analysts by decreasing order of points they got.

See solution

Step 1 - This query is not complicated from a syntax standpoint (there is very few subtlety in with SQL language) but is hard to organise. This is why we spend the first video thinking about how to break down the problem to a set of small components we can then put together. Try doing it by yourself before watching the video as this is the single most important step to solve this exercise. The rest is just data plumbing. If you do not break the query the same way I do, try implementing your solution in code and let me know if you manage or hit a road block. There are usually more than a valid solution just be sure you understand precisely what you do and retrieve as data. For example using the MAX(gradeTime) directly to retrieve the last evaluation looks correct but might in fact produce wrong answers (in this particular case, it’s a pity SQLite is so tolerant with imprecise instructions).

and here is the map and the corresponding videos:

<figure >

⊕

Proposed breakdown of problem 2 with corresponding videos

Step 2 - We start by generating the performance of the stocks during the month of November 2019.

Note that hardcoding the values for the first and last valuations during the month is a bad idea. However, since I’m feeling quite lazy and hypocrit today, let’s say fixing that is an exercise left for you to test your understanding.

query_perf = """
            SELECT close.symbol, (close.close_price/open.open_price) - 1 AS var
            FROM
                (SELECT ca.symbol, ca.open_price as open_price
                FROM Candlesticks ca
                WHERE ca.timestamp = "2019-11-01") open
                JOIN
                (SELECT ca.symbol, ca.close_price as close_price
                FROM Candlesticks ca
                WHERE ca.timestamp = "2019-11-29") close
                ON close.symbol = open.symbol
"""

pd.read_sql_query(query_perf, con)

	symbol	var
0	FINX	0.060153
1	BTA	-0.015711
2	NORW	-0.014141
3	NIM	-0.024621
4	VONG	0.038809
...	...	...
2157	IBMI	0.001394
2158	MFV	-0.015152
2159	MNCL	-0.001986
2160	VTWO	0.034113
2161	VTI	0.032387

2162 rows × 2 columns

Step 3 - Retrieving the last Evaluation by analysts for every company in the month of October 2019. Beware, this step is the trickiest of the whole exercise.

query_last_eval = """
        SELECT e.symbol, e.company, e.toGrade
        FROM Evaluations e
            JOIN
            (SELECT symbol, company, MAX(gradeTime) as last_eval
            FROM Evaluations
            WHERE gradeTime LIKE "2019-10%"
            GROUP BY symbol, company) last
            ON e.symbol = last.symbol AND e.company = last.company 
                                    AND e.gradeTime = last.last_eval
        GROUP BY e.symbol, e.company
"""

pd.read_sql_query(query_last_eval, con)

	symbol	company	toGrade
0	AAPL	Bank of America	Buy
1	AAPL	Barclays	Equal-Weight
2	AAPL	Canaccord Genuity	Buy
3	AAPL	Cascend	Buy
4	AAPL	Cowen & Co.	Outperform
...	...	...	...
150	W	Morgan Stanley	Equal-Weight
151	W	Wells Fargo	Market Perform
152	WDC	Loop Capital	Buy
153	YELP	Morgan Stanley	Underweight
154	Z	Morgan Stanley	Equal-Weight

155 rows × 3 columns

Step 4 - Using the two components built previously, we can now compute the hits and misses. Please note that in a business setting, we would probably not build all this query in one go but we would rather encapsulate some of the steps into views or stored procedures. We have not seen those here so we’ll do without.

query_pos = """
        SELECT company, COUNT(*)
        FROM
            (SELECT close.symbol, (close.close_price/open.open_price) - 1 AS var
            FROM
                (SELECT ca.symbol, ca.open_price as open_price
                FROM Candlesticks ca
                WHERE ca.timestamp = "2019-11-01") open
                JOIN
                (SELECT ca.symbol, ca.close_price as close_price
                FROM Candlesticks ca
                WHERE ca.timestamp = "2019-11-29") close
                ON close.symbol = open.symbol) perf
            JOIN
            (SELECT e.symbol, e.company, e.toGrade
            FROM Evaluations e
                JOIN
                (SELECT symbol, company, MAX(gradeTime) as last_eval
                FROM Evaluations
                WHERE gradeTime LIKE "2019-10%"
                GROUP BY symbol, company) last
                ON e.symbol = last.symbol AND e.company = last.company 
                                        AND e.gradeTime = last.last_eval
            GROUP BY e.symbol, e.company) eval
            ON perf.symbol = eval.symbol
        WHERE
            (var > 0 AND toGrade IN ("Buy", "Strong Buy", 
                                    "Outperform", "Overweight"))
            OR
            (var < 0 AND toGrade IN ("Sell", "Underweight", "Underperform"))
        GROUP BY company
"""

query_neg = """
        SELECT company, COUNT(*)
        FROM
            (SELECT close.symbol, (close.close_price/open.open_price) - 1 AS var
            FROM
                (SELECT ca.symbol, ca.open_price as open_price
                FROM Candlesticks ca
                WHERE ca.timestamp = "2019-11-01") open
                JOIN
                (SELECT ca.symbol, ca.close_price as close_price
                FROM Candlesticks ca
                WHERE ca.timestamp = "2019-11-29") close
                ON close.symbol = open.symbol) perf
            JOIN
            (SELECT e.symbol, e.company, e.toGrade
            FROM Evaluations e
                JOIN
                (SELECT symbol, company, MAX(gradeTime) as last_eval
                FROM Evaluations
                WHERE gradeTime LIKE "2019-10%"
                GROUP BY symbol, company) last
                ON e.symbol = last.symbol AND e.company = last.company 
                                        AND e.gradeTime = last.last_eval
            GROUP BY e.symbol, e.company) eval
            ON perf.symbol = eval.symbol
        WHERE
            (var < 0 AND toGrade IN ("Buy", "Strong Buy", 
                                    "Outperform", "Overweight"))
            OR
            (var > 0 AND toGrade IN ("Sell", "Underweight", "Underperform"))
        GROUP BY company
"""

pd.read_sql_query(query_pos, con)

	company	COUNT(*)
0	Arete	1
1	B. Riley	1
2	BMO Capital	1
3	Baird	1
4	Bank of America	3
5	Barclays	5
6	Benchmark	2
7	Canaccord Genuity	2
8	Cascend	2
9	Citigroup	2
10	Cowen & Co.	1
11	Craig-Hallum	1
12	Credit Suisse	6
13	DA Davidson	1
14	Deutsche Bank	1
15	JP Morgan	4
16	Jefferies	1
17	KeyBanc	3
18	LightShed Partners	1
19	Longbow Research	1
20	MKM Partners	1
21	Mizuho	1
22	Morgan Stanley	5
23	National Securities	1
24	Nomura	1
25	Oppenheimer	1
26	Piper Jaffray	2
27	RBC Capital	1
28	Raymond James	3
29	SunTrust Robinson Humphrey	3
30	Telsey Advisory Group	1
31	UBS	4
32	Wedbush	3
33	Wells Fargo	1

Step 5 - We can now put everything together!

The final query looks gruesome but it’s mostly a matter of copy/pasting the components above. Also, we used the UNION ALL statement to concatenate the results of two queries to overcome a situation that might arise if you just join the positive and negative queries. Kudos if you spotted the potential issue and looked and found a solution! This is what programming is all about. If not, do not feel bad. It was pretty vicious.

Also, the code here is a little different from the one in the video: I aliased the SUM(score) as final_score and ordered the result by descending order of it.

query = """
SELECT company, SUM(score) as final_score
FROM
    (SELECT company, COUNT(*) as score
        FROM
            (SELECT close.symbol, (close.close_price/open.open_price) - 1 AS var
            FROM
                (SELECT ca.symbol, ca.open_price as open_price
                FROM Candlesticks ca
                WHERE ca.timestamp = "2019-11-01") open
                JOIN
                (SELECT ca.symbol, ca.close_price as close_price
                FROM Candlesticks ca
                WHERE ca.timestamp = "2019-11-29") close
                ON close.symbol = open.symbol) perf
            JOIN
            (SELECT e.symbol, e.company, e.toGrade
            FROM Evaluations e
                JOIN
                (SELECT symbol, company, MAX(gradeTime) as last_eval
                FROM Evaluations
                WHERE gradeTime LIKE "2019-10%"
                GROUP BY symbol, company) last
                ON e.symbol = last.symbol AND e.company = last.company 
                                        AND e.gradeTime = last.last_eval
            GROUP BY e.symbol, e.company) eval
            ON perf.symbol = eval.symbol
        WHERE
            (var > 0 AND toGrade IN ("Buy", "Strong Buy", 
                                    "Outperform", "Overweight"))
            OR
            (var < 0 AND toGrade IN ("Sell", "Underweight", "Underperform"))
        GROUP BY company
    UNION ALL
    SELECT company,  - COUNT(*)
    FROM
        (SELECT close.symbol, (close.close_price/open.open_price) - 1 AS var
        FROM
            (SELECT ca.symbol, ca.open_price as open_price
            FROM Candlesticks ca
            WHERE ca.timestamp = "2019-11-01") open
            JOIN
            (SELECT ca.symbol, ca.close_price as close_price
            FROM Candlesticks ca
            WHERE ca.timestamp = "2019-11-29") close
            ON close.symbol = open.symbol) perf
        JOIN
        (SELECT e.symbol, e.company, e.toGrade
        FROM Evaluations e
            JOIN
            (SELECT symbol, company, MAX(gradeTime) as last_eval
            FROM Evaluations
            WHERE gradeTime LIKE "2019-10%"
            GROUP BY symbol, company) last
            ON e.symbol = last.symbol AND e.company = last.company 
                                    AND e.gradeTime = last.last_eval
        GROUP BY e.symbol, e.company) eval
        ON perf.symbol = eval.symbol
    WHERE
        (var < 0 AND toGrade IN ("Buy", "Strong Buy", 
                                "Outperform", "Overweight"))
        OR
        (var > 0 AND toGrade IN ("Sell", "Underweight", "Underperform"))
    GROUP BY company)
GROUP BY company
ORDER BY final_score DESC
"""

pd.read_sql_query(query, con)

	company	final_score
0	Credit Suisse	6
1	Barclays	5
2	Morgan Stanley	4
3	JP Morgan	4
4	Wedbush	3
5	UBS	3
6	SunTrust Robinson Humphrey	3
7	Raymond James	2
8	Piper Jaffray	2
9	KeyBanc	2
10	Citigroup	2
11	Cascend	2
12	Canaccord Genuity	2
13	Benchmark	2
14	Wells Fargo	1
15	Telsey Advisory Group	1
16	Mizuho	1
17	MKM Partners	1
18	Longbow Research	1
19	LightShed Partners	1
20	Jefferies	1
21	DA Davidson	1
22	Craig-Hallum	1
23	Cowen & Co.	1
24	Baird	1
25	BMO Capital	1
26	B. Riley	1
27	Arete	1
28	RBC Capital	0
29	Oppenheimer	0
30	National Securities	0
31	Bank of America	0
32	Loop Capital	-1
33	Deutsche Bank	-1
34	Nomura	-2
35	Goldman Sachs	-3

It would probably be a good idea to provide the result as a percentage of advices given but this too is left as an additional exercise if you’re interested.

SQL III - A horse race

If capitalism is a race, companies are horses. But some of them are more like donkeys. Compute the growth rate of each company’s stock price in January of 2020 (which is the ratio between the latest closing price on the earliest opening price). Compare them with the one of their peers and give the rank of the company in the list of its peers.

You might want to think about what this rank represents to solve this one.

See solution

Step 1 - Much like for exercise 2, we start by breaking down the query in smaller work packages. Fortunately, this is significantly easier in this case.

And here is the plan:

<figure >

⊕

Proposed breakdown of problem 2 with corresponding videos

Step 2 - We start, much like for Question 2 at step, with figuring out the performance of the stock during the month of January. Same caveats and hypocrisy regarding the answer ensue.

And here is the code:

query_perf = """
    SELECT open.symbol, (close_price/open_price) - 1 as perf
    FROM
       (SELECT ca.symbol, ca.open_price
       FROM Candlesticks ca
       WHERE ca.timestamp = "2020-01-02") open
       JOIN
       (SELECT ca.symbol, ca.close_price
       FROM Candlesticks ca
       WHERE ca.timestamp = "2020-01-31") close
       ON open.symbol = close.symbol
"""

pd.read_sql_query(query_perf, con)

	symbol	perf
0	FINX	0.034916
1	BTA	0.008567
2	NORW	-0.081423
3	NIM	0.014272
4	VONG	0.014878
...	...	...
2168	MFV	0.077519
2169	MNCL	-0.000978
2170	VTWO	-0.037350
2171	PAAC	0.010499
2172	VTI	-0.007044

2173 rows × 2 columns

Step 3 - We can now count the number of peers that have a better performance than each companies. Please take very good care to understand what we are doing at this step as it is the very heart of this solution. The distinction between a table on the disk and a table (or the result of a query) in memory is paramount for understanding how queries are parsed and understood by a database engine!

And here is the query:

query_numb_peers = """
    SELECT p.symbol, COUNT(*)
    FROM Peers p
        JOIN
        (SELECT open.symbol, (close_price/open_price) - 1 as perf
        FROM
           (SELECT ca.symbol, ca.open_price
           FROM Candlesticks ca
           WHERE ca.timestamp = "2020-01-02") open
           JOIN
           (SELECT ca.symbol, ca.close_price
           FROM Candlesticks ca
           WHERE ca.timestamp = "2020-01-31") close
           ON open.symbol = close.symbol) perf_company
        ON p.symbol = perf_company.symbol
        JOIN
        (SELECT open.symbol, (close_price/open_price) - 1 as perf
        FROM
           (SELECT ca.symbol, ca.open_price
           FROM Candlesticks ca
           WHERE ca.timestamp = "2020-01-02") open
           JOIN
           (SELECT ca.symbol, ca.close_price
           FROM Candlesticks ca
           WHERE ca.timestamp = "2020-01-31") close
           ON open.symbol = close.symbol) perf_peer
        ON p.peers_symbol = perf_peer.symbol
    WHERE perf_peer.perf > perf_company.perf
    GROUP BY p.symbol
"""

pd.read_sql_query(query_numb_peers, con)

	symbol	COUNT(*)
0	AAAU	132
1	AADR	1108
2	AAPL	3
3	AAXJ	1866
4	ACAM	418
...	...	...
2162	ZJPN	1446
2163	ZMLP	1828
2164	ZROZ	39
2165	ZSL	978
2166	ZTR	648

2167 rows × 2 columns

Step 4 - Here’s the final step. The logic is similar to the one on exercise 2 with a twist: we start by creating a query that returns all the companies that have at least 1 peer and give it an offset of 1 in the rank (the why is explained in the video).

And here is the final query:

query = """
    SELECT symbol, SUM(position) as rank
    FROM
        (SELECT DISTINCT symbol, 1 as position
        FROM Peers
        UNION ALL
        SELECT p.symbol, COUNT(*)
        FROM Peers p
            JOIN
            (SELECT open.symbol, (close_price/open_price) - 1 as perf
            FROM
               (SELECT ca.symbol, ca.open_price
               FROM Candlesticks ca
               WHERE ca.timestamp = "2020-01-02") open
               JOIN
               (SELECT ca.symbol, ca.close_price
               FROM Candlesticks ca
               WHERE ca.timestamp = "2020-01-31") close
               ON open.symbol = close.symbol) perf_company
            ON p.symbol = perf_company.symbol
            JOIN
            (SELECT open.symbol, (close_price/open_price) - 1 as perf
            FROM
               (SELECT ca.symbol, ca.open_price
               FROM Candlesticks ca
               WHERE ca.timestamp = "2020-01-02") open
               JOIN
               (SELECT ca.symbol, ca.close_price
               FROM Candlesticks ca
               WHERE ca.timestamp = "2020-01-31") close
               ON open.symbol = close.symbol) perf_peer
            ON p.peers_symbol = perf_peer.symbol
        WHERE perf_peer.perf > perf_company.perf
        GROUP BY p.symbol)
    GROUP BY symbol
"""

pd.read_sql_query(query, con)

	symbol	rank
0	AAAU	133
1	AADR	1109
2	AAPL	4
3	AAXJ	1867
4	ACAM	419
...	...	...
2200	ZJPN	1447
2201	ZMLP	1829
2202	ZROZ	40
2203	ZSL	979
2204	ZTR	649

2205 rows × 2 columns

Aaaaaaaand, we are now in business! Time to relax and then go on with understand how we can model business processes to improve and automate them.

Where to go next?

Do the exercises if they are already available on the UV (the corrections will be posted on this page pretty soon). Otherwise or if you want to know more about business process and how to represent, digitize and, ultimately, automate them, head to the BPMN page!