1.1 - Python

The Digital Firm

Estimated Reading Time: 43 minutes

Contents

Thinking must never submit itself, neither to a dogma, nor to a party, nor to a passion, nor to an interest, nor to a preconceived idea, nor to anything whatsoever, except to the facts themselves, because for it to submit to anything else would be the end of its existence.

Henri Poincaré, Discourse at the occasion of the 75th anniversary of the Free University of Brussels (1909)

Introduction

  • DOWNEY A. (2015), “Think Python”, 2nd Edition, Green Tea Press, Needham, Massachusetts. Available here
  • SWEIGART A. (2020), “Automate the Boring Stuff with Python”, 2nd Edition, No Starsh Press, San Francisco, California. Available here

Variables

Basics

We start our programming journey by discussing what are variables. They are one of the most basic components of any programming language as programming without them would amount to simply using computers as glorified calculators.

Variables can be seen as little “pockets” where you put things. Those things are values. You can put a value inside a variable by using the equal sign. For example, if I want to create a variable with the name age_of_the_captain and give it the value 42, I would write:

age_of_the_captain = 42

If, afterwards, I want to print the age of the captain on the screen, I can ask python to do so by using the name of the variable rather than the value.

print(age_of_the_captain)
42

Using variables is a great way to:

  1. Avoid having to keep a lot of things in your head as you program
  2. Deal with data that is not known at the moment you program
  3. Making your code more readable

To explain point 2 in the list above, think about it in the following way: Say you want to write a program that asks the user her age and determine if she is young or old (young here is defined as people below 22 years old). You’re in a pickle! Indeed, how will you write this very simple program if you do not know the age of the use?

In this case, salvation goes through the use of a variable: You are going to create a variable and, once the user has entered her age, this variable will have the correct value. You can then use this variable to determine the behavior if your program (whether it has to consider the person young or old).

For instance (if you’re running the notebook on your computer, feel free to change the age and see what changes):

age_of_the_user = input("What is your age?")
if int(age_of_the_user) >= 22: 
    # We need to use int() because by default, user input is a string - see below 
    status = "old"
else:
    status = "young"
print( "You are " + status )
What is your age? 31


You are old

What a brutal little piece of software… Anyway, see that in this short code, I actually created two variables: age_of_the_user and status. I could have created just the first one but I chose to create status for two reasons. First because I can then reuse the status elsewhere in my program without rechecking the age so I’m saving my computer time by committing computed information to memory. Secondly, by doing so, I only had to write one print() statement. Time is money and by avoiding rewriting this, I may have saved a fraction of a cent Of course, by writing this explanation, I probably wasted it away . It is trivial, and arguably prejudicial in such a simple example, but trying to avoid rewriting chunks of code has multiple benefits that we will not discuss here.

Data Type

In Python, each variable has a type, according to the value it contains. You don’t have to specify this type each type you create a variable as Python is dynamically typed (it recognizes, or infer, the type of the variable by evaluating its value). This saves time while programming but on the other hand, this can create problem during the execution of your program: the functions The operations we execute on the variables, such as the print command in the snippet of code above usually can operate on certain types of variables and not other.

Consider a mathematical operation, such as a division. Most of the time, it only makes sense to divide two numbers: Trying to divide a string of characters is difficult to conceptualize from a mathematical point of view. In Python, the division is represented by the symbol /. It’s a special function as most functions usually take the name of the function and works on variables that are worked on between parenthesis - see below. As such, you can do:

10/2
5.0
11/2
5.5

But doing:

"This is a sentence!"/2
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-6-961b9ae07049> in <module>
----> 1 "This is a sentence!"/2


TypeError: unsupported operand type(s) for /: 'str' and 'int'

Take a minute to read the above error message, I guarantee this isn’t the last time you see it. Python is telling you exactly what I just explained in its weird, mechanical language: TypeError is the general description of an error when you are trying to run a function with (a combination of) variable(s) of the wrong type. It then goes on describing the exact issue: “unsupported operand type(s) for /: ‘str’ and ‘int’” Means that the function / cannot be used with a string (of characters) as the first value and an integer number as the second one.

Talking about types, there are actually a small number of them (at least at first - those are the basic, or primitive types). You can discover the type of a variable, or even of a value itself, by using the type function. Let’s experiment.

my_cool_integer = 11
type(my_cool_integer)
int

The first type is the integer number - a number without a decimal point. In the snippet of code above, you see that I create a variable, called my_cool_integer There are actually few requirements about the name the variable can take as long as you use standard characters. Try to keep you variables names descriptive of what their content represent. and assign it the value 11, which is an integer. We then use the function type to discover the type of the variable. Note that I did not use print to make the result of the type function appear on the screen here. It is linked with the way Jupyter, about which I spoke in the introductory session, works: by default, it prints the result of the last line of code in the cell. Beware that it does not do so for the other lines. .

For decimal numbers (commonly called floating point or float numbers), we’d have:

an_equally_cool_float = 11.0
type(an_equally_cool_float)
float

and you also have strings of characters (or simply string - which Python abbreviate to str) that start and end with a single or double-quote:

this_is_a_string = "'sup people?"
type(this_is_a_string)
str

Those are the most frequent primitive types you are likely to manipulate as we continue our journey to communicate with computers. Those types can sometimes be converted from one to the other. For example, an integer can always be converted to a float (which just take a 0 after the decimal point) to do so, just try using the name of the type as a function and give it the value you try to convert. This is a process called casting.

cool_number = 11
float(cool_number)
11.0

Here, the value that was originally an integer 11 was converted to the float 11.0. Note that this does not changes the type of the variable cool_number. Instead it creates a copy of the transformed value. As such:

print(type(cool_number))
cool_number
<class 'int'>





11

See? It is still an integer. If we want to “save” the result of the transformation either in the cool_number variable or in another one.

cool_number = float(cool_number)
print(type(cool_number))
cool_number
<class 'float'>





11.0

The first line comverts the variable into a float and then assigns the result to the cool_number variable, erasing the integer that was initially stored inside. I then print the type of the variable, that has, indeed be changed to float and the value contained in the variable, which is 11.0.

Of course, casting only works for certain types of conversions. For example:

int("What's the point of converting this into an integer?")
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-13-3b6ba5f634bf> in <module>
----> 1 int("What's the point of converting this into an integer?")


ValueError: invalid literal for int() with base 10: "What's the point of converting this into an integer?"

Gives you a ValueError as Python has no clue about how to convert this sentence into an integer number. Remarks, however, that we did not get a TypeError (which indicates that this operation cannot by done with this type of variable)… This indicate that certain values of the type str would work. And indeed:

a_string_that_contains_an_int = "42"
print(type(a_string_that_contains_an_int))
int(a_string_that_contains_an_int)
<class 'str'>





42

Above, remark that the variable a_string_that_contains_an_int is indeed a string (that’s what we discover in line 2). Python does not really try to force the conversion of the string "42" unless you force it to do it by casting it. This is exactly what we do at line 3.

Go Further

You might have been surprised that I used the division as an example for an operation that cannot be done on a string and an integer. A multiplication usually would have popped to mind faster. Open a Python interpreter and try to multiply a string by an integer. Also, try to add two strings together. Indeed, Python implements some operations that would be non-sensical in pure arithmetics in creative ways!

Data Structures

Basics

Very often, it makes sense to “group” pieces of informations that go together. Think about how you would represent the data about one of your user. Suppose your first user is called Hans. You surely could describe all the infos about Hans in different variables and keep his name in the variable name.

hans_name = "Gruber"
hans_age = 55
hans_email = "[email protected]"

But there are several issues: Python is not made easily aware of the name of the variables. For it, la variable hans_name isn’t closer from hans_email than the variable named pink_banana so you have to maintain a log in your own head of which variables are connected at the time you write the program. Moreover, if there are new users, you have to rewrite your code to create variables for them and keep then in your head until, unavoidably, you brain goes…

Your head explodes

To avoid finding ourselves in those predicaments, programming languages developers have tought about ways to link certain data. Those are the data structures. Usually, and certainly in Python’s case, the most common data structures are lists and dictionaries.

Dictionaries

Dictionaries are structures that enable stocking values under special keys. This is abstract but it is in fact relatively intuitive once you see some examples.

Let’s go back to the example with our user, Hans, from above. In this case, we track several pieces of information about Hans (his name, age and email address). Those fragments of information are facets of the same underlying entity (in this case, the person named Hans Gruber). We can therefore store them all together in a dictionary. To do so, let’s write:

hans = {"first_name" : "Hans",
        "name" : "Gruber",
        "age" : 42,
       "email" : "[email protected]"}

That’s it. We have a dictionary representing Hans. We created it by using the squiggly braces (the { and } symbols) and them passing pairs of keys and values separated by a colon (:), the pairs being separated from one another by comma (,).

Now, having a dictionary is something but we need to use it. There are two main operations you will probably want to do most often with a dictionary: find the value under a certain key and create new key/value pairs (or updating the value under an existing key, which happens to be done in the same way).

Retrieving values under a known key is a done using the name of the dictionary and, between square brackets, the key. For example, we can retrieve the email address of Hans by writing:

hans["email"]
'[email protected]'

Of course, you can only retrieve values that are stored under existing keys. If you try to do:

hans["height"]
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-18-056ec473fe8f> in <module>
----> 1 hans["height"]


KeyError: 'height'

you obtain the infamous KeyError which simply tells you that no value was stored under that key and that the key simply does not exist in this dictionary. Note that the key must be exactly the same as the key you put and this entails that the comparison is case-sensitive (meaning that uppercase and lowercase characters are not the same). In this case, “email” is indeed a key of the dictionary contained under the variable hans but “Email” is not. This kind of small difference caused many a KeyError through history and attention to details is as important in programming as a reliable source of caffeine.

Now that we know that our data structure currently does not contain Hans’ height, we can add it. Doing so is done by assigning a value to the key in the dictionary, as if it was a regular variable. In this case, we can write:

hans["height"] = 1.79

and if we try to retrieve this value, we do not expect running into a KeyError anymore. The proof:

hans["height"]
1.79

After a while, you realise that Alan Rickman, the actor that played Hans Gruber in Die Hard (and that many of you probably know best for his portrayal of Severus Snape) was actually 1.85m tall Believe it or not, there is actually a website with the height of many celebrity: CelebHeight .

We can correct this information by simply assigning a new value to the dictionary under the relevant key. As such:

hans["height"] = 1.85

Correct the information indeed:

hans["height"]
1.85

We now know most of what’s important to manipulate dictionaries. Before letting you roam the world with this extremely relevant knowledge, let me give you a warning and a useful trick. The warning is that a dictionary is fundamentally unordered and that it is not because you declared a key before another that it is in any way “before” that other one. This is not an issue if you start writing code but as you grow bolder and start writing more complicated pieces of software, you might tend to forget it.

The trick is somewhat directly useful: Sometimes, you want to have access to all the keys in a dictionary. To do so, you might use the keys() method Without entering into too much details, a method is akin to a function that you apply on another object. You use it by putting a dot, the name of the method and then parenthesis after the name of the object. For those interested in Object-Oriented Programming, here’s a primer on StackAbuse . To print all the keys in our dictionary hans, I would write:

hans.keys()
dict_keys(['first_name', 'name', 'age', 'email', 'height'])

We will see that this kind of access is particularly useful with the ability to iterate on lists (this dict_keys is something that “resembles” a list and can be casted into one).

Cool, we have the info about Hans Gruber, so all infos about him are somehow “connected” together in our program (because they are all parts of the same dictionary). But this does not help me if I need to manually create one dictionary for every user of my application. We will tackle this aspect with lists.

Yippee Ki-Yay!

Lists

Lists are ordered collections of values, data structures or variables. That’s it. Now,you might wonder in which cases we encounter those. It might be in two main cases: When there is a de facto sequence of objects (for example in a time series, it makes sense that the first elements are classified before the last ones, for example) and when there are several similar objects but you are not too sure about the number.

Think about the software we are writing: It turns out it we are building a small application for all the villains in Die Hard to connect and finally make John McLane… die hard (badum-tssss). OK, so, Hans Gruber is already created, let’s create a second one.

karl = {"first_name": "Karl",
       "name" : "Vreski",
       "age" : 39}

We now have Karl Vreski but for now, he and Hans are not part of any common structure. That’s where the list comes in handy. Both hans and karl represent the same type of entity (they are Die Hard villains) and there may be many more. Imagine we decide to store them in a dictionary. We would have, each time we want to add a new villain, to create a new key in it. Sure, we could use their name as a key but what if two villains have the same name? It turns out the brother of Hans Gruber is also a villain, so it is not good. And surely, the first names are even worse… We will talk about this more in a future session (about SQL) when we discuss the topic of primary keys but for now, let’s just agree it is a bad idea.

The alternative is therefore to put them in a list. The first villain will have number 0 (This is important. In Python, lists index start at 0), the second number 1 and so forth. Every time you add a villain, it goes to the end of the list. Let’s now create a list with our two villains.

die_hard_villains = [hans, karl]

See that we create a list by putting elements separated by commas , between two square brackets ([]) that represent the limit of the list. We can look at the content of this list by just printing it.

die_hard_villains
[{'first_name': 'Hans',
  'name': 'Gruber',
  'age': 42,
  'email': '[email protected]',
  'height': 1.85},
 {'first_name': 'Karl', 'name': 'Vreski', 'age': 39}]

If I want to retrieve the first villain, I can simply access the list at element 0 using the square brackets as we have done for the dictionaries.

die_hard_villains[0]
{'first_name': 'Hans',
 'name': 'Gruber',
 'age': 42,
 'email': '[email protected]',
 'height': 1.85}

It is indeed the dictionary containing Hans Gruber. Note that, if in a dictionary the keys are usually strings, in a list, the keys are always integers in a range (it means that if an integer is a valid index in this list, every integer between 0 and this integer are valid indices of the list, too.

If you try to access the list at an index that does not exists:

die_hard_villains[2]
---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

<ipython-input-28-fbf59438a171> in <module>
----> 1 die_hard_villains[2]


IndexError: list index out of range

You get an IndexError with the description “list index out of range”. This happens rather frequently when you forget that the first element is indexed by 0. This is because you might be lead to think that the last valid index of a list of N elements is N while in fact it is N-1. For instance, in the list above, we have 2 elements, indexed by 0 and 1. There is therefore no element 2.

It is super easy to add elements at the end elements at the end of a list. To do so, you can simply use the method .append(the_element_you_want_to_add). Let’s add Irina Komarov.

irina = {"first_name" : "Irina",
        "name" : "Komarov",
        "age" : 30}
die_hard_villains.append(irina)
die_hard_villains
[{'first_name': 'Hans',
  'name': 'Gruber',
  'age': 42,
  'email': '[email protected]',
  'height': 1.85},
 {'first_name': 'Karl', 'name': 'Vreski', 'age': 39},
 {'first_name': 'Irina', 'name': 'Komarov', 'age': 30}]

Another handy tool is the len function that returns the length of a list. For example, we can know that we currently have 3 villains in our list by writing.

len(die_hard_villains)
3

and we can therefore retrieve the last element by writing:

die_hard_villains[len(die_hard_villains) - 1]
{'first_name': 'Irina', 'name': 'Komarov', 'age': 30}

Note that in Python as in many programming languages or even plain math, the composition (i.e. the embedding of several functions) is always resolved from the inside-out. In the code above, we first resolve the len(die_hard_villains), which gives 3, we then resolve 3-1, which gives 2 and finally, we resolve die_hard_villains[2], which return the dictionary containing the information about Irina.

you’re amazing

Now, we have sneakily created an embedded data structure! Surprise! See, here we have a list containing dictionaries. It turns out that, when working with data coming from the internet, this kind of crazy structures are incredibly common. This is why part of the exercises are dedicated to becoming very familiar. It is not so hard once you’ve done a few but it brings you a long way towards being able to communicate with a broad array of machines. Assume we want to access Hans Gruber’s email, with our die_hard_villains list, we can do so relatively easily (if we know that Hans is the first element of our list) with:

die_hard_villains[0]["email"]
'[email protected]'

See, here we access the key “email” of the first element of the list. This contains the information we want but it forces us to know the structure of the full list. For example, I have to know that Hans is the first element at the index 0 and that he has an email (what if he hadn’t?). Fortunately, we have what we call control flow.

Time for a break

OK, we covered a lot of ground so far. Here is a nice place to make a break. If you lost motivation and feel like this is pointless, may I suggest avoiding binge-watching something on Netflix and try to get back on the bandwagon by watching this video about the beauty you might find in writing code (it’s long but worth it)?

The art of code

It is a pitty that the learning curve for coding is so steep at the beginning but, once you pass the initial slope, the first plateau already provides you with abilities that would have been considered as supernatural a couple of decades ago. Take a breather, grab something to drink and then let’s power together through the last part of this arguably relatively dry first lesson. The rewards you’ll reap will be plenty, I promise.

Control Flow

Basics

So far, everything we’ve seem makes programming seems like using a fancy calculator: You have do most of the heavy lifting and it seems that you have to work for the computer rather than having it work for you. This changes now.

Computers are better than human at doing boring stuff again and again without getting tired. If you think about the execution of your code as a little head on a record player (like an old-timey phonograph An old-timey phonograph ) reading instructions in a sequence, telling your program to repeat something goes back to telling the reading head to go back in your code. As such, consider the pseudocode below.

  1. say “hello”
  2. wait 20 seconds
  3. go to point 1

The above code, if read by a computer that understands English will just be the computer saying “hello” every 20 seconds until the end of time or, more realistically, until there is some malfunction with its hardware.

To avoid it, we also need some conditional logic. Here is the modified pseudo-code with this logic. Now it’s just going to repeat “hello” 1000 times (much more than a human should ever made to do).

  1. initialize a counter, named “my_counter” at 0
  2. say “hello”
  3. increase the value of “my_counter” by 1
  4. wait 20 seconds
  5. if my_counter is smaller than 1000, go to point 2

See, with conditional logic and what is called loops, we can create arbitrarily complex logics and behavior. This is called control flow as you control the execution flow of your program (or you makes the “reading head” of the interpreter jump to certain lines in your code). Control flow is one of the crucial parts that make some people hope to have computers as smart as human some day For readers interested in links between computing, art and intelligence, I warmly recommend the monumental “Godel, Escher, Bach”, by Douglas Hofstadter. . It’s that important.

Conditional Logic

Python makes writing conditional logic very close to writing English statements. Let’s say I have a variable that contains a string of characters. If the length of this string is larger than 5 characters, I want to print the word followed by “Wow, that’s a long word!” on the screen. Otherwise, I want to print the word followed by “That’s an OK word.”. Let’s start with the first part. The code below checks if the length of the word is higher than 5 and prints the appropriate sentence if so.

word = "python" # this has 6 letters

if len(word) > 5:
    print(word, ": ", "Wow, that's a long word!")
python :  Wow, that's a long word!

To test a condition, write if, then the test you want to execute (in this case is the length of the word higher than 5), then a colon and then go to the line and indent what you want to happen. To indent code means to add 2 or 4 spaces (or a tabulation, but avoid using them if you can) before the beginning of the line. All the code that is indented in the sequence will only be executed if the condition in the if clause is fullfilled. The first non-indented line indicates the end of the block that is only executed conditionally. For example,

word = "python" # this has 6 letters

if len(word) > 5:
    print("Test started")
    print(word, ": ", "Wow, that's a long word!")
print("Test finished")
Test started
python :  Wow, that's a long word!
Test finished

Now let’s see what happens if we use the name of another programming language, Ruby.

word = "ruby" # this has 4 letters

if len(word) > 5:
    print("Test started")
    print(word, ": ", "Wow, that's a long word!")
print("Test finished")
Test finished

The only part of the code that gets executed is the last line. Why? Because, since the condition was not verified (the length of the word was lower than 5 characters), the indented lines where simply ignored. How can we do something if the condition is verified and something else if it isn’t? We can use the keyword else. For instance:

word = "ruby" # this has 4 letters

if len(word) > 5:
    print("Test passed")
    print(word, ": ", "Wow, that's a long word!")
else:
    print("Test failed")
    print(word, ": ", "That's an OK word.")
print("Test finished")
Test failed
ruby :  That's an OK word.
Test finished

In this case, since the condition is not verified, only the indented part in the else clause is executed (the reading head “jumps” from the line with the if to the line immediately after the else). The last line is getting executed no matter what as it is not indented and therefore does not depend on what happen in the condition block.

Now, imagine we also want to show a different message if the word is very short, say, less than 3 letters. Sure, you could put an if in the else clause and indent twice (with 8 spaces if you use 4 space for indentation or with 4 spaces if you use 2) but you can also use the elif clause. As such, let’s try with another programming language name, “Go”.

word = "go" # this has 2 letters

if len(word) > 5:
    print("Test passed")
    print(word, ": ", "Wow, that's a long word!")
elif len(word) < 3:
    print("Alternative test passed")
    print(word, ": ", "Such a short word!")
else:
    print("Test failed")
    print(word, ": ", "That's an OK word.")
print("Test finished")
Alternative test passed
go :  Such a short word!
Test finished

Keep in mind that with if...elif...else... only one of the clauses will match. Once the associated code block (the adjacent instructions indented at the same level) is executed, the reading head of the program jumps out of the instruction. As such, if we were to write the condition as:

word = "Objective-C" # this has 11 letters

if len(word) > 5:
    print("Test passed")
    print(word, ": ", "Wow, that's a long word!")
elif len(word) > 8:
    print("Alternative super long test passed")
    print(word, ": ", "This word is way too long")
elif len(word) < 3:
    print("Alternative super short test passed")
    print(word, ": ", "Such a short word!")
else:
    print("Test failed")
    print(word, ": ", "That's an OK word.")
print("Test finished")
Test passed
Objective-C :  Wow, that's a long word!
Test finished

Although the word has more than 8 letters, you don’t see the corresponding message (This word is way too long) because, as the first condition matches, the code under this condition is executed and the program then jumps to the last line, not evaluating any of the remaining conditions in the elif.

Loops

We wrote a lot of code to do a simple thing in the last section. And what is even worse is that we rewrote the same code snippets several times! Surely this isn’t the best use of our time. In fact, in programming, there is a loose principle (more like a general guideline - sometimes considered as a code smell A code pattern indicating that it might benefit from some change. Think of it as a sentence in an assignment that you know is bad but let it nonetheless until the next revision. ) that say the you should not repeat yourself. Programmers abbreviate this principle as DRY (don’t repeat yourself).

What if we had a huge list with the name of countless programming languages? Would we rewrite the full if...elif...else... countless times? Of course not. To tackle this kind of problems, we can use loops. Those were introduced in the basics section of this chapter. At its core, a loop is just a structure that makes the program repeat the same instructions in a sequence.

In Python, there are two main types of loops: the for loop, that iterate over a collection of values and that we will see in details here and the while loop, that combines the concept of a loop and that of conditional logic to repeat a portion of code while some condition is verified. I’ll leave it to you to learn about the while loop by yourself as it is not so different from a for loop (over an infinite collection) combined with an if statement.

So, at this point you’re probably wondering what “iterate over a collection” means. Well, it means that, if you pass a data structure to a for loop that is composed of several values (or other data structure) and is ordered, the loop is going to execute the code in its block once for the first element, then once for the second, then the third, and so forth.

Still not entirely sure what this means? This is normal. This is an extremely theoretical way to see a loop but here I’m going to write the code that takes every number between 0 and 5 and print the square on the screen. I’ll do so using the concept of lists that we seen before. In the code the ** is the operator that correspond to the exponentiation.

numbers_to_square = [0, 1, 2, 3, 4, 5]
current_number = numbers_to_square[0] # We take the first value of the list
print(current_number**2) # We print its squared value
current_number = numbers_to_square[1] # We take the second value of the list
print(current_number**2) # We print its squared value
current_number = numbers_to_square[2] # We take the third value of the list
print(current_number**2) # We print its squared value
current_number = numbers_to_square[3] # We take the fourth value of the list
print(current_number**2) # We print its squared value
current_number = numbers_to_square[4] # We take the fifth value of the list
print(current_number**2) # We print its squared value
current_number = numbers_to_square[5] # We take the sixth value of the list
print(current_number**2) # We print its squared value
0
1
4
9
16
25

Sure the code work, but, as the saying goes “Ain’t nobody got time for that”. So we need to use the for loop to speed coding up while retaining the logic. We can rewrite the code above in 2 lines. The function range create a kind of “list”. If you only give it 1 argument, it will produce the list of all integer between 0 (included) and the argument (excluded). If you pass two arguments, it will be the list of integer between the first argument (included) and the second (excluded).

for current_number in range(6):
    print(current_number ** 2)
0
1
4
9
16
25

Much more compact. Remark that, in this loop, the current_number acts as a variable that is being reassigned at each iteration of the loop to match the value contained at that position in the collection.

Of course, a collection can be any kind of list and contain anything, even dictionaries. For example, here is the loop that is going to print the first name of all the Die Hard villains I gathered previously.

for villain in die_hard_villains:
    print(villain["first_name"])
Hans
Karl
Irina

In the code above, villain successively takes the value of each dictionary in the list. This makes that inside the loop (i.e. in the code that is indented below the for statement), I can use it as I would any dictionary. In this case, I ask to print the value that is stored under the first_name key.

You also can iterate on dictionaries directly but to do so, you need to go through some hoops. The first way to do so is to use the .keys() method we evoked above. Don’t forget that dictionary are fundamentally unordered! Do not do anything in your loop that depends on one specific key being used before another as this will not necessarily produce the result you expect. This can be done like this:

for one_key in hans.keys():
    print(one_key, ": ", hans[one_key])
first_name :  Hans
name :  Gruber
age :  42
email :  [email protected]
height :  1.85

Alternatively, you can retrieve the key and the value in the loop using the .items() method. You can do it like this:

for one_key, one_value in hans.items():
    print(one_key, ": ", one_value)
first_name :  Hans
name :  Gruber
age :  42
email :  [email protected]
height :  1.85

Both methods work, the second is probably a little clearer but it is mostly a matter of opinion.

Tying it all together

Watch how you can link all the concept in a simple exercice and then go on and solve the exercises for this session.

Here is the code used in this video, you have to have a copy of the master_2017.json file in the same folder as this notebook if you want to execute it.

This imports the tweet into a list of dictionaries (containing themselves lists and dictionaries under certain keys):

import json

with open("master_2017.json", 'r') as tweets_repo:
    content = tweets_repo.read()

trump_tweets = json.loads(content)

This pretty-prints the first 3 dictionaries representing tweets:

from pprint import pprint

pprint(trump_tweets[:3])
[{'contributors': None,
  'coordinates': None,
  'created_at': 'Mon Jan 01 13:37:52 +0000 2018',
  'display_text_range': [0, 119],
  'entities': {'hashtags': [], 'symbols': [], 'urls': [], 'user_mentions': []},
  'favorite_count': 51473,
  'favorited': False,
  'full_text': 'Will be leaving Florida for Washington (D.C.) today at 4:00 '
               'P.M. Much work to be done, but it will be a great New Year!',
  'geo': None,
  'id': 947824196909961216,
  'id_str': '947824196909961216',
  'in_reply_to_screen_name': None,
  'in_reply_to_status_id': None,
  'in_reply_to_status_id_str': None,
  'in_reply_to_user_id': None,
  'in_reply_to_user_id_str': None,
  'is_quote_status': False,
  'lang': 'en',
  'place': None,
  'retweet_count': 8237,
  'retweeted': False,
  'source': '<a href="http://twitter.com/download/iphone" '
            'rel="nofollow">Twitter for iPhone</a>',
  'truncated': False,
  'user': {'contributors_enabled': False,
           'created_at': 'Wed Mar 18 13:46:38 +0000 2009',
           'default_profile': False,
           'default_profile_image': False,
           'description': '45th President of the United States of America🇺🇸',
           'entities': {'description': {'urls': []},
                        'url': {'urls': [{'display_url': 'Instagram.com/realDonaldTrump',
                                          'expanded_url': 'http://www.Instagram.com/realDonaldTrump',
                                          'indices': [0, 23],
                                          'url': 'https://t.co/OMxB0x7xC5'}]}},
           'favourites_count': 24,
           'follow_request_sent': False,
           'followers_count': 45551365,
           'following': True,
           'friends_count': 45,
           'geo_enabled': True,
           'has_extended_profile': False,
           'id': 25073877,
           'id_str': '25073877',
           'is_translation_enabled': True,
           'is_translator': False,
           'lang': 'en',
           'listed_count': 81603,
           'location': 'Washington, DC',
           'name': 'Donald J. Trump',
           'notifications': True,
           'profile_background_color': '6D5C18',
           'profile_background_image_url': 'http://pbs.twimg.com/profile_background_images/530021613/trump_scotland__43_of_70_cc.jpg',
           'profile_background_image_url_https': 'https://pbs.twimg.com/profile_background_images/530021613/trump_scotland__43_of_70_cc.jpg',
           'profile_background_tile': True,
           'profile_banner_url': 'https://pbs.twimg.com/profile_banners/25073877/1514347856',
           'profile_image_url': 'http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg',
           'profile_image_url_https': 'https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg',
           'profile_link_color': '1B95E0',
           'profile_sidebar_border_color': 'BDDCAD',
           'profile_sidebar_fill_color': 'C5CEC0',
           'profile_text_color': '333333',
           'profile_use_background_image': True,
           'protected': False,
           'screen_name': 'realDonaldTrump',
           'statuses_count': 36673,
           'time_zone': 'Eastern Time (US & Canada)',
           'translator_type': 'regular',
           'url': 'https://t.co/OMxB0x7xC5',
           'utc_offset': -18000,
           'verified': True}},
 {'contributors': None,
  'coordinates': None,
  'created_at': 'Mon Jan 01 12:44:40 +0000 2018',
  'display_text_range': [0, 283],
  'entities': {'hashtags': [], 'symbols': [], 'urls': [], 'user_mentions': []},
  'favorite_count': 53557,
  'favorited': False,
  'full_text': 'Iran is failing at every level despite the terrible deal made '
               'with them by the Obama Administration. The great Iranian '
               'people have been repressed for many years. They are hungry for '
               'food &amp; for freedom. Along with human rights, the wealth of '
               'Iran is being looted. TIME FOR CHANGE!',
  'geo': None,
  'id': 947810806430826496,
  'id_str': '947810806430826496',
  'in_reply_to_screen_name': 'realDonaldTrump',
  'in_reply_to_status_id': 947544600918372353,
  'in_reply_to_status_id_str': '947544600918372353',
  'in_reply_to_user_id': 25073877,
  'in_reply_to_user_id_str': '25073877',
  'is_quote_status': False,
  'lang': 'en',
  'place': None,
  'retweet_count': 14595,
  'retweeted': False,
  'source': '<a href="http://twitter.com/download/iphone" '
            'rel="nofollow">Twitter for iPhone</a>',
  'truncated': False,
  'user': {'contributors_enabled': False,
           'created_at': 'Wed Mar 18 13:46:38 +0000 2009',
           'default_profile': False,
           'default_profile_image': False,
           'description': '45th President of the United States of America🇺🇸',
           'entities': {'description': {'urls': []},
                        'url': {'urls': [{'display_url': 'Instagram.com/realDonaldTrump',
                                          'expanded_url': 'http://www.Instagram.com/realDonaldTrump',
                                          'indices': [0, 23],
                                          'url': 'https://t.co/OMxB0x7xC5'}]}},
           'favourites_count': 24,
           'follow_request_sent': False,
           'followers_count': 45551365,
           'following': True,
           'friends_count': 45,
           'geo_enabled': True,
           'has_extended_profile': False,
           'id': 25073877,
           'id_str': '25073877',
           'is_translation_enabled': True,
           'is_translator': False,
           'lang': 'en',
           'listed_count': 81603,
           'location': 'Washington, DC',
           'name': 'Donald J. Trump',
           'notifications': True,
           'profile_background_color': '6D5C18',
           'profile_background_image_url': 'http://pbs.twimg.com/profile_background_images/530021613/trump_scotland__43_of_70_cc.jpg',
           'profile_background_image_url_https': 'https://pbs.twimg.com/profile_background_images/530021613/trump_scotland__43_of_70_cc.jpg',
           'profile_background_tile': True,
           'profile_banner_url': 'https://pbs.twimg.com/profile_banners/25073877/1514347856',
           'profile_image_url': 'http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg',
           'profile_image_url_https': 'https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg',
           'profile_link_color': '1B95E0',
           'profile_sidebar_border_color': 'BDDCAD',
           'profile_sidebar_fill_color': 'C5CEC0',
           'profile_text_color': '333333',
           'profile_use_background_image': True,
           'protected': False,
           'screen_name': 'realDonaldTrump',
           'statuses_count': 36673,
           'time_zone': 'Eastern Time (US & Canada)',
           'translator_type': 'regular',
           'url': 'https://t.co/OMxB0x7xC5',
           'utc_offset': -18000,
           'verified': True}},
 {'contributors': None,
  'coordinates': None,
  'created_at': 'Mon Jan 01 12:12:00 +0000 2018',
  'display_text_range': [0, 284],
  'entities': {'hashtags': [], 'symbols': [], 'urls': [], 'user_mentions': []},
  'favorite_count': 138808,
  'favorited': False,
  'full_text': 'The United States has foolishly given Pakistan more than 33 '
               'billion dollars in aid over the last 15 years, and they have '
               'given us nothing but lies &amp; deceit, thinking of our '
               'leaders as fools. They give safe haven to the terrorists we '
               'hunt in Afghanistan, with little help. No more!',
  'geo': None,
  'id': 947802588174577664,
  'id_str': '947802588174577664',
  'in_reply_to_screen_name': None,
  'in_reply_to_status_id': None,
  'in_reply_to_status_id_str': None,
  'in_reply_to_user_id': None,
  'in_reply_to_user_id_str': None,
  'is_quote_status': False,
  'lang': 'en',
  'place': None,
  'retweet_count': 49566,
  'retweeted': False,
  'source': '<a href="http://twitter.com/download/iphone" '
            'rel="nofollow">Twitter for iPhone</a>',
  'truncated': False,
  'user': {'contributors_enabled': False,
           'created_at': 'Wed Mar 18 13:46:38 +0000 2009',
           'default_profile': False,
           'default_profile_image': False,
           'description': '45th President of the United States of America🇺🇸',
           'entities': {'description': {'urls': []},
                        'url': {'urls': [{'display_url': 'Instagram.com/realDonaldTrump',
                                          'expanded_url': 'http://www.Instagram.com/realDonaldTrump',
                                          'indices': [0, 23],
                                          'url': 'https://t.co/OMxB0x7xC5'}]}},
           'favourites_count': 24,
           'follow_request_sent': False,
           'followers_count': 45551365,
           'following': True,
           'friends_count': 45,
           'geo_enabled': True,
           'has_extended_profile': False,
           'id': 25073877,
           'id_str': '25073877',
           'is_translation_enabled': True,
           'is_translator': False,
           'lang': 'en',
           'listed_count': 81603,
           'location': 'Washington, DC',
           'name': 'Donald J. Trump',
           'notifications': True,
           'profile_background_color': '6D5C18',
           'profile_background_image_url': 'http://pbs.twimg.com/profile_background_images/530021613/trump_scotland__43_of_70_cc.jpg',
           'profile_background_image_url_https': 'https://pbs.twimg.com/profile_background_images/530021613/trump_scotland__43_of_70_cc.jpg',
           'profile_background_tile': True,
           'profile_banner_url': 'https://pbs.twimg.com/profile_banners/25073877/1514347856',
           'profile_image_url': 'http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg',
           'profile_image_url_https': 'https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg',
           'profile_link_color': '1B95E0',
           'profile_sidebar_border_color': 'BDDCAD',
           'profile_sidebar_fill_color': 'C5CEC0',
           'profile_text_color': '333333',
           'profile_use_background_image': True,
           'protected': False,
           'screen_name': 'realDonaldTrump',
           'statuses_count': 36673,
           'time_zone': 'Eastern Time (US & Canada)',
           'translator_type': 'regular',
           'url': 'https://t.co/OMxB0x7xC5',
           'utc_offset': -18000,
           'verified': True}}]

This shows the number of tweets contained in the list:

len(trump_tweets)
2605

This counts the number of times the name “Obama” is mentioned in a tweet during that year:

obama_counter = 0

for tweet in trump_tweets:
    if "full_text" in tweet.keys():
        words_in_the_tweet = tweet["full_text"].split()
        for word in words_in_the_tweet:
            if word == "Obama":
                obama_counter += 1
    elif "text" in tweet.keys():
        words_in_the_tweet = tweet["text"].split()
        for word in words_in_the_tweet:
            if word == "Obama":
                obama_counter += 1
    else:
        print("Couldn't treat this tweet")

And this prints the result:

print("Trump said 'Obama' ", obama_counter,
     " times in 2017 on Twitter!")
Trump said 'Obama'  50  times in 2017 on Twitter!

Check your understanding

This session was quite long. What about a little test to check if you have assimilated some of the concepts seen here? It is ungraded (the grade you get is only for you, it will not be taken into consideration for your class grade) and anonymous - I’ll get your answers, but they will not be tied to your name and I have no way of knowing who replied what (or even who took the test). You can take this test several time.

Exercises

I - Trees of Brussels™

You can cut us, you can trim us

But you’ll have to answer, too

Ohoh, the trees of Brussels

(Go ahead and solve this exercise on the tune of the Clash’s “Guns of Brixton” for maximum effect)

Trees are cool. OK, “cool” might be an overstatement… but I mean, they take out \(CO_2\) and provide fresh air. I guess what I’m really trying to say is that trees have different shapes and size, OK?

After this moderately convincing intro, here’s the situation: Brussels, for some reason, has a database of what they call “remarkable trees”, which IMHO is interesting. The code below retrieve this dataset but it is formatted in a weird way in the variable remarkable_trees. Please format it under the form of a dictionary that has, as its keys, the “especes” (French for species) and as value a list of all the circunferences of the remarkable trees on its territory.

For instance, if there are two Abies concolor (Sapin du Colorado) and three Abies grandis (Sapin de Vancouver):

{“Abies concolor\rSapin du Colorado” : [ 31, 54 ], “Abies grandis\rSapin de Vancouver” : [ 51, 72, 66 ]}

Step 1 - Execute the cell below to retrieve all the remarkable trees

The code will then print them so you can check how they are organised.

import requests
from pprint import pprint

# retrieve the dataset
# documentation: https://opendata.brussels.be/explore/dataset/remarkable-trees/information/

url = "https://opendata.brussels.be/api/records/1.0/search/?dataset=remarkable-trees&rows=3000"
remarkable_trees = requests.get(url).json()

# We produce a small-sized preview so that you can have an idea of how the remarkable_trees
# dictionary looks like but for the exercise, work with the full remarkable_trees, please.
preview = remarkable_trees.copy()
preview["records"] = preview["records"][:3]

pprint(preview)
{'nhits': 1047,
 'parameters': {'dataset': 'remarkable-trees',
                'format': 'json',
                'rows': 3000,
                'timezone': 'UTC'},
 'records': [{'datasetid': 'remarkable-trees',
              'fields': {'circ': 217,
                         'commune': 'Bruxelles',
                         'diametre_de_la_couronne': 14,
                         'espece': 'Aesculus x carnea\r'
                                   'Marronnier à fleurs rouges',
                         'hauteur': 15,
                         'id': 977,
                         'plantation': 'Arbre isolé',
                         'position': 'Arbre visible de la voirie',
                         'rue': 'Avenue De Béjar 22',
                         'statut': "À l'inventaire scientifique"},
              'record_timestamp': '2014-08-13T20:36:16+00:00',
              'recordid': '607ffcaa4269665f4502d82a1bab3567115c0e69'},
             {'datasetid': 'remarkable-trees',
              'fields': {'circ': 191,
                         'commune': 'Bruxelles',
                         'diametre_de_la_couronne': 14,
                         'espece': 'Ailanthus altissima\rAilante glanduleux',
                         'hauteur': 11,
                         'id': 3440,
                         'plantation': 'Arbre isolé',
                         'position': 'Arbre visible de la voirie, occupant une '
                                     'position centrale dans le paysage.',
                         'rue': 'Rue du Marché aux Herbes 111',
                         'statut': "À l'inventaire scientifique"},
              'record_timestamp': '2014-08-13T20:36:16+00:00',
              'recordid': '8eb07528450e71d64227e47aa2758d61242f9a11'},
             {'datasetid': 'remarkable-trees',
              'fields': {'circ': 204,
                         'commune': 'Bruxelles',
                         'espece': "Cedrus atlantica\rCèdre de l'Atlas",
                         'id': 515,
                         'site': 'Bois de la Cambre',
                         'statut': "À l'inventaire scientifique"},
              'record_timestamp': '2014-08-13T20:36:16+00:00',
              'recordid': 'b94ed84e6c85ea75b0d484d56baaed2623caa100'}]}

Step 2 - Your code goes here

Produce the dictionary asked above, using the data contained in the variable remarkable_trees.

See solution
trees = remarkable_trees["records"]

circumference_by_species = {}

for tree in trees:
    species = tree["fields"]["espece"]
    circumference = tree["fields"]["circ"]
    if species in circumference_by_species.keys():
        circumference_by_species[species].append(circumference)
    else:
        circumference_by_species[species] = [circumference]
        
pprint(circumference_by_species)

Now that this is done, check what’s the largest species of trees in circumferences. You’re well on your way to become an amateur botanist!

II - Beyond Harry Potter

Some people still borrow books from public libraries and in 2013, the city of Brussels decided it would be nice to publish the 100 most borrowed French language books in each of those. In some libraries, there isn’t 100 most borrowed books, so the dataset is actually a little smaller but it’s still sizable. It contains 1691 books.

Say you want to stay up-to-date by reading the most-borrowed book in each of the libraries in the categories “Instructifs adultes”, “Fiction adultes” and “BD adultes” (those are comic books). Which books will you need to read?

Step 1 - Run this code to retrieve all the books in the dataset

import requests
from pprint import pprint

# retrieve the dataset
# documentation: https://opendata.brussels.be/explore/dataset/prets-dans-les-bibliotheques-francophones-en-2013/information/

url = "https://opendata.brussels.be/api/records/1.0/search/?dataset=bruxelles_top_100_livres_empruntes_par_bibilotheque&rows=3000" \
       + "&facet=documents&facet=type_document"
borrowed_books = requests.get(url).json()

# We produce a small-sized preview so that you can have an idea of how the remarkable_trees
# dictionary looks like but for the exercise, work with the full borrowed_books, please.
preview = borrowed_books.copy()
preview["records"] = preview["records"][:3]

pprint(preview)
{'nhits': 1691,
 'parameters': {'dataset': 'bruxelles_top_100_livres_empruntes_par_bibilotheque',
                'facet': ['documents', 'type_document'],
                'format': 'json',
                'rows': 3000,
                'timezone': 'UTC'},
 'records': [{'datasetid': 'bruxelles_top_100_livres_empruntes_par_bibilotheque',
              'fields': {'auteur': 'Saint-Mars,  Dominique de',
                         'code_bibilotheque': 'Adolphe Max',
                         'nombre_d_emprunts': 12,
                         'rang': 38,
                         'titre': 'Lili veut être une star',
                         'type': 'Fiction jeunesse'},
              'record_timestamp': '2015-07-29T14:55:00+00:00',
              'recordid': '3a33cfbece2dbd03a0c5373d2985887a2917384a'},
             {'datasetid': 'bruxelles_top_100_livres_empruntes_par_bibilotheque',
              'fields': {'auteur': 'Lévy,  Didier',
                         'code_bibilotheque': 'Adolphe Max',
                         'nombre_d_emprunts': 11,
                         'rang': 95,
                         'titre': "La fée Coquillette et l'ours mal léché",
                         'type': 'Fiction jeunesse'},
              'record_timestamp': '2015-07-29T14:55:00+00:00',
              'recordid': 'f1ac29aadaaaf20962706f727c2c09cbfe0be217'},
             {'datasetid': 'bruxelles_top_100_livres_empruntes_par_bibilotheque',
              'fields': {'auteur': 'Saint-Mars,  Dominique de',
                         'code_bibilotheque': 'Bruxelles 1',
                         'nombre_d_emprunts': 24,
                         'rang': 4,
                         'titre': 'Lili trouve sa maîtresse méchante',
                         'type': 'Fiction jeunesse'},
              'record_timestamp': '2015-07-29T14:55:00+00:00',
              'recordid': 'c7b6bac91bb4077216a9f58632e59b73d7bbaa0a'}]}

Step 2 - Write your code

Write your code in the cell below and figure out which books were the most borrowed in 2013. I suggest you format your output as a list of tuples (tuples are lists of fixed size) of which the first element is the name of the book, the second the author and the third the number of libraries in which this book was the most borrowed.

Note that this exercise is significantly more difficult than the first. The most borrowed book in a category is not necessarily in rank 0: It might be that the most borrowed book in this library was in fact a youth book. Break down the problem is several tasks and work little by little, checking that the previous step is working correctly. This problem is not a coding problem as much as a modelling problem. Start working on a sheet of paper before going to the code if it helps.

See solution
books = borrowed_books["records"]

best_books = {}

for book in books:
    name = book["fields"]["titre"]
    library = book["fields"]["code_bibilotheque"]
    rank = book["fields"]["rang"]
    type_book = book["fields"]["type"]
    
    if type_book in ["Instructifs adultes",
                    "Fiction adultes",
                    "BD adultes"]:
        if library in best_books.keys():
            if type_book in best_books[library]:
                if rank < best_books[library][type_book][1]:
                    best_books[library][type_book] = (name, rank)
            else:
                best_books[library][type_book] = (name, rank)
        else:
            best_books[library] = {type_book : (name, rank)}
    
pprint(best_books)    

Now is the perfect time to go enjoy a book or something.

III - The curious case of the hidden meme

Images you print on a screen are just streams of bits. As such, they are sequences of 1 and 0. At a higher level, if you save your images in png (Portable Network Graphics), they are also coordinates of pixels in the color space. This means that each pixel is represented by 4 numbers between 0 and 255: the first three are the components of Red, Green and Blue - frequently referred to as the RGB coordinates. The last one is the Alpha component, representing the transparency of the pixel, but we will not work with these. The mixture of the RGB components enable representing any color on the visible spectrum (discretized, however, as the color is represented on 24bits - 8bits for each component of the colour).

Now, the human eye is fairly insensitive to small variations in the values in the RGB components. Increasing or decreasing the value of the components of the pixel by one does not change the colour that we perceive. We can use this property to hide pictures, text or sound inside pictures (or sounds). In this case, I hid a picture inside one of those two pictures:

xzibit

xzibit2

Turns out, there is a secret meme in the first of these images while the second one is just the Xzibit meme… I, for one, cannot really spot any difference but let’s retrieve the hidden meme.

Step 0: acquire and install the Pillow library

Use pip or conda to install the library called Pillow. This is required to run the following cell (if you don’t get an error message after running it, it is correctly installed):

from PIL import Image

import os.path
if not os.path.isfile("unsuspicious_meme.png"): # If you don't have the file yet...
    import urllib.request 
    urllib.request.urlretrieve("http://homepages.ulb.ac.be/~choffreu/resources/unsuspicious_meme.png", "unsuspicious_meme.png") # download it

# Open and load the image
im = Image.open("unsuspicious_meme.png")

Step 1: Convert the image in a sequence of RGB values

That’s were the real work starts! The scaffolding code in the following cell extract the information about each pixel in a list containing tuples (those are sort of invariable lists delimited by parentheses () rather than brackets []). Your job is to convert this into a sequence of values. Visually:

pixels = [(\(R_1\), \(G_1\), \(B_1\), \(A_1\)), (\(R_2\), \(G_2\), \(B_2\), \(A_2\)), (\(R_3\), \(G_3\), \(B_3\), \(A_3\)), (\(R_4\), \(G_4\), \(B_4\), \(A_4\)), (\(R_5\), \(G_5\), \(B_5\), \(A_5\)), (\(R_6\), \(G_6\), \(B_6\), \(A_6\)) (\(R_7\), \(G_7\), \(B_7\), \(A_7\)) (\(R_8\), \(G_8\), \(B_8\), \(A_8\))]

needs to be converted into:

pixels_in_a_row = [\(R_1\), \(G_1\), \(B_1\), \(R_2\), \(G_2\), \(B_2\), \(R_3\), \(G_3\), \(B_3\), \(R_4\), \(G_4\), \(B_4\), \(R_5\), \(G_5\), \(B_5\), \(R_6\), \(G_6\), \(B_6\), \(R_7\), \(G_7\), \(B_7\), \(R_8\), \(G_8\), \(B_8\)]

Note that we did not keep the Alpha component as, in this picture, there is no transparency. The first 8 values of your variable pixels_in_a_row should be 61, 61, 74, 55, 56, 68, 54, 55 and the length of this list should be 540000.

See solution
# Convert the image to a list of tuples representing pixels
pixels = list(im.getdata())

### Your code goes here below ###
pixels_in_a_row = []

for pixel in pixels:
    for position in range(3):
        pixels_in_a_row.append(pixel[position])
        
pixels_in_a_row

Step 2: Convert this image into a sequence of 1 and 0

We now convert the sequence of values we got into a sequence of 1 and 0. This can be done in several manners depending on the conventions you adopt with the person who will receive your hidden meme but in this case, we went for the obvious one: Even or Odd. Even numbers (numbers which, divided by two produce an integer) are coded as 0 and odd numbers (the ones that, when divided by two produce a float with .5 as the decimal part) are coded as 1. For example:

pixels_in_a_row = [61, 61, 74, 55, 56, 68, 54, 55]

Will be converted to:

binary_coding = [1, 1, 0, 1, 0, 0, 0, 1]

These should be the 8 first values of your binary coding and it should be 540000 values long.

See solution
### Your code goes here below ###
binary_coding = [value%2 for value in pixels_in_a_row]
    
binary_coding

Step 3: Convert these as a sequence of integers coded on 8bits

We are nearly there! Now, we want to join the 1’s and 0’s in small bundle of 8 consecutive characters (important hint: to do so, it is actually easier to consider the number of strings of caracters and concatenate them!) and convert those in decimal values that we put in a list.

As such:

binary_coding = [1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0]

becomes:

strings_of_numbers = [‘11010001’, ‘10111010’, ‘01011100’]

which, once transformed from binary to decimal becomes:

rgb_as_list = [209, 186, 92]

and the length of rgb_as_list should be 67500.

See solution
### Your code goes here below ###
strings_of_numbers = []
acc_string = ''
counter = 0

for element in binary_coding:
    acc_string = acc_string + str(element)
    counter += 1
    if counter == 8:
        strings_of_numbers.append(acc_string)
        acc_string = ''
        counter = 0

rgb_as_list = [ int(number, 2) for number in strings_of_numbers]

rgb_as_list

Step 4: Convert back to RGBA coordinates

We now need to produce back a list of tuples containing the value of each pixel in the RGBA format. This means that we need to copy the values in the rgb_as_list variable into a sequences of 4 elements tuples with the 3 first values coming from the list of values you discovered and the last one set to 255 (the full opacity). Technically:

rgb_as_list = [209, 186, 92, 209, 186, 93, 208, 184, 94]

becomes:

the_hidden_meme = [(209, 186, 92, 255), (209, 186, 93, 255), (208, 184, 94, 255)]

See solution
### Your code goes here below ###
the_hidden_meme = []
acc_list = []
counter = 0

for rgb_comp in rgb_as_list:
    acc_list.append(rgb_comp)
    counter += 1
    if counter == 3:
        tuple_pixel = (acc_list[0],
                      acc_list[1],
                      acc_list[2],
                      255)
        the_hidden_meme.append(tuple_pixel)
        acc_list = []
        counter = 0
        
the_hidden_meme

Step 5: Discover… the HIDDEN MEME

You have moved quite a lot of data around so far. Think about the fact that you had you computer parse, analyse, transform and store up to 540.000 values in a matter of seconds. Cryptoanalysts a few decades ago would have spend thousands of human-hours worth (man-hour here would be particularly wrong: Most of the people who used to do heavy computations in the XXth century and many pioneers of computing machines were women… more on that later) of hard labor to discover the secret we do not want them to know: The almighty Hidden Meme™.

It’s now up to you to behold it, if you dare. I wrote a helper function and code so that, if you have the list of tuples in a variable named the_hidden_meme, you just have to execute the cell and gaze into the meme’s glory. Respect it, it cost me dearly (I had to download Comic Sans to make it, I still feel terrible about it).

def image_from_list(pixels_list, size, mode = "RGBA"):
    img = Image.new(mode, size)
    img.putdata(pixels_list)
    return img

image_from_list(the_hidden_meme, [150, 150])

Note that you can hide some other image inside this meme too, etc, etc. There is, however, a size constraint: Since we need 8 bits to represent 1 value in the RGB space, we have to divide the number of pixel in each layer of steganography by 8. However, if you start with one full HD picture, you can store a few images before having a very very small image.

Also, we don’t know each other so well yet, I might just be bullshitting you and load the hidden meme from somewhere with some cleverly obfuscated code (although, to be fair, it is not so hard to verify what my code do here). Verify that the information was actually contained in the image and nowhere else by replacing all the instances of unsuspicious_meme.png - the first of the two pictures above - by xzibit_meme_stegano_scaled.png - the second one - and rerun all the steps. The result will certainly be much more abstract…

Where to go now?

Either the chapter on Open-Source and Modularity to learn how to use functions other wrote for you to your advantage or the one on SQL to delve into the world of data representation.