2+2
4
Navigation
Python is a programming language, introduced in 1991. The current version is Python 3, though you can still find Python 2 code in the older editions of many books. To work with Python, you will use an interface for which you have many choices available. You can have several “instances” of Python, called kernels, running independently in your computer.
Python is case sensitive. So, type
is a Python function that returns the type of an object, but Type
is not recognized (unless you create a new function with this name), and will return an error message.
Python is a powerful and widely-used programming language that debuted in 1991. While Python 3 is the current standard, you might encounter Python 2 code in older resources. The language offers flexibility in how you interact with it - you can choose from various interfaces, and run multiple independent instances (called kernels) simultaneously on your computer.
One important detail to remember: Python is case sensitive. For example, the built-in function type
(lowercase) tells you what kind of object you’re working with, but writing it as Type
will trigger an error because Python treats uppercase and lowercase letters as distinct characters.
There are many distributions of Python, but Anaconda (anaconda.com
) is the preferred choice in the data science community. Anaconda comes pre-packaged with most resources needed for this course. When installing Anaconda, select Python 3, and you’ll get access to the Anaconda Navigator - a browser-based interface for launching various Python environments. For those who prefer command-line tools, you can bypass the Navigator using a command-line interface (CLI): Terminal on Mac or the Anaconda prompt on Windows.
Historically, developers favored the Jupyter Qt console or IPython (Interactive Python). IPython enhanced the basic Python shell with additional features, while the Qt console added a graphical interface with menus and mouse support using the Qt toolkit. IPython introduced magic commands, denoted by %cmd
(like %cd
for changing directories). While these commands aren’t part of Python itself, they’re still referenced in many tutorials. For a complete list of magic commands, use %quickref
. Though you can omit the %
prefix, we’ll keep it to distinguish magic commands from standard Python code.
Today’s data scientists primarily use IPython’s successor: Jupyter. The Jupyter ecosystem introduces the notebook paradigm, allowing seamless integration of code, output, and documentation. Jupyter Notebook leads this space, with Apache Zeppelin a notable alternatives. All of these platforms support multiple programming languages beyond Python, including R. These course materials were developed using a Python-based Jupyter notebook. The page that you are reading right now is a rendered notebook.
Anaconda also includes two full-featured Integrated Development Environments (IDEs): Spyder and Visual Studio Code. These provide comprehensive development tools like consoles, text editors, debuggers, and variable inspectors. While we won’t cover these advanced features, users familiar with IDEs (like RStudio for R) may prefer these environments. Feel free to use any tool you’re comfortable with, though our instructions will assume you’re using Jupyter Notebook, Jupyter Lab or Visual Studio Code - the concepts generally transfer to other interfaces with minimal adaptation.
In case you are not able to install/run Anaconda on your own computer, it is recommended that you create a google account and use Google Collab instead (though it has some limitations).
Code is typed into code cells like the one below, when you enter that cell and press Shift+Return the code will run and you will get the output underneath. Python returns either an output (preceded by [1]:), a (typically long and difficult) error message, or no answer at all. Here is a supersimple example
So, if you enter 2 + 2, the output will be the result of this calculation. But, when you want to store this result for later use (in the same session), you enter it with a name, as follows:
This creates the variable a. Note that the value of 2 + 2 is not outputted now. But you can call it:
In Python, when you assign a value to a variable that has already been created, the previous assignment is forgotten. So:
If you copypaste in the console code chunks that you edit in a text editor, you can input several code lines at once. In that case, you will only get the output for the last line. If the cursor is not at the end of the last line, you have to press Shift+Return to get the output. Here is a simple example:
Note. You would probably have written bˆ2 for the square of 2, but the caret symbol (ˆ) plays a different role in Python.
When you start a new kernel, only a core set of Python resources is initially available. You’ll need to import additional packages for most data science tasks, as these core resources are intentionally kept minimal. These extra resources come in packages, which themselves are split into different modules. Python can be extended by more than 200,000 packages. Some large packages, like scikit-learn, are collections of modules called subpackages.
For example, if you need to calculate the square root of 2, you would import the math
package, which includes various mathematical functions. Once imported, all its functions become available. The notation math.sqrt
indicates that sqrt
is a function from the math
package.
Alternatively, you can import only the functions that you plan to use:
The following example generates a random number according to a normal distribution by calling a function in the random
package of numpy
module.
Packages must be installed on your computer before they can be imported, and these imports only last for the current kernel session. The math
package, being part of the Python Standard Library, is always available by default.
If you use the Anaconda distribution, most packages used in this course are already available and can be directly imported. If it is not the case, you have to install the package (only once). If you need additional packages, you have two installation options:
pypi.org/project/pip
), though it may encounter version conflicts between related packagesBoth pip
and conda
can be run directly from Jupyter notebooks, which we’ll explore later as the need comes up.
This course approaches Python from a data science perspective, we therefore focus on its four core packages:
NumPy (1995): Provides support for large vectors and matrices (called arrays) - the foundation for scientific computing in Python
Matplotlib: Built on NumPy, this is Python’s primary visualization library
Pandas: A powerful data management library inspired by R, built on top of NumPy and Matplotlib. Its introduction has significantly increased Python’s popularity in data science, making Python/Pandas a common alternative to R
scikit-learn (2008): A comprehensive machine learning library that works with both NumPy arrays and Pandas objects (though it always returns NumPy arrays)
The data types in Python are similar to those of other languages. The type can be learned with the function type
. The main data types are:
First, we have integer numbers (type int
). There are subdivisions of this basic type, such as int64
, but you don’t need to know about that to start your Python trip.
We can also have floating-point numbers (type float
), that is, numbers with decimals. We also have subdivisions here, such as float64
.
Under type bool
, Python admits Boolean values, which are either True
or False
. In Python, Boolean variables are converted to type int
or float
by applying a mathematical operator.
Besides numbers, we can also manage strings, with type str
. Strings come in Python with many methods attached. They will be discussed, in the Pandas context, in a specific chapter.
Python also has type datetime
for dealing with dates and times. This will also be discussed later in a specific chapter.
As in other languages, data can have different data types in Python. The data type can be learned with the function type
. Let me start with the numeric types. For the variable a defined above:
So, a has type int (meaning integer). Another numeric type is that of floating-point numbers (float
), which have decimals:
There are subdivisions of these two basic types (such as int64
), but I skip them in this brief tutorial. Note that, in Python, integers are not, as in the mathematics textbook, a subset of the real numbers, but a different type:
In the above square root calculation, b
got type float
because this is what the math
function sqrt
returns. The functions int
and float
can be used to convert numbers from one type to another type (sometimes at a loss):
We also have Boolean (bool
) variables, whose value is either True
or False
:
Even if they don’t appear explicitly, Booleans are always under the hood. When you enter an expression involving a comparison such as 5 < a
, the Python interpreter evaluates it, returning either True
or False
.
Here, I have defined a variable by means of such an expression, so I got a Boolean variable. Warning: as a comparison operator, equality is denoted by two equal signs. This may surprise you.
Boolean variables can be converted to int
and float
type with the functions mentioned above, but also by applying a mathematical operator:
Besides numbers, we can also manage strings with type str
:
The quote marks indicate type str
. You can use single or double quotes, but take care of using the same on both sides of the string. Strings come in Python with many methods attached. These methods will be discussed later in this course
Python has several types for objects that work as data containers. The most versatile is the list, which is represented as a sequence of comma-separated values inside square brackets.
Lists can contain items of different type. A simple example of a list, of length 4, follows.
Lists can be concatenated in a very simple way in Python:
Now, the length of newlist
is 6:
The first item of mylist can be extracted as mylist[0]
, the second item as mylist[1]
, etc. The last item can be extracted either as mylist[3]
or as mylist[-1]
. Sublists can be extracted by using a colon inside the brackets, as in:
Note that 0:2
includes 0
but not 2
. This is a general rule for indexing in Python. Other examples:
The items of a list are ordered, and can be repeated. This is not so in other data containers.
A set is represented in the same way as a list, but with curly braces replacing the square brackets:
A difference between the list and the set is that the elements of a set are not ordered, and repetition is ignored.
A tuple is like a list, represented with parentheses instead of square brackets:
A range is a sequence of integers which in many aspects works as a list, but the terms of the sequence are not saved as in a list. Instead, only the procedure to create the sequence is saved. The syntax is range(start, end, step)
. Example:
Note that the items from a range cannot printed directly. So, I have converted the range to a list here with the function list
. If the step is omitted, it is assumed to be 1:
If the start is also omitted, it is assumed to be 0:
A dictionary is a set of pairs key/value. For instance, the following dictionary contains three features of an individual:
The keys can be listed:
In the dictionary, a value is not extracted using an index which indicates its order in a sequence, as in the list, but using the corresponding key:
Example
Note: if you want to run this example you will need to install the dload
library.
Dictionaries are important because a lot of real life data comes in the form of dictionaries. Let’s say you want to get the latests information on bitcoin. You may want to call the CoinGecko API (Application Programming Interface) to obtain that data.
Typically, these API’s are essentially just websites that you visit. In this case:
https://api.coingecko.com/api/v3/coins/markets?vs_currency=usd&ids=bitcoin&order=market_cap_desc&per_page=100&page=1&sparkline=false&locale=en
import dload
api_url = 'https://api.coingecko.com/api/v3/coins/markets?vs_currency=usd&ids=bitcoin&order=market_cap_desc&per_page=100&page=1&sparkline=false&locale=en'
data = dload.json(api_url)[0]
data
{'id': 'bitcoin',
'symbol': 'btc',
'name': 'Bitcoin',
'image': 'https://assets.coingecko.com/coins/images/1/large/bitcoin.png?1547033579',
'current_price': 29016,
'market_cap': 563323025278,
'market_cap_rank': 1,
'fully_diluted_valuation': 610914958763,
'total_volume': 21563774846,
'high_24h': 29434,
'low_24h': 28191,
'price_change_24h': 766.32,
'price_change_percentage_24h': 2.71267,
'market_cap_change_24h': 16194739000,
'market_cap_change_percentage_24h': 2.95995,
'circulating_supply': 19364043.0,
'total_supply': 21000000.0,
'max_supply': 21000000.0,
'ath': 69045,
'ath_change_percentage': -57.8662,
'ath_date': '2021-11-10T14:24:11.849Z',
'atl': 67.81,
'atl_change_percentage': 42801.66278,
'atl_date': '2013-07-06T00:00:00.000Z',
'roi': None,
'last_updated': '2023-05-04T13:29:17.410Z'}
A function takes a collection of arguments and performs an action. Let me present a couple of examples of value-returning functions. They are easily distinguished from other functions, because the definition’s last line is a return
clause.
A first example follows. Note the indentation after the colon.
When we define a function, Python just takes note of the definition, accepting it when it is syntactically correct (parentheses, commas, etc). The function can be applied later to different arguments.
If we apply the function to an argument for which it does not make sense, Python will return an error message which depends on the values supplied for the argument.
--------------------------------------------------------------------------- ZeroDivisionError Traceback (most recent call last) Cell In[50], line 1 ----> 1 f(1) Cell In[48], line 2, in f(x) 1 def f(x): ----> 2 y = 1/(1 - x**2) 3 return y ZeroDivisionError: division by zero
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[51], line 1 ----> 1 f('Mary') Cell In[48], line 2, in f(x) 1 def f(x): ----> 2 y = 1/(1 - x**2) 3 return y TypeError: unsupported operand type(s) for ** or pow(): 'str' and 'int'
Short functions can be defined on a single line:
Sometimes, you will see them written in an even shorter way called lambda functions:
Functions can have more than one argument, as in:
Loops are used in practically all programming languages, so, if you have seen them if you have any kind of experience in programming. In particular, for
loops are used to avoid repetition. Suppose that you wish to extract the first letter of the names of the list mylist
, storing them in a new list. You can use the for
loop to iterate the extraction:
Loops are much less frequent in the data science practice, because NumPy and Pandas provide vectorized functions, that, when applied to a data container such as a Pandas series, return a data container with same shape, whose terms are the values of the function on the corresponding terms of the original data container. Nevertheless, we may use occasionally a for
loop in this course.
Also ubiquitous in programming is conditional logic, operationalized through if-then-else commands. You also have this in Python. For instance, if you wish to create a dummy flag for names with more than 5 letters in the list mylist
, you can do it with:
It is also rare to find explicit if-the-else arguments in data science, since “vectorial” syntax is preferred (and typically leads to a faster execution).
You try it
Create a phone book using a dict
where you have 5 names of your friends as a key and their phone number as the values.
Then, also create a reverse mapping so that you can find out who’s calling you based on the number.
Define a function called find_caller_id
that returns the name of the person who’s calling you based on the number, use the reverse mapping.
[('John', '+12345'),
('Gerald', '+15345'),
('Mary', '+12385'),
('Yennefer', '+19345'),
('Rience', '+12315')]
('John', '+12345')
('Gerald', '+15345')
('Mary', '+12385')
('Yennefer', '+19345')
('Rience', '+12315')
tuple
John +12345
Gerald +15345
Mary +12385
Yennefer +19345
Rience +12315
{'+12345': 'John',
'+15345': 'Gerald',
'+12385': 'Mary',
'+19345': 'Yennefer',
'+12315': 'Rience'}
You are allowed to use ChatGPT or other LLMs in this course. However, I recommend using it as a last resort or when exploring some feature of the python language that you are not familiar with. Then, make sure to lookup the relevant libraries to verifiy that you’re using them correctly. ChaptGPT 4 is substantially better than ChatGPT3.5 and has a dedicated tool to help you to write code. Even so the advice here will be general to ensure it works with every LLM.
Don’t ask for general prompts, instead ask specific questions and show ChatGPT some data so it knows what you’re dealing with. These examples contain concepts and libraries that we didn’t cover yet at this point but the general style of asking tends to work for problems encountered in this course.
Prompt:
I want you to act as a coder. Please write me a regex in Python that extracts the email from the following pandas Series: s = pd.Series(‘Enric enric@iese.edu’, ‘Franz franz@iese.edu’).
Response:
Here’s a Python code snippet using pandas and regular expression to extract emails:
import pandas as pd
import re
s = pd.Series(['Enric <enric@iese.edu>', 'Franz <franz@iese.edu>'])
email_pattern = r'<(.*?)>'
s_extracted = s.str.extract(email_pattern)
print(s_extracted)
0
0 enric@iese.edu
1 franz@iese.edu
This will output a new Series with just the email addresses.
Note how it (a) corrected a bug in my original code, but (b) introduced a redundancy as it’s importing a package (re
) that it’s not using …
Prompt:
I have the following dataframe called df_aapl, plot the rolling average of the Open price over time using just pandas, no matplotlib or numpy. Make it look professional and add a title:Open High Low Close Adj Close Volume 2023-08-01 196.240005 196.729996 195.279999 195.610001 195.346207 35175100 2023-08-02 195.039993 195.179993 191.850006 192.580002 192.320297 50389300 2023-08-03 191.570007 192.369995 190.690002 191.169998 190.912201 61235200
Response:
You can plot the rolling average of the “Open” price using Pandas directly like this:
import pandas as pd
data = {
'Open': [196.24, 195.039993, 191.570007],
'High': [196.729996, 195.179993, 192.369995],
'Low': [195.279999, 191.850006, 190.690002],
'Close': [195.610001, 192.580002, 191.169998],
'Adj Close': [195.346207, 192.320297, 190.912201],
'Volume': [35175100, 50389300, 61235200]
}
index = pd.to_datetime(['2023-08-01', '2023-08-02', '2023-08-03'])
df_aapl = pd.DataFrame(data, index=index)
df_aapl['Open'].rolling(window=2).mean().plot(title='Rolling Average of Open Price', grid=True);
The code works but as it didn’t know your data it made up its own, simply replace data with your own variable name and you should be good to go.
There are many books for learning about Python, but some of them would not be appropriate for learning how to work with data in Python. It can even happen that you do not find anything about data in many of them. Mind that Python has so many applications that the intersection of the know-how of all Python users is relatively narrow. For an introduction to Python as a programming language, in a computer science context, I would recommend Zelle (2010). For the self-learning data scientist, McKinney (2017) and VanderPlas (2017) are both worth their price.
There is also plenty of learning materials in Internet, including MOOC’s. For instance, Coursera has a pack of courses on Python (see coursera.org/courses?query=python
). But, probably, the most attractive marketplace for data science courses is DataCamp. They offer, under subscription or academic license, an impressive collection of courses, most of them focused on either R or Python. In addition to follow DataCamp courses, you can also benefit from the DataCamp Community Tutorials, which are free and cover a wide range of topics. Finally, a good place to start is CS dojo.
Adapted from: Miguel Ángel Canela