• Review
  • Image Data
  • Tabular Data
  • Text Data
  • Assignments
    • Haiteng Engineering: Quality Control System Analysis
    • AirBnBarcelona
  1. Review
  2. Introduction to Python
  • Review
    • Introduction to Python
    • NumPy: Numerical Python
  • Image Data
    • How Computers See
    • Computer Vision
  • Tabular Data
    • Pandas: Better than Excel
    • Investment Drivers
    • From Data to Model: AirBnB
    • Time Series Data in Pandas
  • Text Data
    • How Computers “Read”
    • 30-Text-Data/060-Embeddings.ipynb

Navigation

On this page

  • What is Python?
  • The Anaconda distribution
  • Typing Python
  • Python Packages
    • Installing packages
  • Data Types
    • Numeric types
    • Boolean Data
    • Strings
  • Data Containers
    • Lists
    • Set
    • Tuple
    • Ranges
    • Dictionaries
  • Functions
  • Loops and conditional logic
  • Asking ChatGPT for help
    • Prompt Suggestions
  • Additional Resources
    • References
  1. Review
  2. Introduction to Python

Introduction to Python

What is Python?

Python is a programming language, introduced in 1991. The current version is Python 3, though you can still find Python 2 code in the older editions of many books. To work with Python, you will use an interface for which you have many choices available. You can have several “instances” of Python, called kernels, running independently in your computer.

Python is case sensitive. So, type is a Python function that returns the type of an object, but Type is not recognized (unless you create a new function with this name), and will return an error message.

Python is a powerful and widely-used programming language that debuted in 1991. While Python 3 is the current standard, you might encounter Python 2 code in older resources. The language offers flexibility in how you interact with it - you can choose from various interfaces, and run multiple independent instances (called kernels) simultaneously on your computer.

One important detail to remember: Python is case sensitive. For example, the built-in function type (lowercase) tells you what kind of object you’re working with, but writing it as Type will trigger an error because Python treats uppercase and lowercase letters as distinct characters.

The Anaconda distribution

There are many distributions of Python, but Anaconda (anaconda.com) is the preferred choice in the data science community. Anaconda comes pre-packaged with most resources needed for this course. When installing Anaconda, select Python 3, and you’ll get access to the Anaconda Navigator - a browser-based interface for launching various Python environments. For those who prefer command-line tools, you can bypass the Navigator using a command-line interface (CLI): Terminal on Mac or the Anaconda prompt on Windows.

Historically, developers favored the Jupyter Qt console or IPython (Interactive Python). IPython enhanced the basic Python shell with additional features, while the Qt console added a graphical interface with menus and mouse support using the Qt toolkit. IPython introduced magic commands, denoted by %cmd (like %cd for changing directories). While these commands aren’t part of Python itself, they’re still referenced in many tutorials. For a complete list of magic commands, use %quickref. Though you can omit the % prefix, we’ll keep it to distinguish magic commands from standard Python code.

Today’s data scientists primarily use IPython’s successor: Jupyter. The Jupyter ecosystem introduces the notebook paradigm, allowing seamless integration of code, output, and documentation. Jupyter Notebook leads this space, with Apache Zeppelin a notable alternatives. All of these platforms support multiple programming languages beyond Python, including R. These course materials were developed using a Python-based Jupyter notebook. The page that you are reading right now is a rendered notebook.

Anaconda also includes two full-featured Integrated Development Environments (IDEs): Spyder and Visual Studio Code. These provide comprehensive development tools like consoles, text editors, debuggers, and variable inspectors. While we won’t cover these advanced features, users familiar with IDEs (like RStudio for R) may prefer these environments. Feel free to use any tool you’re comfortable with, though our instructions will assume you’re using Jupyter Notebook, Jupyter Lab or Visual Studio Code - the concepts generally transfer to other interfaces with minimal adaptation.

In case you are not able to install/run Anaconda on your own computer, it is recommended that you create a google account and use Google Collab instead (though it has some limitations).

Typing Python

Code is typed into code cells like the one below, when you enter that cell and press Shift+Return the code will run and you will get the output underneath. Python returns either an output (preceded by [1]:), a (typically long and difficult) error message, or no answer at all. Here is a supersimple example

2+2
4

So, if you enter 2 + 2, the output will be the result of this calculation. But, when you want to store this result for later use (in the same session), you enter it with a name, as follows:

a = 2 + 2

This creates the variable a. Note that the value of 2 + 2 is not outputted now. But you can call it:

a
4

In Python, when you assign a value to a variable that has already been created, the previous assignment is forgotten. So:

a = 7 - 2
a
5

If you copypaste in the console code chunks that you edit in a text editor, you can input several code lines at once. In that case, you will only get the output for the last line. If the cursor is not at the end of the last line, you have to press Shift+Return to get the output. Here is a simple example:

b = 2 * 3
b - 1
b**2
36

Note. You would probably have written bˆ2 for the square of 2, but the caret symbol (ˆ) plays a different role in Python.

Python Packages

When you start a new kernel, only a core set of Python resources is initially available. You’ll need to import additional packages for most data science tasks, as these core resources are intentionally kept minimal. These extra resources come in packages, which themselves are split into different modules. Python can be extended by more than 200,000 packages. Some large packages, like scikit-learn, are collections of modules called subpackages.

For example, if you need to calculate the square root of 2, you would import the math package, which includes various mathematical functions. Once imported, all its functions become available. The notation math.sqrt indicates that sqrt is a function from the math package.

import math
math.sqrt(2)
1.4142135623730951

Alternatively, you can import only the functions that you plan to use:

from math import sqrt
sqrt(2)
1.4142135623730951

The following example generates a random number according to a normal distribution by calling a function in the random package of numpy module.

import numpy.random as rnd
rnd.randn()
-1.0764016072133342

Installing packages

Packages must be installed on your computer before they can be imported, and these imports only last for the current kernel session. The math package, being part of the Python Standard Library, is always available by default.

If you use the Anaconda distribution, most packages used in this course are already available and can be directly imported. If it is not the case, you have to install the package (only once). If you need additional packages, you have two installation options:

  • pip: The basic Python package installer (pypi.org/project/pip), though it may encounter version conflicts between related packages
  • conda: The recommended installer for Anaconda users, as it better manages package dependencies and version conflicts

Both pip and conda can be run directly from Jupyter notebooks, which we’ll explore later as the need comes up.

This course approaches Python from a data science perspective, we therefore focus on its four core packages:

  • NumPy (1995): Provides support for large vectors and matrices (called arrays) - the foundation for scientific computing in Python

  • Matplotlib: Built on NumPy, this is Python’s primary visualization library

  • Pandas: A powerful data management library inspired by R, built on top of NumPy and Matplotlib. Its introduction has significantly increased Python’s popularity in data science, making Python/Pandas a common alternative to R

  • scikit-learn (2008): A comprehensive machine learning library that works with both NumPy arrays and Pandas objects (though it always returns NumPy arrays)

Data Types

The data types in Python are similar to those of other languages. The type can be learned with the function type. The main data types are:

  • First, we have integer numbers (type int). There are subdivisions of this basic type, such as int64, but you don’t need to know about that to start your Python trip.

  • We can also have floating-point numbers (type float), that is, numbers with decimals. We also have subdivisions here, such as float64.

  • Under type bool, Python admits Boolean values, which are either True or False. In Python, Boolean variables are converted to type int or float by applying a mathematical operator.

  • Besides numbers, we can also manage strings, with type str. Strings come in Python with many methods attached. They will be discussed, in the Pandas context, in a specific chapter.

  • Python also has type datetime for dealing with dates and times. This will also be discussed later in a specific chapter.

Numeric types

As in other languages, data can have different data types in Python. The data type can be learned with the function type. Let me start with the numeric types. For the variable a defined above:

type(a)
int

So, a has type int (meaning integer). Another numeric type is that of floating-point numbers (float), which have decimals:

b = math.sqrt(2)
type(b)
float

There are subdivisions of these two basic types (such as int64), but I skip them in this brief tutorial. Note that, in Python, integers are not, as in the mathematics textbook, a subset of the real numbers, but a different type:

type(2)
int
type(2.0)
float

In the above square root calculation, b got type float because this is what the math function sqrt returns. The functions int and float can be used to convert numbers from one type to another type (sometimes at a loss):

float(2)
2.0
int(2.3)
2

Boolean Data

We also have Boolean (bool) variables, whose value is either True or False:

d = 5 < a
d
False
type(d)
bool

Even if they don’t appear explicitly, Booleans are always under the hood. When you enter an expression involving a comparison such as 5 &lt; a, the Python interpreter evaluates it, returning either True or False.

Here, I have defined a variable by means of such an expression, so I got a Boolean variable. Warning: as a comparison operator, equality is denoted by two equal signs. This may surprise you.

a == 4
False

Boolean variables can be converted to int and float type with the functions mentioned above, but also by applying a mathematical operator:

math.sqrt(d)
0.0
1 - d
1

Strings

Besides numbers, we can also manage strings with type str:

c = 'Lovelace'
type(c)
str

The quote marks indicate type str. You can use single or double quotes, but take care of using the same on both sides of the string. Strings come in Python with many methods attached. These methods will be discussed later in this course

Data Containers

Lists

Python has several types for objects that work as data containers. The most versatile is the list, which is represented as a sequence of comma-separated values inside square brackets.

Lists can contain items of different type. A simple example of a list, of length 4, follows.

mylist = ['Lovelace', 'Babbage', 'Shannon', 'Turing']
len(mylist)
4

Lists can be concatenated in a very simple way in Python:

newlist = mylist + [2, 3]
newlist
['Lovelace', 'Babbage', 'Shannon', 'Turing', 2, 3]

Now, the length of newlist is 6:

len(newlist)
6

The first item of mylist can be extracted as mylist[0], the second item as mylist[1], etc. The last item can be extracted either as mylist[3] or as mylist[-1]. Sublists can be extracted by using a colon inside the brackets, as in:

mylist[0:2]
['Lovelace', 'Babbage']

Note that 0:2 includes 0 but not 2. This is a general rule for indexing in Python. Other examples:

mylist[2:]
['Shannon', 'Turing']
mylist[:3]
['Lovelace', 'Babbage', 'Shannon']
mylist[:-1]
['Lovelace', 'Babbage', 'Shannon']

The items of a list are ordered, and can be repeated. This is not so in other data containers.

Set

A set is represented in the same way as a list, but with curly braces replacing the square brackets:

myset = {'Messi', 'Cristiano', 'Neymar', 'Coutinho'}

A difference between the list and the set is that the elements of a set are not ordered, and repetition is ignored.

Tuple

A tuple is like a list, represented with parentheses instead of square brackets:

mytuple = ('Messi', 'Cristiano', 'Neymar', 'Coutinho')

Ranges

A range is a sequence of integers which in many aspects works as a list, but the terms of the sequence are not saved as in a list. Instead, only the procedure to create the sequence is saved. The syntax is range(start, end, step). Example:

myrange = range(0, 10, 2)
list(myrange)
[0, 2, 4, 6, 8]

Note that the items from a range cannot printed directly. So, I have converted the range to a list here with the function list. If the step is omitted, it is assumed to be 1:

list(range(5, 12))
[5, 6, 7, 8, 9, 10, 11]

If the start is also omitted, it is assumed to be 0:

list(range(10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Dictionaries

A dictionary is a set of pairs key/value. For instance, the following dictionary contains three features of an individual:

my_dict = {'name': 'Joan', 'gender': 'F', 'age': 32}

The keys can be listed:

my_dict.keys()
dict_keys(['name', 'gender', 'age'])

In the dictionary, a value is not extracted using an index which indicates its order in a sequence, as in the list, but using the corresponding key:

my_dict['name']
'Joan'

Example

Note: if you want to run this example you will need to install the dload library.

Dictionaries are important because a lot of real life data comes in the form of dictionaries. Let’s say you want to get the latests information on bitcoin. You may want to call the CoinGecko API (Application Programming Interface) to obtain that data.

Typically, these API’s are essentially just websites that you visit. In this case:

https://api.coingecko.com/api/v3/coins/markets?vs_currency=usd&ids=bitcoin&order=market_cap_desc&per_page=100&page=1&sparkline=false&locale=en

import dload

api_url = 'https://api.coingecko.com/api/v3/coins/markets?vs_currency=usd&ids=bitcoin&order=market_cap_desc&per_page=100&page=1&sparkline=false&locale=en'
data = dload.json(api_url)[0]

data
{'id': 'bitcoin',
 'symbol': 'btc',
 'name': 'Bitcoin',
 'image': 'https://assets.coingecko.com/coins/images/1/large/bitcoin.png?1547033579',
 'current_price': 29016,
 'market_cap': 563323025278,
 'market_cap_rank': 1,
 'fully_diluted_valuation': 610914958763,
 'total_volume': 21563774846,
 'high_24h': 29434,
 'low_24h': 28191,
 'price_change_24h': 766.32,
 'price_change_percentage_24h': 2.71267,
 'market_cap_change_24h': 16194739000,
 'market_cap_change_percentage_24h': 2.95995,
 'circulating_supply': 19364043.0,
 'total_supply': 21000000.0,
 'max_supply': 21000000.0,
 'ath': 69045,
 'ath_change_percentage': -57.8662,
 'ath_date': '2021-11-10T14:24:11.849Z',
 'atl': 67.81,
 'atl_change_percentage': 42801.66278,
 'atl_date': '2013-07-06T00:00:00.000Z',
 'roi': None,
 'last_updated': '2023-05-04T13:29:17.410Z'}

Functions

A function takes a collection of arguments and performs an action. Let me present a couple of examples of value-returning functions. They are easily distinguished from other functions, because the definition’s last line is a return clause.

A first example follows. Note the indentation after the colon.

def f(x):
    y = 1/(1 - x**2)
    return y

When we define a function, Python just takes note of the definition, accepting it when it is syntactically correct (parentheses, commas, etc). The function can be applied later to different arguments.

f(2)
-0.3333333333333333

If we apply the function to an argument for which it does not make sense, Python will return an error message which depends on the values supplied for the argument.

f(1)
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
Cell In[50], line 1
----> 1 f(1)

Cell In[48], line 2, in f(x)
      1 def f(x):
----> 2     y = 1/(1 - x**2)
      3     return y

ZeroDivisionError: division by zero
f('Mary')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[51], line 1
----> 1 f('Mary')

Cell In[48], line 2, in f(x)
      1 def f(x):
----> 2     y = 1/(1 - x**2)
      3     return y

TypeError: unsupported operand type(s) for ** or pow(): 'str' and 'int'

Short functions can be defined on a single line:

def g(x): return x**2
g(3)
9

Sometimes, you will see them written in an even shorter way called lambda functions:

g = lambda x : x**2
g(3)
9

Functions can have more than one argument, as in:

def g(x, y): return x*y/(x**2 + y**2)
g(1, 1)
0.5

Loops and conditional logic

Loops are used in practically all programming languages, so, if you have seen them if you have any kind of experience in programming. In particular, for loops are used to avoid repetition. Suppose that you wish to extract the first letter of the names of the list mylist, storing them in a new list. You can use the for loop to iterate the extraction:

inilist = [name[0] for name in mylist]
inilist
['L', 'B', 'S', 'T']

Loops are much less frequent in the data science practice, because NumPy and Pandas provide vectorized functions, that, when applied to a data container such as a Pandas series, return a data container with same shape, whose terms are the values of the function on the corresponding terms of the original data container. Nevertheless, we may use occasionally a for loop in this course.

Also ubiquitous in programming is conditional logic, operationalized through if-then-else commands. You also have this in Python. For instance, if you wish to create a dummy flag for names with more than 5 letters in the list mylist, you can do it with:

flaglist = [1 if len(name) > 5 else 0 for name in mylist]
flaglist
[1, 1, 1, 1]

It is also rare to find explicit if-the-else arguments in data science, since “vectorial” syntax is preferred (and typically leads to a faster execution).

You try it

  1. Create a phone book using a dict where you have 5 names of your friends as a key and their phone number as the values.

  2. Then, also create a reverse mapping so that you can find out who’s calling you based on the number.

  3. Define a function called find_caller_id that returns the name of the person who’s calling you based on the number, use the reverse mapping.

Solution A
phonebook = {   
    'John'     : '+12345',
    'Gerald'   : '+15345',
    'Mary'     : '+12385',
    'Yennefer' : '+19345',
    'Rience'   : '+12315',
}
Solution B
phonebook_items = phonebook.items()
list(phonebook_items)
[('John', '+12345'),
 ('Gerald', '+15345'),
 ('Mary', '+12385'),
 ('Yennefer', '+19345'),
 ('Rience', '+12315')]
Solution C
for item in phonebook.items():
    print(item)
    
type(item)
('John', '+12345')
('Gerald', '+15345')
('Mary', '+12385')
('Yennefer', '+19345')
('Rience', '+12315')
tuple
for one,two in phonebook.items():
    print(one,two)
John +12345
Gerald +15345
Mary +12385
Yennefer +19345
Rience +12315
reverse_phonebook = dict([ (val,key) for key,val in phonebook.items() ])
reverse_phonebook
{'+12345': 'John',
 '+15345': 'Gerald',
 '+12385': 'Mary',
 '+19345': 'Yennefer',
 '+12315': 'Rience'}
def find_caller_id(number):
    return reverse_phonebook[number]

print(f"The person who called you was {find_caller_id('+12345')}")
The person who called you was John

Asking ChatGPT for help

You are allowed to use ChatGPT or other LLMs in this course. However, I recommend using it as a last resort or when exploring some feature of the python language that you are not familiar with. Then, make sure to lookup the relevant libraries to verifiy that you’re using them correctly. ChaptGPT 4 is substantially better than ChatGPT3.5 and has a dedicated tool to help you to write code. Even so the advice here will be general to ensure it works with every LLM.

Prompt Suggestions

Don’t ask for general prompts, instead ask specific questions and show ChatGPT some data so it knows what you’re dealing with. These examples contain concepts and libraries that we didn’t cover yet at this point but the general style of asking tends to work for problems encountered in this course.

Example 1:

Prompt:

I want you to act as a coder. Please write me a regex in Python that extracts the email from the following pandas Series: s = pd.Series(‘Enric enric@iese.edu’, ‘Franz franz@iese.edu’).

Response:

Here’s a Python code snippet using pandas and regular expression to extract emails:

import pandas as pd
import re

s = pd.Series(['Enric <enric@iese.edu>', 'Franz <franz@iese.edu>'])
email_pattern = r'<(.*?)>'
s_extracted = s.str.extract(email_pattern)

print(s_extracted)
                0
0  enric@iese.edu
1  franz@iese.edu

This will output a new Series with just the email addresses.

Note how it (a) corrected a bug in my original code, but (b) introduced a redundancy as it’s importing a package (re) that it’s not using …

Example 2

Prompt:

I have the following dataframe called df_aapl, plot the rolling average of the Open price over time using just pandas, no matplotlib or numpy. Make it look professional and add a title:
  Open    High    Low Close   Adj Close   Volume
2023-08-01    196.240005  196.729996  195.279999  195.610001  195.346207  35175100
2023-08-02    195.039993  195.179993  191.850006  192.580002  192.320297  50389300
2023-08-03    191.570007  192.369995  190.690002  191.169998  190.912201  61235200

Response:

You can plot the rolling average of the “Open” price using Pandas directly like this:

import pandas as pd

data = {
    'Open': [196.24, 195.039993, 191.570007],
    'High': [196.729996, 195.179993, 192.369995],
    'Low': [195.279999, 191.850006, 190.690002],
    'Close': [195.610001, 192.580002, 191.169998],
    'Adj Close': [195.346207, 192.320297, 190.912201],
    'Volume': [35175100, 50389300, 61235200]
}
index = pd.to_datetime(['2023-08-01', '2023-08-02', '2023-08-03'])
df_aapl = pd.DataFrame(data, index=index)

df_aapl['Open'].rolling(window=2).mean().plot(title='Rolling Average of Open Price', grid=True);

The code works but as it didn’t know your data it made up its own, simply replace data with your own variable name and you should be good to go.

Additional Resources

There are many books for learning about Python, but some of them would not be appropriate for learning how to work with data in Python. It can even happen that you do not find anything about data in many of them. Mind that Python has so many applications that the intersection of the know-how of all Python users is relatively narrow. For an introduction to Python as a programming language, in a computer science context, I would recommend Zelle (2010). For the self-learning data scientist, McKinney (2017) and VanderPlas (2017) are both worth their price.

There is also plenty of learning materials in Internet, including MOOC’s. For instance, Coursera has a pack of courses on Python (see coursera.org/courses?query=python). But, probably, the most attractive marketplace for data science courses is DataCamp. They offer, under subscription or academic license, an impressive collection of courses, most of them focused on either R or Python. In addition to follow DataCamp courses, you can also benefit from the DataCamp Community Tutorials, which are free and cover a wide range of topics. Finally, a good place to start is CS dojo.

References

Adapted from: Miguel Ángel Canela