• Review
  • Image Data
  • Tabular Data
  • Text Data
  • Assignments
    • Haiteng Engineering: Quality Control System Analysis
    • AirBnBarcelona
  1. Review
  2. NumPy: Numerical Python
  • Review
    • Introduction to Python
    • NumPy: Numerical Python
  • Image Data
    • How Computers See
    • Computer Vision
  • Tabular Data
    • Pandas: Better than Excel
    • Investment Drivers
    • From Data to Model: AirBnB
    • Time Series Data in Pandas
  • Text Data
    • How Computers “Read”
    • 30-Text-Data/060-Embeddings.ipynb

Navigation

On this page

  • Why learn Numpy?
  • Numpy Arrays
  • NumPy functions
  • Subsetting arrays
    • Extra exercises for home
  1. Review
  2. NumPy: Numerical Python

NumPy: Numerical Python

Why learn Numpy?

In today’s data-driven world, everything can be quantified and represented as numbers—from user behavior on websites to patterns in medical data. Understanding these numerical representations is essential for machine learning, data analysis, and scientific computing. NumPy serves as the foundational package for numerical operations in Python, making it easier for you to delve into these domains.

Numpy Arrays

This tutorial introduces you to NumPy’s data containers, offering a quick look at how to manage data sets in Python. In the realm of mathematics, vectors and matrices are fundamental units represented as sequences and grids of numbers, respectively. They are integral to linear algebra, a crucial area in computational tasks underlying modern Machine Learning and Artificial Intelligence. In NumPy, these structures are simplified as one-dimensional (1D) and two-dimensional (2D) arrays, and the library even supports higher-dimensional arrays with ease.

To get started, import NumPy like this:

import numpy as np

A 1D arrays can be created from a list with the NumPy function array. If the items of the list have different type, they are converted to a common type when creating the array. A simple example follows.

mylist = [2, 7, 14, 5, 9]
mylist
[2, 7, 14, 5, 9]
type(mylist)
list
arr1 = np.array([1,2,3,4])
arr1
array([1, 2, 3, 4])

This looks the same, but it’s a very different beast! The constraint that everything has the same type is a useful one as it allows us to operate on the array more naturally. Try these for comparison:

arr1 + 1
mylist + [1]
arr1 * 2
mylist * 2
[2, 7, 14, 5, 9, 2, 7, 14, 5, 9]

In case of mixed types (never do this!) you will get unexpected results:

arr1 = np.array([1,'a',3,4])
arr1
array(['1', 'a', '3', '4'], dtype='<U21')

There are two types involves with arrays. The type of the array is array, its elements also have a type which can be checked using myarray.dtype:

type(arr1)
numpy.ndarray
arr1.dtype
dtype('<U21')

A 2D array can be directly created from a list of lists of equal length. The terms are entered row-by-row:

my_list_of_lists = [
    [0, 7, 2, 3], 
    [3, 9, -5, 1]
]
my_list_of_lists
[[0, 7, 2, 3], [3, 9, -5, 1]]
arr2 = np.array(my_list_of_lists)
arr2
array([[ 0,  7,  2,  3],
       [ 3,  9, -5,  1]])

Although we visualize a vector as a column (or as a row) and a matrix as a rectangular arrangement, with rows and columns, it is not so in the computer. The 1d array is just a sequence of elements of the same type, neither horizontal nor vertical. It has one axis, which is the 0-axis.

In a similar way, a 2d array is a sequence of 1d arrays of the same length and type. It has two axes. When we visualize it as rows and columns, axis=0 means across rows, while axis=1 means across columns.

The number of terms stored along an axis is the dimension of that axis. The dimensions are collected in the attribute shape:

arr1.shape
(4,)
arr2.shape
(2, 4)
arr3 = np.random.randn(2,3,4)
arr3.shape
(2, 3, 4)

This is a tuple, meaning that you can extract one of the elements, but you cannot reassign it.

print(arr2.shape[0])
2
arr2.shape[0] = 123
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[35], line 1
----> 1 arr2.shape[0] = 123

TypeError: 'tuple' object does not support item assignment

You try it:

  1. Create a prediction function price that predicts the price of an appartment based on its surface:

    Price = 100,000 + 50,000 x (surf in m2)

  2. Calculate the price of a house of 130 m2.

  3. Calculate the price over a range of surfaces 50,60,..,130

...
Ellipsis
Solution
def price(surf):
    return 100000 + 50000 * surf


print(price(130))

surfaces = np.array([50,60,70,80,90,100,110,120,130]) # Don't use me
surfaces = np.array(list(range(50,131,10)))   # This is better
surfaces = np.arange(50,131,10)               # This is best
prices   = price(surfaces)

prices
6600000
array([2600000, 3100000, 3600000, 4100000, 4600000, 5100000, 5600000,
       6100000, 6600000])

NumPy functions

NumPy incorporates vectorized forms of the mathematical functions of the package math. A vectorized function is one that, when applied to an array, returns an array with same shape, whose terms are the values of the function on the corresponding terms of the original array. For instance, the NumPy square root function np.max takes the maximum of every term of a numeric array:

# Heights in cm [Female, Male]
heights = np.array([[160, 175],  
                    [155, 180],  
                    [165, 170],  
                    [162, 178],  
                    [158, 172]])
np.max(heights)
180

You can also tell numpy to do this calculation along the rows or columns using the axis parameter. Let’s try calculating the mean for Females v.s. Males using the np.mean() function.

np.mean(heights ,axis=0)
array([160., 175.])

NumPy also provides common mathematical and statistical functions, such as median, max, sum, sqrt, std, quantile etc.

Functions that are defined in terms of vectorized functions are automatically vectorized. Let’s try this with an exercise:

You try it:

Given the heights and weights of Females and Males, calculate the BMI of the whole population. Remember that the formula for BMI is:

\[ \text{BMI} = \frac{\text{weight}}{\text{height (in meter)}^2} \]

# Weights in kg [Female, Male]
heights = np.array([[160, 175],
                       [155, 180],
                       [165, 170],
                       [162, 178],
                       [158, 172]])

weights = np.array([[55, 70],
                    [52, 77],
                    [58, 68],
                    [54, 75],
                    [53, 72]])

def bmi(height, weight):
    # complete me ...
    return 0

# call your function
...

# compute the average
...
Ellipsis
Solution
# Heights in kg [Female, Male]
heights = np.array([[160, 175],
                    [155, 180],
                    [165, 170],
                    [162, 178],
                    [158, 172]])
# Weights in kg [Female, Male]
weights = np.array([[55, 70],
                    [52, 77],
                    [58, 68],
                    [54, 75],
                    [53, 72]])

# long version
def bmi(height, weight):
    return weight / ((height/100.)**2)

# short version
bmi = lambda height, weight: weight / ((height/100.)**2)

all_bmis = bmi(heights, weights)

np.mean(all_bmis,axis=0)
array([21.2478296 , 23.63214401])

Subsetting arrays

Slicing a 1D array is done the same as for a list:

arr1 = np.array([5,4,2,41])
arr1[0]
5
arr1[:3]
array([5, 4, 2])

The same applies to two-dimensional arrays, but we need two indexes within the square brackets. The first index selects the rows (axis=0), and the second index the columns (axis=1):

heights[[1,2], :3]
array([[155, 180],
       [165, 170]])

Step intervals can be used if we need to select only every n-th element of the array:

arr1, arr1[::2]
(array([ 5,  4,  2, 41]), array([5, 2]))

One special case of this which is used quite often is to use steps of -1, which is equivalent to reversing the array.

arr1, arr1[::-1]
(array([ 5,  4,  2, 41]), array([41,  2,  4,  5]))
arr1
array([ 5,  4,  2, 41])
arr1[::2]
array([5, 2])

Here’s an overview of common slicing operations from McKinney (2017):

Filtering Subsets of an array can also be extracted by means of expressions which acts as filters. Any expression involving an array is evaluated in Python as a Boolean array (called a Boolean mask):

filter_tall = heights > 160
filter_tall
array([[False,  True],
       [False,  True],
       [ True,  True],
       [ True,  True],
       [False,  True]])

We can use this as an index, and it will remove anyone in the dataset who is too tall:

heights[filter_tall]
array([175, 180, 165, 170, 162, 178, 172])

These can be combined into complex expressions as suited for the application:

filter_tall   = heights > 160
filter_short  = heights < 180
heights[filter_tall & filter_short ]
array([175, 165, 170, 162, 178, 172])

Since both heights and weights are of the same size, we can filter either on the other:

weights[filter_tall]
array([70, 77, 58, 68, 54, 75, 72])

Extra exercises for home

You try it

What is the average BMI of people taller than 170m?

...
Ellipsis
Solution
all_bmis = bmi(heights, weights)

filter_tall = heights > 170
np.mean(all_bmis[filter_tall])
23.65782707606685

You try it

Try to change the BMI function so that instead of given you your BMI, it returns a classification (string). The BMI classifications are:

  • Below 18.5 Underweight
  • 18.5 – 24.9 Healthy Weight
  • 25.0 – 29.9 Overweight
  • 30.0 and Above Obesity
...
Solution
def bmi_classification(height, weight):
    score = bmi(height, weight)

    if(score < 18.5):
        return "Underweight"
    elif(score <= 24.9):
        return "Healthy Weight"
    elif(score <=29.9):
        return "Overweight"
    else:
        return "Above Obesity"
    
bmi_classification(180,79)
'Healthy Weight'