import numpy as np
NumPy: Numerical Python
Why learn Numpy?
In today’s data-driven world, everything can be quantified and represented as numbers—from user behavior on websites to patterns in medical data. Understanding these numerical representations is essential for machine learning, data analysis, and scientific computing. NumPy serves as the foundational package for numerical operations in Python, making it easier for you to delve into these domains.
Numpy Arrays
This tutorial introduces you to NumPy’s data containers, offering a quick look at how to manage data sets in Python. In the realm of mathematics, vectors and matrices are fundamental units represented as sequences and grids of numbers, respectively. They are integral to linear algebra, a crucial area in computational tasks underlying modern Machine Learning and Artificial Intelligence. In NumPy, these structures are simplified as one-dimensional (1D) and two-dimensional (2D) arrays, and the library even supports higher-dimensional arrays with ease.
To get started, import NumPy like this:
A 1D arrays can be created from a list with the NumPy function array. If the items of the list have different type, they are converted to a common type when creating the array. A simple example follows.
= [2, 7, 14, 5, 9]
mylist mylist
[2, 7, 14, 5, 9]
type(mylist)
list
= np.array([1,2,3,4])
arr1 arr1
array([1, 2, 3, 4])
This looks the same, but it’s a very different beast! The constraint that everything has the same type is a useful one as it allows us to operate on the array more naturally. Try these for comparison:
+ 1
arr1 + [1]
mylist * 2
arr1 * 2 mylist
[2, 7, 14, 5, 9, 2, 7, 14, 5, 9]
In case of mixed types (never do this!) you will get unexpected results:
= np.array([1,'a',3,4])
arr1 arr1
array(['1', 'a', '3', '4'], dtype='<U21')
There are two types involves with arrays. The type of the array is array
, its elements also have a type which can be checked using myarray.dtype
:
type(arr1)
numpy.ndarray
arr1.dtype
dtype('<U21')
A 2D array can be directly created from a list of lists of equal length. The terms are entered row-by-row:
= [
my_list_of_lists 0, 7, 2, 3],
[3, 9, -5, 1]
[
] my_list_of_lists
[[0, 7, 2, 3], [3, 9, -5, 1]]
= np.array(my_list_of_lists)
arr2 arr2
array([[ 0, 7, 2, 3],
[ 3, 9, -5, 1]])
Although we visualize a vector as a column (or as a row) and a matrix as a rectangular arrangement, with rows and columns, it is not so in the computer. The 1d array is just a sequence of elements of the same type, neither horizontal nor vertical. It has one axis, which is the 0-axis.
In a similar way, a 2d array is a sequence of 1d arrays of the same length and type. It has two axes. When we visualize it as rows and columns, axis=0
means across rows, while axis=1
means across columns.
The number of terms stored along an axis is the dimension of that axis. The dimensions are collected in the attribute shape
:
arr1.shape
(4,)
arr2.shape
(2, 4)
= np.random.randn(2,3,4)
arr3 arr3.shape
(2, 3, 4)
This is a tuple, meaning that you can extract one of the elements, but you cannot reassign it.
print(arr2.shape[0])
2
0] = 123 arr2.shape[
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[35], line 1 ----> 1 arr2.shape[0] = 123 TypeError: 'tuple' object does not support item assignment
You try it:
Create a prediction function
price
that predicts the price of an appartment based on its surface:Price = 100,000 + 50,000 x (surf in m2)
Calculate the price of a house of 130 m2.
Calculate the price over a range of surfaces 50,60,..,130
...
Ellipsis
Solution
def price(surf):
return 100000 + 50000 * surf
print(price(130))
= np.array([50,60,70,80,90,100,110,120,130]) # Don't use me
surfaces = np.array(list(range(50,131,10))) # This is better
surfaces = np.arange(50,131,10) # This is best
surfaces = price(surfaces)
prices
prices
6600000
array([2600000, 3100000, 3600000, 4100000, 4600000, 5100000, 5600000,
6100000, 6600000])
NumPy functions
NumPy incorporates vectorized forms of the mathematical functions of the package math
. A vectorized function is one that, when applied to an array, returns an array with same shape, whose terms are the values of the function on the corresponding terms of the original array. For instance, the NumPy square root function np.max
takes the maximum of every term of a numeric array:
# Heights in cm [Female, Male]
= np.array([[160, 175],
heights 155, 180],
[165, 170],
[162, 178],
[158, 172]])
[max(heights) np.
180
You can also tell numpy
to do this calculation along the rows or columns using the axis
parameter. Let’s try calculating the mean for Females v.s. Males using the np.mean()
function.
=0) np.mean(heights ,axis
array([160., 175.])
NumPy also provides common mathematical and statistical functions, such as median, max, sum, sqrt, std, quantile
etc.
Functions that are defined in terms of vectorized functions are automatically vectorized. Let’s try this with an exercise:
You try it:
Given the heights and weights of Females and Males, calculate the BMI of the whole population. Remember that the formula for BMI is:
\[ \text{BMI} = \frac{\text{weight}}{\text{height (in meter)}^2} \]
# Weights in kg [Female, Male]
= np.array([[160, 175],
heights 155, 180],
[165, 170],
[162, 178],
[158, 172]])
[
= np.array([[55, 70],
weights 52, 77],
[58, 68],
[54, 75],
[53, 72]])
[
def bmi(height, weight):
# complete me ...
return 0
# call your function
...
# compute the average
...
Ellipsis
Solution
# Heights in kg [Female, Male]
= np.array([[160, 175],
heights 155, 180],
[165, 170],
[162, 178],
[158, 172]])
[# Weights in kg [Female, Male]
= np.array([[55, 70],
weights 52, 77],
[58, 68],
[54, 75],
[53, 72]])
[
# long version
def bmi(height, weight):
return weight / ((height/100.)**2)
# short version
= lambda height, weight: weight / ((height/100.)**2)
bmi
= bmi(heights, weights)
all_bmis
=0) np.mean(all_bmis,axis
array([21.2478296 , 23.63214401])
Subsetting arrays
Slicing a 1D array is done the same as for a list:
= np.array([5,4,2,41])
arr1 0] arr1[
5
3] arr1[:
array([5, 4, 2])
The same applies to two-dimensional arrays, but we need two indexes within the square brackets. The first index selects the rows (axis=0
), and the second index the columns (axis=1
):
1,2], :3] heights[[
array([[155, 180],
[165, 170]])
Step intervals can be used if we need to select only every n-th element of the array:
2] arr1, arr1[::
(array([ 5, 4, 2, 41]), array([5, 2]))
One special case of this which is used quite often is to use steps of -1, which is equivalent to reversing the array.
-1] arr1, arr1[::
(array([ 5, 4, 2, 41]), array([41, 2, 4, 5]))
arr1
array([ 5, 4, 2, 41])
2] arr1[::
array([5, 2])
Here’s an overview of common slicing operations from McKinney (2017):
Filtering Subsets of an array can also be extracted by means of expressions which acts as filters. Any expression involving an array is evaluated in Python as a Boolean array (called a Boolean mask):
= heights > 160
filter_tall filter_tall
array([[False, True],
[False, True],
[ True, True],
[ True, True],
[False, True]])
We can use this as an index, and it will remove anyone in the dataset who is too tall:
heights[filter_tall]
array([175, 180, 165, 170, 162, 178, 172])
These can be combined into complex expressions as suited for the application:
= heights > 160
filter_tall = heights < 180
filter_short & filter_short ] heights[filter_tall
array([175, 165, 170, 162, 178, 172])
Since both heights and weights are of the same size, we can filter either on the other:
weights[filter_tall]
array([70, 77, 58, 68, 54, 75, 72])
Extra exercises for home
You try it
What is the average BMI of people taller than 170m?
...
Ellipsis
Solution
= bmi(heights, weights)
all_bmis
= heights > 170
filter_tall np.mean(all_bmis[filter_tall])
23.65782707606685
You try it
Try to change the BMI function so that instead of given you your BMI, it returns a classification (string). The BMI classifications are:
- Below 18.5 Underweight
- 18.5 – 24.9 Healthy Weight
- 25.0 – 29.9 Overweight
- 30.0 and Above Obesity
...
Solution
def bmi_classification(height, weight):
= bmi(height, weight)
score
if(score < 18.5):
return "Underweight"
elif(score <= 24.9):
return "Healthy Weight"
elif(score <=29.9):
return "Overweight"
else:
return "Above Obesity"
180,79) bmi_classification(
'Healthy Weight'