import pandas as pd
Pandas: Better than Excel
Tabular data sets are datasets that look like Excel sheets. They can be imported into Pandas data frames from various file formats.
Pandas data frames
Pandas provides a wide range of data wrangling tools. It is typically imported as:
At the heart of any pandas analysis lies a pythonic equivalent of an excel sheet, called a data frame. It looks as follows:
A dataframe is thus: a 2d array of data plus the column names and an index, which contains the row names. An important difference between numpy
2d arrays and pandas
data frames is that a data frame does not have a data type, each column has its own data type. In pandas, we don’t call them columns, but series.
We can import an data file (excel, csv, …) using the pd.read_x()
command where ‘x’ is csv, excel. The cars data file that will be working with today is available here. You can import it as follows:
= pd.read_csv('./resources/cars.csv') df
There are a lot of extra arguments that we have not specified, you’ll find the most commonly used ones at the bottom of this document. We can show it by just typing its name:
df
introduction | group | brand | model_name | length | width | height | curb_weight | total_max_power_hp | price | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2007 | Fiat Chrysler Automobiles | Alfa Romeo | Alfa Romeo 147 | 4223 | 1729.00 | 1442.00 | 1185 | 105 | 21950 |
1 | 2008 | Fiat Chrysler Automobiles | Alfa Romeo | Alfa Romeo 159 | 4661 | 1830.00 | 1422.00 | 1360 | 140 | 28450 |
2 | 2013 | Fiat Chrysler Automobiles | Alfa Romeo | Alfa Romeo 4C | 3990 | 1868.00 | 1184.00 | 895 | 240 | 75360 |
3 | 2008 | Fiat Chrysler Automobiles | Alfa Romeo | Alfa Romeo 8C | 4381 | 1894.00 | 1341.00 | 1560 | 450 | 220000 |
4 | 2009 | Fiat Chrysler Automobiles | Alfa Romeo | Alfa Romeo Brera | 4413 | 1830.00 | 1372.00 | 1420 | 200 | 40950 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
586 | 2007 | Geely | Volvo | Volvo V70 | 4823 | 1861.00 | 1540.00 | 1467 | 145 | 39895 |
587 | 2020 | Geely | Volvo | Volvo XC40 | 4425 | 1900.00 | 1652.00 | 2150 | 408 | 58775 |
588 | 2010 | Geely | Volvo | Volvo XC60 | 4628 | 1891.00 | 1713.00 | 1610 | 203 | 49495 |
589 | 2004 | Geely | Volvo | Volvo XC70 | 4733 | 1860.00 | 1562.00 | 1664 | 210 | 49815 |
590 | 2003 | Geely | Volvo | Volvo XC90 | 4798 | 1898.00 | 1743.00 | 2023 | 210 | 56115 |
591 rows × 10 columns
We can extract its index, values and columns using the appropriately properties of the df:
df.index
RangeIndex(start=0, stop=591, step=1)
Without a explicit specification, the index is created automatically, as a RangeIndex
. In this example, since the columns have different data types, df.values
takes object
type. The third component of the data frame is a list with the column names, which can be extracted as the attribute columns
:
df.values
array([[2007, 'Fiat Chrysler Automobiles', 'Alfa Romeo', ..., 1185, 105,
21950],
[2008, 'Fiat Chrysler Automobiles', 'Alfa Romeo', ..., 1360, 140,
28450],
[2013, 'Fiat Chrysler Automobiles', 'Alfa Romeo', ..., 895, 240,
75360],
...,
[2010, 'Geely', 'Volvo', ..., 1610, 203, 49495],
[2004, 'Geely', 'Volvo', ..., 1664, 210, 49815],
[2003, 'Geely', 'Volvo', ..., 2023, 210, 56115]], dtype=object)
df.columns
Index(['introduction', 'group', 'brand', 'model_name', 'length', 'width',
'height', 'curb_weight', 'total_max_power_hp', 'price'],
dtype='object')
We can retreive the number of rows/columns using the familiar shape
command:
df.shape
(591, 10)
We can also retreive the data types of each column using the dtypes
property:
df.dtypes
introduction int64
group object
brand object
model_name object
length int64
width float64
height float64
curb_weight int64
total_max_power_hp int64
price int64
dtype: object
Note that the data type of e.g., the fourth column, for which you would have expected str
, is reported as object
. Don’t worry about this, you can apply string functions to this column, as will be seen later in this course.
Exploring Pandas objects
The methods head
and tail
extract the first and the last rows of a data frame, respectively. The default number of rows extracted is 5, but you can pass a custom number.
2) df.head(
introduction | group | brand | model_name | length | width | height | curb_weight | total_max_power_hp | price | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2007 | Fiat Chrysler Automobiles | Alfa Romeo | Alfa Romeo 147 | 4223 | 1729.00 | 1442.00 | 1185 | 105 | 21950 |
1 | 2008 | Fiat Chrysler Automobiles | Alfa Romeo | Alfa Romeo 159 | 4661 | 1830.00 | 1422.00 | 1360 | 140 | 28450 |
The content of a data frame can also be explored with the method info
. It reports the dimensions, the data type and the number of non-missing values of every column of the data frame.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 591 entries, 0 to 590
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 introduction 591 non-null int64
1 group 591 non-null object
2 brand 591 non-null object
3 model_name 591 non-null object
4 length 591 non-null int64
5 width 590 non-null float64
6 height 590 non-null float64
7 curb_weight 591 non-null int64
8 total_max_power_hp 591 non-null int64
9 price 591 non-null int64
dtypes: float64(2), int64(5), object(3)
memory usage: 46.3+ KB
Extracting a Series
A Series (column) can be extracted by typing its name in quotes:
'introduction'] df[
0 2007
1 2008
2 2013
3 2008
4 2009
...
586 2007
587 2020
588 2010
589 2004
590 2003
Name: introduction, Length: 591, dtype: int64
# This is faster to type, but gives issues if you have special characters in
# the name of the dataframe's column.
df.introduction
0 2007
1 2008
2 2013
3 2008
4 2009
...
586 2007
587 2020
588 2010
589 2004
590 2003
Name: introduction, Length: 591, dtype: int64
Note how the series is not just the values of that column, but also includes a reference to the name and the index. These can be extracted if you need them:
df.introduction.index
df.introduction.name df.introduction.shape
(591,)
You try it:
Create a pandas Series of the volume of all the cars in m3.
Hint: start by extracting the height, width and length. Then, multiply the three numbers.
...
Ellipsis
Solution
= df['height'] * df['width'] * df['length']
volume = df.height * df.width * df.length volume
Extracting a dataframe
You can also extract a data subdataframe containing a subset of complete columns from a data frame. You can specify this with a list containing the names of those columns:
'introduction', 'price','model_name']] df[[
introduction | price | model_name | |
---|---|---|---|
0 | 2007 | 21950 | Alfa Romeo 147 |
1 | 2008 | 28450 | Alfa Romeo 159 |
2 | 2013 | 75360 | Alfa Romeo 4C |
3 | 2008 | 220000 | Alfa Romeo 8C |
4 | 2009 | 40950 | Alfa Romeo Brera |
... | ... | ... | ... |
586 | 2007 | 39895 | Volvo V70 |
587 | 2020 | 58775 | Volvo XC40 |
588 | 2010 | 49495 | Volvo XC60 |
589 | 2004 | 49815 | Volvo XC70 |
590 | 2003 | 56115 | Volvo XC90 |
591 rows × 3 columns
The function describe
returns a conventional statistical summary. The columns of type object are omitted, except when all the columns have that type. Then the report contains only counts. This function also works for series.
df.describe()
introduction | length | width | height | curb_weight | total_max_power_hp | price | |
---|---|---|---|---|---|---|---|
count | 591.00 | 591.00 | 590.00 | 590.00 | 591.00 | 591.00 | 591.00 |
mean | 2011.40 | 4438.12 | 1808.69 | 1547.02 | 1442.04 | 190.54 | 63054.16 |
std | 5.79 | 471.16 | 106.20 | 203.17 | 379.72 | 146.91 | 91429.29 |
min | 2001.00 | 2338.00 | 1396.00 | 1117.00 | 474.00 | 16.00 | 6995.00 |
25% | 2006.00 | 4178.00 | 1750.00 | 1430.00 | 1164.00 | 100.00 | 21087.00 |
50% | 2011.00 | 4495.00 | 1810.50 | 1513.50 | 1401.00 | 136.00 | 33207.00 |
75% | 2017.00 | 4764.00 | 1881.50 | 1657.25 | 1694.50 | 231.00 | 60778.50 |
max | 2020.00 | 6088.00 | 2070.00 | 2687.00 | 2635.00 | 1000.00 | 909770.00 |
Statistics can also be calculated manually using min
, max
, mean
, std
, quantile
, etc.
df.price.median()
33207.0
Warning: You can extract a subframe with a single column. Beware that this is not the same as a series. df['introduction']
is a series with shape (5,)
, and df[['introduction']]
is a data frame with shape (5,1)! This is a common source of confusion for people starting out in pandas.
Adding data
During the process of handling a dataset, you may want to add new (calculated) columns to the dataframe. You can add them using the column indexing syntax:
'price_squared'] = df['price'] * df['price'] df[
'price_squared'] df[
0 481802500
1 809402500
2 5679129600
3 48400000000
4 1676902500
...
586 1591611025
587 3454500625
588 2449755025
589 2481534225
590 3148893225
Name: price_squared, Length: 591, dtype: int64
Row filtering expressions
In data science, rows are typically filtered by expressions (also called filters, or Boolean masks). An expression is a set of conditions that we wish to impose. They are encoded as a Boolean array.
Let’s say that we want to get all cars introduced from the year 2010 onwards. We know that the introduction
columns contains the year so we could write an expression for it as follows:
= df['introduction'] >= 2010 introduction_post_2010
This variable is Boolean numpy array that is True
if the condition holds and False
otherwise.
introduction_post_2010
0 False
1 False
2 True
3 False
4 False
...
586 False
587 True
588 True
589 False
590 False
Name: introduction, Length: 591, dtype: bool
Such an expression can readily be pluggeds into the indexing of a dataframe to produce another (smaller) one where the condition applies:
df.shape
(591, 11)
df[introduction_post_2010].shape
(369, 11)
For small expression, this is usually written as a one-liner:
'introduction'] >= 2010].head(10) df[df[
introduction | group | brand | model_name | length | width | height | curb_weight | total_max_power_hp | price | price_squared | |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 2013 | Fiat Chrysler Automobiles | Alfa Romeo | Alfa Romeo 4C | 3990 | 1868.00 | 1184.00 | 895 | 240 | 75360 | 5679129600 |
7 | 2020 | Fiat Chrysler Automobiles | Alfa Romeo | Alfa Romeo Giulia | 4643 | 1860.00 | 1436.00 | 1404 | 200 | 44170 | 1950988900 |
8 | 2014 | Fiat Chrysler Automobiles | Alfa Romeo | Alfa Romeo Giulietta | 4351 | 1798.00 | 1465.00 | 1255 | 105 | 23450 | 549902500 |
9 | 2011 | Fiat Chrysler Automobiles | Alfa Romeo | Alfa Romeo MiTo | 4065 | 1721.00 | 1434.00 | 1055 | 70 | 16450 | 270602500 |
10 | 2016 | Fiat Chrysler Automobiles | Alfa Romeo | Alfa Romeo Mito | 4063 | 1720.00 | 1453.00 | 1055 | 78 | 17950 | 322202500 |
11 | 2020 | Fiat Chrysler Automobiles | Alfa Romeo | Alfa Romeo Stelvio | 4687 | 1903.00 | 1671.00 | 1635 | 200 | 63335 | 4011322225 |
12 | 2011 | Independent | Aston Martin | Aston Martin Cygnet | 3078 | 1680.00 | 1500.00 | 988 | 98 | 41063 | 1686169969 |
13 | 2018 | Independent | Aston Martin | Aston Martin DB11 | 4739 | 1940.00 | 1279.00 | 1870 | 510 | 262908 | 69120616464 |
14 | 2012 | Independent | Aston Martin | Aston Martin DB9 | 4720 | 1875.00 | 1282.00 | 1785 | 517 | 258501 | 66822767001 |
16 | 2010 | Independent | Aston Martin | Aston Martin Rapide | 5019 | 1875.00 | 1360.00 | 1965 | 476 | 270234 | 73026414756 |
Of course, we can make more complicated expressions. If you want all car models form BMW after the year 2010:
= df['brand'] == 'BMW'
is_bmw = df['introduction'] >= 2010
after_2010
# the & symbol means "and", it is going to check the condition
# row-by-row for both parts of the expression and return True
# if both conditions are True
= is_bmw & after_2010
bmw_after_2010
10) df[bmw_after_2010].head(
introduction | group | brand | model_name | length | width | height | curb_weight | total_max_power_hp | price | price_squared | |
---|---|---|---|---|---|---|---|---|---|---|---|
47 | 2012 | BMW | BMW | BMW 114i | 4324 | 1765.00 | 1421.00 | 1265 | 102 | 25990 | 675480100 |
48 | 2020 | BMW | BMW | BMW 116i | 4319 | 1799.00 | 1434.00 | 1295 | 109 | 29989 | 899340121 |
49 | 2011 | BMW | BMW | BMW 118i | 4360 | 1748.00 | 1411.00 | 1395 | 143 | 36990 | 1368260100 |
50 | 2011 | BMW | BMW | BMW 120i | 4360 | 1748.00 | 1423.00 | 1275 | 170 | 36990 | 1368260100 |
51 | 2018 | BMW | BMW | BMW 216i | 4354 | 1800.00 | 1555.00 | 1325 | 109 | 32940 | 1085043600 |
52 | 2017 | BMW | BMW | BMW 218i | 4432 | 1774.00 | 1413.00 | 1475 | 136 | 47070 | 2215584900 |
53 | 2012 | BMW | BMW | BMW 316i | 4624 | 1811.00 | 1429.00 | 1360 | 136 | 33700 | 1135690000 |
56 | 2020 | BMW | BMW | BMW 318i | 4709 | 1827.00 | 1435.00 | 1445 | 156 | 41226 | 1699583076 |
57 | 2010 | BMW | BMW | BMW 320i | 4612 | 1782.00 | 1384.00 | 1570 | 170 | 51000 | 2601000000 |
58 | 2017 | BMW | BMW | BMW 418i | 4640 | 1825.00 | 1377.00 | 1405 | 136 | 41093 | 1688634649 |
Once expressions and indexing mechanisms become more complicated, I advise you to apply them step-by-step in order to not cause confusion. Here’s another example where we apply both a row and column filter:
= df['introduction'] > 2000
myfilter = df[myfilter]
filtered_data = filtered_data[['introduction', 'model_name', 'price']] filtered_data_just_3_columns
A lot of questions in data science can be answered using complex combinations of these filters - something that gets pretty cumbersome pretty quickly in Excel …
You try it
Extract mass-consumer cars built after 2010. What’s their average price?
We define a mass-consumer car as one for which the price is less than 25K dollars.
Hint: work in steps:
- first filter the data
- then extract the price
- then apply the
mean()
function
...
Ellipsis
Subsetting by index and label
Besides this, there are two additional ways to carry out a selection:
- Selection by label is specified by adding
.loc
after the name of the data frame. The selection of the rows is based on the index, and that of the columns is based on the column names. - Selection by position uses
.iloc
. The selection of the rows is based on the row number and that of the columns on the column number.
In both cases, if you enter a single specification inside the brackets, it refers to the rows. If you enter two specifications, the first one refers to the rows and the second one to the columns.
# df.loc[row_labels, column_labels]
1,2,3,4,5],['model_name','price']] df.loc[[
model_name | price | |
---|---|---|
1 | Alfa Romeo 159 | 28450 |
2 | Alfa Romeo 4C | 75360 |
3 | Alfa Romeo 8C | 220000 |
4 | Alfa Romeo Brera | 40950 |
5 | Alfa Romeo Crosswagon | 38855 |
1:5,2:6] df.iloc[
brand | model_name | length | width | |
---|---|---|---|---|
1 | Alfa Romeo | Alfa Romeo 159 | 4661 | 1830.00 |
2 | Alfa Romeo | Alfa Romeo 4C | 3990 | 1868.00 |
3 | Alfa Romeo | Alfa Romeo 8C | 4381 | 1894.00 |
4 | Alfa Romeo | Alfa Romeo Brera | 4413 | 1830.00 |
Pivot Tables
We will not cover all the possible functions of pandas here now, but as a rule of thumb if you can do it in Excel, there’s a (usually faster) way to do it in Pandas. A more complex - but frequently used - Excel operation is the pivot function:
In pandas, this can be done in one line of code by calling the .pivot_table()
on your dataframe. By default, it will calculate the average, but you can specify other functions using the aggfunc
argument.
='group',columns='introduction',values='price') df.pivot_table(index
introduction | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
group | ||||||||||||||||||||
BMW | NaN | NaN | 38250.00 | NaN | 53700.00 | 57000.00 | 89365.00 | NaN | NaN | 51000.00 | 36990.00 | 27995.00 | 62478.50 | 29495.00 | 112168.00 | NaN | 47070.00 | 43347.00 | 78968.50 | 41226.00 |
Daimler AG | 17555.00 | 28090.00 | 24350.00 | NaN | 10490.00 | NaN | 70901.00 | 78550.00 | 31950.00 | 145445.00 | 48450.00 | NaN | 48395.00 | 61470.00 | 75108.00 | 45538.00 | NaN | 39086.00 | 50795.00 | 53990.00 |
Fiat Chrysler Automobiles | 15175.00 | 52395.00 | 124507.00 | 32500.00 | 27075.00 | 17950.00 | 27472.50 | 27622.50 | 27725.00 | 25990.00 | 49470.00 | 41395.00 | 48677.50 | 23450.00 | NaN | 35165.00 | 136008.00 | NaN | 25355.00 | 31315.00 |
Ford Motor Company | NaN | NaN | 17645.00 | NaN | 15695.00 | NaN | NaN | NaN | NaN | NaN | NaN | 19555.00 | NaN | 19490.00 | 30425.00 | NaN | NaN | 21635.00 | 74995.00 | 49877.50 |
Geely | 46222.00 | 22635.00 | 56115.00 | 49815.00 | NaN | 45800.00 | 26995.00 | NaN | 59922.00 | 49495.00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 49824.00 |
General Motors | 39498.00 | NaN | NaN | 115450.00 | 16595.00 | 27230.00 | 29464.00 | 11295.00 | NaN | 49755.00 | 39245.00 | NaN | 26795.00 | 18995.00 | 66970.00 | NaN | 103195.00 | NaN | NaN | NaN |
Groupe PSA | 28995.00 | 18100.00 | 20897.50 | 18990.00 | 23845.00 | 55490.00 | 36307.50 | 31340.00 | 12020.00 | 23665.00 | 37092.50 | 14245.00 | 18200.00 | 29080.00 | 13715.00 | 30382.50 | 32387.00 | 30430.00 | 26624.50 | 35625.00 |
Honda | 24940.00 | 126000.00 | NaN | 40730.00 | NaN | NaN | NaN | 14900.00 | 19990.00 | NaN | NaN | NaN | 24990.00 | NaN | NaN | NaN | 25045.00 | 33872.50 | NaN | 34500.00 |
Hyundai Motor Group | 15375.00 | NaN | NaN | 36145.00 | 10495.00 | NaN | NaN | NaN | NaN | 34995.00 | 21795.00 | 8695.00 | NaN | NaN | NaN | 25041.00 | 21747.00 | 33207.00 | 27132.00 | 24097.00 |
Independent | 28725.00 | 586919.00 | 40794.50 | 244387.00 | 329816.00 | 21130.00 | 25795.00 | 72995.00 | 57995.00 | 185097.00 | 236965.00 | 285993.50 | 26495.00 | 96194.00 | 169542.50 | NaN | 169101.00 | 227868.50 | 29580.00 | 276937.50 |
RNMA Alliance | 19595.00 | 28247.50 | 17295.00 | 30699.00 | 14760.00 | 44515.00 | 25990.00 | 59550.00 | 32036.50 | 22990.00 | 27493.50 | 11890.00 | 20240.00 | 31135.00 | 35940.00 | 49945.00 | 52035.00 | 35883.50 | 29825.00 | 29410.00 |
Suzuki | 14999.00 | NaN | NaN | NaN | NaN | 11227.00 | NaN | 16057.50 | NaN | 29999.00 | NaN | NaN | NaN | NaN | 10244.00 | 15144.00 | NaN | 20744.00 | NaN | 63900.00 |
Tata Motors | NaN | 99800.00 | 33500.00 | 52200.00 | NaN | NaN | 73795.00 | NaN | NaN | NaN | 30970.00 | 61600.00 | NaN | 53120.00 | 101460.00 | NaN | 95540.00 | NaN | NaN | 60506.00 |
Toyota | 12249.00 | 62475.00 | 36550.00 | NaN | 10515.00 | 10699.00 | 22890.00 | NaN | 17397.50 | 12150.00 | 18995.00 | 39990.00 | NaN | NaN | 31980.00 | 61395.00 | 68552.50 | 38100.00 | 34182.50 | 40247.50 |
Volkswagen Group | 326860.00 | 83615.00 | 45125.00 | NaN | 107440.00 | 18085.00 | NaN | 98770.00 | 21295.00 | 38940.00 | 25755.00 | 56825.00 | 45750.00 | 127050.00 | 42930.00 | 82380.00 | 31210.00 | 62277.00 | 33155.00 | 42585.00 |
The NaN
(not a number) values are places where there was no entry. E.g., there was no car in the dataset for BMW group in 2001.
Plotting with pandas
Behind the scenes pandas calls on other python libraries, including numpy
but also matplotlib
. This means that it contains a lot of convenience functions. One that we will encounter in the next classes is the ability to plot things. Those can be accessed by calling .plot
on the Series or Dataframe.
E.g., to investigate the relation between height of a car and its horse power, we could do a scatter plot:
#df.plot.scatter(x='height',y='total_max_power_hp',alpha=0.5);
df.corr()
introduction | length | width | height | curb_weight | total_max_power_hp | price | price_squared | |
---|---|---|---|---|---|---|---|---|
introduction | 1.00 | 0.19 | 0.31 | 0.15 | 0.23 | 0.10 | 0.01 | -0.05 |
length | 0.19 | 1.00 | 0.83 | 0.29 | 0.84 | 0.44 | 0.32 | 0.14 |
width | 0.31 | 0.83 | 1.00 | 0.24 | 0.82 | 0.60 | 0.49 | 0.29 |
height | 0.15 | 0.29 | 0.24 | 1.00 | 0.37 | -0.35 | -0.34 | -0.28 |
curb_weight | 0.23 | 0.84 | 0.82 | 0.37 | 1.00 | 0.53 | 0.36 | 0.14 |
total_max_power_hp | 0.10 | 0.44 | 0.60 | -0.35 | 0.53 | 1.00 | 0.87 | 0.60 |
price | 0.01 | 0.32 | 0.49 | -0.34 | 0.36 | 0.87 | 1.00 | 0.88 |
price_squared | -0.05 | 0.14 | 0.29 | -0.28 | 0.14 | 0.60 | 0.88 | 1.00 |
To see how this pattern evolved over time we could color dots per year
='height',y='total_max_power_hp', c='introduction'); df.plot.scatter(x
We will see in the next class what all the commands below mean, but I’m writing it here to show to you what a typical flow would look like to do an exploratory plot of your data: you concatenate a long sequence of commands until at then you reach a plot
statement. Note the \
symbol at the end of each line which means - continue the statement on the next line as if it were on this line. This is to aid legibility but you could perfectly write it on one line if you wanted to:
'introduction')['total_max_power_hp']\
df.groupby(\
.median()5)\
.rolling(\
.mean()
.plot(='Horse Power over Time',
title='Horse Power',
ylabel='Introduction Year',
xlabel='r--',
style=[2005,2020]
xlim; )
I recommend that at home you diagonally go through the official documentation to see all the supported plots, but you can always ask ChatGPT for help if you have something specific in mind.
Extra Arguments for Loading Datasets
Although defaults work in most cases satisfactorily, it is worth to comment a few things about some optional arguments of pd.read_csv()
command. The list is not complete, but enough to give you an idea of the extent to which you can customize this function.
sep
specifies the column separator. The default issep=','
, but CSV files created with Excel may needsep=';'
.header
andnames
specify the row where the data to be imported start and the column names. The default isheader=0
,names=None
, which makes Pandas start reading from the first row and take it as the column names. When the data come without names, you can useheader=None
,names=namelist
to provide a list of names. With a positive value for header, you can skip some rows.index_col
specifies a column that you wish to use as the index, if that is the case. The default isindex_col=None
. If the intended index comes in the first column, as frequently happens, you useindex_col=0
.usecols
specifies the columns to be read. You can specify them in a list, either by name or by position. The default isusecols=None
, which means that you wish to read all the columns.dtype
specifies the data types of the columns of the data frame. This saves time with big data sets. The default isdtype=None
, which means that Python will guess the data type, based on what it reads. When all the entries in a column are numbers, that column is imported as numeric. If there is, at least, one entry that is not numeric, all read as strings, and data typeobject
is assigned to that column.encoding
. If the string data contained in a CSV file can contain special characters (such as ñ, or á), which can make trouble, you may need to control this. The default in Python isencoding='utf-8'
. So, if you reading a CSV file created in Excel (in Western Europe), you may need to setencoding='latin'
to read the special characters in the right way.