import pandas as pd
= './resources/AirBnB-Madrid.csv'
fname = pd.read_csv(fname, index_col='listing_id') df
From Data to Model: AirBnB
AirBnB is an online marketplace and homestay network that allows individuals to list or rent short-term lodging in residential properties. The cost of accommodation is set by the property owner, and the company charges percentage service fees from both guests and hosts for every booking made. Since its inception in 2008, Airbnb has experienced exponential growth and currently boasts over 2 million listings across approximately 200 countries.
Notably, AirBnB regularly updates and releases data on its platform. Interested parties can access this data via the Inside Airbnb site (insideairbnb.com
). The most recent update, posted in last month, covers more than 100 areas, with the majority of these being in the US and Europe. For the purpose of this example, we will use local data, which was captured on May 9th, 2023.
📁 Loading the data
Description
The file AirBnB-Madrid.csv
contains data on 18,302 local Airbnb listings. The language in the descriptions is typically English or Spanish (with exceptions). The text comes in UTF-8 encoding, so special characters may not be correctly shown (in Spanish words such as “habitación”) in Windows machines and Microsoft applications.
The variables are:
listing_id
, a unique listing’s ID. An active listing is a property listed on Airbnb. Listings may include entire homes or apartments, private rooms or shared spaces.host_id
, a unique host’s ID.host_since
, the date of the host’s first listing in Airbnb, as yyyy-mm-dd.name
, the listing’s name. Aminimal description (maximum 35 characters) of the place, intended to be appealing, such as “Centric Bohemian next Ramblas&Macba”.neighbourhood
, the neighbourhood of the listing. The neighbourhoods are sourced from city or open source GIS files.property_type
, the type of property listed. Typically ‘Appartment’, ‘Bed & Breakfast’ or ‘House’, but it can also be ‘Boat’, ‘Loft’, or others.room_type
, taking values ‘Entire home/apt’, ‘Private room’ and ‘Shared room’.bedrooms
the number of available bedrooms.price
, the daily listing’s price on that date in euros. The price shown is for the listing as a whole, not per person. The price that you see when you search Airbnb (with dates selected) is the total price divided by the number of nights you selected. When a listing has been booked for several days, the price can be lower, since the host can apply different prices depending of the number of days booked.number_of_reviews
, the number of reviews of this listings that have been posted.review_scores_rating
, the average reviewers’ rating of overall experience (what was your guest’s overall experience?). Listings are rated in the range 1-100.
Source: Inside Airbnb, edited.
Importing the data
Let’s import the data to a Pandas data frame, taking listing_id
as the index and leave the rest to its default settings. You can download it here.
Our Data Processing Pipeline
flowchart LR A[Import Data] A --> B(Check Duplicates) A --> C(Handle Null Values) A --> C2(...) B --> Clean[Clean Data] C --> Clean C2 --> Clean Clean --> D[Descriptive] D --> E[Modeling]
🧹 Data Cleaning
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 21239 entries, 6369 to 843551963497475593
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 host_id 21239 non-null int64
1 host_since 21236 non-null object
2 name 21235 non-null object
3 neighbourhood 21239 non-null object
4 property_type 21239 non-null object
5 room_type 21239 non-null object
6 bedrooms 19667 non-null float64
7 price 21239 non-null float64
8 number_of_reviews 21239 non-null int64
9 review_scores_rating 17168 non-null float64
dtypes: float64(3), int64(2), object(5)
memory usage: 1.8+ MB
df.describe()
host_id | bedrooms | price | number_of_reviews | review_scores_rating | |
---|---|---|---|---|---|
count | 2.123900e+04 | 19667.000000 | 21239.000000 | 21239.000000 | 17168.000000 |
mean | 1.818276e+08 | 1.475467 | 139.106173 | 42.764396 | 4.596427 |
std | 1.626871e+08 | 0.846230 | 660.313831 | 78.737076 | 0.621636 |
min | 7.952000e+03 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 3.255115e+07 | 1.000000 | 50.000000 | 1.000000 | 4.500000 |
50% | 1.289901e+08 | 1.000000 | 85.000000 | 10.000000 | 4.750000 |
75% | 3.205107e+08 | 2.000000 | 133.000000 | 47.000000 | 4.920000 |
max | 5.050783e+08 | 18.000000 | 71577.000000 | 889.000000 | 5.000000 |
We can do a preliminary check of the dataframe using info
and head
:
3)
df.head(3) df.tail(
host_id | host_since | name | neighbourhood | property_type | room_type | bedrooms | price | number_of_reviews | review_scores_rating | |
---|---|---|---|---|---|---|---|---|---|---|
listing_id | ||||||||||
843487874169862949 | 230936437 | 2018-12-16 | Habitación privada en Madrid. | Valverde | Private room in rental unit | Private room | 1.0 | 46.0 | 1 | 5.0 |
843544239423340038 | 455852114 | 2022-04-24 | Habitación Privada Confort | Ventas | Private room in rental unit | Private room | 1.0 | 28.0 | 1 | 5.0 |
843551963497475593 | 504579859 | 2023-03-09 | Habitación en piso compartido | Imperial | Private room in rental unit | Private room | 1.0 | 60.0 | 0 | NaN |
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 21239 entries, 6369 to 843551963497475593
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 host_id 21239 non-null int64
1 host_since 21236 non-null object
2 name 21235 non-null object
3 neighbourhood 21239 non-null object
4 property_type 21239 non-null object
5 room_type 21239 non-null object
6 bedrooms 19667 non-null float64
7 price 21239 non-null float64
8 number_of_reviews 21239 non-null int64
9 review_scores_rating 17168 non-null float64
dtypes: float64(3), int64(2), object(5)
memory usage: 1.8+ MB
This is already letting us know that some columns have missing values. This is a fact of life when doing data science so we need to be able to handle this case.
Missing values
Missing values are denoted by NaN
in Pandas. When a Pandas object is built, both plain Python’s None
and NumPy’s nan
are taken as NaN. Since np.nan
has type float
, a numeric series containing NaN values gets type float.
Three useful Pandas methods related to missing values, which can be applied to both series and data frames, are: * isna
returns a Boolean mask indicating which terms are missing. * fillna
is used for replacing NaN
’s by a fixed value, set by the user. * dropna
returns the same data frame minus the rows that contain at least one missing value. If a list of columns is specified, the missing values are searched only for those columns.
For this dataset, in particular, we have a relevant number of listings for which rating scores are not available (many guests don’t rate the lodgings). A report on all missing values can be printed with:
sum() df.isna().
host_id 0
host_since 3
name 4
neighbourhood 0
property_type 0
room_type 0
bedrooms 1572
price 0
number_of_reviews 0
review_scores_rating 4071
dtype: int64
Note that df.isna()
is a data frame of the same shape as df
, whose terms have type bool
. With sum
, we get the number of True
values per column.
We can drop the NA’s with the dropna() command. Since I’m not going to analyze ratings or bedrooms for now, I will only drop NA’s for name, and host since.
print('Before dropping NAs:', df.shape)
= df.dropna(subset=['name','host_since'])
df print('After dropping NAs:', df.shape)
Before dropping NAs: (21239, 10)
After dropping NAs: (21232, 10)
Duplicates
There are two useful Pandas functions for managing duplicates:
drop_duplicates
drops the duplicated entries (in a series) or the duplicated rows (in a data frame).duplicated
returns a Boolean series indicating which entries (for a series) or which rows (for a data frame) are duplicated.
For instance, the listing’s ID in this data set is never duplicated, so it makes sense to use it as the index:
sum() df.index.duplicated().
0
But a different thing is whether several listings have exactly the same data (e.g., this happens when they refer to beds in a shared bedroom). How often this occur in our dataset?
sum() df.duplicated().
22
How does duplicated
work? It reads the data top-down. For entries that occur for the first time, it returns False
. For terms that have already occurred before, it returns True
.
Let us now permanently remove the duplicates from the dataframe using the drop_duplicates
function.
print('Before dropping duplicates:', df.shape)
= df.drop_duplicates()
df print('After dropping duplicates:', df.shape)
df.shape
Before dropping duplicates: (21232, 10)
After dropping duplicates: (21210, 10)
(21210, 10)
Problem (?) solved!
It’s good practice to save the dataframe after preprocessing this allows you to continue with a clean dataset going forward.
'./resources/AirBnB-cleaned.csv') df.to_csv(
💡 Descriptive Analytics
We reload the dataset.
= pd.read_csv('./resources/AirBnB-cleaned.csv', index_col='listing_id') df
Much of the work of a datascientist comes down to manipulating the data to answering specific business questions. Most of these operators come in the form of either simple statistics applied to the whole dataset, or simple statistics applied to subgroups of the dataset.
Let’s try to answer a few questions now to see examples of such procedures:
Is AirBnB still P2P?
Airbnb was initially designed to function as a peer-to-peer (P2P) network, where individuals could rent out their apartments to others. However, some critics argue that it has now evolved into a platform that allows investors with large real estate portfolios to offer their properties for rent. The question remains: is this the case here?
One way to investigate this is by examining the ratio of the number of listings to the number of hosts on the platform. By extracting the list of unique host IDs as the index of the series returned by value_counts
, we can obtain a ranked list of hosts based on the number of listings they have. Alternatively, we can obtain an array of these IDs using the unique
method.
'host_id'].unique() df[
array([ 13660, 83531, 346366726, ..., 230936437, 455852114,
504579859])
To count them:
# option 1: look at the shape
'host_id'].unique().shape
df[
# option 2: use nunique()
'host_id'].nunique() df[
11073
These are the number of hosts. We have this many for this number of listings:
df.shape
(21210, 10)
So how many hosts are actually following the original AirBnB philosophy?
Let’s take a loot at the original dataframe. There too, we can check the value counts to see then # of listings per host. Then we count the number of hosts that have just 1 property.
'host_id'].value_counts() df[
host_id
377605855 263
346367515 148
471977881 143
434208714 127
28038703 101
...
205944086 1
216063028 1
11857294 1
53974722 1
504579859 1
Name: count, Length: 11073, dtype: int64
Let’s see how many have just 1:
'host_id'].value_counts() == 1)*1).sum() ((df[
8503
We can easily identify the hosts with the biggest portfolios:
'host_id'].value_counts().head(10) df[
host_id
377605855 263
346367515 148
471977881 143
434208714 127
28038703 101
291253690 97
107335103 96
107141654 96
40877988 86
17499500 80
Name: count, dtype: int64
You try it
The amount of hosts with more than 10 listings is:
...
Ellipsis
Solution
'host_id'].value_counts() > 10).sum() (df[
194
How are experienced hosts behaving?
The file contains some information about the hosts, that is, the people renting the lodgings. For instance, you may be interested in the more experienced hosts, using a Boolean mask like df['host_since'] < '2012-01-01'
to select the listings with these more experienced hosts.
= df['host_since'] < '2012-01-01'
filter_experienced = df[filter_experienced]
df_experienced
df_experienced.shape
(437, 10)
Looks like there are 437 listings from hosts that joined before 2012. How many listings do they behave? Let’s use the value_counts
function to analyze how many listings these have.
You try it
Count how many listings each host that joined before 2012:
- Figure out which hosts joined before 2012.
- Use this as a filter.
- Apply value_counts to the resulting dataframe.
...
Ellipsis
Solution
= df['host_since'] < '2012-01-01'
filter_experienced = df[filter_experienced]
df_experienced 'host_id'].value_counts() df_experienced[
host_id
1408525 65
901218 36
596469 26
142578 11
1108348 9
..
606925 1
1206817 1
70273 1
901541 1
1499853 1
Name: count, Length: 179, dtype: int64
So, we have quite a few hosts in this list. Many of them have one listing, but the top hosts look like companies. You may wonder how frequent is this.
Distribution of the price
What’s the distribution of the listing’s price? We can explore this point with a histogram. The histogram is returned by the Pandas method plot.hist()
.
'price'].plot.hist(); df[
A statistical summary, as the one returned by df.describe()
, can be helpful:
'price'].describe() df[
count 21210.000000
mean 139.201367
std 660.758099
min 0.000000
25% 50.000000
50% 85.000000
75% 133.000000
max 71577.000000
Name: price, dtype: float64
Alternatively, we can compute other quantiles using df.quantile()
:
= df['price'].quantile(q=[0.05,0.95])
quantiles quantiles
0.05 23.0
0.95 331.0
Name: price, dtype: float64
You try it
These 5-95% quantiles look like reasonable limits for trimming. Try plotting the distribution for just this part of the dataset.
- Calculate the quantiles
- Filter out bad data (outliers)
- Plot the distribution
The goal is to recreate the plot below:
...
Ellipsis
Solution
= df[(df['price'] >= 23) & (df['price'] <= 331)]
df_no_outlier =(8,4), color='gray', rwidth=0.9, bins=40).set_xlim(50,320); df_no_outlier.price.plot.hist(figsize
Fancy Solution with Bar Highlights
import matplotlib.pyplot as plt
# I start at 25 here to ensure that round number bars are prominently shown
= df[(df['price'] >= 25) & (df['price'] <= 331)]
df_no_outlier =(8,4), color='gray', rwidth=0.9, bins=40);
df_no_outlier.price.plot.hist(figsize
# We repeat the plot, but this time keep track of the bars
= plt.hist(df_no_outlier.price, bins=40, rwidth=0.9, color='gray')
hist_data, bins, bars
# Then, we define the index of the bar that we want to highlight
= [0, 9, 16, 22,29,35]
highlight_bins
# Loop through them and change their fc (face color)
for i in highlight_bins:
'red')
bars[i].set_fc(
#plt.show()
50,320); plt.gca().set_xlim(
Hint: To make it look more like mine, use the argument bins
which specifies that the range covered by the data is partitioned into this number of intervals. Note how intervals starting at multiples of 10 have higher bars. This is not rare. Prices don’t change continuously, and hosts prefer prices which are multiples of 10. The histogram also shows that e.g., 50, 100 and 150 euros are popular prices.
Note. The rule for counting the observations in every bin is as follows. For an interval of limits a
and b
, the values x
such that a <= x < b
are counted. Except for the last interval, for which the right limit is also included.
Average price per room type
Suppose that you are interested in the average price per room type. You can get this by applying groupby()
which works as follows:
This method groups the rows according to one or more grouping variables. Here, you specify groupby(by='room_type')
.
Next, you select the columns to be aggregated, which, here would be just price
. Finally, you specify the aggregation function. Having only the mean, you can do it just by adding mean()
or by specifying .agg('mean')
. The method round()
works as you would expect it to.
'room_type') df.groupby(
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fa6660e2690>
'room_type')['price'] df.groupby(
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fa6604a0f10>
'room_type')['price'].mean() df.groupby(
room_type
Entire home/apt 160.338694
Hotel room 204.131148
Private room 98.362606
Shared room 104.191638
Name: price, dtype: float64
The above way of writing a groupby is a shorthand notation for the full version using agg
as shown below:
='room_type')['price'].agg('mean').round() df.groupby(by
room_type
Entire home/apt 160.0
Hotel room 204.0
Private room 98.0
Shared room 104.0
Name: price, dtype: float64
While the differences in prices across room types may be as expected (does anything stand out), relying on the average price alone may not always provide an accurate representation of the data. Skewed distributions can result in extreme values on the right tail, which can significantly affect the mean price and lead it to deviate from the central tendency of the data. In such cases, the median price would provide a more representative measure of the mid-point of the distribution.
To calculate the median price per room type, the aggregation function mean
in our calculations should be replaced by median
. Similar rules apply for the ‘usual’ statistics:
='room_type')['price'].agg(['min','max','mean','median']).round() df.groupby(by
min | max | mean | median | |
---|---|---|---|---|
room_type | ||||
Entire home/apt | 10.0 | 71577.0 | 160.0 | 104.0 |
Hotel room | 0.0 | 1149.0 | 204.0 | 125.0 |
Private room | 8.0 | 40714.0 | 98.0 | 40.0 |
Shared room | 11.0 | 3000.0 | 104.0 | 28.0 |
You can even provide your own functions!
# price_series is a series passed to the groupby function
# it is applied to every split-group
def myquantiles(price_series):
return [
=.5).round(),
price_series.quantile(q=.95).round()
price_series.quantile(q
]
='room_type')['price'].agg(['min','max','mean',myquantiles]).round() df.groupby(by
min | max | mean | myquantiles | |
---|---|---|---|---|
room_type | ||||
Entire home/apt | 10.0 | 71577.0 | 160.0 | [104.0, 324.0] |
Hotel room | 0.0 | 1149.0 | 204.0 | [125.0, 1000.0] |
Private room | 8.0 | 40714.0 | 98.0 | [40.0, 350.0] |
Shared room | 11.0 | 3000.0 | 104.0 | [28.0, 481.0] |
Note the myquantiles
is called in the ‘gear’ / agg step and is thus applied to every single split-group seperately. The input (price_series
) are the prices for that specific group. The output are the quantiles for that specific group. These outputs are merged into one column called myquantiles
.
The aggregation functions are applied to each column separately. If instead, you would like to use all columns to do some advanced calculation, you can use the `.apply(myfunction)` method instead of the `.agg(myfunction)` method.
The results obtained from using the median price provide a more meaningful perspective compared to the earlier results, where shared rooms appeared to be more expensive than private rooms. Using the median price allows for a more accurate representation of the mid-point of the price distribution and reduces the influence of extreme values that can bias the results.
Top-10 neighbourhoods
A natural analysis of the local Airbnb data, is to take a look at the most popular neighbourhoods by number of listings. The top ten list can be extracted with value_counts
:
'neighbourhood'].value_counts().head(10) df[
neighbourhood
Embajadores 2573
Universidad 2009
Palacio 1679
Sol 1187
Justicia 1049
Cortes 931
Trafalgar 394
Palos de Moguer 373
Goya 341
Argüelles 312
Name: count, dtype: int64
Are these neighbourhoods more or perhaps less expensive? One could wonder if there is an association between the mid price and the amount of property listed (following the laws of supply-demand).
= df.groupby(by='neighbourhood')['price'].agg(['count','mean','median'])
nb_counts
'count', ascending=False).head(10) nb_counts.sort_values(
count | mean | median | |
---|---|---|---|
neighbourhood | |||
Embajadores | 2573 | 145.029149 | 86.0 |
Universidad | 2009 | 127.534594 | 96.0 |
Palacio | 1679 | 144.176891 | 100.0 |
Sol | 1187 | 153.972199 | 112.0 |
Justicia | 1049 | 151.954242 | 100.0 |
Cortes | 931 | 149.902256 | 115.0 |
Trafalgar | 394 | 192.180203 | 95.5 |
Palos de Moguer | 373 | 206.426273 | 82.0 |
Goya | 341 | 211.741935 | 116.0 |
Argüelles | 312 | 127.147436 | 90.5 |
Sometimes, we prefer to name the columns or use special functions that are not built-in by default. This can be done by providing a dict instead of a list to the agg
function. The keys will be the names of the columns, the values are the calculations to be done:
View Code
def my_property_function(x):
return x.mode()
='neighbourhood').agg({
df.groupby(by'price' : ['count', 'median'],
'property_type': [my_property_function]
})
price | property_type | ||
---|---|---|---|
count | median | my_property_function | |
neighbourhood | |||
Abrantes | 40 | 37.0 | Private room in rental unit |
Acacias | 169 | 64.0 | Entire rental unit |
Adelfas | 96 | 70.0 | Entire rental unit |
Aeropuerto | 8 | 30.5 | Private room in rental unit |
Aguilas | 41 | 45.0 | Private room in rental unit |
... | ... | ... | ... |
Valverde | 95 | 65.0 | Entire rental unit |
Ventas | 174 | 50.0 | Private room in rental unit |
Vinateros | 23 | 40.0 | Private room in rental unit |
Vista Alegre | 124 | 48.5 | Private room in rental unit |
Zofío | 26 | 68.5 | Entire rental unit |
126 rows × 3 columns
You try it
What is the actual correlation between price and number of listings? Follow these steps to find out:
- Do a group-by to get the counts and the median price:
- group by neighbourhood
- extract the price as a
Series
- aggregate to get the
count
and themedian
- use the
.corr()
function to obtain the correlation.
Your goal is to produce the following table:
...
Ellipsis
Solution
='neighbourhood')['price'].agg(['count','median']).corr() df.groupby(by
count | median | |
---|---|---|
count | 1.000000 | 0.177183 |
median | 0.177183 | 1.000000 |
'neighbourhood')['price'].agg(['count','median','mean']).corr() df.groupby(
count | median | mean | |
---|---|---|---|
count | 1.000000 | 0.177183 | 0.070872 |
median | 0.177183 | 1.000000 | 0.600848 |
mean | 0.070872 | 0.600848 | 1.000000 |
This doesn’t seem to directly correlate, something more must be up. We’ll try some actual statistics towards the end of the class.
Pivoting
In this dataset too, a pivot table can be handy. Let us try to answer the following question.
You try it
What is the median price of a property in each neighbourhood? Organize by room type:
- Rows = room type
- Columns = neighbourhood
- Values = price
Use the syntax df.pivot_table(index=..., columns=..., values=...)
Straightforward Solution
= df.pivot_table(index='room_type',columns='neighbourhood', values='price', aggfunc='median') pivot
Advanced Solution
# Here's a more complex solution, I will pivot on 2 statistics, both the mean and the count
# I use the count to remove any neighbourhood with very few records as we cannot calculate
# meaningful statistics for those.
#
# Then, I sort the values. I will use Entire home/apt prices as a leading indicator.
# 1. pivot on both stats
= df.pivot_table(index='room_type',columns='neighbourhood', values='price', aggfunc=['median','count'])
pivot
# 2. find out/filter which neighbourhoods have less than 10 recorded prices in Entire home/apt
= pivot['count']
count_columns = count_columns.loc['Entire home/apt'][count_columns.loc['Entire home/apt'] >= 10].index
valid_neighborhoods = pivot.loc[:, pivot.columns.get_level_values(1).isin(valid_neighborhoods)]
filtered_pivot
# 3. we're only interested in the mean price, so we pick that one. Then, we also sort from
# most to least expensive
'median'].T.sort_values(by='Entire home/apt', ascending=False).T filtered_pivot[
neighbourhood | Rosas | Jerónimos | Castellana | Goya | Recoletos | Colina | Niño Jesús | Sol | Rios Rosas | Cortes | ... | Adelfas | Marroquina | Almendrales | Pradolongo | Concepción | Zofío | San Andrés | Aluche | Vista Alegre | Entrevías |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
room_type | |||||||||||||||||||||
Entire home/apt | 165.0 | 158.0 | 151.0 | 141.0 | 140.0 | 140.0 | 135.0 | 126.0 | 125.0 | 125.0 | ... | 75.0 | 75.0 | 75.0 | 73.5 | 73.0 | 73.0 | 70.0 | 69.5 | 67.0 | 66.0 |
Hotel room | NaN | 0.0 | NaN | NaN | 197.0 | NaN | NaN | 41.0 | 106.5 | 159.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | 200.0 | NaN | NaN | NaN |
Private room | 375.0 | 46.5 | 35.0 | 43.5 | 39.0 | 38.5 | 46.0 | 54.0 | 38.0 | 60.0 | ... | 35.0 | 41.0 | 35.0 | 41.0 | 29.0 | 28.0 | 28.0 | 28.0 | 30.0 | 32.0 |
Shared room | 1000.0 | NaN | 168.5 | NaN | NaN | NaN | 18.0 | 30.0 | 42.0 | 31.0 | ... | NaN | NaN | NaN | 17.0 | 180.0 | NaN | 11.0 | 18.0 | NaN | 15.0 |
4 rows × 100 columns
📈 Build a regression model to predict prices of AirBnB listings.
Lastly, we could build some basic statistical regressions to model the prices of local AirBnB listings. This could allow us to potentially spot good opportunities of over-/under-valued rooms. For home-owners, on the other hand, it could give us an indiciation of what a right price for a property.
We start by importing statsmodels as explained in the accompanying document:
import statsmodels.api as sm
import statsmodels.formula.api as smf
We will only look at the dataframe without outliers here.
= smf.ols("""
model1 price ~ bedrooms + number_of_reviews + review_scores_rating
""", data=df_no_outlier).fit()
model1.summary()
Dep. Variable: | price | R-squared: | 0.233 |
Model: | OLS | Adj. R-squared: | 0.233 |
Method: | Least Squares | F-statistic: | 1455. |
Date: | Wed, 27 Sep 2023 | Prob (F-statistic): | 0.00 |
Time: | 10:00:46 | Log-Likelihood: | -77530. |
No. Observations: | 14380 | AIC: | 1.551e+05 |
Df Residuals: | 14376 | BIC: | 1.551e+05 |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | 31.6574 | 3.588 | 8.822 | 0.000 | 24.624 | 38.691 |
bedrooms | 36.8712 | 0.559 | 65.956 | 0.000 | 35.775 | 37.967 |
number_of_reviews | -0.0025 | 0.005 | -0.473 | 0.636 | -0.013 | 0.008 |
review_scores_rating | 3.0993 | 0.755 | 4.106 | 0.000 | 1.620 | 4.579 |
Omnibus: | 2498.921 | Durbin-Watson: | 1.858 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 19666.142 |
Skew: | 0.617 | Prob(JB): | 0.00 |
Kurtosis: | 8.595 | Cond. No. | 828. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
= smf.ols("""
model2 price ~ bedrooms + C(neighbourhood) + number_of_reviews + review_scores_rating
""", data=df_no_outlier).fit()
model2.summary()
Dep. Variable: | price | R-squared: | 0.346 |
Model: | OLS | Adj. R-squared: | 0.340 |
Method: | Least Squares | F-statistic: | 59.32 |
Date: | Wed, 27 Sep 2023 | Prob (F-statistic): | 0.00 |
Time: | 10:02:28 | Log-Likelihood: | -76386. |
No. Observations: | 14380 | AIC: | 1.530e+05 |
Df Residuals: | 14252 | BIC: | 1.540e+05 |
Df Model: | 127 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | -3.4442 | 11.702 | -0.294 | 0.769 | -26.381 | 19.493 |
C(neighbourhood)[T.Acacias] | 14.8142 | 12.185 | 1.216 | 0.224 | -9.071 | 38.699 |
C(neighbourhood)[T.Adelfas] | 3.8755 | 12.717 | 0.305 | 0.761 | -21.051 | 28.802 |
C(neighbourhood)[T.Aeropuerto] | -18.0615 | 27.109 | -0.666 | 0.505 | -71.200 | 35.077 |
C(neighbourhood)[T.Aguilas] | -11.9278 | 15.604 | -0.764 | 0.445 | -42.515 | 18.659 |
C(neighbourhood)[T.Alameda de Osuna] | 15.2433 | 14.125 | 1.079 | 0.281 | -12.443 | 42.930 |
C(neighbourhood)[T.Almagro] | 58.9526 | 12.138 | 4.857 | 0.000 | 35.161 | 82.745 |
C(neighbourhood)[T.Almenara] | 25.3954 | 12.629 | 2.011 | 0.044 | 0.640 | 50.150 |
C(neighbourhood)[T.Almendrales] | 11.1726 | 13.249 | 0.843 | 0.399 | -14.798 | 37.143 |
C(neighbourhood)[T.Aluche] | -15.0626 | 13.910 | -1.083 | 0.279 | -42.329 | 12.203 |
C(neighbourhood)[T.Ambroz] | -23.4862 | 21.792 | -1.078 | 0.281 | -66.201 | 19.229 |
C(neighbourhood)[T.Amposta] | 9.9295 | 36.634 | 0.271 | 0.786 | -61.877 | 81.736 |
C(neighbourhood)[T.Apostol Santiago] | 12.4785 | 15.439 | 0.808 | 0.419 | -17.784 | 42.741 |
C(neighbourhood)[T.Arapiles] | 28.7075 | 12.165 | 2.360 | 0.018 | 4.862 | 52.553 |
C(neighbourhood)[T.Aravaca] | 7.1299 | 14.281 | 0.499 | 0.618 | -20.863 | 35.123 |
C(neighbourhood)[T.Arcos] | -13.7477 | 14.877 | -0.924 | 0.355 | -42.908 | 15.413 |
C(neighbourhood)[T.Argüelles] | 41.8478 | 11.832 | 3.537 | 0.000 | 18.655 | 65.040 |
C(neighbourhood)[T.Atocha] | 35.5168 | 17.071 | 2.081 | 0.037 | 2.056 | 68.978 |
C(neighbourhood)[T.Bellas Vistas] | 13.8640 | 12.187 | 1.138 | 0.255 | -10.025 | 37.753 |
C(neighbourhood)[T.Berruguete] | 23.9287 | 12.495 | 1.915 | 0.056 | -0.564 | 48.421 |
C(neighbourhood)[T.Buenavista] | 6.7709 | 13.283 | 0.510 | 0.610 | -19.266 | 32.808 |
C(neighbourhood)[T.Butarque] | 11.5161 | 18.670 | 0.617 | 0.537 | -25.080 | 48.112 |
C(neighbourhood)[T.Campamento] | -18.5179 | 15.792 | -1.173 | 0.241 | -49.472 | 12.436 |
C(neighbourhood)[T.Canillas] | 0.3081 | 12.568 | 0.025 | 0.980 | -24.326 | 24.942 |
C(neighbourhood)[T.Canillejas] | 8.4796 | 13.117 | 0.646 | 0.518 | -17.231 | 34.190 |
C(neighbourhood)[T.Casa de Campo] | 25.3565 | 13.443 | 1.886 | 0.059 | -0.993 | 51.706 |
C(neighbourhood)[T.Casco Histórico de Barajas] | 31.7824 | 14.662 | 2.168 | 0.030 | 3.042 | 60.523 |
C(neighbourhood)[T.Casco Histórico de Vallecas] | 1.5332 | 13.362 | 0.115 | 0.909 | -24.658 | 27.724 |
C(neighbourhood)[T.Casco Histórico de Vicálvaro] | -2.4579 | 14.881 | -0.165 | 0.869 | -31.626 | 26.710 |
C(neighbourhood)[T.Castellana] | 68.3615 | 12.155 | 5.624 | 0.000 | 44.536 | 92.187 |
C(neighbourhood)[T.Castilla] | 34.5684 | 13.791 | 2.507 | 0.012 | 7.536 | 61.600 |
C(neighbourhood)[T.Castillejos] | 33.7429 | 12.095 | 2.790 | 0.005 | 10.035 | 57.450 |
C(neighbourhood)[T.Chopera] | 16.6831 | 12.298 | 1.357 | 0.175 | -7.423 | 40.789 |
C(neighbourhood)[T.Ciudad Jardín] | 22.0364 | 12.925 | 1.705 | 0.088 | -3.299 | 47.372 |
C(neighbourhood)[T.Ciudad Universitaria] | 12.7401 | 14.757 | 0.863 | 0.388 | -16.186 | 41.666 |
C(neighbourhood)[T.Colina] | 47.0181 | 16.724 | 2.811 | 0.005 | 14.238 | 79.799 |
C(neighbourhood)[T.Comillas] | 18.0013 | 12.399 | 1.452 | 0.147 | -6.303 | 42.306 |
C(neighbourhood)[T.Concepción] | 10.4510 | 13.445 | 0.777 | 0.437 | -15.903 | 36.805 |
C(neighbourhood)[T.Corralejos] | 33.2445 | 30.623 | 1.086 | 0.278 | -26.780 | 93.269 |
C(neighbourhood)[T.Cortes] | 61.1947 | 11.473 | 5.334 | 0.000 | 38.707 | 83.683 |
C(neighbourhood)[T.Costillares] | 21.7116 | 15.790 | 1.375 | 0.169 | -9.240 | 52.663 |
C(neighbourhood)[T.Cuatro Caminos] | 36.0319 | 11.898 | 3.028 | 0.002 | 12.711 | 59.353 |
C(neighbourhood)[T.Cuatro Vientos] | 4.5201 | 20.772 | 0.218 | 0.828 | -36.196 | 45.236 |
C(neighbourhood)[T.Cármenes] | 11.2344 | 14.118 | 0.796 | 0.426 | -16.439 | 38.907 |
C(neighbourhood)[T.Delicias] | 19.5064 | 12.030 | 1.622 | 0.105 | -4.073 | 43.086 |
C(neighbourhood)[T.El Goloso] | 35.2039 | 19.945 | 1.765 | 0.078 | -3.891 | 74.299 |
C(neighbourhood)[T.El Pardo] | 13.4382 | 50.560 | 0.266 | 0.790 | -85.665 | 112.542 |
C(neighbourhood)[T.El Plantío] | 7.5515 | 20.777 | 0.363 | 0.716 | -33.173 | 48.276 |
C(neighbourhood)[T.El Viso] | 51.7654 | 13.119 | 3.946 | 0.000 | 26.050 | 77.481 |
C(neighbourhood)[T.Embajadores] | 38.6202 | 11.372 | 3.396 | 0.001 | 16.330 | 60.911 |
C(neighbourhood)[T.Entrevías] | -6.2525 | 15.606 | -0.401 | 0.689 | -36.842 | 24.337 |
C(neighbourhood)[T.Estrella] | 36.5910 | 16.722 | 2.188 | 0.029 | 3.813 | 69.369 |
C(neighbourhood)[T.Fontarrón] | -7.4839 | 14.875 | -0.503 | 0.615 | -36.641 | 21.673 |
C(neighbourhood)[T.Fuente del Berro] | 31.4966 | 12.699 | 2.480 | 0.013 | 6.605 | 56.388 |
C(neighbourhood)[T.Gaztambide] | 32.1650 | 12.138 | 2.650 | 0.008 | 8.373 | 55.957 |
C(neighbourhood)[T.Goya] | 49.8259 | 11.759 | 4.237 | 0.000 | 26.776 | 72.876 |
C(neighbourhood)[T.Guindalera] | 31.4116 | 11.845 | 2.652 | 0.008 | 8.195 | 54.628 |
C(neighbourhood)[T.Hellín] | -5.6841 | 18.674 | -0.304 | 0.761 | -42.287 | 30.919 |
C(neighbourhood)[T.Hispanoamérica] | 46.2799 | 12.660 | 3.655 | 0.000 | 21.464 | 71.096 |
C(neighbourhood)[T.Horcajo] | 2.945e-14 | 1.59e-14 | 1.856 | 0.063 | -1.64e-15 | 6.05e-14 |
C(neighbourhood)[T.Ibiza] | 31.7360 | 11.981 | 2.649 | 0.008 | 8.251 | 55.221 |
C(neighbourhood)[T.Imperial] | 22.3133 | 12.179 | 1.832 | 0.067 | -1.560 | 46.186 |
C(neighbourhood)[T.Jerónimos] | 67.9640 | 12.699 | 5.352 | 0.000 | 43.073 | 92.855 |
C(neighbourhood)[T.Justicia] | 51.9468 | 11.446 | 4.538 | 0.000 | 29.510 | 74.383 |
C(neighbourhood)[T.La Paz] | 2.0962 | 15.603 | 0.134 | 0.893 | -28.488 | 32.681 |
C(neighbourhood)[T.Legazpi] | 21.0025 | 14.121 | 1.487 | 0.137 | -6.677 | 48.682 |
C(neighbourhood)[T.Lista] | 53.2862 | 12.083 | 4.410 | 0.000 | 29.601 | 76.971 |
C(neighbourhood)[T.Los Angeles] | 6.2183 | 16.726 | 0.372 | 0.710 | -26.568 | 39.004 |
C(neighbourhood)[T.Los Rosales] | 0.6175 | 16.454 | 0.038 | 0.970 | -31.634 | 32.869 |
C(neighbourhood)[T.Lucero] | -7.5381 | 13.059 | -0.577 | 0.564 | -33.136 | 18.059 |
C(neighbourhood)[T.Marroquina] | 8.7202 | 17.027 | 0.512 | 0.609 | -24.654 | 42.095 |
C(neighbourhood)[T.Media Legua] | -1.8909 | 17.742 | -0.107 | 0.915 | -36.668 | 32.886 |
C(neighbourhood)[T.Mirasierra] | 1.0352 | 15.995 | 0.065 | 0.948 | -30.317 | 32.388 |
C(neighbourhood)[T.Moscardó] | 4.3998 | 12.626 | 0.348 | 0.727 | -20.348 | 29.148 |
C(neighbourhood)[T.Niño Jesús] | 50.0095 | 13.582 | 3.682 | 0.000 | 23.387 | 76.632 |
C(neighbourhood)[T.Nueva España] | 35.6868 | 13.059 | 2.733 | 0.006 | 10.090 | 61.284 |
C(neighbourhood)[T.Numancia] | 11.3359 | 12.186 | 0.930 | 0.352 | -12.550 | 35.222 |
C(neighbourhood)[T.Opañel] | 24.7965 | 12.660 | 1.959 | 0.050 | -0.019 | 49.612 |
C(neighbourhood)[T.Orcasitas] | -42.1824 | 17.364 | -2.429 | 0.015 | -76.218 | -8.147 |
C(neighbourhood)[T.Orcasur] | 11.9868 | 13.359 | 0.897 | 0.370 | -14.198 | 38.172 |
C(neighbourhood)[T.Pacífico] | 13.8451 | 11.828 | 1.171 | 0.242 | -9.340 | 37.030 |
C(neighbourhood)[T.Palacio] | 52.5310 | 11.398 | 4.609 | 0.000 | 30.189 | 74.873 |
C(neighbourhood)[T.Palomas] | 40.7050 | 23.084 | 1.763 | 0.078 | -4.542 | 85.952 |
C(neighbourhood)[T.Palomeras Bajas] | -10.6048 | 13.682 | -0.775 | 0.438 | -37.423 | 16.213 |
C(neighbourhood)[T.Palomeras Sureste] | -1.8211 | 15.605 | -0.117 | 0.907 | -32.408 | 28.766 |
C(neighbourhood)[T.Palos de Moguer] | 34.0250 | 11.694 | 2.910 | 0.004 | 11.103 | 56.946 |
C(neighbourhood)[T.Pavones] | -29.6977 | 24.777 | -1.199 | 0.231 | -78.265 | 18.869 |
C(neighbourhood)[T.Peñagrande] | 1.2636 | 15.439 | 0.082 | 0.935 | -29.000 | 31.527 |
C(neighbourhood)[T.Pilar] | 7.8029 | 13.736 | 0.568 | 0.570 | -19.122 | 34.728 |
C(neighbourhood)[T.Pinar del Rey] | 16.8362 | 12.486 | 1.348 | 0.178 | -7.638 | 41.310 |
C(neighbourhood)[T.Piovera] | 36.5582 | 15.793 | 2.315 | 0.021 | 5.602 | 67.514 |
C(neighbourhood)[T.Portazgo] | -15.9508 | 16.454 | -0.969 | 0.332 | -48.203 | 16.302 |
C(neighbourhood)[T.Pradolongo] | 6.6079 | 15.788 | 0.419 | 0.676 | -24.338 | 37.554 |
C(neighbourhood)[T.Prosperidad] | 28.2371 | 12.138 | 2.326 | 0.020 | 4.446 | 52.029 |
C(neighbourhood)[T.Pueblo Nuevo] | 4.7364 | 12.229 | 0.387 | 0.699 | -19.233 | 28.706 |
C(neighbourhood)[T.Puerta Bonita] | -7.4593 | 13.910 | -0.536 | 0.592 | -34.725 | 19.807 |
C(neighbourhood)[T.Puerta del Angel] | 2.9935 | 11.882 | 0.252 | 0.801 | -20.297 | 26.284 |
C(neighbourhood)[T.Quintana] | 4.7823 | 12.633 | 0.379 | 0.705 | -19.979 | 29.544 |
C(neighbourhood)[T.Recoletos] | 69.0001 | 11.885 | 5.806 | 0.000 | 45.704 | 92.296 |
C(neighbourhood)[T.Rejas] | 14.7888 | 13.001 | 1.138 | 0.255 | -10.694 | 40.272 |
C(neighbourhood)[T.Rios Rosas] | 43.8075 | 12.103 | 3.619 | 0.000 | 20.083 | 67.532 |
C(neighbourhood)[T.Rosas] | 9.7475 | 14.878 | 0.655 | 0.512 | -19.416 | 38.911 |
C(neighbourhood)[T.Salvador] | -2.8517 | 18.676 | -0.153 | 0.879 | -39.459 | 33.755 |
C(neighbourhood)[T.San Andrés] | 7.3299 | 13.910 | 0.527 | 0.598 | -19.936 | 34.596 |
C(neighbourhood)[T.San Cristobal] | -10.3517 | 19.946 | -0.519 | 0.604 | -49.449 | 28.745 |
C(neighbourhood)[T.San Diego] | 11.9146 | 12.080 | 0.986 | 0.324 | -11.763 | 35.592 |
C(neighbourhood)[T.San Fermín] | -7.8443 | 14.765 | -0.531 | 0.595 | -36.786 | 21.097 |
C(neighbourhood)[T.San Isidro] | 14.6513 | 12.156 | 1.205 | 0.228 | -9.176 | 38.479 |
C(neighbourhood)[T.San Juan Bautista] | 30.3704 | 15.283 | 1.987 | 0.047 | 0.414 | 60.327 |
C(neighbourhood)[T.San Pascual] | 9.5865 | 14.651 | 0.654 | 0.513 | -19.132 | 38.305 |
C(neighbourhood)[T.Santa Eugenia] | -8.5277 | 20.777 | -0.410 | 0.681 | -49.254 | 32.199 |
C(neighbourhood)[T.Simancas] | 27.9041 | 13.089 | 2.132 | 0.033 | 2.249 | 53.559 |
C(neighbourhood)[T.Sol] | 64.7310 | 11.439 | 5.659 | 0.000 | 42.309 | 87.153 |
C(neighbourhood)[T.Timón] | 3.8465 | 13.681 | 0.281 | 0.779 | -22.971 | 30.664 |
C(neighbourhood)[T.Trafalgar] | 40.9136 | 11.729 | 3.488 | 0.000 | 17.924 | 63.903 |
C(neighbourhood)[T.Universidad] | 52.6676 | 11.386 | 4.626 | 0.000 | 30.349 | 74.986 |
C(neighbourhood)[T.Valdeacederas] | 9.9176 | 12.750 | 0.778 | 0.437 | -15.073 | 34.909 |
C(neighbourhood)[T.Valdefuentes] | 24.6241 | 12.369 | 1.991 | 0.047 | 0.379 | 48.869 |
C(neighbourhood)[T.Valdemarín] | 22.7797 | 36.637 | 0.622 | 0.534 | -49.033 | 94.593 |
C(neighbourhood)[T.Valdezarza] | 0.6239 | 13.180 | 0.047 | 0.962 | -25.211 | 26.458 |
C(neighbourhood)[T.Vallehermoso] | 42.3945 | 13.731 | 3.087 | 0.002 | 15.480 | 69.309 |
C(neighbourhood)[T.Valverde] | 5.2359 | 12.791 | 0.409 | 0.682 | -19.836 | 30.308 |
C(neighbourhood)[T.Ventas] | 3.1014 | 12.239 | 0.253 | 0.800 | -20.888 | 27.091 |
C(neighbourhood)[T.Vinateros] | -6.8132 | 19.944 | -0.342 | 0.733 | -45.906 | 32.280 |
C(neighbourhood)[T.Vista Alegre] | 0.3320 | 12.711 | 0.026 | 0.979 | -24.583 | 25.247 |
C(neighbourhood)[T.Zofío] | 10.2001 | 16.455 | 0.620 | 0.535 | -22.055 | 42.455 |
bedrooms | 36.4822 | 0.524 | 69.672 | 0.000 | 35.456 | 37.509 |
number_of_reviews | -0.0417 | 0.005 | -8.227 | 0.000 | -0.052 | -0.032 |
review_scores_rating | 3.2728 | 0.706 | 4.639 | 0.000 | 1.890 | 4.656 |
Omnibus: | 2562.636 | Durbin-Watson: | 1.865 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 32841.808 |
Skew: | 0.473 | Prob(JB): | 0.00 |
Kurtosis: | 10.343 | Cond. No. | 2.42e+18 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.47e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
You try it:
How many euros is the model missing the mark on average?
Extra: grouping and aggregation
The following is a list of functions which group/aggregarte data tables into new ones. We have already seen the first three, but the others may be useful as well:
value_counts
extracts a frequency table as a series. The table contains the counts of the occurrences of every value of a given series. It does not include the missing values.pivot_table
extracts a spreadsheet-style pivot table as a data frame. For a Pandas data framedf
, the syntax ispd.pivot_table(df, values=cname1, index=cname2)
. This returns a one-way table containing the average value ofcname1
for the groups defined bycname2
. Instead of the average, you can get a different summary by adding an argumentaggfunc=fname
. With an additional argumentcolumns=cname3
, you get a two-way table. For two-way tables, it works the same as crosstab, but it only applies to columns from the same data frame.groupby
groups the rows of a data frame so that an aggregation function can be applied to every group, extracting a SQL-like table as a data frame. The syntax isdf.groupby(by=cname).fname()
. Wherefname
ismean
,sum
,count
, etc.crosstab
extracts a simple cross tabulation as a Pandas data frame. For two series of the same lengths1
ands2
, the syntax ispd.crosstab(s1, s2)
. Thens1
will be placed on the rows ands2
on the columns. By default, crosstab extracts a frequency table, unless an array of values and an aggregation function are passed, e.g., asvalues=s3
andaggfunc=fname
.- Finally, if you actually know some SQL and would like to use that syntax instead, you can use a package like
fugue
orpandassql
.
You try it
There are many paths to the same solution in pandas. A great exercises to realize this is to try to mimick the output of value_counts()
using a group_by()
instead, note that the aggregation function would be … count
.
# How would you do this with a groupby?
df.neighbourhood.value_counts()
Embajadores 2573
Universidad 2010
Palacio 1680
Sol 1187
Justicia 1049
...
Palomas 7
Corralejos 6
Valdemarín 5
El Pardo 1
Horcajo 1
Name: neighbourhood, Length: 126, dtype: int64
Solution
'neighbourhood')['name'].agg('count').sort_values(ascending=False) df.groupby(
neighbourhood
Embajadores 2573
Universidad 2009
Palacio 1679
Sol 1187
Justicia 1049
...
Palomas 7
Corralejos 6
Valdemarín 5
El Pardo 1
Horcajo 1
Name: name, Length: 126, dtype: int64
Extra: performing SQL on dataframes
If you are already familiar with SQL, you may find it more convenient to work with it instead of the grouping functions presented before. Here’s a simple example of how you would execute an SQL query in pandas.
! pip install fugue
from fugue_sql import fsql
# fugue does not allow indices, so we must remove it
= df.reset_index()
table
# Queries are very sql-like, but have some unique syntax
# at the end. Here, we are asking for a dataframe called
# df_out.
= """
query_1 SELECT COUNT(*) AS cnt FROM table
GROUP BY host_id
ORDER BY cnt DESC
YIELD DATAFRAME AS df_out
"""
# result is a dictionary containing all the requested outputs
# in this case, just df_out
= fsql(query_1).run()
result 'df_out'].native.head(10) result[
cnt | |
---|---|
9343 | 263 |
9114 | 148 |
10457 | 143 |
9861 | 127 |
2572 | 101 |
8603 | 97 |
5315 | 96 |
5324 | 96 |
3292 | 86 |
1874 | 80 |
References
Adapted from: Miguel Ángel Canela