From Data to Model: AirBnB

AirBnB is an online marketplace and homestay network that allows individuals to list or rent short-term lodging in residential properties. The cost of accommodation is set by the property owner, and the company charges percentage service fees from both guests and hosts for every booking made. Since its inception in 2008, Airbnb has experienced exponential growth and currently boasts over 2 million listings across approximately 200 countries.

Notably, AirBnB regularly updates and releases data on its platform. Interested parties can access this data via the Inside Airbnb site (insideairbnb.com). The most recent update, posted in last month, covers more than 100 areas, with the majority of these being in the US and Europe. For the purpose of this example, we will use local data, which was captured on May 9th, 2023.

📁 Loading the data

Description

The file AirBnB-Madrid.csv contains data on 18,302 local Airbnb listings. The language in the descriptions is typically English or Spanish (with exceptions). The text comes in UTF-8 encoding, so special characters may not be correctly shown (in Spanish words such as “habitación”) in Windows machines and Microsoft applications.

The variables are:

listing_id, a unique listing’s ID. An active listing is a property listed on Airbnb. Listings may include entire homes or apartments, private rooms or shared spaces.
host_id, a unique host’s ID.
host_since, the date of the host’s first listing in Airbnb, as yyyy-mm-dd.
name, the listing’s name. Aminimal description (maximum 35 characters) of the place, intended to be appealing, such as “Centric Bohemian next Ramblas&Macba”.
neighbourhood, the neighbourhood of the listing. The neighbourhoods are sourced from city or open source GIS files.
property_type, the type of property listed. Typically ‘Appartment’, ‘Bed & Breakfast’ or ‘House’, but it can also be ‘Boat’, ‘Loft’, or others.
room_type, taking values ‘Entire home/apt’, ‘Private room’ and ‘Shared room’.
bedrooms the number of available bedrooms.
price, the daily listing’s price on that date in euros. The price shown is for the listing as a whole, not per person. The price that you see when you search Airbnb (with dates selected) is the total price divided by the number of nights you selected. When a listing has been booked for several days, the price can be lower, since the host can apply different prices depending of the number of days booked.
number_of_reviews, the number of reviews of this listings that have been posted.
review_scores_rating, the average reviewers’ rating of overall experience (what was your guest’s overall experience?). Listings are rated in the range 1-100.

Source: Inside Airbnb, edited.

Importing the data

Let’s import the data to a Pandas data frame, taking listing_id as the index and leave the rest to its default settings. You can download it here.

import pandas as pd
fname = './resources/AirBnB-Madrid.csv'
df    = pd.read_csv(fname, index_col='listing_id')

Our Data Processing Pipeline

flowchart LR
  A[Import Data] 
  A --> B(Check Duplicates)
  A --> C(Handle Null Values)
  A --> C2(...)

  B --> Clean[Clean Data]
  C --> Clean
  C2 --> Clean
  
  Clean --> D[Descriptive]
  D --> E[Modeling]

🧹 Data Cleaning

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21239 entries, 6369 to 843551963497475593
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   host_id               21239 non-null  int64  
 1   host_since            21236 non-null  object 
 2   name                  21235 non-null  object 
 3   neighbourhood         21239 non-null  object 
 4   property_type         21239 non-null  object 
 5   room_type             21239 non-null  object 
 6   bedrooms              19667 non-null  float64
 7   price                 21239 non-null  float64
 8   number_of_reviews     21239 non-null  int64  
 9   review_scores_rating  17168 non-null  float64
dtypes: float64(3), int64(2), object(5)
memory usage: 1.8+ MB

df.describe()

	host_id	bedrooms	price	number_of_reviews	review_scores_rating
count	2.123900e+04	19667.000000	21239.000000	21239.000000	17168.000000
mean	1.818276e+08	1.475467	139.106173	42.764396	4.596427
std	1.626871e+08	0.846230	660.313831	78.737076	0.621636
min	7.952000e+03	1.000000	0.000000	0.000000	0.000000
25%	3.255115e+07	1.000000	50.000000	1.000000	4.500000
50%	1.289901e+08	1.000000	85.000000	10.000000	4.750000
75%	3.205107e+08	2.000000	133.000000	47.000000	4.920000
max	5.050783e+08	18.000000	71577.000000	889.000000	5.000000

We can do a preliminary check of the dataframe using info and head:

df.head(3)
df.tail(3)

	host_id	host_since	name	neighbourhood	property_type	room_type	bedrooms	price	number_of_reviews	review_scores_rating
listing_id
843487874169862949	230936437	2018-12-16	Habitación privada en Madrid.	Valverde	Private room in rental unit	Private room	1.0	46.0	1	5.0
843544239423340038	455852114	2022-04-24	Habitación Privada Confort	Ventas	Private room in rental unit	Private room	1.0	28.0	1	5.0
843551963497475593	504579859	2023-03-09	Habitación en piso compartido	Imperial	Private room in rental unit	Private room	1.0	60.0	0	NaN

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21239 entries, 6369 to 843551963497475593
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   host_id               21239 non-null  int64  
 1   host_since            21236 non-null  object 
 2   name                  21235 non-null  object 
 3   neighbourhood         21239 non-null  object 
 4   property_type         21239 non-null  object 
 5   room_type             21239 non-null  object 
 6   bedrooms              19667 non-null  float64
 7   price                 21239 non-null  float64
 8   number_of_reviews     21239 non-null  int64  
 9   review_scores_rating  17168 non-null  float64
dtypes: float64(3), int64(2), object(5)
memory usage: 1.8+ MB

This is already letting us know that some columns have missing values. This is a fact of life when doing data science so we need to be able to handle this case.

Missing values

Missing values are denoted by NaN in Pandas. When a Pandas object is built, both plain Python’s None and NumPy’s nan are taken as NaN. Since np.nan has type float, a numeric series containing NaN values gets type float.

Three useful Pandas methods related to missing values, which can be applied to both series and data frames, are: * isna returns a Boolean mask indicating which terms are missing. * fillna is used for replacing NaN’s by a fixed value, set by the user. * dropna returns the same data frame minus the rows that contain at least one missing value. If a list of columns is specified, the missing values are searched only for those columns.

For this dataset, in particular, we have a relevant number of listings for which rating scores are not available (many guests don’t rate the lodgings). A report on all missing values can be printed with:

df.isna().sum()

host_id                    0
host_since                 3
name                       4
neighbourhood              0
property_type              0
room_type                  0
bedrooms                1572
price                      0
number_of_reviews          0
review_scores_rating    4071
dtype: int64

Note that df.isna() is a data frame of the same shape as df, whose terms have type bool. With sum, we get the number of True values per column.

We can drop the NA’s with the dropna() command. Since I’m not going to analyze ratings or bedrooms for now, I will only drop NA’s for name, and host since.

print('Before dropping NAs:', df.shape)
df = df.dropna(subset=['name','host_since'])
print('After dropping NAs:', df.shape)

Before dropping NAs: (21239, 10)
After dropping NAs: (21232, 10)

Duplicates

There are two useful Pandas functions for managing duplicates:

drop_duplicates drops the duplicated entries (in a series) or the duplicated rows (in a data frame).
duplicated returns a Boolean series indicating which entries (for a series) or which rows (for a data frame) are duplicated.

For instance, the listing’s ID in this data set is never duplicated, so it makes sense to use it as the index:

df.index.duplicated().sum()

But a different thing is whether several listings have exactly the same data (e.g., this happens when they refer to beds in a shared bedroom). How often this occur in our dataset?

df.duplicated().sum()

How does duplicated work? It reads the data top-down. For entries that occur for the first time, it returns False. For terms that have already occurred before, it returns True.

Let us now permanently remove the duplicates from the dataframe using the drop_duplicates function.

print('Before dropping duplicates:', df.shape)
df = df.drop_duplicates()
print('After dropping duplicates:', df.shape)

df.shape

Before dropping duplicates: (21232, 10)
After dropping duplicates: (21210, 10)

(21210, 10)

Problem (?) solved!

It’s good practice to save the dataframe after preprocessing this allows you to continue with a clean dataset going forward.

df.to_csv('./resources/AirBnB-cleaned.csv')

💡 Descriptive Analytics

We reload the dataset.

df = pd.read_csv('./resources/AirBnB-cleaned.csv', index_col='listing_id')

Much of the work of a datascientist comes down to manipulating the data to answering specific business questions. Most of these operators come in the form of either simple statistics applied to the whole dataset, or simple statistics applied to subgroups of the dataset.

Let’s try to answer a few questions now to see examples of such procedures:

Is AirBnB still P2P?

Airbnb was initially designed to function as a peer-to-peer (P2P) network, where individuals could rent out their apartments to others. However, some critics argue that it has now evolved into a platform that allows investors with large real estate portfolios to offer their properties for rent. The question remains: is this the case here?

One way to investigate this is by examining the ratio of the number of listings to the number of hosts on the platform. By extracting the list of unique host IDs as the index of the series returned by value_counts, we can obtain a ranked list of hosts based on the number of listings they have. Alternatively, we can obtain an array of these IDs using the unique method.

df['host_id'].unique()

array([    13660,     83531, 346366726, ..., 230936437, 455852114,
       504579859])

To count them:

# option 1: look at the shape
df['host_id'].unique().shape

# option 2: use nunique()
df['host_id'].nunique()

These are the number of hosts. We have this many for this number of listings:

df.shape

(21210, 10)

So how many hosts are actually following the original AirBnB philosophy?

Let’s take a loot at the original dataframe. There too, we can check the value counts to see then # of listings per host. Then we count the number of hosts that have just 1 property.

df['host_id'].value_counts()

host_id
377605855    263
346367515    148
471977881    143
434208714    127
28038703     101
            ... 
205944086      1
216063028      1
11857294       1
53974722       1
504579859      1
Name: count, Length: 11073, dtype: int64

Let’s see how many have just 1:

((df['host_id'].value_counts() == 1)*1).sum()

We can easily identify the hosts with the biggest portfolios:

df['host_id'].value_counts().head(10)

host_id
377605855    263
346367515    148
471977881    143
434208714    127
28038703     101
291253690     97
107335103     96
107141654     96
40877988      86
17499500      80
Name: count, dtype: int64

You try it

The amount of hosts with more than 10 listings is:

...

Ellipsis

Solution

(df['host_id'].value_counts() > 10).sum()

How are experienced hosts behaving?

The file contains some information about the hosts, that is, the people renting the lodgings. For instance, you may be interested in the more experienced hosts, using a Boolean mask like df['host_since'] < '2012-01-01' to select the listings with these more experienced hosts.

filter_experienced = df['host_since'] < '2012-01-01'
df_experienced = df[filter_experienced]

df_experienced.shape

(437, 10)

Looks like there are 437 listings from hosts that joined before 2012. How many listings do they behave? Let’s use the value_counts function to analyze how many listings these have.

You try it

Count how many listings each host that joined before 2012:

Figure out which hosts joined before 2012.
Use this as a filter.
Apply value_counts to the resulting dataframe.

...

Ellipsis

Solution

filter_experienced = df['host_since'] < '2012-01-01'
df_experienced     = df[filter_experienced]
df_experienced['host_id'].value_counts()

host_id
1408525    65
901218     36
596469     26
142578     11
1108348     9
           ..
606925      1
1206817     1
70273       1
901541      1
1499853     1
Name: count, Length: 179, dtype: int64

So, we have quite a few hosts in this list. Many of them have one listing, but the top hosts look like companies. You may wonder how frequent is this.

Distribution of the price

What’s the distribution of the listing’s price? We can explore this point with a histogram. The histogram is returned by the Pandas method plot.hist().

df['price'].plot.hist();

The distribution of the prices (naive version). This not really useful, since some very expensive listings distort the whole picture. But we can trim the data, by dropping the most expensive listings, to get a better picture.

A statistical summary, as the one returned by df.describe(), can be helpful:

df['price'].describe()

count    21210.000000
mean       139.201367
std        660.758099
min          0.000000
25%         50.000000
50%         85.000000
75%        133.000000
max      71577.000000
Name: price, dtype: float64

Alternatively, we can compute other quantiles using df.quantile():

quantiles = df['price'].quantile(q=[0.05,0.95])
quantiles

0.05     23.0
0.95    331.0
Name: price, dtype: float64

You try it

These 5-95% quantiles look like reasonable limits for trimming. Try plotting the distribution for just this part of the dataset.

Calculate the quantiles
Filter out bad data (outliers)
Plot the distribution

The goal is to recreate the plot below:

...

Ellipsis

Solution

df_no_outlier = df[(df['price'] >= 23) & (df['price'] <= 331)]
df_no_outlier.price.plot.hist(figsize=(8,4), color='gray', rwidth=0.9, bins=40).set_xlim(50,320);

Fancy Solution with Bar Highlights

import matplotlib.pyplot as plt

# I start at 25 here to ensure that round number bars are prominently shown
df_no_outlier = df[(df['price'] >= 25) & (df['price'] <= 331)]
df_no_outlier.price.plot.hist(figsize=(8,4), color='gray', rwidth=0.9, bins=40);

# We repeat the plot, but this time keep track of the bars 
hist_data, bins, bars = plt.hist(df_no_outlier.price, bins=40, rwidth=0.9, color='gray')

# Then, we define the index of the bar that we want to highlight
highlight_bins = [0, 9, 16, 22,29,35] 

# Loop through them and change their fc (face color)
for i in highlight_bins:
    bars[i].set_fc('red')

#plt.show()

plt.gca().set_xlim(50,320);

Hint: To make it look more like mine, use the argument bins which specifies that the range covered by the data is partitioned into this number of intervals. Note how intervals starting at multiples of 10 have higher bars. This is not rare. Prices don’t change continuously, and hosts prefer prices which are multiples of 10. The histogram also shows that e.g., 50, 100 and 150 euros are popular prices.

Note. The rule for counting the observations in every bin is as follows. For an interval of limits a and b, the values x such that a <= x < b are counted. Except for the last interval, for which the right limit is also included.

Average price per room type

Suppose that you are interested in the average price per room type. You can get this by applying groupby() which works as follows:

This method groups the rows according to one or more grouping variables. Here, you specify groupby(by='room_type').

Next, you select the columns to be aggregated, which, here would be just price. Finally, you specify the aggregation function. Having only the mean, you can do it just by adding mean() or by specifying .agg('mean'). The method round() works as you would expect it to.

df.groupby('room_type')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fa6660e2690>

df.groupby('room_type')['price']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fa6604a0f10>

df.groupby('room_type')['price'].mean()

room_type
Entire home/apt    160.338694
Hotel room         204.131148
Private room        98.362606
Shared room        104.191638
Name: price, dtype: float64

The above way of writing a groupby is a shorthand notation for the full version using agg as shown below:

df.groupby(by='room_type')['price'].agg('mean').round()

room_type
Entire home/apt    160.0
Hotel room         204.0
Private room        98.0
Shared room        104.0
Name: price, dtype: float64

While the differences in prices across room types may be as expected (does anything stand out), relying on the average price alone may not always provide an accurate representation of the data. Skewed distributions can result in extreme values on the right tail, which can significantly affect the mean price and lead it to deviate from the central tendency of the data. In such cases, the median price would provide a more representative measure of the mid-point of the distribution.

To calculate the median price per room type, the aggregation function mean in our calculations should be replaced by median. Similar rules apply for the ‘usual’ statistics:

df.groupby(by='room_type')['price'].agg(['min','max','mean','median']).round()

	min	max	mean	median
room_type
Entire home/apt	10.0	71577.0	160.0	104.0
Hotel room	0.0	1149.0	204.0	125.0
Private room	8.0	40714.0	98.0	40.0
Shared room	11.0	3000.0	104.0	28.0

You can even provide your own functions!

# price_series is a series passed to the groupby function
# it is applied to every split-group
def myquantiles(price_series):
    return [
        price_series.quantile(q=.5).round(),
        price_series.quantile(q=.95).round()
    ]

df.groupby(by='room_type')['price'].agg(['min','max','mean',myquantiles]).round()

	min	max	mean	myquantiles
room_type
Entire home/apt	10.0	71577.0	160.0	[104.0, 324.0]
Hotel room	0.0	1149.0	204.0	[125.0, 1000.0]
Private room	8.0	40714.0	98.0	[40.0, 350.0]
Shared room	11.0	3000.0	104.0	[28.0, 481.0]

Note the myquantiles is called in the ‘gear’ / agg step and is thus applied to every single split-group seperately. The input (price_series) are the prices for that specific group. The output are the quantiles for that specific group. These outputs are merged into one column called myquantiles.

The aggregation functions are applied to each column separately. If instead, you would like to use all columns to do some advanced calculation, you can use the `.apply(myfunction)` method instead of the `.agg(myfunction)` method.

The results obtained from using the median price provide a more meaningful perspective compared to the earlier results, where shared rooms appeared to be more expensive than private rooms. Using the median price allows for a more accurate representation of the mid-point of the price distribution and reduces the influence of extreme values that can bias the results.

Top-10 neighbourhoods

A natural analysis of the local Airbnb data, is to take a look at the most popular neighbourhoods by number of listings. The top ten list can be extracted with value_counts:

df['neighbourhood'].value_counts().head(10)

neighbourhood
Embajadores        2573
Universidad        2009
Palacio            1679
Sol                1187
Justicia           1049
Cortes              931
Trafalgar           394
Palos de Moguer     373
Goya                341
Argüelles           312
Name: count, dtype: int64

Are these neighbourhoods more or perhaps less expensive? One could wonder if there is an association between the mid price and the amount of property listed (following the laws of supply-demand).

nb_counts = df.groupby(by='neighbourhood')['price'].agg(['count','mean','median'])

nb_counts.sort_values('count', ascending=False).head(10)

	count	mean	median
neighbourhood
Embajadores	2573	145.029149	86.0
Universidad	2009	127.534594	96.0
Palacio	1679	144.176891	100.0
Sol	1187	153.972199	112.0
Justicia	1049	151.954242	100.0
Cortes	931	149.902256	115.0
Trafalgar	394	192.180203	95.5
Palos de Moguer	373	206.426273	82.0
Goya	341	211.741935	116.0
Argüelles	312	127.147436	90.5

Sometimes, we prefer to name the columns or use special functions that are not built-in by default. This can be done by providing a dict instead of a list to the agg function. The keys will be the names of the columns, the values are the calculations to be done:

View Code

def my_property_function(x):
    return x.mode()

df.groupby(by='neighbourhood').agg({
    'price'        : ['count', 'median'],
    'property_type': [my_property_function]
})

	price		property_type
	count	median	my_property_function
neighbourhood
Abrantes	40	37.0	Private room in rental unit
Acacias	169	64.0	Entire rental unit
Adelfas	96	70.0	Entire rental unit
Aeropuerto	8	30.5	Private room in rental unit
Aguilas	41	45.0	Private room in rental unit
...	...	...	...
Valverde	95	65.0	Entire rental unit
Ventas	174	50.0	Private room in rental unit
Vinateros	23	40.0	Private room in rental unit
Vista Alegre	124	48.5	Private room in rental unit
Zofío	26	68.5	Entire rental unit

126 rows × 3 columns

You try it

What is the actual correlation between price and number of listings? Follow these steps to find out:

Do a group-by to get the counts and the median price:
- group by neighbourhood
- extract the price as a Series
- aggregate to get the count and the median
use the .corr() function to obtain the correlation.

Your goal is to produce the following table:

...

Ellipsis

Solution

df.groupby(by='neighbourhood')['price'].agg(['count','median']).corr()

Correlation Matrix of Key Statistics
	count	median
count	1.000000	0.177183
median	0.177183	1.000000

df.groupby('neighbourhood')['price'].agg(['count','median','mean']).corr()

	count	median	mean
count	1.000000	0.177183	0.070872
median	0.177183	1.000000	0.600848
mean	0.070872	0.600848	1.000000

This doesn’t seem to directly correlate, something more must be up. We’ll try some actual statistics towards the end of the class.

Pivoting

In this dataset too, a pivot table can be handy. Let us try to answer the following question.

You try it

What is the median price of a property in each neighbourhood? Organize by room type:

Rows = room type
Columns = neighbourhood
Values = price

Use the syntax df.pivot_table(index=..., columns=..., values=...)

Straightforward Solution

pivot = df.pivot_table(index='room_type',columns='neighbourhood', values='price', aggfunc='median')

Advanced Solution

# Here's a more complex solution, I will pivot on 2 statistics, both the mean and the count
# I use the count to remove any neighbourhood with very few records as we cannot calculate
# meaningful statistics for those. 
#
# Then, I sort the values. I will use Entire home/apt prices as a leading indicator.

# 1. pivot on both stats
pivot = df.pivot_table(index='room_type',columns='neighbourhood', values='price', aggfunc=['median','count'])

# 2. find out/filter which neighbourhoods have less than 10 recorded prices in Entire home/apt
count_columns       = pivot['count']
valid_neighborhoods = count_columns.loc['Entire home/apt'][count_columns.loc['Entire home/apt'] >= 10].index
filtered_pivot      = pivot.loc[:, pivot.columns.get_level_values(1).isin(valid_neighborhoods)]

# 3. we're only interested in the mean price, so we pick that one. Then, we also sort from
#    most to least expensive
filtered_pivot['median'].T.sort_values(by='Entire home/apt', ascending=False).T

neighbourhood	Rosas	Jerónimos	Castellana	Goya	Recoletos	Colina	Niño Jesús	Sol	Rios Rosas	Cortes	...	Adelfas	Marroquina	Almendrales	Pradolongo	Concepción	Zofío	San Andrés	Aluche	Vista Alegre	Entrevías
room_type
Entire home/apt	165.0	158.0	151.0	141.0	140.0	140.0	135.0	126.0	125.0	125.0	...	75.0	75.0	75.0	73.5	73.0	73.0	70.0	69.5	67.0	66.0
Hotel room	NaN	0.0	NaN	NaN	197.0	NaN	NaN	41.0	106.5	159.0	...	NaN	NaN	NaN	NaN	NaN	NaN	200.0	NaN	NaN	NaN
Private room	375.0	46.5	35.0	43.5	39.0	38.5	46.0	54.0	38.0	60.0	...	35.0	41.0	35.0	41.0	29.0	28.0	28.0	28.0	30.0	32.0
Shared room	1000.0	NaN	168.5	NaN	NaN	NaN	18.0	30.0	42.0	31.0	...	NaN	NaN	NaN	17.0	180.0	NaN	11.0	18.0	NaN	15.0

4 rows × 100 columns

📈 Build a regression model to predict prices of AirBnB listings.

Lastly, we could build some basic statistical regressions to model the prices of local AirBnB listings. This could allow us to potentially spot good opportunities of over-/under-valued rooms. For home-owners, on the other hand, it could give us an indiciation of what a right price for a property.

We start by importing statsmodels as explained in the accompanying document:

import statsmodels.api as sm
import statsmodels.formula.api as smf

We will only look at the dataframe without outliers here.

model1 = smf.ols("""
    price ~ bedrooms + number_of_reviews + review_scores_rating
""", data=df_no_outlier).fit()
model1.summary()

OLS Regression Results
Dep. Variable:	price	R-squared:	0.233
Model:	OLS	Adj. R-squared:	0.233
Method:	Least Squares	F-statistic:	1455.
Date:	Wed, 27 Sep 2023	Prob (F-statistic):	0.00
Time:	10:00:46	Log-Likelihood:	-77530.
No. Observations:	14380	AIC:	1.551e+05
Df Residuals:	14376	BIC:	1.551e+05
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	31.6574	3.588	8.822	0.000	24.624	38.691
bedrooms	36.8712	0.559	65.956	0.000	35.775	37.967
number_of_reviews	-0.0025	0.005	-0.473	0.636	-0.013	0.008
review_scores_rating	3.0993	0.755	4.106	0.000	1.620	4.579

Omnibus:	2498.921	Durbin-Watson:	1.858
Prob(Omnibus):	0.000	Jarque-Bera (JB):	19666.142
Skew:	0.617	Prob(JB):	0.00
Kurtosis:	8.595	Cond. No.	828.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

model2 = smf.ols("""
    price ~ bedrooms + C(neighbourhood) + number_of_reviews + review_scores_rating
""", data=df_no_outlier).fit()
model2.summary()

OLS Regression Results
Dep. Variable:	price	R-squared:	0.346
Model:	OLS	Adj. R-squared:	0.340
Method:	Least Squares	F-statistic:	59.32
Date:	Wed, 27 Sep 2023	Prob (F-statistic):	0.00
Time:	10:02:28	Log-Likelihood:	-76386.
No. Observations:	14380	AIC:	1.530e+05
Df Residuals:	14252	BIC:	1.540e+05
Df Model:	127
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-3.4442	11.702	-0.294	0.769	-26.381	19.493
C(neighbourhood)[T.Acacias]	14.8142	12.185	1.216	0.224	-9.071	38.699
C(neighbourhood)[T.Adelfas]	3.8755	12.717	0.305	0.761	-21.051	28.802
C(neighbourhood)[T.Aeropuerto]	-18.0615	27.109	-0.666	0.505	-71.200	35.077
C(neighbourhood)[T.Aguilas]	-11.9278	15.604	-0.764	0.445	-42.515	18.659
C(neighbourhood)[T.Alameda de Osuna]	15.2433	14.125	1.079	0.281	-12.443	42.930
C(neighbourhood)[T.Almagro]	58.9526	12.138	4.857	0.000	35.161	82.745
C(neighbourhood)[T.Almenara]	25.3954	12.629	2.011	0.044	0.640	50.150
C(neighbourhood)[T.Almendrales]	11.1726	13.249	0.843	0.399	-14.798	37.143
C(neighbourhood)[T.Aluche]	-15.0626	13.910	-1.083	0.279	-42.329	12.203
C(neighbourhood)[T.Ambroz]	-23.4862	21.792	-1.078	0.281	-66.201	19.229
C(neighbourhood)[T.Amposta]	9.9295	36.634	0.271	0.786	-61.877	81.736
C(neighbourhood)[T.Apostol Santiago]	12.4785	15.439	0.808	0.419	-17.784	42.741
C(neighbourhood)[T.Arapiles]	28.7075	12.165	2.360	0.018	4.862	52.553
C(neighbourhood)[T.Aravaca]	7.1299	14.281	0.499	0.618	-20.863	35.123
C(neighbourhood)[T.Arcos]	-13.7477	14.877	-0.924	0.355	-42.908	15.413
C(neighbourhood)[T.Argüelles]	41.8478	11.832	3.537	0.000	18.655	65.040
C(neighbourhood)[T.Atocha]	35.5168	17.071	2.081	0.037	2.056	68.978
C(neighbourhood)[T.Bellas Vistas]	13.8640	12.187	1.138	0.255	-10.025	37.753
C(neighbourhood)[T.Berruguete]	23.9287	12.495	1.915	0.056	-0.564	48.421
C(neighbourhood)[T.Buenavista]	6.7709	13.283	0.510	0.610	-19.266	32.808
C(neighbourhood)[T.Butarque]	11.5161	18.670	0.617	0.537	-25.080	48.112
C(neighbourhood)[T.Campamento]	-18.5179	15.792	-1.173	0.241	-49.472	12.436
C(neighbourhood)[T.Canillas]	0.3081	12.568	0.025	0.980	-24.326	24.942
C(neighbourhood)[T.Canillejas]	8.4796	13.117	0.646	0.518	-17.231	34.190
C(neighbourhood)[T.Casa de Campo]	25.3565	13.443	1.886	0.059	-0.993	51.706
C(neighbourhood)[T.Casco Histórico de Barajas]	31.7824	14.662	2.168	0.030	3.042	60.523
C(neighbourhood)[T.Casco Histórico de Vallecas]	1.5332	13.362	0.115	0.909	-24.658	27.724
C(neighbourhood)[T.Casco Histórico de Vicálvaro]	-2.4579	14.881	-0.165	0.869	-31.626	26.710
C(neighbourhood)[T.Castellana]	68.3615	12.155	5.624	0.000	44.536	92.187
C(neighbourhood)[T.Castilla]	34.5684	13.791	2.507	0.012	7.536	61.600
C(neighbourhood)[T.Castillejos]	33.7429	12.095	2.790	0.005	10.035	57.450
C(neighbourhood)[T.Chopera]	16.6831	12.298	1.357	0.175	-7.423	40.789
C(neighbourhood)[T.Ciudad Jardín]	22.0364	12.925	1.705	0.088	-3.299	47.372
C(neighbourhood)[T.Ciudad Universitaria]	12.7401	14.757	0.863	0.388	-16.186	41.666
C(neighbourhood)[T.Colina]	47.0181	16.724	2.811	0.005	14.238	79.799
C(neighbourhood)[T.Comillas]	18.0013	12.399	1.452	0.147	-6.303	42.306
C(neighbourhood)[T.Concepción]	10.4510	13.445	0.777	0.437	-15.903	36.805
C(neighbourhood)[T.Corralejos]	33.2445	30.623	1.086	0.278	-26.780	93.269
C(neighbourhood)[T.Cortes]	61.1947	11.473	5.334	0.000	38.707	83.683
C(neighbourhood)[T.Costillares]	21.7116	15.790	1.375	0.169	-9.240	52.663
C(neighbourhood)[T.Cuatro Caminos]	36.0319	11.898	3.028	0.002	12.711	59.353
C(neighbourhood)[T.Cuatro Vientos]	4.5201	20.772	0.218	0.828	-36.196	45.236
C(neighbourhood)[T.Cármenes]	11.2344	14.118	0.796	0.426	-16.439	38.907
C(neighbourhood)[T.Delicias]	19.5064	12.030	1.622	0.105	-4.073	43.086
C(neighbourhood)[T.El Goloso]	35.2039	19.945	1.765	0.078	-3.891	74.299
C(neighbourhood)[T.El Pardo]	13.4382	50.560	0.266	0.790	-85.665	112.542
C(neighbourhood)[T.El Plantío]	7.5515	20.777	0.363	0.716	-33.173	48.276
C(neighbourhood)[T.El Viso]	51.7654	13.119	3.946	0.000	26.050	77.481
C(neighbourhood)[T.Embajadores]	38.6202	11.372	3.396	0.001	16.330	60.911
C(neighbourhood)[T.Entrevías]	-6.2525	15.606	-0.401	0.689	-36.842	24.337
C(neighbourhood)[T.Estrella]	36.5910	16.722	2.188	0.029	3.813	69.369
C(neighbourhood)[T.Fontarrón]	-7.4839	14.875	-0.503	0.615	-36.641	21.673
C(neighbourhood)[T.Fuente del Berro]	31.4966	12.699	2.480	0.013	6.605	56.388
C(neighbourhood)[T.Gaztambide]	32.1650	12.138	2.650	0.008	8.373	55.957
C(neighbourhood)[T.Goya]	49.8259	11.759	4.237	0.000	26.776	72.876
C(neighbourhood)[T.Guindalera]	31.4116	11.845	2.652	0.008	8.195	54.628
C(neighbourhood)[T.Hellín]	-5.6841	18.674	-0.304	0.761	-42.287	30.919
C(neighbourhood)[T.Hispanoamérica]	46.2799	12.660	3.655	0.000	21.464	71.096
C(neighbourhood)[T.Horcajo]	2.945e-14	1.59e-14	1.856	0.063	-1.64e-15	6.05e-14
C(neighbourhood)[T.Ibiza]	31.7360	11.981	2.649	0.008	8.251	55.221
C(neighbourhood)[T.Imperial]	22.3133	12.179	1.832	0.067	-1.560	46.186
C(neighbourhood)[T.Jerónimos]	67.9640	12.699	5.352	0.000	43.073	92.855
C(neighbourhood)[T.Justicia]	51.9468	11.446	4.538	0.000	29.510	74.383
C(neighbourhood)[T.La Paz]	2.0962	15.603	0.134	0.893	-28.488	32.681
C(neighbourhood)[T.Legazpi]	21.0025	14.121	1.487	0.137	-6.677	48.682
C(neighbourhood)[T.Lista]	53.2862	12.083	4.410	0.000	29.601	76.971
C(neighbourhood)[T.Los Angeles]	6.2183	16.726	0.372	0.710	-26.568	39.004
C(neighbourhood)[T.Los Rosales]	0.6175	16.454	0.038	0.970	-31.634	32.869
C(neighbourhood)[T.Lucero]	-7.5381	13.059	-0.577	0.564	-33.136	18.059
C(neighbourhood)[T.Marroquina]	8.7202	17.027	0.512	0.609	-24.654	42.095
C(neighbourhood)[T.Media Legua]	-1.8909	17.742	-0.107	0.915	-36.668	32.886
C(neighbourhood)[T.Mirasierra]	1.0352	15.995	0.065	0.948	-30.317	32.388
C(neighbourhood)[T.Moscardó]	4.3998	12.626	0.348	0.727	-20.348	29.148
C(neighbourhood)[T.Niño Jesús]	50.0095	13.582	3.682	0.000	23.387	76.632
C(neighbourhood)[T.Nueva España]	35.6868	13.059	2.733	0.006	10.090	61.284
C(neighbourhood)[T.Numancia]	11.3359	12.186	0.930	0.352	-12.550	35.222
C(neighbourhood)[T.Opañel]	24.7965	12.660	1.959	0.050	-0.019	49.612
C(neighbourhood)[T.Orcasitas]	-42.1824	17.364	-2.429	0.015	-76.218	-8.147
C(neighbourhood)[T.Orcasur]	11.9868	13.359	0.897	0.370	-14.198	38.172
C(neighbourhood)[T.Pacífico]	13.8451	11.828	1.171	0.242	-9.340	37.030
C(neighbourhood)[T.Palacio]	52.5310	11.398	4.609	0.000	30.189	74.873
C(neighbourhood)[T.Palomas]	40.7050	23.084	1.763	0.078	-4.542	85.952
C(neighbourhood)[T.Palomeras Bajas]	-10.6048	13.682	-0.775	0.438	-37.423	16.213
C(neighbourhood)[T.Palomeras Sureste]	-1.8211	15.605	-0.117	0.907	-32.408	28.766
C(neighbourhood)[T.Palos de Moguer]	34.0250	11.694	2.910	0.004	11.103	56.946
C(neighbourhood)[T.Pavones]	-29.6977	24.777	-1.199	0.231	-78.265	18.869
C(neighbourhood)[T.Peñagrande]	1.2636	15.439	0.082	0.935	-29.000	31.527
C(neighbourhood)[T.Pilar]	7.8029	13.736	0.568	0.570	-19.122	34.728
C(neighbourhood)[T.Pinar del Rey]	16.8362	12.486	1.348	0.178	-7.638	41.310
C(neighbourhood)[T.Piovera]	36.5582	15.793	2.315	0.021	5.602	67.514
C(neighbourhood)[T.Portazgo]	-15.9508	16.454	-0.969	0.332	-48.203	16.302
C(neighbourhood)[T.Pradolongo]	6.6079	15.788	0.419	0.676	-24.338	37.554
C(neighbourhood)[T.Prosperidad]	28.2371	12.138	2.326	0.020	4.446	52.029
C(neighbourhood)[T.Pueblo Nuevo]	4.7364	12.229	0.387	0.699	-19.233	28.706
C(neighbourhood)[T.Puerta Bonita]	-7.4593	13.910	-0.536	0.592	-34.725	19.807
C(neighbourhood)[T.Puerta del Angel]	2.9935	11.882	0.252	0.801	-20.297	26.284
C(neighbourhood)[T.Quintana]	4.7823	12.633	0.379	0.705	-19.979	29.544
C(neighbourhood)[T.Recoletos]	69.0001	11.885	5.806	0.000	45.704	92.296
C(neighbourhood)[T.Rejas]	14.7888	13.001	1.138	0.255	-10.694	40.272
C(neighbourhood)[T.Rios Rosas]	43.8075	12.103	3.619	0.000	20.083	67.532
C(neighbourhood)[T.Rosas]	9.7475	14.878	0.655	0.512	-19.416	38.911
C(neighbourhood)[T.Salvador]	-2.8517	18.676	-0.153	0.879	-39.459	33.755
C(neighbourhood)[T.San Andrés]	7.3299	13.910	0.527	0.598	-19.936	34.596
C(neighbourhood)[T.San Cristobal]	-10.3517	19.946	-0.519	0.604	-49.449	28.745
C(neighbourhood)[T.San Diego]	11.9146	12.080	0.986	0.324	-11.763	35.592
C(neighbourhood)[T.San Fermín]	-7.8443	14.765	-0.531	0.595	-36.786	21.097
C(neighbourhood)[T.San Isidro]	14.6513	12.156	1.205	0.228	-9.176	38.479
C(neighbourhood)[T.San Juan Bautista]	30.3704	15.283	1.987	0.047	0.414	60.327
C(neighbourhood)[T.San Pascual]	9.5865	14.651	0.654	0.513	-19.132	38.305
C(neighbourhood)[T.Santa Eugenia]	-8.5277	20.777	-0.410	0.681	-49.254	32.199
C(neighbourhood)[T.Simancas]	27.9041	13.089	2.132	0.033	2.249	53.559
C(neighbourhood)[T.Sol]	64.7310	11.439	5.659	0.000	42.309	87.153
C(neighbourhood)[T.Timón]	3.8465	13.681	0.281	0.779	-22.971	30.664
C(neighbourhood)[T.Trafalgar]	40.9136	11.729	3.488	0.000	17.924	63.903
C(neighbourhood)[T.Universidad]	52.6676	11.386	4.626	0.000	30.349	74.986
C(neighbourhood)[T.Valdeacederas]	9.9176	12.750	0.778	0.437	-15.073	34.909
C(neighbourhood)[T.Valdefuentes]	24.6241	12.369	1.991	0.047	0.379	48.869
C(neighbourhood)[T.Valdemarín]	22.7797	36.637	0.622	0.534	-49.033	94.593
C(neighbourhood)[T.Valdezarza]	0.6239	13.180	0.047	0.962	-25.211	26.458
C(neighbourhood)[T.Vallehermoso]	42.3945	13.731	3.087	0.002	15.480	69.309
C(neighbourhood)[T.Valverde]	5.2359	12.791	0.409	0.682	-19.836	30.308
C(neighbourhood)[T.Ventas]	3.1014	12.239	0.253	0.800	-20.888	27.091
C(neighbourhood)[T.Vinateros]	-6.8132	19.944	-0.342	0.733	-45.906	32.280
C(neighbourhood)[T.Vista Alegre]	0.3320	12.711	0.026	0.979	-24.583	25.247
C(neighbourhood)[T.Zofío]	10.2001	16.455	0.620	0.535	-22.055	42.455
bedrooms	36.4822	0.524	69.672	0.000	35.456	37.509
number_of_reviews	-0.0417	0.005	-8.227	0.000	-0.052	-0.032
review_scores_rating	3.2728	0.706	4.639	0.000	1.890	4.656

Omnibus:	2562.636	Durbin-Watson:	1.865
Prob(Omnibus):	0.000	Jarque-Bera (JB):	32841.808
Skew:	0.473	Prob(JB):	0.00
Kurtosis:	10.343	Cond. No.	2.42e+18

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.47e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

You try it:

How many euros is the model missing the mark on average?

Extra: grouping and aggregation

The following is a list of functions which group/aggregarte data tables into new ones. We have already seen the first three, but the others may be useful as well:

value_counts extracts a frequency table as a series. The table contains the counts of the occurrences of every value of a given series. It does not include the missing values.
pivot_table extracts a spreadsheet-style pivot table as a data frame. For a Pandas data frame df, the syntax is pd.pivot_table(df, values=cname1, index=cname2). This returns a one-way table containing the average value of cname1 for the groups defined by cname2. Instead of the average, you can get a different summary by adding an argument aggfunc=fname. With an additional argument columns=cname3, you get a two-way table. For two-way tables, it works the same as crosstab, but it only applies to columns from the same data frame.
groupby groups the rows of a data frame so that an aggregation function can be applied to every group, extracting a SQL-like table as a data frame. The syntax is df.groupby(by=cname).fname(). Where fname is mean, sum, count, etc.
crosstab extracts a simple cross tabulation as a Pandas data frame. For two series of the same length s1 and s2, the syntax is pd.crosstab(s1, s2). Then s1 will be placed on the rows and s2 on the columns. By default, crosstab extracts a frequency table, unless an array of values and an aggregation function are passed, e.g., as values=s3 and aggfunc=fname.
Finally, if you actually know some SQL and would like to use that syntax instead, you can use a package like fugue or pandassql.

You try it

There are many paths to the same solution in pandas. A great exercises to realize this is to try to mimick the output of value_counts() using a group_by() instead, note that the aggregation function would be … count.

# How would you do this with a groupby?
df.neighbourhood.value_counts()

Embajadores    2573
Universidad    2010
Palacio        1680
Sol            1187
Justicia       1049
               ... 
Palomas           7
Corralejos        6
Valdemarín        5
El Pardo          1
Horcajo           1
Name: neighbourhood, Length: 126, dtype: int64

Solution

df.groupby('neighbourhood')['name'].agg('count').sort_values(ascending=False)

neighbourhood
Embajadores    2573
Universidad    2009
Palacio        1679
Sol            1187
Justicia       1049
               ... 
Palomas           7
Corralejos        6
Valdemarín        5
El Pardo          1
Horcajo           1
Name: name, Length: 126, dtype: int64

Extra: performing SQL on dataframes

If you are already familiar with SQL, you may find it more convenient to work with it instead of the grouping functions presented before. Here’s a simple example of how you would execute an SQL query in pandas.

! pip install fugue

from fugue_sql import fsql

# fugue does not allow indices, so we must remove it
table = df.reset_index()

# Queries are very sql-like, but have some unique syntax
# at the end. Here, we are asking for a dataframe called
# df_out.
query_1 = """
    SELECT COUNT(*) AS cnt FROM table 
    GROUP BY host_id
    ORDER BY cnt DESC
    YIELD DATAFRAME AS df_out
"""

# result is a dictionary containing all the requested outputs
# in this case, just df_out
result = fsql(query_1).run()
result['df_out'].native.head(10)

	cnt
9343	263
9114	148
10457	143
9861	127
2572	101
8603	97
5315	96
5324	96
3292	86
1874	80

References

Adapted from: Miguel Ángel Canela