• Review
  • Image Data
  • Tabular Data
  • Text Data
  • Assignments
    • Haiteng Engineering: Quality Control System Analysis
    • AirBnBarcelona
  1. Tabular Data
  2. From Data to Model: AirBnB
  • Review
    • Introduction to Python
    • NumPy: Numerical Python
  • Image Data
    • How Computers See
    • Computer Vision
  • Tabular Data
    • Pandas: Better than Excel
    • Investment Drivers
    • From Data to Model: AirBnB
    • Time Series Data in Pandas
  • Text Data
    • How Computers “Read”
    • 30-Text-Data/060-Embeddings.ipynb

Navigation

On this page

  • 📁 Loading the data
    • Description
    • Importing the data
    • Our Data Processing Pipeline
  • 🧹 Data Cleaning
    • Missing values
    • Duplicates
  • 💡 Descriptive Analytics
    • Is AirBnB still P2P?
    • How are experienced hosts behaving?
    • Distribution of the price
    • Average price per room type
    • Top-10 neighbourhoods
    • Pivoting
  • 📈 Build a regression model to predict prices of AirBnB listings.
    • Extra: grouping and aggregation
  1. Tabular Data
  2. From Data to Model: AirBnB

From Data to Model: AirBnB

AirBnB is an online marketplace and homestay network that allows individuals to list or rent short-term lodging in residential properties. The cost of accommodation is set by the property owner, and the company charges percentage service fees from both guests and hosts for every booking made. Since its inception in 2008, Airbnb has experienced exponential growth and currently boasts over 2 million listings across approximately 200 countries.

Notably, AirBnB regularly updates and releases data on its platform. Interested parties can access this data via the Inside Airbnb site (insideairbnb.com). The most recent update, posted in last month, covers more than 100 areas, with the majority of these being in the US and Europe. For the purpose of this example, we will use local data, which was captured on May 9th, 2023.

📁 Loading the data

Description

The file AirBnB-Madrid.csv contains data on 18,302 local Airbnb listings. The language in the descriptions is typically English or Spanish (with exceptions). The text comes in UTF-8 encoding, so special characters may not be correctly shown (in Spanish words such as “habitación”) in Windows machines and Microsoft applications.

The variables are:

  • listing_id, a unique listing’s ID. An active listing is a property listed on Airbnb. Listings may include entire homes or apartments, private rooms or shared spaces.

  • host_id, a unique host’s ID.

  • host_since, the date of the host’s first listing in Airbnb, as yyyy-mm-dd.

  • name, the listing’s name. Aminimal description (maximum 35 characters) of the place, intended to be appealing, such as “Centric Bohemian next Ramblas&Macba”.

  • neighbourhood, the neighbourhood of the listing. The neighbourhoods are sourced from city or open source GIS files.

  • property_type, the type of property listed. Typically ‘Appartment’, ‘Bed & Breakfast’ or ‘House’, but it can also be ‘Boat’, ‘Loft’, or others.

  • room_type, taking values ‘Entire home/apt’, ‘Private room’ and ‘Shared room’.

  • bedrooms the number of available bedrooms.

  • price, the daily listing’s price on that date in euros. The price shown is for the listing as a whole, not per person. The price that you see when you search Airbnb (with dates selected) is the total price divided by the number of nights you selected. When a listing has been booked for several days, the price can be lower, since the host can apply different prices depending of the number of days booked.

  • number_of_reviews, the number of reviews of this listings that have been posted.

  • review_scores_rating, the average reviewers’ rating of overall experience (what was your guest’s overall experience?). Listings are rated in the range 1-100.

Source: Inside Airbnb, edited.

Importing the data

Let’s import the data to a Pandas data frame, taking listing_id as the index and leave the rest to its default settings. You can download it here.

import pandas as pd
fname = './resources/AirBnB-Madrid.csv'
df    = pd.read_csv(fname, index_col='listing_id')

Our Data Processing Pipeline

flowchart LR
  A[Import Data] 
  A --> B(Check Duplicates)
  A --> C(Handle Null Values)
  A --> C2(...)

  B --> Clean[Clean Data]
  C --> Clean
  C2 --> Clean
  
  Clean --> D[Descriptive]
  D --> E[Modeling]

🧹 Data Cleaning

df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 21239 entries, 6369 to 843551963497475593
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   host_id               21239 non-null  int64  
 1   host_since            21236 non-null  object 
 2   name                  21235 non-null  object 
 3   neighbourhood         21239 non-null  object 
 4   property_type         21239 non-null  object 
 5   room_type             21239 non-null  object 
 6   bedrooms              19667 non-null  float64
 7   price                 21239 non-null  float64
 8   number_of_reviews     21239 non-null  int64  
 9   review_scores_rating  17168 non-null  float64
dtypes: float64(3), int64(2), object(5)
memory usage: 1.8+ MB
df.describe()
host_id bedrooms price number_of_reviews review_scores_rating
count 2.123900e+04 19667.000000 21239.000000 21239.000000 17168.000000
mean 1.818276e+08 1.475467 139.106173 42.764396 4.596427
std 1.626871e+08 0.846230 660.313831 78.737076 0.621636
min 7.952000e+03 1.000000 0.000000 0.000000 0.000000
25% 3.255115e+07 1.000000 50.000000 1.000000 4.500000
50% 1.289901e+08 1.000000 85.000000 10.000000 4.750000
75% 3.205107e+08 2.000000 133.000000 47.000000 4.920000
max 5.050783e+08 18.000000 71577.000000 889.000000 5.000000

We can do a preliminary check of the dataframe using info and head:

df.head(3)
df.tail(3)
host_id host_since name neighbourhood property_type room_type bedrooms price number_of_reviews review_scores_rating
listing_id
843487874169862949 230936437 2018-12-16 Habitación privada en Madrid. Valverde Private room in rental unit Private room 1.0 46.0 1 5.0
843544239423340038 455852114 2022-04-24 Habitación Privada Confort Ventas Private room in rental unit Private room 1.0 28.0 1 5.0
843551963497475593 504579859 2023-03-09 Habitación en piso compartido Imperial Private room in rental unit Private room 1.0 60.0 0 NaN
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 21239 entries, 6369 to 843551963497475593
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   host_id               21239 non-null  int64  
 1   host_since            21236 non-null  object 
 2   name                  21235 non-null  object 
 3   neighbourhood         21239 non-null  object 
 4   property_type         21239 non-null  object 
 5   room_type             21239 non-null  object 
 6   bedrooms              19667 non-null  float64
 7   price                 21239 non-null  float64
 8   number_of_reviews     21239 non-null  int64  
 9   review_scores_rating  17168 non-null  float64
dtypes: float64(3), int64(2), object(5)
memory usage: 1.8+ MB

This is already letting us know that some columns have missing values. This is a fact of life when doing data science so we need to be able to handle this case.

Missing values

Missing values are denoted by NaN in Pandas. When a Pandas object is built, both plain Python’s None and NumPy’s nan are taken as NaN. Since np.nan has type float, a numeric series containing NaN values gets type float.

Three useful Pandas methods related to missing values, which can be applied to both series and data frames, are: * isna returns a Boolean mask indicating which terms are missing. * fillna is used for replacing NaN’s by a fixed value, set by the user. * dropna returns the same data frame minus the rows that contain at least one missing value. If a list of columns is specified, the missing values are searched only for those columns.

For this dataset, in particular, we have a relevant number of listings for which rating scores are not available (many guests don’t rate the lodgings). A report on all missing values can be printed with:

df.isna().sum()
host_id                    0
host_since                 3
name                       4
neighbourhood              0
property_type              0
room_type                  0
bedrooms                1572
price                      0
number_of_reviews          0
review_scores_rating    4071
dtype: int64

Note that df.isna() is a data frame of the same shape as df, whose terms have type bool. With sum, we get the number of True values per column.

We can drop the NA’s with the dropna() command. Since I’m not going to analyze ratings or bedrooms for now, I will only drop NA’s for name, and host since.

print('Before dropping NAs:', df.shape)
df = df.dropna(subset=['name','host_since'])
print('After dropping NAs:', df.shape)
Before dropping NAs: (21239, 10)
After dropping NAs: (21232, 10)

Duplicates

There are two useful Pandas functions for managing duplicates:

  • drop_duplicates drops the duplicated entries (in a series) or the duplicated rows (in a data frame).
  • duplicated returns a Boolean series indicating which entries (for a series) or which rows (for a data frame) are duplicated.

For instance, the listing’s ID in this data set is never duplicated, so it makes sense to use it as the index:

df.index.duplicated().sum()
0

But a different thing is whether several listings have exactly the same data (e.g., this happens when they refer to beds in a shared bedroom). How often this occur in our dataset?

df.duplicated().sum()
22

How does duplicated work? It reads the data top-down. For entries that occur for the first time, it returns False. For terms that have already occurred before, it returns True.

Let us now permanently remove the duplicates from the dataframe using the drop_duplicates function.

print('Before dropping duplicates:', df.shape)
df = df.drop_duplicates()
print('After dropping duplicates:', df.shape)

df.shape
Before dropping duplicates: (21232, 10)
After dropping duplicates: (21210, 10)
(21210, 10)

Problem (?) solved!

It’s good practice to save the dataframe after preprocessing this allows you to continue with a clean dataset going forward.

df.to_csv('./resources/AirBnB-cleaned.csv')

💡 Descriptive Analytics

We reload the dataset.

df = pd.read_csv('./resources/AirBnB-cleaned.csv', index_col='listing_id')

Much of the work of a datascientist comes down to manipulating the data to answering specific business questions. Most of these operators come in the form of either simple statistics applied to the whole dataset, or simple statistics applied to subgroups of the dataset.

Let’s try to answer a few questions now to see examples of such procedures:

Is AirBnB still P2P?

Airbnb was initially designed to function as a peer-to-peer (P2P) network, where individuals could rent out their apartments to others. However, some critics argue that it has now evolved into a platform that allows investors with large real estate portfolios to offer their properties for rent. The question remains: is this the case here?

One way to investigate this is by examining the ratio of the number of listings to the number of hosts on the platform. By extracting the list of unique host IDs as the index of the series returned by value_counts, we can obtain a ranked list of hosts based on the number of listings they have. Alternatively, we can obtain an array of these IDs using the unique method.

df['host_id'].unique()
array([    13660,     83531, 346366726, ..., 230936437, 455852114,
       504579859])

To count them:

# option 1: look at the shape
df['host_id'].unique().shape

# option 2: use nunique()
df['host_id'].nunique()
11073

These are the number of hosts. We have this many for this number of listings:

df.shape
(21210, 10)

So how many hosts are actually following the original AirBnB philosophy?

Let’s take a loot at the original dataframe. There too, we can check the value counts to see then # of listings per host. Then we count the number of hosts that have just 1 property.

df['host_id'].value_counts()
host_id
377605855    263
346367515    148
471977881    143
434208714    127
28038703     101
            ... 
205944086      1
216063028      1
11857294       1
53974722       1
504579859      1
Name: count, Length: 11073, dtype: int64

Let’s see how many have just 1:

((df['host_id'].value_counts() == 1)*1).sum()
8503

We can easily identify the hosts with the biggest portfolios:

df['host_id'].value_counts().head(10)
host_id
377605855    263
346367515    148
471977881    143
434208714    127
28038703     101
291253690     97
107335103     96
107141654     96
40877988      86
17499500      80
Name: count, dtype: int64

You try it

The amount of hosts with more than 10 listings is:

...
Ellipsis
Solution
(df['host_id'].value_counts() > 10).sum()
194

How are experienced hosts behaving?

The file contains some information about the hosts, that is, the people renting the lodgings. For instance, you may be interested in the more experienced hosts, using a Boolean mask like df['host_since'] < '2012-01-01' to select the listings with these more experienced hosts.

filter_experienced = df['host_since'] < '2012-01-01'
df_experienced = df[filter_experienced]

df_experienced.shape
(437, 10)

Looks like there are 437 listings from hosts that joined before 2012. How many listings do they behave? Let’s use the value_counts function to analyze how many listings these have.

You try it

Count how many listings each host that joined before 2012:

  1. Figure out which hosts joined before 2012.
  2. Use this as a filter.
  3. Apply value_counts to the resulting dataframe.
...
Ellipsis
Solution
filter_experienced = df['host_since'] < '2012-01-01'
df_experienced     = df[filter_experienced]
df_experienced['host_id'].value_counts()
host_id
1408525    65
901218     36
596469     26
142578     11
1108348     9
           ..
606925      1
1206817     1
70273       1
901541      1
1499853     1
Name: count, Length: 179, dtype: int64

So, we have quite a few hosts in this list. Many of them have one listing, but the top hosts look like companies. You may wonder how frequent is this.

Distribution of the price

What’s the distribution of the listing’s price? We can explore this point with a histogram. The histogram is returned by the Pandas method plot.hist().

df['price'].plot.hist();

The distribution of the prices (naive version). This not really useful, since some very expensive listings distort the whole picture. But we can trim the data, by dropping the most expensive listings, to get a better picture.

A statistical summary, as the one returned by df.describe(), can be helpful:

df['price'].describe()
count    21210.000000
mean       139.201367
std        660.758099
min          0.000000
25%         50.000000
50%         85.000000
75%        133.000000
max      71577.000000
Name: price, dtype: float64

Alternatively, we can compute other quantiles using df.quantile():

quantiles = df['price'].quantile(q=[0.05,0.95])
quantiles
0.05     23.0
0.95    331.0
Name: price, dtype: float64

You try it

These 5-95% quantiles look like reasonable limits for trimming. Try plotting the distribution for just this part of the dataset.

  1. Calculate the quantiles
  2. Filter out bad data (outliers)
  3. Plot the distribution

The goal is to recreate the plot below:

...
Ellipsis
Solution
df_no_outlier = df[(df['price'] >= 23) & (df['price'] <= 331)]
df_no_outlier.price.plot.hist(figsize=(8,4), color='gray', rwidth=0.9, bins=40).set_xlim(50,320);

Fancy Solution with Bar Highlights
import matplotlib.pyplot as plt

# I start at 25 here to ensure that round number bars are prominently shown
df_no_outlier = df[(df['price'] >= 25) & (df['price'] <= 331)]
df_no_outlier.price.plot.hist(figsize=(8,4), color='gray', rwidth=0.9, bins=40);

# We repeat the plot, but this time keep track of the bars 
hist_data, bins, bars = plt.hist(df_no_outlier.price, bins=40, rwidth=0.9, color='gray')

# Then, we define the index of the bar that we want to highlight
highlight_bins = [0, 9, 16, 22,29,35] 

# Loop through them and change their fc (face color)
for i in highlight_bins:
    bars[i].set_fc('red')

#plt.show()

plt.gca().set_xlim(50,320);

Hint: To make it look more like mine, use the argument bins which specifies that the range covered by the data is partitioned into this number of intervals. Note how intervals starting at multiples of 10 have higher bars. This is not rare. Prices don’t change continuously, and hosts prefer prices which are multiples of 10. The histogram also shows that e.g., 50, 100 and 150 euros are popular prices.

Note. The rule for counting the observations in every bin is as follows. For an interval of limits a and b, the values x such that a <= x < b are counted. Except for the last interval, for which the right limit is also included.

Average price per room type

Suppose that you are interested in the average price per room type. You can get this by applying groupby() which works as follows:

This method groups the rows according to one or more grouping variables. Here, you specify groupby(by='room_type').

Next, you select the columns to be aggregated, which, here would be just price. Finally, you specify the aggregation function. Having only the mean, you can do it just by adding mean() or by specifying .agg('mean'). The method round() works as you would expect it to.

df.groupby('room_type')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fa6660e2690>
df.groupby('room_type')['price']
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fa6604a0f10>
df.groupby('room_type')['price'].mean()
room_type
Entire home/apt    160.338694
Hotel room         204.131148
Private room        98.362606
Shared room        104.191638
Name: price, dtype: float64

The above way of writing a groupby is a shorthand notation for the full version using agg as shown below:

df.groupby(by='room_type')['price'].agg('mean').round()
room_type
Entire home/apt    160.0
Hotel room         204.0
Private room        98.0
Shared room        104.0
Name: price, dtype: float64

While the differences in prices across room types may be as expected (does anything stand out), relying on the average price alone may not always provide an accurate representation of the data. Skewed distributions can result in extreme values on the right tail, which can significantly affect the mean price and lead it to deviate from the central tendency of the data. In such cases, the median price would provide a more representative measure of the mid-point of the distribution.

To calculate the median price per room type, the aggregation function mean in our calculations should be replaced by median. Similar rules apply for the ‘usual’ statistics:

df.groupby(by='room_type')['price'].agg(['min','max','mean','median']).round()
min max mean median
room_type
Entire home/apt 10.0 71577.0 160.0 104.0
Hotel room 0.0 1149.0 204.0 125.0
Private room 8.0 40714.0 98.0 40.0
Shared room 11.0 3000.0 104.0 28.0

You can even provide your own functions!

# price_series is a series passed to the groupby function
# it is applied to every split-group
def myquantiles(price_series):
    return [
        price_series.quantile(q=.5).round(),
        price_series.quantile(q=.95).round()
    ]

df.groupby(by='room_type')['price'].agg(['min','max','mean',myquantiles]).round()
min max mean myquantiles
room_type
Entire home/apt 10.0 71577.0 160.0 [104.0, 324.0]
Hotel room 0.0 1149.0 204.0 [125.0, 1000.0]
Private room 8.0 40714.0 98.0 [40.0, 350.0]
Shared room 11.0 3000.0 104.0 [28.0, 481.0]

Note the myquantiles is called in the ‘gear’ / agg step and is thus applied to every single split-group seperately. The input (price_series) are the prices for that specific group. The output are the quantiles for that specific group. These outputs are merged into one column called myquantiles.

The aggregation functions are applied to each column separately. If instead, you would like to use all columns to do some advanced calculation, you can use the `.apply(myfunction)` method instead of the `.agg(myfunction)` method.

The results obtained from using the median price provide a more meaningful perspective compared to the earlier results, where shared rooms appeared to be more expensive than private rooms. Using the median price allows for a more accurate representation of the mid-point of the price distribution and reduces the influence of extreme values that can bias the results.

Top-10 neighbourhoods

A natural analysis of the local Airbnb data, is to take a look at the most popular neighbourhoods by number of listings. The top ten list can be extracted with value_counts:

df['neighbourhood'].value_counts().head(10)
neighbourhood
Embajadores        2573
Universidad        2009
Palacio            1679
Sol                1187
Justicia           1049
Cortes              931
Trafalgar           394
Palos de Moguer     373
Goya                341
Argüelles           312
Name: count, dtype: int64

Are these neighbourhoods more or perhaps less expensive? One could wonder if there is an association between the mid price and the amount of property listed (following the laws of supply-demand).

nb_counts = df.groupby(by='neighbourhood')['price'].agg(['count','mean','median'])

nb_counts.sort_values('count', ascending=False).head(10)
count mean median
neighbourhood
Embajadores 2573 145.029149 86.0
Universidad 2009 127.534594 96.0
Palacio 1679 144.176891 100.0
Sol 1187 153.972199 112.0
Justicia 1049 151.954242 100.0
Cortes 931 149.902256 115.0
Trafalgar 394 192.180203 95.5
Palos de Moguer 373 206.426273 82.0
Goya 341 211.741935 116.0
Argüelles 312 127.147436 90.5

Sometimes, we prefer to name the columns or use special functions that are not built-in by default. This can be done by providing a dict instead of a list to the agg function. The keys will be the names of the columns, the values are the calculations to be done:

View Code
def my_property_function(x):
    return x.mode()

df.groupby(by='neighbourhood').agg({
    'price'        : ['count', 'median'],
    'property_type': [my_property_function]
})
price property_type
count median my_property_function
neighbourhood
Abrantes 40 37.0 Private room in rental unit
Acacias 169 64.0 Entire rental unit
Adelfas 96 70.0 Entire rental unit
Aeropuerto 8 30.5 Private room in rental unit
Aguilas 41 45.0 Private room in rental unit
... ... ... ...
Valverde 95 65.0 Entire rental unit
Ventas 174 50.0 Private room in rental unit
Vinateros 23 40.0 Private room in rental unit
Vista Alegre 124 48.5 Private room in rental unit
Zofío 26 68.5 Entire rental unit

126 rows × 3 columns

You try it

What is the actual correlation between price and number of listings? Follow these steps to find out:

  1. Do a group-by to get the counts and the median price:
    • group by neighbourhood
    • extract the price as a Series
    • aggregate to get the count and the median
  2. use the .corr() function to obtain the correlation.

Your goal is to produce the following table:

...
Ellipsis
Solution
df.groupby(by='neighbourhood')['price'].agg(['count','median']).corr()
Correlation Matrix of Key Statistics
count median
count 1.000000 0.177183
median 0.177183 1.000000
df.groupby('neighbourhood')['price'].agg(['count','median','mean']).corr()
count median mean
count 1.000000 0.177183 0.070872
median 0.177183 1.000000 0.600848
mean 0.070872 0.600848 1.000000

This doesn’t seem to directly correlate, something more must be up. We’ll try some actual statistics towards the end of the class.

Pivoting

In this dataset too, a pivot table can be handy. Let us try to answer the following question.

You try it

What is the median price of a property in each neighbourhood? Organize by room type:

  • Rows = room type
  • Columns = neighbourhood
  • Values = price

Use the syntax df.pivot_table(index=..., columns=..., values=...)

Straightforward Solution
pivot = df.pivot_table(index='room_type',columns='neighbourhood', values='price', aggfunc='median')
Advanced Solution
# Here's a more complex solution, I will pivot on 2 statistics, both the mean and the count
# I use the count to remove any neighbourhood with very few records as we cannot calculate
# meaningful statistics for those. 
#
# Then, I sort the values. I will use Entire home/apt prices as a leading indicator.

# 1. pivot on both stats
pivot = df.pivot_table(index='room_type',columns='neighbourhood', values='price', aggfunc=['median','count'])

# 2. find out/filter which neighbourhoods have less than 10 recorded prices in Entire home/apt
count_columns       = pivot['count']
valid_neighborhoods = count_columns.loc['Entire home/apt'][count_columns.loc['Entire home/apt'] >= 10].index
filtered_pivot      = pivot.loc[:, pivot.columns.get_level_values(1).isin(valid_neighborhoods)]

# 3. we're only interested in the mean price, so we pick that one. Then, we also sort from
#    most to least expensive
filtered_pivot['median'].T.sort_values(by='Entire home/apt', ascending=False).T
neighbourhood Rosas Jerónimos Castellana Goya Recoletos Colina Niño Jesús Sol Rios Rosas Cortes ... Adelfas Marroquina Almendrales Pradolongo Concepción Zofío San Andrés Aluche Vista Alegre Entrevías
room_type
Entire home/apt 165.0 158.0 151.0 141.0 140.0 140.0 135.0 126.0 125.0 125.0 ... 75.0 75.0 75.0 73.5 73.0 73.0 70.0 69.5 67.0 66.0
Hotel room NaN 0.0 NaN NaN 197.0 NaN NaN 41.0 106.5 159.0 ... NaN NaN NaN NaN NaN NaN 200.0 NaN NaN NaN
Private room 375.0 46.5 35.0 43.5 39.0 38.5 46.0 54.0 38.0 60.0 ... 35.0 41.0 35.0 41.0 29.0 28.0 28.0 28.0 30.0 32.0
Shared room 1000.0 NaN 168.5 NaN NaN NaN 18.0 30.0 42.0 31.0 ... NaN NaN NaN 17.0 180.0 NaN 11.0 18.0 NaN 15.0

4 rows × 100 columns

📈 Build a regression model to predict prices of AirBnB listings.

Lastly, we could build some basic statistical regressions to model the prices of local AirBnB listings. This could allow us to potentially spot good opportunities of over-/under-valued rooms. For home-owners, on the other hand, it could give us an indiciation of what a right price for a property.

We start by importing statsmodels as explained in the accompanying document:

import statsmodels.api as sm
import statsmodels.formula.api as smf

We will only look at the dataframe without outliers here.

model1 = smf.ols("""
    price ~ bedrooms + number_of_reviews + review_scores_rating
""", data=df_no_outlier).fit()
model1.summary()
OLS Regression Results
Dep. Variable: price R-squared: 0.233
Model: OLS Adj. R-squared: 0.233
Method: Least Squares F-statistic: 1455.
Date: Wed, 27 Sep 2023 Prob (F-statistic): 0.00
Time: 10:00:46 Log-Likelihood: -77530.
No. Observations: 14380 AIC: 1.551e+05
Df Residuals: 14376 BIC: 1.551e+05
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 31.6574 3.588 8.822 0.000 24.624 38.691
bedrooms 36.8712 0.559 65.956 0.000 35.775 37.967
number_of_reviews -0.0025 0.005 -0.473 0.636 -0.013 0.008
review_scores_rating 3.0993 0.755 4.106 0.000 1.620 4.579
Omnibus: 2498.921 Durbin-Watson: 1.858
Prob(Omnibus): 0.000 Jarque-Bera (JB): 19666.142
Skew: 0.617 Prob(JB): 0.00
Kurtosis: 8.595 Cond. No. 828.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
model2 = smf.ols("""
    price ~ bedrooms + C(neighbourhood) + number_of_reviews + review_scores_rating
""", data=df_no_outlier).fit()
model2.summary()
OLS Regression Results
Dep. Variable: price R-squared: 0.346
Model: OLS Adj. R-squared: 0.340
Method: Least Squares F-statistic: 59.32
Date: Wed, 27 Sep 2023 Prob (F-statistic): 0.00
Time: 10:02:28 Log-Likelihood: -76386.
No. Observations: 14380 AIC: 1.530e+05
Df Residuals: 14252 BIC: 1.540e+05
Df Model: 127
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -3.4442 11.702 -0.294 0.769 -26.381 19.493
C(neighbourhood)[T.Acacias] 14.8142 12.185 1.216 0.224 -9.071 38.699
C(neighbourhood)[T.Adelfas] 3.8755 12.717 0.305 0.761 -21.051 28.802
C(neighbourhood)[T.Aeropuerto] -18.0615 27.109 -0.666 0.505 -71.200 35.077
C(neighbourhood)[T.Aguilas] -11.9278 15.604 -0.764 0.445 -42.515 18.659
C(neighbourhood)[T.Alameda de Osuna] 15.2433 14.125 1.079 0.281 -12.443 42.930
C(neighbourhood)[T.Almagro] 58.9526 12.138 4.857 0.000 35.161 82.745
C(neighbourhood)[T.Almenara] 25.3954 12.629 2.011 0.044 0.640 50.150
C(neighbourhood)[T.Almendrales] 11.1726 13.249 0.843 0.399 -14.798 37.143
C(neighbourhood)[T.Aluche] -15.0626 13.910 -1.083 0.279 -42.329 12.203
C(neighbourhood)[T.Ambroz] -23.4862 21.792 -1.078 0.281 -66.201 19.229
C(neighbourhood)[T.Amposta] 9.9295 36.634 0.271 0.786 -61.877 81.736
C(neighbourhood)[T.Apostol Santiago] 12.4785 15.439 0.808 0.419 -17.784 42.741
C(neighbourhood)[T.Arapiles] 28.7075 12.165 2.360 0.018 4.862 52.553
C(neighbourhood)[T.Aravaca] 7.1299 14.281 0.499 0.618 -20.863 35.123
C(neighbourhood)[T.Arcos] -13.7477 14.877 -0.924 0.355 -42.908 15.413
C(neighbourhood)[T.Argüelles] 41.8478 11.832 3.537 0.000 18.655 65.040
C(neighbourhood)[T.Atocha] 35.5168 17.071 2.081 0.037 2.056 68.978
C(neighbourhood)[T.Bellas Vistas] 13.8640 12.187 1.138 0.255 -10.025 37.753
C(neighbourhood)[T.Berruguete] 23.9287 12.495 1.915 0.056 -0.564 48.421
C(neighbourhood)[T.Buenavista] 6.7709 13.283 0.510 0.610 -19.266 32.808
C(neighbourhood)[T.Butarque] 11.5161 18.670 0.617 0.537 -25.080 48.112
C(neighbourhood)[T.Campamento] -18.5179 15.792 -1.173 0.241 -49.472 12.436
C(neighbourhood)[T.Canillas] 0.3081 12.568 0.025 0.980 -24.326 24.942
C(neighbourhood)[T.Canillejas] 8.4796 13.117 0.646 0.518 -17.231 34.190
C(neighbourhood)[T.Casa de Campo] 25.3565 13.443 1.886 0.059 -0.993 51.706
C(neighbourhood)[T.Casco Histórico de Barajas] 31.7824 14.662 2.168 0.030 3.042 60.523
C(neighbourhood)[T.Casco Histórico de Vallecas] 1.5332 13.362 0.115 0.909 -24.658 27.724
C(neighbourhood)[T.Casco Histórico de Vicálvaro] -2.4579 14.881 -0.165 0.869 -31.626 26.710
C(neighbourhood)[T.Castellana] 68.3615 12.155 5.624 0.000 44.536 92.187
C(neighbourhood)[T.Castilla] 34.5684 13.791 2.507 0.012 7.536 61.600
C(neighbourhood)[T.Castillejos] 33.7429 12.095 2.790 0.005 10.035 57.450
C(neighbourhood)[T.Chopera] 16.6831 12.298 1.357 0.175 -7.423 40.789
C(neighbourhood)[T.Ciudad Jardín] 22.0364 12.925 1.705 0.088 -3.299 47.372
C(neighbourhood)[T.Ciudad Universitaria] 12.7401 14.757 0.863 0.388 -16.186 41.666
C(neighbourhood)[T.Colina] 47.0181 16.724 2.811 0.005 14.238 79.799
C(neighbourhood)[T.Comillas] 18.0013 12.399 1.452 0.147 -6.303 42.306
C(neighbourhood)[T.Concepción] 10.4510 13.445 0.777 0.437 -15.903 36.805
C(neighbourhood)[T.Corralejos] 33.2445 30.623 1.086 0.278 -26.780 93.269
C(neighbourhood)[T.Cortes] 61.1947 11.473 5.334 0.000 38.707 83.683
C(neighbourhood)[T.Costillares] 21.7116 15.790 1.375 0.169 -9.240 52.663
C(neighbourhood)[T.Cuatro Caminos] 36.0319 11.898 3.028 0.002 12.711 59.353
C(neighbourhood)[T.Cuatro Vientos] 4.5201 20.772 0.218 0.828 -36.196 45.236
C(neighbourhood)[T.Cármenes] 11.2344 14.118 0.796 0.426 -16.439 38.907
C(neighbourhood)[T.Delicias] 19.5064 12.030 1.622 0.105 -4.073 43.086
C(neighbourhood)[T.El Goloso] 35.2039 19.945 1.765 0.078 -3.891 74.299
C(neighbourhood)[T.El Pardo] 13.4382 50.560 0.266 0.790 -85.665 112.542
C(neighbourhood)[T.El Plantío] 7.5515 20.777 0.363 0.716 -33.173 48.276
C(neighbourhood)[T.El Viso] 51.7654 13.119 3.946 0.000 26.050 77.481
C(neighbourhood)[T.Embajadores] 38.6202 11.372 3.396 0.001 16.330 60.911
C(neighbourhood)[T.Entrevías] -6.2525 15.606 -0.401 0.689 -36.842 24.337
C(neighbourhood)[T.Estrella] 36.5910 16.722 2.188 0.029 3.813 69.369
C(neighbourhood)[T.Fontarrón] -7.4839 14.875 -0.503 0.615 -36.641 21.673
C(neighbourhood)[T.Fuente del Berro] 31.4966 12.699 2.480 0.013 6.605 56.388
C(neighbourhood)[T.Gaztambide] 32.1650 12.138 2.650 0.008 8.373 55.957
C(neighbourhood)[T.Goya] 49.8259 11.759 4.237 0.000 26.776 72.876
C(neighbourhood)[T.Guindalera] 31.4116 11.845 2.652 0.008 8.195 54.628
C(neighbourhood)[T.Hellín] -5.6841 18.674 -0.304 0.761 -42.287 30.919
C(neighbourhood)[T.Hispanoamérica] 46.2799 12.660 3.655 0.000 21.464 71.096
C(neighbourhood)[T.Horcajo] 2.945e-14 1.59e-14 1.856 0.063 -1.64e-15 6.05e-14
C(neighbourhood)[T.Ibiza] 31.7360 11.981 2.649 0.008 8.251 55.221
C(neighbourhood)[T.Imperial] 22.3133 12.179 1.832 0.067 -1.560 46.186
C(neighbourhood)[T.Jerónimos] 67.9640 12.699 5.352 0.000 43.073 92.855
C(neighbourhood)[T.Justicia] 51.9468 11.446 4.538 0.000 29.510 74.383
C(neighbourhood)[T.La Paz] 2.0962 15.603 0.134 0.893 -28.488 32.681
C(neighbourhood)[T.Legazpi] 21.0025 14.121 1.487 0.137 -6.677 48.682
C(neighbourhood)[T.Lista] 53.2862 12.083 4.410 0.000 29.601 76.971
C(neighbourhood)[T.Los Angeles] 6.2183 16.726 0.372 0.710 -26.568 39.004
C(neighbourhood)[T.Los Rosales] 0.6175 16.454 0.038 0.970 -31.634 32.869
C(neighbourhood)[T.Lucero] -7.5381 13.059 -0.577 0.564 -33.136 18.059
C(neighbourhood)[T.Marroquina] 8.7202 17.027 0.512 0.609 -24.654 42.095
C(neighbourhood)[T.Media Legua] -1.8909 17.742 -0.107 0.915 -36.668 32.886
C(neighbourhood)[T.Mirasierra] 1.0352 15.995 0.065 0.948 -30.317 32.388
C(neighbourhood)[T.Moscardó] 4.3998 12.626 0.348 0.727 -20.348 29.148
C(neighbourhood)[T.Niño Jesús] 50.0095 13.582 3.682 0.000 23.387 76.632
C(neighbourhood)[T.Nueva España] 35.6868 13.059 2.733 0.006 10.090 61.284
C(neighbourhood)[T.Numancia] 11.3359 12.186 0.930 0.352 -12.550 35.222
C(neighbourhood)[T.Opañel] 24.7965 12.660 1.959 0.050 -0.019 49.612
C(neighbourhood)[T.Orcasitas] -42.1824 17.364 -2.429 0.015 -76.218 -8.147
C(neighbourhood)[T.Orcasur] 11.9868 13.359 0.897 0.370 -14.198 38.172
C(neighbourhood)[T.Pacífico] 13.8451 11.828 1.171 0.242 -9.340 37.030
C(neighbourhood)[T.Palacio] 52.5310 11.398 4.609 0.000 30.189 74.873
C(neighbourhood)[T.Palomas] 40.7050 23.084 1.763 0.078 -4.542 85.952
C(neighbourhood)[T.Palomeras Bajas] -10.6048 13.682 -0.775 0.438 -37.423 16.213
C(neighbourhood)[T.Palomeras Sureste] -1.8211 15.605 -0.117 0.907 -32.408 28.766
C(neighbourhood)[T.Palos de Moguer] 34.0250 11.694 2.910 0.004 11.103 56.946
C(neighbourhood)[T.Pavones] -29.6977 24.777 -1.199 0.231 -78.265 18.869
C(neighbourhood)[T.Peñagrande] 1.2636 15.439 0.082 0.935 -29.000 31.527
C(neighbourhood)[T.Pilar] 7.8029 13.736 0.568 0.570 -19.122 34.728
C(neighbourhood)[T.Pinar del Rey] 16.8362 12.486 1.348 0.178 -7.638 41.310
C(neighbourhood)[T.Piovera] 36.5582 15.793 2.315 0.021 5.602 67.514
C(neighbourhood)[T.Portazgo] -15.9508 16.454 -0.969 0.332 -48.203 16.302
C(neighbourhood)[T.Pradolongo] 6.6079 15.788 0.419 0.676 -24.338 37.554
C(neighbourhood)[T.Prosperidad] 28.2371 12.138 2.326 0.020 4.446 52.029
C(neighbourhood)[T.Pueblo Nuevo] 4.7364 12.229 0.387 0.699 -19.233 28.706
C(neighbourhood)[T.Puerta Bonita] -7.4593 13.910 -0.536 0.592 -34.725 19.807
C(neighbourhood)[T.Puerta del Angel] 2.9935 11.882 0.252 0.801 -20.297 26.284
C(neighbourhood)[T.Quintana] 4.7823 12.633 0.379 0.705 -19.979 29.544
C(neighbourhood)[T.Recoletos] 69.0001 11.885 5.806 0.000 45.704 92.296
C(neighbourhood)[T.Rejas] 14.7888 13.001 1.138 0.255 -10.694 40.272
C(neighbourhood)[T.Rios Rosas] 43.8075 12.103 3.619 0.000 20.083 67.532
C(neighbourhood)[T.Rosas] 9.7475 14.878 0.655 0.512 -19.416 38.911
C(neighbourhood)[T.Salvador] -2.8517 18.676 -0.153 0.879 -39.459 33.755
C(neighbourhood)[T.San Andrés] 7.3299 13.910 0.527 0.598 -19.936 34.596
C(neighbourhood)[T.San Cristobal] -10.3517 19.946 -0.519 0.604 -49.449 28.745
C(neighbourhood)[T.San Diego] 11.9146 12.080 0.986 0.324 -11.763 35.592
C(neighbourhood)[T.San Fermín] -7.8443 14.765 -0.531 0.595 -36.786 21.097
C(neighbourhood)[T.San Isidro] 14.6513 12.156 1.205 0.228 -9.176 38.479
C(neighbourhood)[T.San Juan Bautista] 30.3704 15.283 1.987 0.047 0.414 60.327
C(neighbourhood)[T.San Pascual] 9.5865 14.651 0.654 0.513 -19.132 38.305
C(neighbourhood)[T.Santa Eugenia] -8.5277 20.777 -0.410 0.681 -49.254 32.199
C(neighbourhood)[T.Simancas] 27.9041 13.089 2.132 0.033 2.249 53.559
C(neighbourhood)[T.Sol] 64.7310 11.439 5.659 0.000 42.309 87.153
C(neighbourhood)[T.Timón] 3.8465 13.681 0.281 0.779 -22.971 30.664
C(neighbourhood)[T.Trafalgar] 40.9136 11.729 3.488 0.000 17.924 63.903
C(neighbourhood)[T.Universidad] 52.6676 11.386 4.626 0.000 30.349 74.986
C(neighbourhood)[T.Valdeacederas] 9.9176 12.750 0.778 0.437 -15.073 34.909
C(neighbourhood)[T.Valdefuentes] 24.6241 12.369 1.991 0.047 0.379 48.869
C(neighbourhood)[T.Valdemarín] 22.7797 36.637 0.622 0.534 -49.033 94.593
C(neighbourhood)[T.Valdezarza] 0.6239 13.180 0.047 0.962 -25.211 26.458
C(neighbourhood)[T.Vallehermoso] 42.3945 13.731 3.087 0.002 15.480 69.309
C(neighbourhood)[T.Valverde] 5.2359 12.791 0.409 0.682 -19.836 30.308
C(neighbourhood)[T.Ventas] 3.1014 12.239 0.253 0.800 -20.888 27.091
C(neighbourhood)[T.Vinateros] -6.8132 19.944 -0.342 0.733 -45.906 32.280
C(neighbourhood)[T.Vista Alegre] 0.3320 12.711 0.026 0.979 -24.583 25.247
C(neighbourhood)[T.Zofío] 10.2001 16.455 0.620 0.535 -22.055 42.455
bedrooms 36.4822 0.524 69.672 0.000 35.456 37.509
number_of_reviews -0.0417 0.005 -8.227 0.000 -0.052 -0.032
review_scores_rating 3.2728 0.706 4.639 0.000 1.890 4.656
Omnibus: 2562.636 Durbin-Watson: 1.865
Prob(Omnibus): 0.000 Jarque-Bera (JB): 32841.808
Skew: 0.473 Prob(JB): 0.00
Kurtosis: 10.343 Cond. No. 2.42e+18


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.47e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

You try it:

How many euros is the model missing the mark on average?

Extra: grouping and aggregation

The following is a list of functions which group/aggregarte data tables into new ones. We have already seen the first three, but the others may be useful as well:

  • value_counts extracts a frequency table as a series. The table contains the counts of the occurrences of every value of a given series. It does not include the missing values.
  • pivot_table extracts a spreadsheet-style pivot table as a data frame. For a Pandas data frame df, the syntax is pd.pivot_table(df, values=cname1, index=cname2). This returns a one-way table containing the average value of cname1 for the groups defined by cname2. Instead of the average, you can get a different summary by adding an argument aggfunc=fname. With an additional argument columns=cname3, you get a two-way table. For two-way tables, it works the same as crosstab, but it only applies to columns from the same data frame.
  • groupby groups the rows of a data frame so that an aggregation function can be applied to every group, extracting a SQL-like table as a data frame. The syntax is df.groupby(by=cname).fname(). Where fname is mean, sum, count, etc.
  • crosstab extracts a simple cross tabulation as a Pandas data frame. For two series of the same length s1 and s2, the syntax is pd.crosstab(s1, s2). Then s1 will be placed on the rows and s2 on the columns. By default, crosstab extracts a frequency table, unless an array of values and an aggregation function are passed, e.g., as values=s3 and aggfunc=fname.
  • Finally, if you actually know some SQL and would like to use that syntax instead, you can use a package like fugue or pandassql.

You try it

There are many paths to the same solution in pandas. A great exercises to realize this is to try to mimick the output of value_counts() using a group_by() instead, note that the aggregation function would be … count.

# How would you do this with a groupby?
df.neighbourhood.value_counts()
Embajadores    2573
Universidad    2010
Palacio        1680
Sol            1187
Justicia       1049
               ... 
Palomas           7
Corralejos        6
Valdemarín        5
El Pardo          1
Horcajo           1
Name: neighbourhood, Length: 126, dtype: int64
Solution
df.groupby('neighbourhood')['name'].agg('count').sort_values(ascending=False)
neighbourhood
Embajadores    2573
Universidad    2009
Palacio        1679
Sol            1187
Justicia       1049
               ... 
Palomas           7
Corralejos        6
Valdemarín        5
El Pardo          1
Horcajo           1
Name: name, Length: 126, dtype: int64

Extra: performing SQL on dataframes

If you are already familiar with SQL, you may find it more convenient to work with it instead of the grouping functions presented before. Here’s a simple example of how you would execute an SQL query in pandas.

! pip install fugue
from fugue_sql import fsql

# fugue does not allow indices, so we must remove it
table = df.reset_index()

# Queries are very sql-like, but have some unique syntax
# at the end. Here, we are asking for a dataframe called
# df_out.
query_1 = """
    SELECT COUNT(*) AS cnt FROM table 
    GROUP BY host_id
    ORDER BY cnt DESC
    YIELD DATAFRAME AS df_out
"""

# result is a dictionary containing all the requested outputs
# in this case, just df_out
result = fsql(query_1).run()
result['df_out'].native.head(10)
cnt
9343 263
9114 148
10457 143
9861 127
2572 101
8603 97
5315 96
5324 96
3292 86
1874 80

References

Adapted from: Miguel Ángel Canela