Stats 140SL AirBnb Sentiment Analysis

Data taken from

http://insideairbnb.com/get-the-data.html

Libraries

library(tidyverse)
library(DT)
library(caret)
library(leaps)
library(parallel)
library(glmnet)
library(reticulate)
library(syuzhet)

Downloading the Data

The Airbnb website we scraped the data from can be seen here

Each city has 3 files we are interested in: “neighbourhoods.csv”, “reviews.csv”, and “listings.csv”.
The webpage only allows us to download one of the above mentioned files for a particular city. To solve this problem, we downloaded the html page containing the list of available files for each city. Luckily this html file also contained the link to download each of those files.

We automated the download process by creating a folder for the name of each city and downloading the 3 files for that particular city.
In order to make the data ‘tidy’, we added the name of the city as a column on each of the csv files and combined them to make 3 large data frames: listings, reviews, and neighbourhoods.

Getting Links

import os
import requests
from bs4 import BeautifulSoup

We open our raw html files in order to extract the url to the files we want to download.
We also keep track of the name of the file.

with open("scrape/airbnb.html", encoding="utf8") as file:  # Use file to refer to the file object
    html_doc = file.read()
soup = BeautifulSoup(html_doc, 'html.parser')

cities = soup.find_all("table")
cities = [city.find_all("tbody")[0] for city in cities]

Extracting the Links

We define a function to extract the city name, file name and url of the file for a given city.

def extract_city(city):
    result = []
    city_name = city.find_all("tr")[0].find_all("td")[1].text
    for tr in city.find_all("tr"):
        if("archived" not in tr.get("class")):
            city_file = tr.find_all("td")[2].text
            if(city_file in ["reviews.csv.gz","listings.csv","neighbourhoods.csv"] ):
                city_data = {
                    "file_name": city_file,
                    "url": tr.find_all("td")[2].find_all("a")[0].get("href")
                }
                result.append(city_data)
    return {"city_name": city_name, "data": result}
download_dir = "cities"

We define 2 more functions to download a single file, and one to download all files for a city.

def download_file(file_name, url, city_name):
    r = requests.get(url)
    city_dir = os.path.join(download_dir, city_name)
    if(os.path.isdir(city_dir) == False):
        os.mkdir(city_dir)
    with open(os.path.join(download_dir, city_name, file_name), 'wb') as f:
        f.write(r.content)

def download_city(city_name, files):
    for file in files:
        download_file(file["file_name"], file["url"], city_name)

Downloading the Files

Then we use these functions to extract all the information we want from each city.
If we want, we can also uncomment the next line and use our download_city function we defined earlier to download all the content.

extracted_cities = [extract_city(city) for city in cities]
# DOWNLOADS FILES
#[download_city(city["city_name"], city["data"]) for city in extracted_cities]

Making the Data Tidy.

We are going to make the data tidy by combining all of the observations of different cities from different csv files into 3 massive data frames.
In order to keep the data tidy, we will add a column called city, which will specify what city the observation belongs to.

data_directory <- file.path("scrape","cities")
cities <- list.files(data_directory)
cities <- c("New York City","Los Angeles","San Francisco","Austin","Washington, D.C")
city_paths <- file.path(data_directory,cities)
names(city_paths) <- cities
file_names <- c("reviews.csv.gz","listings.csv","neighbourhoods.csv")

Process and combine each Data Frame.

We combine each dataframe and add the city column.

gather_file <- function(filename){
  cities%>%lapply(function(city){
    city_path <-  file.path(data_directory,city)
    file_path <- file.path(city_path,filename)
    df <- read_csv(file_path)
    if(!is.null(df$neighbourhood)){
      df$neighbourhood <- as.character(df$neighbourhood)
    }
    df <- cbind(df,"city"=rep(city,nrow(df)))
    df
  })
}

airbnb <- file_names%>%lapply(function(f){
  #bind_rows(gather_file(f))
})
names(airbnb) <- c("reviews","listings","neighbourhoods")

Dump the object into a RData File.

#save(airbnb$reviews,airbnb$listings,airbnb$neighbourhoods,file="airbnb.RData")

Preview of the Data

Listings

The Listings Data Frame is our main Airbnb listings table.
The Majority of our visualizations and analysis will happen on this table.

Reviews

The Reviews Table is more of a supplementary table to the listings table. The listings table already has a variable for the total number of reviews. The Reviews table allows us to additionally see the date of those reviews and which listing they belong to.

Neighbourhoods

The Neighbourhoods table would also be a supplementary table that could be added later

Vizualizations

Room Type

From the chart we see that the most common type of room offered on airbnb globally would be an entire apartment or house. A private room comes second wilth a minimal amount of hotels and shared rooms offered as well.

Summary

For the summary data, we see that the means for all room types seems unrealistically large. We can attribute this issue to the max price outliers for each room type. Since we cannot fully say that those exorbitant prices are erroneous, we choosse to leave those in and look at the median prices. These median prices seem to confirm any natural assumption that an entire house, hotel, or apartment would be more expensive per night than a shared or private room.

City

lets see the amount of airbnb’s by city

Based on the data, we see that the city with the most amount of unique listings offered is London, U.K with 76619 unique listings offered However, This graph looks quite complicated and crowded to look at. To further our analysis, we will examine the top 5 highest cities in terms of unique listings

Top 5 Cities

We see that Paris follows London with 6634 unique listings, then Sicily with 48503 listings. New York City with 44666, and Shanghai with 35572.

Sentiment Analysis

We encountered an interesting problem when trying to extract sentiments from the review comments. Developers in R usually like to use apply style functions because they have benefits in speed and convenience. But it does come with a precious tradeoff which is space.

Time Space Complexity

If we were to apply a function to our reviews data frame in order to extract another data frame that has the sentiment columns, we would be creating two instances of the object in our memory.

this_will_fail <- function(){
  reviews%>%apply(1,function(c){
    c['sentiment'] <- get_sentiment(c['comments'])
  })%>%t%>%as.data.frame
}

I can tell you that trying to extract the sentiments this way will almost certainly max out the RAM available on most machines. Instead, we use an approach that trades time complexity for space. Since we dont want our ram to run out, we can create a new column on our existing dataframe, listings, and add the average sentiment for each of the reviews associated with that listing. Doing this will prevent us from storing every single sentiment score for every single review. Thus you will see you RAM oscillate because of freed up memory.

Extracting the sentiment

load("SentimentListing.RData")

listing_ids <- reviews$listing_id%>%unique
#Add an empty column for sentiments
listings$sentiment <- rep(NA,nrow(listings))
listing_sentiment <- function(lid){
  sentiments <- reviews%>%filter(listing_id==lid)%>%apply(1,function(row){
    comments <- iconv(row["comments"], to = 'utf-8')
    sentiment <- get_sentiment(comments)
    sentiment
  })
  score <- mean(sentiments)
  index <- listings$id==lid
  listings[index,"sentiment"]<<-score
}

for(lid in listing_ids){
  #listing_sentiment(lid)
}

datatable(head(listings), options = list(autoWidth = TRUE,pageLength = 5))

Variable Selection Model

Missing Values

The columns “neighbourhood_group” and “number_of_reviews_ltm” were removed because the majority of their values were NA’s

A quick backwards stepwise selection model shows us which variables were the least important variable from the ones we had selected. First we converted applicable columns into factors, and then split the data into a test and a training set. After a rough cleaning of the data to remove missing values, we perform backwards, and stepwise selection to examine which variables are the most important. From this we see that city and room type are clearly the most important variables in predicting price.

load('SentimentListing.RData') 
listings<-listings[,-which(names(listings) %in% c("neighbourhood_group", "number_of_reviews_ltm"))]
listings<-na.omit(listings)

set.seed(1)

cols <- c("room_type","city")

listings[cols] <- lapply(listings[cols], factor) 

subset<-sample(nrow(listings),nrow(listings)*0.70)
test<-listings[-subset,]
test<-test[1:20000,]
train<-listings[subset,]
train<-train[1:20000,]

m1<-lm(sentiment~neighbourhood+latitude+longitude+room_type+minimum_nights+number_of_reviews+reviews_per_month+ calculated_host_listings_count+availability_365+city+price,data=train)

anova(m1)

## Analysis of Variance Table
## 
## Response: sentiment
##                                   Df Sum Sq Mean Sq  F value    Pr(>F)    
## neighbourhood                    544   4464    8.21   3.7856 < 2.2e-16 ***
## latitude                           1      0    0.30   0.1402    0.7081    
## longitude                          1      4    3.54   1.6313    0.2015    
## room_type                          3    273   90.91  41.9376 < 2.2e-16 ***
## minimum_nights                     1      1    1.14   0.5265    0.4681    
## number_of_reviews                  1     78   78.39  36.1597  1.85e-09 ***
## reviews_per_month                  1    375  374.73 172.8635 < 2.2e-16 ***
## calculated_host_listings_count     1    240  239.65 110.5519 < 2.2e-16 ***
## availability_365                   1      0    0.01   0.0031    0.9556    
## city                               2      8    3.95   1.8243    0.1614    
## price                              1      3    3.21   1.4808    0.2237    
## Residuals                      19442  42146    2.17                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

null<-lm(sentiment~1,data=train)
step.model<-step(m1, scope = list(lower = null, upper = m1), 
    direction = "backward")

## Start:  AIC=16024.24
## sentiment ~ neighbourhood + latitude + longitude + room_type + 
##     minimum_nights + number_of_reviews + reviews_per_month + 
##     calculated_host_listings_count + availability_365 + city + 
##     price
## 
##                                   Df Sum of Sq   RSS   AIC
## - availability_365                 1      0.03 42146 16022
## - minimum_nights                   1      1.08 42147 16023
## - longitude                        1      2.08 42148 16023
## - price                            1      3.21 42149 16024
## - city                             2      8.10 42154 16024
## <none>                                         42146 16024
## - latitude                         1      5.60 42152 16025
## - calculated_host_listings_count   1    236.55 42383 16134
## - room_type                        3    281.01 42427 16151
## - number_of_reviews                1    367.96 42514 16196
## - reviews_per_month                1    382.54 42529 16203
## - neighbourhood                  542   3031.36 45178 16329
## 
## Step:  AIC=16022.25
## sentiment ~ neighbourhood + latitude + longitude + room_type + 
##     minimum_nights + number_of_reviews + reviews_per_month + 
##     calculated_host_listings_count + city + price
## 
##                                   Df Sum of Sq   RSS   AIC
## - minimum_nights                   1      1.08 42147 16021
## - longitude                        1      2.07 42148 16021
## - price                            1      3.19 42149 16022
## - city                             2      8.10 42154 16022
## <none>                                         42146 16022
## - latitude                         1      5.60 42152 16023
## - calculated_host_listings_count   1    238.49 42385 16133
## - room_type                        3    281.43 42428 16149
## - number_of_reviews                1    369.63 42516 16195
## - reviews_per_month                1    382.84 42529 16201
## - neighbourhood                  542   3037.14 45183 16330
## 
## Step:  AIC=16020.77
## sentiment ~ neighbourhood + latitude + longitude + room_type + 
##     number_of_reviews + reviews_per_month + calculated_host_listings_count + 
##     city + price
## 
##                                   Df Sum of Sq   RSS   AIC
## - longitude                        1      2.09 42149 16020
## - price                            1      3.18 42150 16020
## - city                             2      8.11 42155 16021
## <none>                                         42147 16021
## - latitude                         1      5.59 42153 16021
## - calculated_host_listings_count   1    238.47 42386 16132
## - room_type                        3    281.66 42429 16148
## - number_of_reviews                1    369.77 42517 16194
## - reviews_per_month                1    382.81 42530 16200
## - neighbourhood                  542   3036.27 45184 16328
## 
## Step:  AIC=16019.76
## sentiment ~ neighbourhood + latitude + room_type + number_of_reviews + 
##     reviews_per_month + calculated_host_listings_count + city + 
##     price
## 
##                                   Df Sum of Sq   RSS   AIC
## - city                             2       7.1 42157 16019
## - price                            1       3.1 42153 16019
## <none>                                         42149 16020
## - latitude                         1       5.7 42155 16020
## - calculated_host_listings_count   1     238.2 42388 16130
## - room_type                        3     281.9 42431 16147
## - number_of_reviews                1     369.9 42519 16192
## - reviews_per_month                1     384.1 42533 16199
## - neighbourhood                  542    3343.6 45493 16463
## 
## Step:  AIC=16019.13
## sentiment ~ neighbourhood + latitude + room_type + number_of_reviews + 
##     reviews_per_month + calculated_host_listings_count + price
## 
##                                   Df Sum of Sq   RSS   AIC
## - latitude                         1       0.1 42157 16017
## - price                            1       3.2 42160 16019
## <none>                                         42157 16019
## - calculated_host_listings_count   1     241.9 42398 16132
## - room_type                        3     282.9 42439 16147
## - number_of_reviews                1     367.7 42524 16191
## - reviews_per_month                1     383.2 42540 16198
## - neighbourhood                  544    3579.9 45736 16561
## 
## Step:  AIC=16017.17
## sentiment ~ neighbourhood + room_type + number_of_reviews + reviews_per_month + 
##     calculated_host_listings_count + price
## 
##                                   Df Sum of Sq   RSS   AIC
## - price                            1       3.2 42160 16017
## <none>                                         42157 16017
## - calculated_host_listings_count   1     242.4 42399 16130
## - room_type                        3     282.9 42439 16145
## - number_of_reviews                1     367.6 42524 16189
## - reviews_per_month                1     383.2 42540 16196
## - neighbourhood                  544    3579.9 45737 16559
## 
## Step:  AIC=16016.69
## sentiment ~ neighbourhood + room_type + number_of_reviews + reviews_per_month + 
##     calculated_host_listings_count
## 
##                                   Df Sum of Sq   RSS   AIC
## <none>                                         42160 16017
## - calculated_host_listings_count   1     242.5 42402 16129
## - room_type                        3     279.7 42439 16143
## - number_of_reviews                1     369.3 42529 16189
## - reviews_per_month                1     381.8 42542 16195
## - neighbourhood                  544    3577.9 45738 16558

Stats 140SL AirBnb Sentiment Analysis