Stats 140SL AirBnb Sentiment Analysis
Data taken from
http://insideairbnb.com/get-the-data.html
Libraries
Downloading the Data
The Airbnb website we scraped the data from can be seen here
Each city has 3 files we are interested in: “neighbourhoods.csv”, “reviews.csv”, and “listings.csv”.
The webpage only allows us to download one of the above mentioned files for a particular city. To solve this problem, we downloaded the html page containing the list of available files for each city. Luckily this html file also contained the link to download each of those files.
We automated the download process by creating a folder for the name of each city and downloading the 3 files for that particular city.
In order to make the data ‘tidy’, we added the name of the city as a column on each of the csv files and combined them to make 3 large data frames: listings, reviews, and neighbourhoods.
Getting Links
We open our raw html files in order to extract the url to the files we want to download.
We also keep track of the name of the file.
Extracting the Links
We define a function to extract the city name, file name and url of the file for a given city.
def extract_city(city):
result = []
city_name = city.find_all("tr")[0].find_all("td")[1].text
for tr in city.find_all("tr"):
if("archived" not in tr.get("class")):
city_file = tr.find_all("td")[2].text
if(city_file in ["reviews.csv.gz","listings.csv","neighbourhoods.csv"] ):
city_data = {
"file_name": city_file,
"url": tr.find_all("td")[2].find_all("a")[0].get("href")
}
result.append(city_data)
return {"city_name": city_name, "data": result}
download_dir = "cities"
We define 2 more functions to download a single file, and one to download all files for a city.
def download_file(file_name, url, city_name):
r = requests.get(url)
city_dir = os.path.join(download_dir, city_name)
if(os.path.isdir(city_dir) == False):
os.mkdir(city_dir)
with open(os.path.join(download_dir, city_name, file_name), 'wb') as f:
f.write(r.content)
def download_city(city_name, files):
for file in files:
download_file(file["file_name"], file["url"], city_name)
Downloading the Files
Then we use these functions to extract all the information we want from each city.
If we want, we can also uncomment the next line and use our download_city function we defined earlier to download all the content.
Making the Data Tidy.
We are going to make the data tidy by combining all of the observations of different cities from different csv files into 3 massive data frames.
In order to keep the data tidy, we will add a column called city, which will specify what city the observation belongs to.
data_directory <- file.path("scrape","cities")
cities <- list.files(data_directory)
cities <- c("New York City","Los Angeles","San Francisco","Austin","Washington, D.C")
city_paths <- file.path(data_directory,cities)
names(city_paths) <- cities
file_names <- c("reviews.csv.gz","listings.csv","neighbourhoods.csv")
Process and combine each Data Frame.
We combine each dataframe and add the city column.
gather_file <- function(filename){
cities%>%lapply(function(city){
city_path <- file.path(data_directory,city)
file_path <- file.path(city_path,filename)
df <- read_csv(file_path)
if(!is.null(df$neighbourhood)){
df$neighbourhood <- as.character(df$neighbourhood)
}
df <- cbind(df,"city"=rep(city,nrow(df)))
df
})
}
airbnb <- file_names%>%lapply(function(f){
#bind_rows(gather_file(f))
})
names(airbnb) <- c("reviews","listings","neighbourhoods")
Preview of the Data
Listings
The Listings Data Frame is our main Airbnb listings table.
The Majority of our visualizations and analysis will happen on this table.
Reviews
The Reviews Table is more of a supplementary table to the listings table. The listings table already has a variable for the total number of reviews. The Reviews table allows us to additionally see the date of those reviews and which listing they belong to.
Neighbourhoods
The Neighbourhoods table would also be a supplementary table that could be added later
Vizualizations
Room Type
From the chart we see that the most common type of room offered on airbnb globally would be an entire apartment or house. A private room comes second wilth a minimal amount of hotels and shared rooms offered as well.
Summary
For the summary data, we see that the means for all room types seems unrealistically large. We can attribute this issue to the max price outliers for each room type. Since we cannot fully say that those exorbitant prices are erroneous, we choosse to leave those in and look at the median prices. These median prices seem to confirm any natural assumption that an entire house, hotel, or apartment would be more expensive per night than a shared or private room.
City
lets see the amount of airbnb’s by city
Based on the data, we see that the city with the most amount of unique listings offered is London, U.K with 76619 unique listings offered However, This graph looks quite complicated and crowded to look at. To further our analysis, we will examine the top 5 highest cities in terms of unique listings
Top 5 Cities
We see that Paris follows London with 6634 unique listings, then Sicily with 48503 listings. New York City with 44666, and Shanghai with 35572.
Sentiment Analysis
We encountered an interesting problem when trying to extract sentiments from the review comments. Developers in R usually like to use apply style functions because they have benefits in speed and convenience. But it does come with a precious tradeoff which is space.
Time Space Complexity
If we were to apply a function to our reviews data frame in order to extract another data frame that has the sentiment columns, we would be creating two instances of the object in our memory.
this_will_fail <- function(){
reviews%>%apply(1,function(c){
c['sentiment'] <- get_sentiment(c['comments'])
})%>%t%>%as.data.frame
}
I can tell you that trying to extract the sentiments this way will almost certainly max out the RAM available on most machines. Instead, we use an approach that trades time complexity for space. Since we dont want our ram to run out, we can create a new column on our existing dataframe, listings, and add the average sentiment for each of the reviews associated with that listing. Doing this will prevent us from storing every single sentiment score for every single review. Thus you will see you RAM oscillate because of freed up memory.
Extracting the sentiment
listing_ids <- reviews$listing_id%>%unique
#Add an empty column for sentiments
listings$sentiment <- rep(NA,nrow(listings))
listing_sentiment <- function(lid){
sentiments <- reviews%>%filter(listing_id==lid)%>%apply(1,function(row){
comments <- iconv(row["comments"], to = 'utf-8')
sentiment <- get_sentiment(comments)
sentiment
})
score <- mean(sentiments)
index <- listings$id==lid
listings[index,"sentiment"]<<-score
}
for(lid in listing_ids){
#listing_sentiment(lid)
}
datatable(head(listings), options = list(autoWidth = TRUE,pageLength = 5))
Variable Selection Model
Missing Values
The columns “neighbourhood_group” and “number_of_reviews_ltm” were removed because the majority of their values were NA’s
A quick backwards stepwise selection model shows us which variables were the least important variable from the ones we had selected. First we converted applicable columns into factors, and then split the data into a test and a training set. After a rough cleaning of the data to remove missing values, we perform backwards, and stepwise selection to examine which variables are the most important. From this we see that city and room type are clearly the most important variables in predicting price.
load('SentimentListing.RData')
listings<-listings[,-which(names(listings) %in% c("neighbourhood_group", "number_of_reviews_ltm"))]
listings<-na.omit(listings)
set.seed(1)
cols <- c("room_type","city")
listings[cols] <- lapply(listings[cols], factor)
subset<-sample(nrow(listings),nrow(listings)*0.70)
test<-listings[-subset,]
test<-test[1:20000,]
train<-listings[subset,]
train<-train[1:20000,]
m1<-lm(sentiment~neighbourhood+latitude+longitude+room_type+minimum_nights+number_of_reviews+reviews_per_month+ calculated_host_listings_count+availability_365+city+price,data=train)
anova(m1)
## Analysis of Variance Table
##
## Response: sentiment
## Df Sum Sq Mean Sq F value Pr(>F)
## neighbourhood 544 4464 8.21 3.7856 < 2.2e-16 ***
## latitude 1 0 0.30 0.1402 0.7081
## longitude 1 4 3.54 1.6313 0.2015
## room_type 3 273 90.91 41.9376 < 2.2e-16 ***
## minimum_nights 1 1 1.14 0.5265 0.4681
## number_of_reviews 1 78 78.39 36.1597 1.85e-09 ***
## reviews_per_month 1 375 374.73 172.8635 < 2.2e-16 ***
## calculated_host_listings_count 1 240 239.65 110.5519 < 2.2e-16 ***
## availability_365 1 0 0.01 0.0031 0.9556
## city 2 8 3.95 1.8243 0.1614
## price 1 3 3.21 1.4808 0.2237
## Residuals 19442 42146 2.17
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
null<-lm(sentiment~1,data=train)
step.model<-step(m1, scope = list(lower = null, upper = m1),
direction = "backward")
## Start: AIC=16024.24
## sentiment ~ neighbourhood + latitude + longitude + room_type +
## minimum_nights + number_of_reviews + reviews_per_month +
## calculated_host_listings_count + availability_365 + city +
## price
##
## Df Sum of Sq RSS AIC
## - availability_365 1 0.03 42146 16022
## - minimum_nights 1 1.08 42147 16023
## - longitude 1 2.08 42148 16023
## - price 1 3.21 42149 16024
## - city 2 8.10 42154 16024
## <none> 42146 16024
## - latitude 1 5.60 42152 16025
## - calculated_host_listings_count 1 236.55 42383 16134
## - room_type 3 281.01 42427 16151
## - number_of_reviews 1 367.96 42514 16196
## - reviews_per_month 1 382.54 42529 16203
## - neighbourhood 542 3031.36 45178 16329
##
## Step: AIC=16022.25
## sentiment ~ neighbourhood + latitude + longitude + room_type +
## minimum_nights + number_of_reviews + reviews_per_month +
## calculated_host_listings_count + city + price
##
## Df Sum of Sq RSS AIC
## - minimum_nights 1 1.08 42147 16021
## - longitude 1 2.07 42148 16021
## - price 1 3.19 42149 16022
## - city 2 8.10 42154 16022
## <none> 42146 16022
## - latitude 1 5.60 42152 16023
## - calculated_host_listings_count 1 238.49 42385 16133
## - room_type 3 281.43 42428 16149
## - number_of_reviews 1 369.63 42516 16195
## - reviews_per_month 1 382.84 42529 16201
## - neighbourhood 542 3037.14 45183 16330
##
## Step: AIC=16020.77
## sentiment ~ neighbourhood + latitude + longitude + room_type +
## number_of_reviews + reviews_per_month + calculated_host_listings_count +
## city + price
##
## Df Sum of Sq RSS AIC
## - longitude 1 2.09 42149 16020
## - price 1 3.18 42150 16020
## - city 2 8.11 42155 16021
## <none> 42147 16021
## - latitude 1 5.59 42153 16021
## - calculated_host_listings_count 1 238.47 42386 16132
## - room_type 3 281.66 42429 16148
## - number_of_reviews 1 369.77 42517 16194
## - reviews_per_month 1 382.81 42530 16200
## - neighbourhood 542 3036.27 45184 16328
##
## Step: AIC=16019.76
## sentiment ~ neighbourhood + latitude + room_type + number_of_reviews +
## reviews_per_month + calculated_host_listings_count + city +
## price
##
## Df Sum of Sq RSS AIC
## - city 2 7.1 42157 16019
## - price 1 3.1 42153 16019
## <none> 42149 16020
## - latitude 1 5.7 42155 16020
## - calculated_host_listings_count 1 238.2 42388 16130
## - room_type 3 281.9 42431 16147
## - number_of_reviews 1 369.9 42519 16192
## - reviews_per_month 1 384.1 42533 16199
## - neighbourhood 542 3343.6 45493 16463
##
## Step: AIC=16019.13
## sentiment ~ neighbourhood + latitude + room_type + number_of_reviews +
## reviews_per_month + calculated_host_listings_count + price
##
## Df Sum of Sq RSS AIC
## - latitude 1 0.1 42157 16017
## - price 1 3.2 42160 16019
## <none> 42157 16019
## - calculated_host_listings_count 1 241.9 42398 16132
## - room_type 3 282.9 42439 16147
## - number_of_reviews 1 367.7 42524 16191
## - reviews_per_month 1 383.2 42540 16198
## - neighbourhood 544 3579.9 45736 16561
##
## Step: AIC=16017.17
## sentiment ~ neighbourhood + room_type + number_of_reviews + reviews_per_month +
## calculated_host_listings_count + price
##
## Df Sum of Sq RSS AIC
## - price 1 3.2 42160 16017
## <none> 42157 16017
## - calculated_host_listings_count 1 242.4 42399 16130
## - room_type 3 282.9 42439 16145
## - number_of_reviews 1 367.6 42524 16189
## - reviews_per_month 1 383.2 42540 16196
## - neighbourhood 544 3579.9 45737 16559
##
## Step: AIC=16016.69
## sentiment ~ neighbourhood + room_type + number_of_reviews + reviews_per_month +
## calculated_host_listings_count
##
## Df Sum of Sq RSS AIC
## <none> 42160 16017
## - calculated_host_listings_count 1 242.5 42402 16129
## - room_type 3 279.7 42439 16143
## - number_of_reviews 1 369.3 42529 16189
## - reviews_per_month 1 381.8 42542 16195
## - neighbourhood 544 3577.9 45738 16558