Lasso-based feature selection

The TF-IDF analysis is a great tool to detect whether a word is more frequent among comments related to higher or lower-rated accommodations, but there are some drawbacks as well. The algorithm fails to leave out words that have a high TF-IDF, even though the global frequency of the word is extremely low. For instance, the word ‘robbery’ only appeared in the comments of low-rated Airbnb flats (so its relative frequency is high), but because of its absolute frequency, it could not be used for predicting user ratings. Another issue raised with the use of the TF-IDF analysis when the words related to lower-rated flats correlated with each other. For example, ‘horrible’ and ‘cancel’ appeared in the same comments:

“I made a reservation. The host didn’t write me, he didn’t respond. After I have called them, they told me they didn’t see the reservation. That the system is broken, and they couldn’t see my reservation. I had to cancel it before arriving and had to look for another accommodation at 9 pm! Horrible” (Airbnb ID: 37383207, average rating: 3.5).

To handle correlation between words, and find the truly useful ones for prediction, we performed a lasso regression-based classification of the words.

The method of categorizing comments as “positive” and “negative” is equivalent to the one presented in the previous section. If a flat has a rating equal to or above the upper decile, it is positive, and negative if it is equal to or below the bottom decile. The difference compared to the previous model is that the intermediatory category has been omitted. The reason for that is the classifier is a logistic regression, which only predicts binary outputs.

For this, the key idea is to generate random samples from the population of the reviews, and divide them into training and testing sets. Logistic lasso regression determines the coefficient of each word, and categorizes the comment reviews as “negative” or “positive”. Its performance (so that if the frequency of a term provides a good prediction of a review’s positive or negative nature) is validated on the test set. After committing this multiple times, we measure the contribution of a term to the reduction of prediction error (variable importance). The most important terms are reported in Figure 11. We created 25 repetitions, and the samples contained about 9000 reviews as a training set and 3000 as a test set.

This model framework outperforms simple TF-IDF-based categorization. (1) If the relative frequency of a term differs in the case of high-rated and low-rated properties’ comments, but its total frequency is low, the word will not show significant contribution to the categorization of a text (here review comments). (2) If a combination of two words frequently occurs together, the logistic lasso regression model will omit the variable which does not contain new information.

This attribute is one of the main disadvantages of the model, since the more meaningful word may be excluded. For example, the word ‘bit’ by itself does not provide much information, yet appears because it represents the information that the Airbnb flat is located in a “bit noisy” surroundings.

Most important positive and negative words based on lasso selection

Figure 11: Most important positive and negative words based on lasso selection

As Figure 11 shows, the flat’s location and amenities are the attributes, that most stand out among review comments, contributing to guest satisfaction in an unfavorable way. The literature studied also supports these findings (Barbosa, 2019). Words related to amenities such as ‘toilet’, ‘broken [equipment]’, ‘bathroom’, ‘windows’, ‘towels’ and ‘wifi’ mostly appeared in reviews that belonged to properties with ratings of 4.9 stars (upper decile) or above. According to Lee et al. (2019), a bad Wi-Fi connection in an Airbnb flat can ruin the entire customer experience. We, as humans tend to value things only after we have lost is – this is also the case with internet connection. Earlier in this paper, ‘price’ has not been defined as a key determinant of overall user ratings, but in this model, it is ranked very high. The reason for this is that the word ‘price’ in customer reviews refers to ‘value-for-money’ numerous times. We also examined which factors contribute favorably to user satisfaction. ‘Equipped’ and ‘furnished [apartments]’ appeared as people’s top priority. ‘Location’ was also a significant factor. We removed some of the adverbs and adjectives like “beautifully” and “amazing” associated with the location, which enabled us to gain concrete information. However, the presence of positive opinions concerning location in the comments gave evidence that it highly determines users’ positive experience. Barbosa (2019) said that several people participate in peer-to-peer accommodation with their families (high frequency of words ‘family’ and ‘husband’). They rent whole houses with many amenities (‘equipped’, ‘furnished’ apartments), so they have an opportunity to ‘relax’, while enjoying the ‘view’ and sipping ‘wine’ in the ‘garden’. Others were provided with destination ‘tips’ by the locals, and were more interested in obtaining traditional experience. ‘Hosts’ also play an important role in evaluating customer experiences.

Graph of bigramms related to the model results

Figure 12: Graph of bigramms related to the model results

Figure 12 illustrates the graph of bigrams related to the lasso regression model results. The words presented were classified into three categories: positive words (belonging to the flats with a user rating equal to or above 4.9), negative words (belonging to the flats which were given 4.5 stars or below) and neutral words. For example, the word ‘bit’, which also appeared in Figure 11, now became interpretable: in some cases, Airbnb users found the environment a “bit noisy”. It is confirmed by Cheng (2019), that negative sentiment is mostly caused by ‘noise’. Another example is the word ‘ruin’, which is also coupled with low user ratings, makes no sense by itself. However, supplemented by the word ‘bars’ or ‘pubs’, it is easily understood. From this comment review below, we can see that apartments located near nightclubs are more likely to get negative feedback:

“There’s a nightclub on the first floor with an open sky, you can hear the music all night long as the windows aren’t noised canceling… not something we were expecting at all… Otherwise, the location is great, the breakfast is a bit expensive for what they serve but it’s unlimited… If you plan to party all night long it’s a good place, otherwise, be prepared to have trouble sleeping.”