Predicting Yelp

Restaurant Ambiance

Natural Language Processing project:

predict restaurant attributes from customer reviews using Logistic Regression, BERT, and CNNs.



Intro

Predicting restaurant ambiance takes into account the steps required to search for the perfect restaurant

  1. whether the restaurant has the attributes one looks for

  2. whether the reviews substantiate the restaurant's listed attributes.

A restaurant may have a specific attribute toggled as "True", but the customer reviews may say otherwise. In this project, my team used natural language processing and deep learning to detect attributes of a restaurant from its user reviews.

Methods

Our goal was to train a model on the restaurants' reviews to predict labels for an attribute. The inputs of our models were reviews grouped and concatenated by restaurant (restaurant-level data), and the outputs of our models were True/False labels for the respective attribute.

Since our models were performing binary predictions, we used accuracy as our main metric of model performance. We evaluated the accuracy, precision, and recall of our predictions for different attributes and compared the different deep learning language models.

The building + training details

Using the review and business datasets from the Yelp website, we subsetted our data to include restaurants with at least 50 reviews to collect enough for training after doing EDA and running baseline predictive models. Specific attributes to focus on were chosen based on minimal class imbalance and sparsity.

Example subsetted dataset (above). Baseline models (right).

To establish a baseline, we used Naive Bayes on vectorized words using Bag-of-Words (BoW). To improve performance and decrease vectorized text size, we applied RegEx to remove stop words, symbols, and numbers. Then, the NLTK tokenizer was applied, followed by the CountVectorizer to create features, which were split into a 70/30 train-test split before being fed into the multinomial Naive Bayes model.

For the next model, we implemented a Logistic Regression by feeding in the same preprocessing steps mentioned above. We also experimented with Term Frequency-Inverse Document Frequency (TF-IDF) to vectorize the reviews. While the BoW vectorizer keeps track of word frequency, TF-IDF measures the frequency of a term in a document multiplied by the inverse document frequency, providing the importance of the words by considering the rareness of a term.

BoW and TF-IDF approaches are based on word counts, which is not the best way to capture context; therefore, we used BERT, known as the bidirectional encoder representations from transformers, which is a transformer-based machine learning framework for natural language processing designed to pre-train deep bidirectional representation from unlabeled text by conditioning on both left and right contexts in all layers. Our first BERT model used the pre-trained 'bert-base-uncased' tokenizer with all layers frozen except for the single dense layer to predict our labels. Subsequent models experimented with varying numbers of unfrozen layers to fine-tune to our data and dense layers (for model complexity).

Because our BERT models had difficulty achieving higher accuracies beyond a certain threshold, we experimented with CNNs as well. Our thought process was that BERT's CLS token for each review may not capture specific information about the specified attributes, so finding relationship between certain words using a CNN may improve accuracy. In the CNN model, the Keras tokenizer and used, and non-alphabetic, punctuation, stop words, and short tokens were removed during the preprocessing. The architecture consisted of three channels that each contained an embedding, convolution, dropout, max pooling, and flattened layers that were then concatenated together.


Results

Across the attributes we predicted, Logistic Regression performed the best. Diving into the analysis, restaurant reviews detected GoodForKids with the highest accuracy. This could be due to the model learning more context and mentions about kids in the reviews. For example, some reviews include: “Kids options look good too! My kids love pizza. The Sunday brunch is phenomenal and kids eat free. My wife and I were out without kids and really wanted an awesome steak!” In contrast, the attribute OutdoorSeating had the lowest accuracy using the Naive Bayes and Logistic Regression models. This could be due to the nature of reviews containing text more focused on food, specific dishes, or customer service. As a result, the BoW and TF-IDF vectorizers learn limited amounts of information from the low frequency of words pertaining to OutdoorSeating when compared to GoodForKids and RestaurantsReservations, which seemed more frequently discussed and led to better detection.

Context & Variation Constraints

Something to keep in mind is that restaurant attributes are specified by restaurant owners, independent of review text written by customers. Some error in our model accuracy could be due to inconsistencies between restaurant owners’ labels and what customers experience. And similarly, there could be a large variance in sentiment or overall experience in reviews for the same restaurant.

Why more complex models may have performed worse

Surprisingly, the BERT models could not surpass the simpler Naive Bayes and Logistic Regression models in accuracy. Out of all the BERT models, the first model with frozen pre-trained layers and the single dense layer performed the best. Compared to the other BERT models, this model had only 769 trainable parameters, compared to 213,377 trainable parameters in the next frozen model, and 109 million trainable parameters for the unfrozen models. We think that the maximum token length may have been exceeded for some restaurants. In addition, the CLS token may not be ideal for our task because it will capture overall sentiment, which is too broad for predicting specific attributes.

The CNN model trained on the restaurant-level data achieves the lowest accuracy when compared to all other models. Depending on the attribute, the model can have more than 170 million trainable parameters, making it the largest model so far and requiring hours to train per epoch. However, the CNN trained on 100,000 samples of review-level data achieved the highest training accuracy (>97%) but performed the lowest on testing accuracy (~50%), a classic example of overfitting when the sampled reviews are not representative of all reviews. The results are not displayed in the table. When incorporating more than 100,000 samples, the CNN became too long to train in the allowed timeframe.

Next Steps

Next steps would include implementing and optimizing complex models such as BERT and transformer-based models and CNN to learn contextualized embeddings of reviews to predict ambience of a restaurant. We hope to fine-tune our models and apply it to make predictions on unlabeled restaurants. Other areas of future research extend to semi-supervised learning and auto labeling in the training process. In our case study we analyze a selected subset of binary attributes, but we hope to extend this to all restaurant attributes. Furthermore, we can create clusters of attributes and define standardized ambiences which our model can predict. A current constraint is to the long nature of our concatenated review texts per restaurant, so we hope to apply BERT to long-texts.

Reference here for full paper and code (Git repository).