Document Theme Classification

The primary objective of the Trustious Restaurants platform is to help people discover where to eat. In more concrete terms, Trustious needs to answer questions like “What are good sushi places near by?”, “What are good breakfast spots in Heliopolis?”, “Where can I order delivery late at night?”, “Where can I go to have good healthy food?”. An obvious pre-requisite, is to build a comprehensive model of a food spot that covers aspects like: What cuisine does it offer? When is it open? Is it good for a family outing, breakfast, large groups? Does it have out door seating, a good view? .. you get the idea. The question now is: How to obtain this information? .. and this is where it gets interesting.

We basically had three approaches:

  1. License an existing data set
  2. Rely on “experts”
  3. Develop a data driven solution

Our first impression was to see if some third party has clean fresh data for food spots in Egypt and to see if we can license it. After an extended period of research we arrived at the blunt conclusion that beyond basic location information, none of the available data sets completely fits our purpose for two primary reasons: (1) Most of that data is aggregated manually by people calling the food place to ask a set of questions. As you may have guessed, this approach could work for “where exactly are you located?” kind of question but not for “Are you good for breakfast?” kind of questions. Data acquired in this way does not differentiate between a place that allows people to eat before noon and a choice breakfast spot. (2) The data sets rarely covered the level of detail we need. For instance, none had information about what the place is really “good for” or when do patrons typically visit.

The approach we ultimately developed is a hybrid data driven pipeline with expert supervision. The idea is to analyze user data to find out what a food spot is really good for as seen by people not as advertised by the food spot. The expert’s role is to guide the data analysis and help spot flaws that we as engineers may not perceive.

In summary, the data driven pipeline works as follows:

  1. Develop a model for each “theme”. For instance: People typically go early to a restaurant that is good for breakfast, reviews about the place should comprise a certain set of keywords (e.g. breakfast, omelette).
  2. For each restaurant, create a document that includes all available data about the place. This includes; Trustious user reviews (currently over 10,000), the menu, Foursquare tips, Foursquare check-in times among others.
  3. For each theme and restaurant pair, measure the quality of fit between the theme’s model and the restaurant’s document.

In the rest of this post, we will cover how that works in some detail.

Theme Modeling

We came up with a list of interesting themes that we want to model. The first type of features associated with a given theme is its Relevant keywords.

Keyword Identification

We applied Keyword Frequency Analysis on all documents for all the restaurants. This resulted in a list of thousands of keywords sorted by frequency. Here’s a sample preview of the keywords list:

chicken - 1971
grill - 1204
sandwich - 966
salad - 906
sushi - 865
cafe - 750
burger - 741
cheese - 729

Clustering techniques were used, such as:

  • Phonetic Fingerprint: Useful for ignoring common phonetic spelling mistakes that may occur in user reviews or posts.
  • Nearest Neighbor Methods: Applying a membership class to an item based on its k-neareast-neibors (kNN)
  • Levenstien Distance: The number of edit operations needed to change one string into the other.

We inspected the clustering results, ending up with a group of keywords for every theme. A sample of the themes extracted is shown here:

romantic = ['romantic', 'nile view', 'late night', 'midnight']
breakfast = ['break-fast', 'breakfast', 'break fast', 'omlette', 'omlete', 'morning', 'فطار', 'افطار']
healthy = ['weight watchers', 'healthy food spot', 'diet']

Timing Signals

While exploring the Foursquare venue API, another valuable theme feature was noticed. Namely, opening hours and popular check-in times of a restaurant. This makes the second type of theme features used in our data-driven approach.

Popular check-in times indicate which times of the day (or which days of the week) does a venue get frequent check-ins. This is offered through the Foursquare API as an array of time windows for each day of the week (as seen here).

Let us consider the case that a certain venue consistently gets many check-ins early in the day (where early is until Noon). This significantly increases the probability that this place is visited often for breakfast. This signal is much stronger than simply checking the menu of restaurants for breakfast items. This allows us to confidently conclude that this place, not only offers breakfast, but also has good breakfast. This can be applied for many themes such as: late romantic dinners, breakfast spots, outings after work, or weekend hangouts.

For each theme T, we draw a timing curve C. covering the time of day or week related to T. C is then added as a theme feature for T. For example, assume T is the late-night romantic dinner theme, then C is the Normal Probability Density Function over the range of time [9:00PM – 1:00AM].

Document Analysis

Our Trustious dataset is rich with valuable Restaurants specific information. This includes user reviews and ratings, restaurant menus, and other associated metadata. Besides our own, several apps on the internet provide an API for the results of their data crunching. Whether that is most checked-in restaurants on Foursquare or popular hashtagged photos on instagram. We aggregated data from several such open API sources.

Challenge: Item Matching

A natural obstacle when merging similar data from multiple independant sources is how to match items. To associate any external information to a Trustious Restaurant or item, we try to match the given titles with the names of the items on Trustious. For example, these titles come from the name of the venue on Foursquare or the title of a blogpost. In most cases the names are not an exact match, which introduces a problem since checking for string equality of both names is not enough.

To solve this problem, we used our existing Elasticsearch cluster to find the best matching item on Trustious. We considered the name of a Foursquare venue as a search query given to our search engine. The result is the best matching document in our search index. Elasticsearch (more specifically Lucene) allows this easily through its fuzzy matching and quality scoring for a query. We chose a quality threshold that optimizes percision more than recall, in order to keep the matching pipeline as automated as possible. At the same time, a moderator verifies any remaining candidate matches with a low confidence.

Restaurants Model

We represent each restaurant as a model composed of the following:

  • Trustious reviews
  • Trustious restaurant menus
  • Foursquare tips
  • Popular check-in times of day/week

This information was gathered from the relevant API endpoints such as:

We aggregate data from these endpoints and for each restaurant generate a “Document”, which is the appropriate scientific term used in Information Retrieval. The next step is to find the best matching theme for each document.

Document Classification

The final step is to match all pairs of documents and themes. For the textual features, the matching was performed via several methods, such as: exact, n-gram, and fuzzy matching. As for comparing themes’ timing signals with restaurants check-in data, we used a modified p-value normality test. This is to judge how busy does a restaurant get during a given time duration. If a document D matches a given theme T. then the restaurant R associated to D is classified with T.

Applications

Our data-driven approach works well for associating a theme for restaurants. We currently use it for generating lists of exciting top places sharing the same theme. It can be used with different datasets to classify different entities. The definitions above can be modified to build a suitable model for a different domain. If you have any questions about any of the models or techniques mentioned above, please don’t hesitate to let us know in the comments.

4 thoughts on “Document Theme Classification”

  1. Nice and rich post. As a company based in Cairo and you have data in different language mainly Arabic and English. Also, data on other resources are also mixed. Do you consider matching Arabic and English items? Have you reached to useful techniques.

    This is not covered in the post but I am interested to know if reached something as this problem is in the core of my current research. So you may help ;)

    1. One approach we rely on is the presence of an extra field inside the document that contains translations of the document itself or keywords present in it. This is also matched with the query and the union of both result sets (English and Arabic) is considered before applying a quality threshold.

      1. So you mainly depend on manual translation for what you have… Best of luck, looking forward to more articles :)

  2. I’m really proud of you guys (Y)
    I think there is a huge room of research in this area specially information retrieval. I think the social analysis for your problem will be very powerful for example sentiment analysis on tweets can be a good source for recommendations in general

    But I have a question how do you collect the bag of words for each theme? is it manually or unsupervised behavior?

    Why don’t you use Wikipedia as the source for the bag of words using the categories idea?

    Also I think the problem of the item matching can’t only be solved with text based similarity, so you can check something like Word Net to help you in this problem by applying direct semantic similarity as it can be a feature in your clustering algorithm or you can start thinking how can you apply word embedding techniques on data from your data sets

Leave a Reply