Background

Planning a trip is one of my favorite things to do. I find if fun to learn about the history, culture, and places to visit. It takes a while to properly plan a trip and I found out how difficult and time consuming while preparing for my trip to the Philippines.

I found myself like many other travelers who feel unprepared when visiting other places. I needed a quick way to filter all the activities and restaurants and just focus on the ones that I’ll enjoy. This lead me to creating a recommendation system based on places and activities that I am familiar with.

Data

Tripadvisor is the first place I go to to learn about potential places to visit. Tripadvisor does a great job of listing the top activities, suggesting hotels, and having a general overview of the city being visited. Activities are ranked by average customer reviews and they contain detailed information like hours of operation, the type of traveler that enjoys the location, and the category of the activity.

The best thing about Tripadvisor are the reviews provided by ordinary people who comment on their experience. These comments make the site an incredible goldmine for data.

The dataframe above shows the data after we’ve scraped the details and the latest 25 reviews for the activity.

Stop Words

In order to create a recommendation system based on the reviews, we must first focus on the most important words. Stop words are commonly used words (such as “the”, “a”, and “an”) that can be ignored when creating a list based on similarities.

Building the Recommendation System

TF-IDF Matrix: Find The Most Important Words

TF-IDF stands for term frequency – inverse document frequency. This is a number that reflects how important a word is in relation to the document vs the collection(corpus). If a word appears multiple times in a specific activity’s comments (document) and it doesn’t appear often in all the comments (corpus), that word gets a higher weighting for that document.

Ngram_range looks at the number of words to evaluate. In the trip recommendation system, I’ve used a range of 1 to 3. This means that we are looking for a TF-IDF value of the word and the following 2 words. This makes it possible to include common terms of phrase that would be missed if we just focused on just an individual word. (ex. “rainy days” vs “rainy” or “days” individually)

# calculate cosine similarity using tfidf
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(df['comments_test'])

Cosine Similarity

Cosine similarity is one of the most widely used similarity measures in Data Science because of how powerful it is. We can find how similar the two activities are by multiplying the values weightings of the 2 activity comments and dividing it by their dot product matrix (tf_idf_matrix•tf_idf_matrix).

Each activity is then compared to every other activity based on the TF-IDF score and given the resulting cosine similarity value to all other activities. The recommendation system simply chooses an activity and presents the activities with the highest cosine similarity values.

similarity_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)

Examples

To test out how well, the suggestions work I input places I like to visit in NYC and I should be given a list of places that are similar in London. For my trip to London, I would like to visit places that are similar to The Metropolitan Museum of Art, Times Square, and Colombia University.

Top Recommendations For The Metropolitan Museum of Art:

The British Museum
Garden Museum
Fashion and Textile Museum
Museum of London
National Maritime Museum

Metropolitan Museum of Art and The British Museum

Top Recommendations For Times Square:

Piccadilly Circus
Leicester Square
The Shard
Dinerama – Street Feast
Gods Own Junkyard

Top Recommendations For Colombia University:

London School of Economics
British Library
Freemasons Hall
Gunnersbury Park & Museum
The Shard

Columbia University and London School Of Economics

The best thing about this recommendation system is that it is all based on user generated comments. Using only combination of 1, 2, or 3 words within the comments provided pretty accurate results for similar activities in another city in another continent. A way to boost the results would be to create a similarity matrix for the details (about section, ranking, number of reviews, and category) and join the two matrices together. This is similar to what I’ve done with my fund recommendation system that will be available for use in the next week.