Natural Language Processing: Find The Most Common Phrases on Twitter

Tazki Anida Asrul
4 min readMar 1, 2019
GIF from here

Last Sunday, one of the biggest awards ceremonies was held. Yup, Oscar 2019. People all over the world got excited about this event. On Twitter, we could see a ton of tweets were posted before, during, and after the ceremony.

So… what the most discussed things during the Oscar this year? I tried to figure it out by using my basic knowledge about NLP and Python.

First of all, I scraped Twitter data using Tweepy. Tweepy library will help you to use Twitter Streaming API, so you can capture Twitter messages in real-time. In my case, I streamed the data for around 10 minutes, with ‘oscar’ as the filtered keyword. The results would be formed as a JSON file, and only tweets with the ‘oscar’ word on it would be extracted.

The second step was the most interesting and challenging part of all, preprocessing. In this step, I converted the unstructured data in JSON format to a more readable structure format, Pandas DataFrame.

import pandas as pd     
import numpy as np
import json
import re
from nltk.corpus import stopwords
from nltk.util import ngrams
from collections import Counter
# For visualization:
import matplotlib.pyplot as plt
import seaborn as sns
tweets = []

for line in open('oscar.json', 'r'):
try:
content = json.loads(line)
content.pop('limit', None)
tweets.append(content)
except:
continue
data = pd.DataFrame(tweets, columns = ['text'])
data = data.dropna()

As seen below, the data still contain a lot of stop words, usernames, and even ‘HTTP’ links. We need to clean it out.

So all I did was do some cleansing processes, including removing some patterns, punctuations, and English stop words.

#remove user, https, and RT
data['clean_tweet'] = np.vectorize(remove_pattern)(data['text'], "https|RT|@[\w]*")
#remove punctuations
data['clean_tweet'] = data['clean_tweet'].str.replace("[^a-zA-Z#]", " ")
#lowering string
data['clean_tweet'] = data['clean_tweet'].str.lower()
#remove stop words
stop_words = set(stopwords.words('english'))

data['clean_tweet'] = [' '.join([w for w in x.lower().split() if w not in stop_words])
for x in data['clean_tweet'].tolist()]
#remove words with len < 2
data['clean_tweet'] = data['clean_tweet'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))
#tokenization
tokenized_tweet = data['clean_tweet'].apply(lambda x: list(ngrams(x.split(), 2)))

And the result would be like:

If you notice, there are only the main words for each tweet. I captured the phrases of the main words by using the Ngrams function. Ngrams can help us to tokenize the list of the sentence by breaking up the string and grouping them into phrases. The second word in a tuple will be the first word in the following one, and so on. For example, if we have a sentence like “lady gaga watch oscar” as the input, the Ngrams output will be like:

[(lady, gaga), (gaga, watch), (watch, oscar)]

After getting the phrases from all tweets, I tried to combine them all into one list and count the appearance of each phrase.

l = reduce(lambda x, y: list(x)+list(y), zip(tokenized_tweet))
flatten = [item for sublist in l for item in sublist]
counts = Counter(flatten).most_common()df = pd.DataFrame.from_records(counts, columns=['Phrase', 'Count'])
df['Phrase']= df['Phrase'].apply(lambda x: ' '.join([w for w in x]))

In the final step, I used Matplotlib and Seaborn library to visualize the result more beautifully.

df = df.nlargest(columns="Count", n = 10) 
plt.figure(figsize=(15,4))
ax = sns.barplot(data=df, x= "Phrase", y = "Count")
ax.set(ylabel = 'Count')
plt.show()

I created a bar chart to present the 10 most common phrases, and the result is…

‘Lady Gaga’ got first place, followed by ‘Oscar Best’ and ‘Green Book’. Overall, the most discussed topics were related to the actress/actor (Lady Gaga, Rami Malek), movie (Green Book), director (Spike Lee), and nomination (Best Picture, Best Actor). We can also see some phrases are in Spanish because we didn’t filter the tweets by language.

However, the experiment above is just a little example of text processing using Twitter data. There’s still a lot of room to improve and explore the idea of NLP utilization. Hope it will help!

Reference

  1. https://methodi.ca/recipes/analyzing-repeating-phrases-ngrams-python
  2. https://www.analyticsvidhya.com/blog/2018/07/hands-on-sentiment-analysis-dataset-python/

--

--