Text Analysis

Absolute and Weighted Word Count

When analyzing a corpus of documents (I'll simply call it a text list), one of the main tasks to accomplish to start text mining is to first count the words. While there are many text mining techniques and approaches, the word_frequency() function works mainly by counting words in a text list. A "word" is defined as a sequence of characters split by whitespace(s), and stripped of non-word characters (commas, dots, quotation marks, etc.). A "word" is actually a phrase consisting of one word, but you have the option of getting phrases that have two words, or more. This can be done simply by providing a value for the phrase_len parameter.

Absolute vs Weighted Frequency

In social media reports, analytics, keyword reports, url and page reports, we get more information than simply the text. We get numbers describing those posts or page titles, or product names, or whatever the text list might contain. Numbers can be pageviews, shares, likes, retweets, sales, bounces, sales, etc. Since we have numbers to quantify those phrases, we can improve our counting by taking into consideration the number list that comes with the text list.

For example, if you have an e-commerce site that has two products, let's say you have bags and shoes, then your products are split 50:50 between bags and shoes. But what if you learn that shoes generate 80% of your sales? Although shoes form half your products, they generate 80% of your revenue. So the weighted count of your products is 80:20.

Let's say two people post two different posts on a social media platform. One of them says, "It's raining", and the other says, "It's snowing". As in the above example, the content is split 50:50 between "raining" and "snowing", but we get a much more informative picture if we get the number of followers of each of those accounts (or the number of shares, likes, etc.). If one of them has a thousand followers, and other has a million (which is typical on social media, as well as in pageviews report, e-commerce and most other datasets), then you get a completely different picture about your dataset.

These two simple examples contain two posts, and a word each. The word_frequency() function can provide insight on hidden trends especially in large datasets, and when the sentences or phrases are also longer then a word or two each.

Let's take a look at how to use the word_frequency() function, and what the available parameters and options are.

text_list

The list of phrases or documents that you want to analyze. Here are some possible ideas that you might use this for:

keywords, whether in a PPC or SEO report

page titles in an analytics report

social media posts (tweets, Facebook posts, YouTube video titles or descriptions etc.)

e-commerce reports (where the text would be the product names)

num_list

Ideally, if you have more than one column describing text_list you should experiment with different options. Try weighting the words by pageviews, then try by bounce rate and see if you get different interesting findings. With e-commerce reports, you can see which word appears the most, and which word is associated with more revenue.

phrase_len

You should also experiment with different lengths of phrases. In many cases, one-word phrases might not be as meaningful as two-words or three.

regex

The default is to simply split words by whitespace, and provide phrases of length phrase_len. But you may want to count the occurrences of certain patterns of text. Check out the regex module for the available regular expressions that might be interesting. Some of the pre-defined ones are hashtags, mentions, questions, emoji, currencies, and more.

rm_words

A list of words to remove and ignore from the count. Known as stop-words these are the most frequently used words in a language, the most used, but don't add much meaning to the content (a, and, of, the, if, etc.). By default a set of English stopwords is provided (which you can check and possibly may want to modify), or run adv.stopwords.keys() to get a list of all the available stopwords in the available languages. In some cases (like page titles for example), you might get "words" that need to be removed as well, like the pipe "|" character for example.

extra_info

The returned DataFrame contains the default columns [word, abs_freq, wtd_freq, rel_value]. You can get extra columns for percentages and cumulative percentages that add perspective to the other columns. Set this parameter to True if you want that.

Below are all the columns of the returned DataFrame:

`word`	Words in the document list each on its own row. The length of these words is determined by `phrase_len`, essentially phrases if containing more than one word each.
`abs_freq`	The number of occurrences of each word in all the documents.
`wtd_freq`	Every occurrence of `word` multiplied by its respective value in `num_list`.
`rel_value`	`wtd_freq` divided by `abs_freq`, showing the value per occurrence of `word`
`abs_perc`	Absolute frequency percentage.
`abs_perc_cum`	Cumulative absolute percentage.
`wtd_freq_perc`	Weighted frequency percentage.
`wtd_freq_perc_cum`	Cumulative weighted frequency percentage.

import advertools as adv
import pandas as pd
tweets = pd.read_csv('data/tweets.csv')
tweets

	tweet_text	followers_count
0	@AERIALMAGZC @penguinnyyyyy you won't be afraid if I give you a real reason :D	157
1	Vibing in the office to #Metallica when the boss is on a coffee break #TheOffice https://t.co/U5vdYevvfe	4687
2	I feel like Ann says she likes coffee and then gets drinks that are 99% sugar and 1% coffee https://t.co/HfuBV4v3aY	104
3	A venti iced coffee with four pumps of white mocha, sweet cream and caramel drizzle might just be my new favorite drink. Shout out to TikTok lol	126
4	I was never a coffee person until I had kids. ☕️ this cup is a life saver. https://t.co/Zo0CnVuiGj	1595
5	Who's excited about our next Coffee Chat? We know we are!🥳 We're also adding Representative John Bradford to this lineup to discuss redistricting in the area. You won't want to miss it! RSVP: https://t.co/R3YNJjJCUG Join the meeting: https://t.co/Ho4Kx7ZZ24 https://t.co/KfPdR3hupY	5004
6	he paid for my coffee= husband💗	165
7	It's nipply outside, and now I side too :) That sounds like blowjob in front of a fire and visit with coffee after :) I'm still out of coffee I could have green tea instead Hahahahahahaha I want to spend the morning pampering you ...	0
8	Good morning 😃🌞☀️ I hope everyone has a great Tuesday morning. Enjoy your day and coffee ☕️ ♥️❤️💕🥰😘	189
9	@MarvinMilton2 I nearly choked on my coffee 🤪	1160

word_freq = adv.word_frequency(text_list=tweets['tweet_text'],
                               num_list=tweets['followers_count'])

# try sorting by 'abs_freq', 'wtd_freq', and 'rel_value':
word_freq.sort_values(by='abs_freq', ascending=False).head(25)

word_frequency(text_list, num_list=None, phrase_len=1, regex=None, rm_words={'a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'both', 'bottom', 'but', 'by', 'ca', 'call', 'can', 'cannot', 'could', 'did', 'do', 'does', 'doing', 'done', 'down', 'due', 'during', 'each', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'if', 'in', 'indeed', 'into', 'is', 'it', 'its', 'itself', 'just', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'made', 'make', 'many', 'may', 'me', 'meanwhile', 'might', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'part', 'per', 'perhaps', 'please', 'put', 'quite', 'rather', 're', 'really', 'regarding', 'same', 'say', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'under', 'unless', 'until', 'up', 'upon', 'us', 'used', 'using', 'various', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves'}, extra_info=False)

Count the absolute as well as the weighted frequency of words in text_list (based on num_list).

Parameters:

text_list (list) -- Typically short phrases, but could be any list of full blown documents. Usually, you would use this to analyze tweets, book titles, URLs, etc.
num_list (list) -- A list of numbers with the same length as text_list, describing a certain attribute of these 'documents'; views, retweets, sales, etc.
regex (str) -- The regex used to split words. Doesn't need changing in most cases.
phrase_len (int) -- The length in words of each token the text is split into (ngrams), defaults to 1.
rm_words (set) -- Words to remove from the list a.k.a 'stop-words'. The default uses. To get all available languages run adv.stopwords.keys().
extra_info (bool) -- Whether or not to give additional metrics about the frequencies.

Returns:

abs_wtd_df -- Absolute and weighted counts DataFrame.

Return type:

pandas.DataFrame

Examples

>>> import advertools as adv
>>> text_list = ['apple orange', 'apple orange banana',
...              'apple kiwi', 'kiwi mango']
>>> num_list = [100, 100, 100, 400]

>>> adv.word_frequency(text_list, num_list)
     word  abs_freq  wtd_freq  rel_value
  kiwi         2       500      250.0
 mango         1       400      400.0
 apple         3       300      100.0
orange         2       200      100.0
banana         1       100      100.0

Although "kiwi" occurred twice abs_freq, and "apple" occurred three times, the phrases in which "kiwi" appear have a total score of 500, so it beats "apple" on wtd_freq even though "apple" wins on abs_freq. You can sort by any of the columns of course. rel_value shows the value per occurrence of each word, as you can see, it is simply obtained by dividing wtd_freq by abs_freq.

>>> adv.word_frequency(text_list)  # num_list values default to 1 each
     word  abs_freq  wtd_freq  rel_value
 apple         3         3        1.0
orange         2         2        1.0
  kiwi         2         2        1.0
banana         1         1        1.0
 mango         1         1        1.0

>>> text_list2 = ['my favorite color is blue',
... 'my favorite color is green', 'the best color is green',
... 'i love the color black']

Setting phrase_len to 2, "words" become two-word phrases instead. Note that we are setting rm_words to the empty list so we can keep the stopwords and see if that makes sense:

>>> word_frequency(text_list2, phrase_len=2, rm_words=[])
              word  abs_freq  wtd_freq  rel_value
       color is         3         3        1.0
    my favorite         2         2        1.0
 favorite color         2         2        1.0
       is green         2         2        1.0
        is blue         1         1        1.0
       the best         1         1        1.0
     best color         1         1        1.0
         i love         1         1        1.0
       love the         1         1        1.0
      the color         1         1        1.0
   color black         1         1        1.0

The same result as above showing all possible columns by setting extra_info to True:

>>> adv.word_frequency(text_list, num_list, extra_info=True)
     word  abs_freq  abs_perc  abs_perc_cum  wtd_freq  wtd_freq_perc  wtd_freq_perc_cum  rel_value
  kiwi         2  0.222222      0.222222       500       0.333333           0.333333      250.0
 mango         1  0.111111      0.333333       400       0.266667           0.600000      400.0
 apple         3  0.333333      0.666667       300       0.200000           0.800000      100.0
orange         2  0.222222      0.888889       200       0.133333           0.933333      100.0
banana         1  0.111111      1.000000       100       0.066667           1.000000      100.0