Text Analysis

Absolute and Weighted Word Count

When analyzing a corpus of documents (I’ll simply call it a text list), one of the main tasks to accomplish to start text mining is to first count the words. While there are many text mining techniques and approaches, the word_frequency() function works mainly by counting words in a text list. A “word” is defined as a sequence of characters split by whitespace(s), and stripped of non-word characters (commas, dots, quotation marks, etc.). A “word” is actually a phrase consisting of one word, but you have the option of getting phrases that have two words, or more. This can be done simply by providing a value for the phrase_len parameter.

Absolute vs Weighted Frequency

In social media reports, analytics, keyword reports, url and page reports, we get more information than simply the text. We get numbers describing those posts or page titles, or product names, or whatever the text list might contain. Numbers can be pageviews, shares, likes, retweets, sales, bounces, sales, etc. Since we have numbers to quantify those phrases, we can improve our counting by taking into consideration the number list that comes with the text list.

For example, if you have an e-commerce site that has two products, let’s say you have bags and shoes, then your products are split 50:50 between bags and shoes. But what if you learn that shoes generate 80% of your sales? Although shoes form half your products, they generate 80% of your revenue. So the weighted count of your products is 80:20.

Let’s say two people post two different posts on a social media platform. One of them says, “It’s raining”, and the other says, “It’s snowing”. As in the above example, the content is split 50:50 between “raining” and “snowing”, but we get a much more informative picture if we get the number of followers of each of those accounts (or the number of shares, likes, etc.). If one of them has a thousand followers, and other has a million (which is typical on social media, as well as in pageviews report, e-commerce and most other datasets), then you get a completely different picture about your dataset.

These two simple examples contain two posts, and a word each. The word_frequency() function can provide insight on hidden trends especially in large datasets, and when the sentences or phrases are also longer then a word or two each.

Let’s take a look at how to use the word_frequency() function, and what the available parameters and options are.

text_list

The list of phrases or documents that you want to analyze. Here are some possible ideas that you might use this for:

  • keywords, whether in a PPC or SEO report

  • page titles in an analytics report

  • social media posts (tweets, Facebook posts, YouTube video titles or descriptions etc.)

  • e-commerce reports (where the text would be the product names)

num_list

Ideally, if you have more than one column describing text_list you should experiment with different options. Try weighting the words by pageviews, then try by bounce rate and see if you get different interesting findings. With e-commerce reports, you can see which word appears the most, and which word is associated with more revenue.

phrase_len

You should also experiment with different lengths of phrases. In many cases, one-word phrases might not be as meaningful as two-words or three.

regex

The default is to simply split words by whitespace, and provide phrases of length phrase_len. But you may want to count the occurrences of certain patterns of text. Check out the regex module for the available regular expressions that might be interesting. Some of the pre-defined ones are hashtags, mentions, questions, emoji, currencies, and more.

rm_words

A list of words to remove and ignore from the count. Known as stop-words these are the most frequently used words in a language, the most used, but don’t add much meaning to the content (a, and, of, the, if, etc.). By default a set of English stopwords is provided (which you can check and possibly may want to modify), or run adv.stopwords.keys() to get a list of all the available stopwords in the available languages. In some cases (like page titles for example), you might get “words” that need to be removed as well, like the pipe “|” character for example.

extra_info

The returned DataFrame contains the default columns [word, abs_freq, wtd_freq, rel_value]. You can get extra columns for percentages and cumulative percentages that add perspective to the other columns. Set this parameter to True if you want that.

Below are all the columns of the returned DataFrame:

word

Words in the document list each on its own row. The length of these words is determined by phrase_len, essentially phrases if containing more than one word each.

abs_freq

The number of occurrences of each word in all the documents.

wtd_freq

Every occurrence of word multiplied by its respective value in num_list.

rel_value

wtd_freq divided by abs_freq, showing the value per occurrence of word

abs_perc

Absolute frequency percentage.

abs_perc_cum

Cumulative absolute percentage.

wtd_freq_perc

Weighted frequency percentage.

wtd_freq_perc_cum

Cumulative weighted frequency percentage.

word_frequency(text_list, num_list=None, phrase_len=1, regex=None, rm_words={'a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'both', 'bottom', 'but', 'by', 'ca', 'call', 'can', 'cannot', 'could', 'did', 'do', 'does', 'doing', 'done', 'down', 'due', 'during', 'each', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'if', 'in', 'indeed', 'into', 'is', 'it', 'its', 'itself', 'just', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'made', 'make', 'many', 'may', 'me', 'meanwhile', 'might', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'part', 'per', 'perhaps', 'please', 'put', 'quite', 'rather', 're', 'really', 'regarding', 'same', 'say', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'under', 'unless', 'until', 'up', 'upon', 'us', 'used', 'using', 'various', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves'}, extra_info=False)[source]

Count the absolute as well as the weighted frequency of words in text_list (based on num_list).

Parameters
  • text_list (list) – Typically short phrases, but could be any list of full blown documents. Usually, you would use this to analyze tweets, book titles, URLs, etc.

  • num_list (list) – A list of numbers with the same length as text_list, describing a certain attribute of these ‘documents’; views, retweets, sales, etc.

  • regex (str) – The regex used to split words. Doesn’t need changing in most cases.

  • phrase_len (int) – the length in words of each token the text is split into, defaults to 1.

  • rm_words (set) – Words to remove from the list a.k.a ‘stop-words’. The default uses. To get all available languages run adv.stopwords.keys()

  • extra_info (bool) – Whether or not to give additional metrics about the frequencies

Returns abs_wtd_df

absolute and weighted DataFrame.

>>> text_list = ['apple orange', 'apple orange banana',
...              'apple kiwi', 'kiwi mango']
>>> num_list = [100, 100, 100, 400]
>>> adv.word_frequency(text_list, num_list)
     word  abs_freq  wtd_freq  rel_value
0    kiwi         2       500      250.0
1   mango         1       400      400.0
2   apple         3       300      100.0
3  orange         2       200      100.0
4  banana         1       100      100.0

Although “kiwi” occurred twice abs_freq, and “apple” occurred three times, the phrases in which “kiwi” appear have a total score of 500, so it beats “apple” on wtd_freq even though “apple” wins on abs_freq. You can sort by any of the columns of course. rel_value shows the value per occurrence of each word, as you can see, it is simply obtained by dividing wtd_freq by abs_freq.

>>> adv.word_frequency(text_list)  # num_list values default to 1 each
     word  abs_freq  wtd_freq  rel_value
0   apple         3         3        1.0
1  orange         2         2        1.0
2    kiwi         2         2        1.0
3  banana         1         1        1.0
4   mango         1         1        1.0
>>> text_list2 = ['my favorite color is blue',
... 'my favorite color is green', 'the best color is green',
... 'i love the color black']

Setting phrase_len to 2, “words” become two-word phrases instead. Note that we are setting rm_words to the empty list so we can keep the stopwords and see if that makes sense:

>>> word_frequency(text_list2, phrase_len=2, rm_words=[])
              word  abs_freq  wtd_freq  rel_value
0         color is         3         3        1.0
1      my favorite         2         2        1.0
2   favorite color         2         2        1.0
3         is green         2         2        1.0
4          is blue         1         1        1.0
5         the best         1         1        1.0
6       best color         1         1        1.0
7           i love         1         1        1.0
8         love the         1         1        1.0
9        the color         1         1        1.0
10     color black         1         1        1.0

The same result as above showing all possible columns by setting extra_info to True:

>>> adv.word_frequency(text_list, num_list, extra_info=True)
     word  abs_freq  abs_perc  abs_perc_cum  wtd_freq  wtd_freq_perc  wtd_freq_perc_cum  rel_value
0    kiwi         2  0.222222      0.222222       500       0.333333           0.333333      250.0
1   mango         1  0.111111      0.333333       400       0.266667           0.600000      400.0
2   apple         3  0.333333      0.666667       300       0.200000           0.800000      100.0
3  orange         2  0.222222      0.888889       200       0.133333           0.933333      100.0
4  banana         1  0.111111      1.000000       100       0.066667           1.000000      100.0