Extract structured entities from text lists

Structured entities are pattern matches and not inferred entities. Some example are hashtags, emoji, mentions, questions, and so on. This is in contrast to entity extraction which are inferred from the context of the sentence (people, companies, brands and so on).

All functions start with extract_ and have a descriptive name for the type of entity that they extract.

There is also a generic extract fucntion which powers all others, and it can be used for any other pattern not included. It takes a regular expression, and returns a similar dictionary to the other functions.

Extract Functions

extract()

A generic function that takes a regex to extract any pattern you want

extract_currency()

Currency symbols together with surrounding text for context. This does not include currency abbreviations (USD, EUR, JPY, etc.), only symbols ($, £, €, etc).

advertools.emoji.extract_emoji()

All the emoji database, together with textual names, groups and sub-groups.

extract_exclamations()

Sentences that end with an excalamation mark!

extract_hashtags()

Extract hashtags with descriptive statistics.

extract_intense_words()

Words that contain three or more repeated characters to express an intense feeling (positive or negative), “I looooooovvvvee this thing”.

extract_mentions()

User mentions in social media posts. Also useful for network analysis.

extract_numbers()

Any numbers that are included the text list. Included a modifiable list of separators to use (“,”, “.”, “-“, etc.).

extract_questions()

Questions included in the text list.

extract_urls()

URls in the text list.

extract_words()

Any arbitrary words that you want extracted. Works in two modes, either the word should fully match the pattern, or as part of a longer word, (“rest” can be matched from “restaurant” or not).

All functions return a dictionary with the entities extracted, along with helpful statistics. Since the entities have different meanings, most of them return additional keys depending on the context.

The recommended way of using:

>>> import advertools as adv
>>> text_list = ['This is the first #text.', 'Second #sentence is here.',
... 'Hello, how are you?', 'This #sentence is the last #sentence']
>>> hashtag_summary = adv.extract_hashtags(text_list)
>>> hashtag_summary.keys()
dict_keys(['hashtags', 'hashtags_flat', 'hashtag_counts', 'hashtag_freq',
           'top_hashtags', 'overview'])

Now you can start exploring:

>>> hashtag_summary['overview']
{'num_posts': 4,
 'num_hashtags': 4,
 'hashtags_per_post': 1.0,
 'unique_hashtags': 2}
>>> hashtag_summary['hashtags']
[['#text'], ['#sentence'], [], ['#sentence', '#sentence']]
>>> hashtag_summary['hashtags_flat']
['#text', '#sentence', '#sentence', '#sentence']
>>> hashtag_summary['hashtag_counts']
[1, 1, 0, 2]
>>> hashtag_summary['hashtag_freq']
[(0, 1), (1, 2), (2, 1)]
>>> hashtag_summary['top_hashtags']
[('#sentence', 3), ('#text', 1)]
extract(text_list, regex, key_name, extracted=None, **kwargs)[source]

Return a summary dictionary about arbitrary matches in text_list.

This function is used by other specialized functions to extract certain elements (hashtags, mentions, emojis, etc.). It can be used for other arbitrary elements/matches. You only need to provide your own regex.

Parameters
  • text_list (list) – Any list of strings (social posts, page titles, etc.)

  • regex (str) – The regex pattern to use for extraction.

  • key_name (str) – The name of the object extracted in singular form (hashtag, mention, etc.)

  • extracted (list(list)) – List of lists, optional. If the regex is not straightforward, and matches need to be made with special code, provide the extracted words/matches as a list for each element of text_list.

  • kwargs (mapping) – Other kwargs that might be needed.

Return summary

A dictionary summarizing the extracted data.

extract_currency(text_list, left_chars=20, right_chars=20)[source]

Return a summary dictionary about currency symbols in text_list

Get a summary of the number of currency symbols, their frequency, the top ones, and more.

Parameters
  • text_list (list) – A list of text strings.

  • left_chars (int) – The number of characters to extract, to the left of the symbol when getting surrounding_text

  • right_chars (int) – The number of characters to extract, to the left of the symbol when getting surrounding_text

Returns summary

A dictionary with various stats about currencies

>>> posts = ['today ₿1 is around $4k', 'and ₿ in £ & €?', 'no idea']
>>> currency_summary = extract_currency(posts)
>>> currency_summary.keys()
dict_keys(['currency_symbols', 'currency_symbols_flat',
'currency_symbol_counts', 'currency_symbol_freq',
'top_currency_symbols', 'overview', 'currency_symbol_names'])
>>> currency_summary['currency_symbols']
[['₿', '$'], ['₿', '£', '€'], []]

A simple extract of currencies from each of the posts. An empty list if none exist

>>> currency_summary['currency_symbols_flat']
['₿', '$', '₿', '£', '€']

All currency symbols in one flat list.

>>> currency_summary['currency_symbol_counts']
[2, 3, 0]

The count of currency symbols per post.

>>> currency_summary['currency_symbol_freq']
[(0, 1), (2, 1), (3, 1)]

Shows how many posts had 0, 1, 2, 3, etc. currency symbols (number_of_symbols, count)

>>> currency_summary['top_currency_symbols']
[('₿', 2), ('$', 1), ('£', 1), ('€', 1)]
>>> currency_summary['currency_symbol_names']
[['bitcoin sign', 'dollar sign'], ['bitcoin sign', 'pound sign',
'euro sign'], []]
>>> currency_summary['surrounding_text']
[['today ₿1 is around $4k'], ['and ₿ in £ & €?'], []]
>>> extract_currency(posts, 5, 5)['surrounding_text']
[['oday ₿1 is ', 'ound $4k'], ['and ₿ in £', ' & €?'], []]
>>> extract_currency(posts, 0, 3)['surrounding_text']
[['₿1 i', '$4k'], ['₿ in', '£ & ', '€?'], []]
>>> currency_summary['overview']
{'num_posts': 3,
'num_currency_symbols': 5,
'currency_symbols_per_post': 1.6666666666666667,
'unique_currency_symbols': 4}
extract_exclamations(text_list)[source]

Return a summary dictionary about exclamation (mark)s in text_list

Get a summary of the number of exclamation marks, their frequency, the top ones, as well the exclamations written/said.

Parameters

text_list (list) – A list of text strings.

Returns summary

A dictionary with various stats about exclamations

>>> posts = ['Who are you!', 'What is this!', 'No exclamation here?']
>>> exclamation_summary = extract_exclamations(posts)
>>> exclamation_summary.keys()
dict_keys(['exclamation_marks', 'exclamation_marks_flat',
'exclamation_mark_counts', 'exclamation_mark_freq',
'top_exclamation_marks', 'overview', 'exclamation_mark_names',
'exclamation_text'])
>>> exclamation_summary['exclamation_marks']
[['!'], ['!'], []]

A simple extract of exclamation marks from each of the posts. An empty list if none exist

>>> exclamation_summary['exclamation_marks_flat']
['!', '!']

All exclamation marks in one flat list.

>>> exclamation_summary['exclamation_mark_counts']
[1, 1, 0]

The count of exclamation marks per post.

>>> exclamation_summary['exclamation_mark_freq']
[(0, 1), (1, 2)]

Shows how many posts had 0, 1, 2, 3, etc. exclamation marks (number_of_symbols, count)

>>> exclamation_summary['top_exclamation_marks']
[('!', 2)]

Might be interesting if you have different types of exclamation marks

>>> exclamation_summary['exclamation_mark_names']
[['exclamation mark'], ['exclamation mark'], []]
>>> exclamation_summary['overview']
{'num_posts': 3,
'num_exclamation_marks': 2,
'exclamation_marks_per_post': 0.6666666666666666,
'unique_exclamation_marks': 1}
>>> posts2 = ["don't go there!", 'مرحبا. لا تذهب!', '¡Hola! ¿cómo estás?',
... 'a few different exclamation marks! make sure you see them!']
>>> exclamation_summary = extract_exclamations(posts2)
>>> exclamation_summary['exclamation_marks']
[['!'], ['!'], ['¡', '!'], ['!', '!']]
# might be displayed in opposite order due to RTL exclamation mark
A simple extract of exclamation marks from each of the posts.
An empty list if none exist
>>> exclamation_summary['exclamation_marks_flat']
['!', '!', '¡', '!', '!', '!']

All exclamation marks in one flat list.

>>> exclamation_summary['exclamation_mark_counts']
[1, 1, 2, 2]

The count of exclamation marks per post.

>>> exclamation_summary['exclamation_mark_freq']
[(1, 2), (2, 2)]

Shows how many posts had 0, 1, 2, 3, etc. exclamation marks (number_of_symbols, count)

>>> exclamation_summary['top_exclamation_marks']
[('!', 5), ('¡', 1)]

Might be interesting if you have different types of exclamation marks

>>> exclamation_summary['exclamation_mark_names']
[['exclamation mark'], ['exclamation mark'],
['inverted exclamation mark', 'exclamation mark'],
['exclamation mark', 'exclamation mark']]
>>> exclamation_summary['overview']
{'num_posts': 4,
'num_exclamation_marks': 6,
'exclamation_marks_per_post': 1.5,
'unique_exclamation_marks': 4}
extract_hashtags(text_list)[source]

Return a summary dictionary about hashtags in text_list

Get a summary of the number of hashtags, their frequency, the top ones, and more.

Parameters

text_list (list) – A list of text strings.

Returns summary

A dictionary with various stats about hashtags

>>> posts = ['i like #blue', 'i like #green and #blue', 'i like all']
>>> hashtag_summary = extract_hashtags(posts)
>>> hashtag_summary.keys()
dict_keys(['hashtags', 'hashtags_flat', 'hashtag_counts', 'hashtag_freq',
'top_hashtags', 'overview'])
>>> hashtag_summary['hashtags']
[['#blue'], ['#green', '#blue'], []]

A simple extract of hashtags from each of the posts. An empty list if none exist

>>> hashtag_summary['hashtags_flat']
['#blue', '#green', '#blue']

All hashtags in one flat list.

>>> hashtag_summary['hashtag_counts']
[1, 2, 0]

The count of hashtags per post.

>>> hashtag_summary['hashtag_freq']
[(0, 1), (1, 1), (2, 1)]

Shows how many posts had 0, 1, 2, 3, etc. hashtags (number_of_hashtags, count)

>>> hashtag_summary['top_hashtags']
[('#blue', 2), ('#green', 1)]
>>> hashtag_summary['overview']
{'num_posts': 3,
 'num_hashtags': 3,
 'hashtags_per_post': 1.0,
 'unique_hashtags': 2}
extract_intense_words(text_list, min_reps=3)[source]

Return a summary dictionary about intense words in text_list

Get all instances of usage of intense words (positive or negative), using words that have min_reps or more repetitions of characters. “I looooooveeee youuuuuuu”, and “I haaatttteeee youuuuuu” are both intense.

Parameters
  • text_list (list) – A text list from which to extract intense words

  • min_reps (int) – The number of times a character has to be repeated for the word to be considered intense.

Returns summary

A dictionary with various stats about intense words

extract_mentions(text_list)[source]

Return a summary dictionary about mentions in text_list

Get a summary of the number of mentions, their frequency, the top ones, and more.

Parameters

text_list (list) – A list of text strings.

Returns summary

A dictionary with various stats about mentions

>>> posts = ['hello @john and @jenny', 'hi there @john', 'good morning']
>>> mention_summary = extract_mentions(posts)
>>> mention_summary.keys()
dict_keys(['mentions', 'mentions_flat', 'mention_counts', 'mention_freq',
'top_mentions', 'overview'])
>>> mention_summary['mentions']
[['@john', '@jenny'], ['@john'], []]

A simple extract of mentions from each of the posts. An empty list if none exist

>>> mention_summary['mentions_flat']
['@john', '@jenny', '@john']

All mentions in one flat list.

>>> mention_summary['mention_counts']
[2, 1, 0]

The count of mentions per post.

>>> mention_summary['mention_freq']
[(0, 1), (1, 1), (2, 1)]

Shows how many posts had 0, 1, 2, 3, etc. mentions (number_of_mentions, count)

>>> mention_summary['top_mentions']
[('@john', 2), ('@jenny', 1)]
>>> mention_summary['overview']
{'num_posts': 3, # number of posts
 'num_mentions': 3,
 'mentions_per_post': 1.0,
 'unique_mentions': 2}
extract_numbers(text_list, number_separators='.', ',', '-')[source]

Return a summary dictionary about numbers in text_list, separated by any of number_separators

Get a summary of the number of numbers, their frequency, the top ones, and more. Typically, numbers would contain separators to make them easier to read, so these are included by default, which you can modify.

Parameters
  • text_list (list) – A list of text strings.

  • number_separators (list(str)) – A list of separators that you want to be included as part of the extracted numbers.

Returns summary

A dictionary with various stats about the numbers

>>> posts = ['text before 123', '123,456 text after', 'phone 333-444-555',
'multiple 123,456 and 123.456.789']
>>> number_summary = extract_numbers(posts)
>>> number_summary.keys()
dict_keys(['numbers', 'numbers_flat', 'number_counts', 'number_freq',
'top_numbers', 'overview'])
>>> number_summary['numbers']
[['123'], ['123,456'], ['333-444-555'], ['123,456', '123.456.789']]

A simple extract of number from each of the posts. An empty list if none exist

>>> number_summary['numbers_flat']
['123', '123,456', '333-444-555', '123,456', '123.456.789']

All numbers in one flat list.

>>> number_summary['number_counts']
[1, 1, 1, 2]

The count of numbers per post.

>>> number_summary['number_freq']
[(1, 3), (2, 1)]

Shows how many posts had 0, 1, 2, 3, etc. numbers (number_of_numbers, count)

>>> number_summary['top_numbers']
[('123,456', 2), ('123', 1), ('333-444-555', 1), ('123.456.789', 1)]
>>> number_summary['overview']
{'num_posts': 4,
 'num_numbers': 5,
 'numbers_per_post': 1.25,
 'unique_numbers': 4}
extract_questions(text_list)[source]

Return a summary dictionary about question(mark)s in text_list

Get a summary of the number of question marks, their frequency, the top ones, as well the questions asked.

Parameters

text_list (list) – A list of text strings.

Returns summary

A dictionary with various stats about questions

>>> posts = ['How are you?', 'What is this?', 'No question Here!']
>>> question_summary = extract_questions(posts)
>>> question_summary.keys()
dict_keys(['question_marks', 'question_marks_flat',
'question_mark_counts', 'question_mark_freq', 'top_question_marks',
'overview', 'question_mark_names', 'question_text'])
>>> question_summary['question_marks']
[['?'], ['?'], []]

A simple extract of question marks from each of the posts. An empty list if none exist

>>> question_summary['question_marks_flat']
['?', '?']

All question marks in one flat list.

>>> question_summary['question_mark_counts']
[1, 1, 0]

The count of question marks per post.

>>> question_summary['question_mark_freq']
[(0, 1), (1, 2)]

Shows how many posts had 0, 1, 2, 3, etc. question marks (number_of_symbols, count)

>>> question_summary['top_question_marks']
[('?', 2)]

Might be interesting if you have different types of question marks (Arabic, Spanish, Greek, Armenian, or other)

>>> question_summary['question_mark_names']
[['question mark'], ['question mark'], []]
>>> question_summary['overview']
{'num_posts': 3,
'num_question_marks': 2,
'question_marks_per_post': 0.6666666666666666,
'unique_question_marks': 1}
>>> posts2 = ['Πώς είσαι;', 'مرحباً. كيف حالك؟', 'Hola, ¿cómo estás?',
... 'Can you see the new questions? Did you notice the different marks?']
>>> question_summary = extract_questions(posts2)
>>> question_summary['question_marks']
[[';'], ['؟'], ['¿', '?'], ['?', '?']]
# might be displayed in opposite order due to RTL question mark
A simple extract of question marks from each of the posts. An empty list if
none exist
>>> question_summary['question_marks_flat']
[';', '؟', '¿', '?', '?', '?']

All question marks in one flat list.

>>> question_summary['question_mark_counts']
[1, 1, 2, 2]

The count of question marks per post.

>>> question_summary['question_mark_freq']
[(1, 2), (2, 2)]

Shows how many posts had 0, 1, 2, 3, etc. question marks (number_of_symbols, count)

>>> question_summary['top_question_marks']
[('?', 3), (';', 1), ('؟', 1), ('¿', 1)]

Might be interesting if you have different types of question marks (Arabic, Spanish, Greek, Armenian, or other)

>>> question_summary['question_mark_names']
[['greek question mark'], ['arabic question mark'],
['inverted question mark', 'question mark'],
['question mark', 'question mark']]
# correct order
>>> question_summary['overview']
{'num_posts': 4,
'num_question_marks': 6,
'question_marks_per_post': 1.5,
'unique_question_marks': 4}
extract_urls(text_list)[source]

Return a summary dictionary about URLs in text_list

Get a summary of the number of URLs, their frequency, the top ones, and more. This does NOT validate URLs, www.a.b would count as a URL

Parameters

text_list (list) – A list of text strings.

Returns summary

A dictionary with various stats about URLs

>>> posts = ['one link http://example.com', 'two: http://a.com www.b.com',
...          'no links here',
...          'long url http://example.com/one/two/?1=one&2=two']
>>> url_summary = extract_urls(posts)
>>> url_summary.keys()
dict_keys(['urls', 'urls_flat', 'url_counts', 'url_freq',
'top_urls', 'overview', 'top_domains', 'top_tlds'])
>>> url_summary['urls']
[['http://example.com'],
 ['http://a.com', 'http://www.b.com'],
 [],
 ['http://example.com/one/two/?1=one&2=two']]

A simple extract of urls from each of the posts. An empty list if none exist

>>> url_summary['urls_flat']
['http://example.com', 'http://a.com', 'http://www.b.com',
 'http://example.com/one/two/?1=one&2=two']

All urls in one flat list.

>>> url_summary['url_counts']
[1, 2, 0, 1]

The count of urls per post.

>>> url_summary['url_freq']
[(0, 1), (1, 2), (2, 1)]

Shows how many posts had 0, 1, 2, 3, etc. urls (number_of_urls, count)

>>> url_summary['top_urls']
[('http://example.com', 1), ('http://a.com', 1), ('http://www.b.com', 1),
 ('http://example.com/one/two/?1=one&2=two', 1)]
>>> url_summary['top_domains']
[('example.com', 2), ('a.com', 1), ('www.b.com', 1)]
>>> url_summary['top_tlds']
[('com', 4)]
>>> url_summary['overview']
{'num_posts': 4,
 'num_urls': 4,
 'urls_per_post': 1.0,
 'unique_urls': 4}
extract_words(text_list, words_to_extract, entire_words_only=False)[source]

Return a summary dictionary about words_to_extract in text_list.

Get a summary of the number of words, their frequency, the top ones, and more.

Parameters
  • text_list (list) – A list of text strings.

  • words_to_extract (list) – A list of words to extract from text_list.

  • entire_words_only (bool) – Whether or not to find only complete words (as specified by words_to_find) or find any any of the words as part of longer strings.

Returns summary

A dictionary with various stats about the words

>>> posts = ['there is rain, it is raining', 'there is snow and rain',
             'there is no rain, it is snowing', 'there is nothing']
>>> word_summary = extract_words(posts, ['rain', 'snow'], True)
>>> word_summary.keys()
dict_keys(['words', 'words_flat', 'word_counts', 'word_freq',
'top_words', 'overview'])
>>> word_summary['overview']
{'num_posts': 4,
 'num_words': 4,
 'words_per_post': 1,
 'unique_words': 2}
>>> word_summary['words']
[['rain'], ['snow', 'rain'], ['rain'], []]

A simple extract of mentions from each of the posts. An empty list if none exist

>>> word_summary['words_flat']
['rain', 'snow', 'rain', 'rain']

All mentions in one flat list.

>>> word_summary['word_counts']
[1, 2, 1, 0]

The count of mentions for each post.

>>> word_summary['word_freq']
[(0, 1) (1, 2), (2, 1)]

Shows how many posts had 0, 1, 2, 3, etc. words (number_of_words, count)

>>> word_summary['top_words']
[('rain', 3), ('snow', 1)]

Check the same posts extracting any occurrence of the specified words with entire_words_only=False:

>>> word_summary = extract_words(posts, ['rain', 'snow'], False)
>>> word_summary['overview']
{'num_posts': 4, # number of posts
 'num_words': 6,
 'words_per_post': 1.5,
 'unique_words': 4}
>>> word_summary['words']
[['rain', 'raining'], ['snow', 'rain'], ['rain', 'snowing'], []]

Note that the extracted words are the complete words so you can see where they occurred. In case “training” was mentioned, you would see that it is not related to rain for example.

>>> word_summary['words_flat']
['rain', 'raining', 'snow', 'rain', 'rain', 'snowing']

All mentions in one flat list.

>>> word_summary['word_counts']
[2, 2, 2, 0]
>>> word_summary['word_freq']
[(0, 1), (2, 3)]

Shows how many posts had 0, 1, 2, 3, etc. words (number_of_words, count)

>>> word_summary['top_words']
[('rain', 3), ('raining', 1), ('snow', 1), ('snowing', 1)]