Tokenize Words (N-grams)
As word counting is an essential step in any text mining task, you first have to split the text into words.
word_tokenize() function achieves that by splitting the text by
whitespace. Another important thing it does after splitting is to trim the
words of any non-word characters (commas, dots, exclamation marks, etc.).
You also have the option of specifying the length of the words that you want.
The default is 2, which can be set through the
This function is mainly a helper function for word_frequency to help with counting words and/or phrases.
- word_tokenize(text_list, phrase_len=2)[source]
text_listinto phrases of length
A "word" is any string between white spaces (or beginning or end of string) with delimiters stripped from both sides. Delimiters include quotes, question marks, parentheses, etc. Any delimiter contained within the string remains. See examples below.
text_list -- List of strings.
phrase_len -- Length of word tokens, defaults to 2.
- Return tokenized
List of lists, split according to
>>> t = ['split me into length-n-words', ... 'commas, (parentheses) get removed!', ... 'commas within text remain $1,000, but not the trailing commas.']
>>> word_tokenize(t, 1) [['split', 'me', 'into', 'length-n-words'], ['commas', 'parentheses', 'get', 'removed'], ['commas', 'within', 'text', 'remain', '$1,000', 'but', 'not', 'the', 'trailing', 'commas']]
The comma inside "$1,000" as well as the dollar sign remain, as they are part of the "word", but the trailing comma is stripped.
>>> word_tokenize(t, 2) [['split me', 'me into', 'into length-n-words'], ['commas parentheses', 'parentheses get', 'get removed'], ['commas within', 'within text', 'text remain', 'remain $1,000', '$1,000 but', 'but not', 'not the', 'the trailing', 'trailing commas']]
>>> word_tokenize(t, 3) [['split me into', 'me into length-n-words'], ['commas parentheses get', 'parentheses get removed'], ['commas within text', 'within text remain', 'text remain $1,000', 'remain $1,000 but', '$1,000 but not', 'but not the', 'not the trailing', 'the trailing commas']]