Stopwords in Several Languages
List of stopwords by the spaCy [1] package, useful in text mining, analyzing content of social media posts, tweets, web pages, keywords, etc.
Each list is accessible as part of a dictionary stopwords
which is a normal
Python dictionary.
Stopword Languages
Arabic
Azerbaijani
Bengali
Catalan
Chinese
Croatian
Danish
Dutch
English
Finnish
French
German
Greek
Hebrew
Hindi
Hungarian
Indonesian
Irish
Italian
Japanese
Kazakh
Nepali
Norwegian
Persian
Polish
Portuguese
Romanian
Russian
Sinhala
Spanish
Swedish
Tagalog
Tamil
Tatar
Telugu
Thai
Turkish
Ukrainian
Urdu
Vietnamese
You can easily explore the available languages and get (and optionally modify) the stopwords by accessing the dictionary as follows:
import advertools as adv
adv.stopwords.keys()
dict_keys(['arabic', 'azerbaijani', 'bengali', 'catalan', 'chinese',
'croatian', 'danish', 'dutch', 'english', 'finnish', 'french',
'german', 'greek', 'hebrew', 'hindi', 'hungarian', 'indonesian',
'irish', 'italian', 'japanese', 'kazakh', 'nepali', 'norwegian',
'persian', 'polish', 'portuguese', 'romanian', 'russian', 'sinhala',
'spanish', 'swedish', 'tagalog', 'tamil', 'tatar', 'telugu', 'thai',
'turkish', 'ukrainian', 'urdu', 'vietnamese'])
You can also access the stopwords of a certain language:
print(sorted(adv.stopwords['english'])[:5])
print(sorted(adv.stopwords['german'])[:5])
Footnotes