Stopwords in Several Languages

List of stopwords by the spaCy [1] package, useful in text mining, analyzing content of social media posts, tweets, web pages, keywords, etc.

Each list is accessible as part of a dictionary stopwords which is a normal Python dictionary.

Stopword Languages

  • Arabic

  • Azerbaijani

  • Bengali

  • Catalan

  • Chinese

  • Croatian

  • Danish

  • Dutch

  • English

  • Finnish

  • French

  • German

  • Greek

  • Hebrew

  • Hindi

  • Hungarian

  • Indonesian

  • Irish

  • Italian

  • Japanese

  • Kazakh

  • Nepali

  • Norwegian

  • Persian

  • Polish

  • Portuguese

  • Romanian

  • Russian

  • Sinhala

  • Spanish

  • Swedish

  • Tagalog

  • Tamil

  • Tatar

  • Telugu

  • Thai

  • Turkish

  • Ukrainian

  • Urdu

  • Vietnamese

You can easily explore the available languages and get (and optionally modify) the stopwords by accessing the dictionary as follows:

import advertools as adv
adv.stopwords.keys()
dict_keys(['arabic', 'azerbaijani', 'bengali', 'catalan', 'chinese',
'croatian', 'danish', 'dutch', 'english', 'finnish', 'french',
'german', 'greek', 'hebrew', 'hindi', 'hungarian', 'indonesian',
'irish', 'italian', 'japanese', 'kazakh', 'nepali', 'norwegian',
'persian', 'polish', 'portuguese', 'romanian', 'russian', 'sinhala',
'spanish', 'swedish', 'tagalog', 'tamil', 'tatar', 'telugu', 'thai',
'turkish', 'ukrainian', 'urdu', 'vietnamese'])

You can also access the stopwords of a certain language:

print(sorted(adv.stopwords['english'])[:5])

print(sorted(adv.stopwords['german'])[:5])

Footnotes