Alex Gaynor’s Django-Taggit is a great piece of software, but was missing two features essential for Bucketlist:
1) Users could create tags “Music” and “music” which would create two logically different taxonomies on the site, which was confusing and messy. I’ve added a settings option to force lowercase. When this is enabled, all tags will be lowercased as they’re submitted, preventing duplication.
2) Many users don’t read instructions, and enter multi-word tags without quotes, which results in the system being polluted with lots of tags “the”, “of”, “an” etc. You can now define a set of “stopwords” in settings. Any detected stop words will be removed from the submitted tag set.
I forked Gaynor’s taggit to add these features. See shacker/django-taggit.
I also wrote a script to fix all of the duplicate tags that had gone into the system after installing the original django-taggit but before doing my forked version. You’ll find that script on the repo’s Wiki page.
Rather than defining a list of stopwords you could add support for NLTK.
>>> from nltk.corpus import stopwords
>>> stopset = set(stopwords.words(‘english’))
This could also help you remove a lot more duplication. Converting to lower case is a great start, but what about pluralisations? You could end up with tags “walk”, “walks”, “walking”, “walked” when really a search for any one should also return the others. With NLTK you can easily stem or lemmatize tags reducing this duplication.
Aaron, this is a great tip, thanks! I hadn’t heard of NLTK before, but will look into it. Nice.
Cool man!
Very useful app.