Natural language processing in Finish language is very limited. There are very few libraries and documentation is almost nonexistence. This problem arises, especially with lemmatization. there are only a few good options to lemmatize Finnish text. This blog post covers how to lemmatize Finnish text with Python.
Lemmatization is a process where a word is converted to its baseform which is a meaningful word. Another option is stemming. In stemming algorithm removes the last few characters using some algorithm. Both methods are very powerful NLP techniques and it often depends on the case in which one is better. NLTK library already has Finnish stemmer and it is very well documented so I don’t cover it in this blog post.
Lemmatization with Voikko library
Voikko is a very powerful library and it is used in most free linguistic tools in open source. But there is a big catch if you are planning using it your property software project. Voikko is licensed with GPL v3 so there might be some licensing problems. But if you are using it in open source or your hobby project Voikko is the best option.
Installing Voikko is pretty straightforward in Ubuntu.
After install, you can try Voikko in python, it is pretty easy and straightforward.
Lemmatization with FINNPOS
FINNPOS is an open source toolkit. Which is used tagging and lemmatizing morphologically rich languages like Finnish. It has Apache 2 license so it much more forgiving than Voikko, and it can also be used in closed source projects. FINNPOSS is a new project, so installing and using it is not as straightforward as with Voikko. First clone FINNPOS from github.
Then you have to get a Morphological analyzer. You can find it here and put it to FinnPos/share/finnpos/omorfi/ directory. After that, you can install FINPOSS
You can try it on command line
. I used subprocess in my python implementation.
So those are two very simple but not so well documented ways to lemmatize Finnish text. I am happy to know that if there are any bugs or missed something crucial. you can always contact me with email or Twitter. Happy hacking.