This is an list of known tools and resources developed specifically to do linguistic processing in Icelandic. It is intended to give readers a clear overview of the ever-growing arsenal of tools for working with Icelandic natural language data at a glance.
This list is categorized by task to increase clarity. Due to that, some multi-functional tools and toolkits might appear more than once in the list. If you notice a category or resource is missing or have suggestions on how to improve this list, please open a pull request.
- Notable papers
- Other resource collections
- Toolkits
- Tokenization and text normalization
- POS tagging
- Syntactic parsing
- Grapheme-to-phoneme
- Stress analysis
Notable papers and reports ↑
- Máltækniáætlun fyrir íslensku 2018-2022 (English version)
- The project plan for an ongoing language technology programme funded by the Icelandic Ministry of Education.
- Short paper describing the programme, note that the programme has been postponed by a year compared to the original plan.
- Risamálheild: A Very Large Icelandic Text Corpus
- Paper describing the Icelandic Gigaword Corpus, a tagged and lemmatized corpus containing over 10^9 tokens.
- Please send a pull request with additions to this list.
Other resource collections ↑
- CLARIN-IS
- The Icelandic branch of the CLARIN-ERIC language resource initiative. Contains information on and downloads for many tools and datasets.
- malfong.is
- List of language technology resources, maintained by Árnastofnun.
Toolkits ↑
- Python 3 package which is capable of syntactic parsing, lemmatization, POS tagging, noun phrase inflection and more
- The GitHub repo for this project
- Developed by Miðeind ehf.
- Java toolkit which does tokenization, POS tagging, lemmatization, parsing and NER
- Developed by Hrafn Loftsson
- TTS frontend designed to work with the Merlin speech synthesis system developed by CSTR
- It contains a pronunciation dictionary, sequitur g2p model, stress analysis component and more. Unfortunately it does not include any documentation.
- Developed by Anna Björk Nikulásdóttir at LVL
Tokenization and text normalization ↑
- Icelandic tokenizer
- Textahaukur - text normalization toolkit
- This seems to be in suspended development and claims to not be functional as of yet.
POS tagging ↑
Syntactic parsing ↑
- Neural parsing pipeline for Icelandic
- Greynir, see above
- IceNLP, see above
Grapheme-to-phoneme ↑
- LSTM encoder-decoder sequence-to-sequence models for Icelandic, reference
- g2p-service is a g2p web service. reference
- Icelandic pronunciation dictionary
- Pronunciation dictionary editor
- Thrax G2P grammar for Icelandic, reference
- LVL-tts-frontend
- G2P - Atli Thor's g2p python module/pip package, reference
- Module for preparing text data for TTS data collections ..., reference
- Althingi ASR g2p, reference
Stress analysis ↑
- LVL-tts-frontend performs stress analysis