Open source AI components

Here we share different applications based on artificial intelligence that all interested parties in the public and private sector are able to reuse without charge and to further develop pursuant to their own needs. The goal is to make at least 5 applications of this type accessible to the public by the end of 2020. 

Among other things, we hope to make public the following solutions in the near future: speech recognition, speech synthesis, text keyword extractor, and a chatbot.

Translation engine

In cooperation with the University of Tartu, another piece of kratt, i.e., a basic component of an application based on artificial intelligence, was added to the source code repository. All interested parties can freely reuse the translation engine Neurotõlge that was completed for machine translation, and can further develop it pursuant to their own needs. 

Another basic component of kratts that was added to the source code repository is the translation engine Neurotõlge that supports seven languages (Estonian/Latvian/Lithuanian/English/Finnish/German/Russian); at that, all 42 language pairs fit in one neural network-based model. In the case of the solution, it is not necessary to separately select a source language – the system does this itself; the user only has to select the target language. In addition, it is also possible to choose the style of translation, be it conversational language or a more official translation. The translation engine is also able to correct the style in the same language, and to correct spelling mistakes. 

It is possible to place the translation engine solution into an environment suitable for the user, which thereby also enables the translation of documents intended for internal use. The translation engine operates online at https://translate.ut.ee/, where it can be used directly as a demo, integrated with translation frameworks, and used through an API. 

The source code repository of the translation engine solution is accessible at https://koodivaramu.eesti.ee/tartunlp/translate

In case anyone is interested in special solutions and field-specific machine translation solutions, feel free to contact us via email kratt@mkm.ee

The development of the translation engine has been supported by the University of Tartu, Enterprise Estonia, the Ministry of Education and Research, the University of Tartu HPC, Interlex Translations, and Luisa Translation Agency.

Text analytics tool
Texta

 The first basic component of kratts that was added to the source code repository is the text analytics toolkit created by Texta OÜ – several institutions have already used the toolkit to improve the efficiency of their work processes and to automate routine activities. For example, the Ministry of Education and Research conducts a document management audit using the Texta Toolkit, which is aimed at identifying documents that have been published without authorisation (e.g. internal documents, personal data, etc.). In collaboration with the Centre of Registers and Information Systems, the Ministry of Justice removed, using Texta, personal data from nearly 80,000 court decisions involving deleted punishments, and republished the decisions in the Court Information System. The Texta Toolkit arose from the applied research in STACC and its development has been supported by the Estonian Language Technology programme.

A new Texta Toolkit 2.0 has been released (added 4 March 2020)

Toolkit 2.x has a new graphical interface, project-based resources. New back end, faster and easier integration with systems, the whole interface is accessible and works via an API. Improved data model, a possibility to fine tune machine learning models. Pytorch for neural network training. Apache Tika for processing documents, more effective adding of documents to the Toolkit, and optical text recognition from scanned documents. 

Speaking in plain language, it is easier to tag documents, memos, to accelerate and streamline conversations with customers – whether via e-mail tagging, automatic redirection or automatically generated responses – and to extract information: exporting the required part of an e-mail, text, document, PDF file, etc., into the system. 

The text analytics toolkit is accessible at the national code repository.

Speech synthesis tool

This is a prototype of Estonian speech synthesis based on neural networks, trained on the corpus of Estonian news, having been developed by the Natural Language Processing research group at the University of Tartu. Right now, our speech synthesis is able to replicate the voice of four speakers; having all been placed into one model. The project is still in the development stage and far from perfect; however, the neural network-based speech synthesis sounds more natural than earlier methods.

The strengths of the speech model include the natural sound and intonation of speech, and the pronunciation of numbers, symbols and abbreviations.

In addition to the online demo, which can be found here, there is an application interface, with more information available here.

The source code, together with the installation manual, can be accessed in the national code repository.

The development of speech synthesis has been supported by the TalTech Laboratory of Phonetics and Speech Technology, the University of Tartu Phonetics Lab, and the Institute of the Estonian Language Department of Speech Research and Speech Technology.

Speech recognition tool

Speech recognition is a technology that translates speech into text. Speech recognition allows you to, for example, dictate documents, transcribe voice and video recordings, and communicate with computers and devices by voice. Estonian speech recognition tool has real applications used by, for example, radiologists at the North-Estonian Regional Hospital, as well as several Estonian media monitoring companies for the automatic transcription of radio and television programs.

Recently, speech recognition reached the hall of the Riigikogu (Estonian Parliament), where it is planned to start creating transcripts automatically using speech recognition.

The speech recognition system developed at TalTech's language technology laboratory is available to everyone free of charge. The system is being developed by a working group led by Tanel Alumäe, which also includes Ottokar Tilk and Asadullah.

The source code, together with the installation manual, can be accessed in the national code repository.

Transcription is fully automatic: no one listens to these recordings or reads the transcripts. However, the contents of audio files may be used for research purposes.

©2019 kratt@mkm.ee