How to Optimize Your Language Data Through the Power of NLP

Track: Multilingual AI | TA3 |
Wednesday, October 20, 2021, 12:30pm – 1:15pm
Held in: Jujama
Andras Aponyi - TAUS 
Amir Kamran - TAUS

If we have access to large volumes of data in all languages and domains, massive-scale machine translation is possible. Isn’t it time to bridge this gap and unleash the power of all our language data? Every company sits on a mountain of language data in translation memories and content management systems. But that data is locked up in legacy formats and templates that make it not very useful and accessible in the modern scenarios of machine translation. Over the past few decades companies have stored and organized language data under their own project and product labels without typically applying the hygiene of cleaning and controlling versions and terminology. Every stakeholder in the translation and global content industry should know by now that in order not to be left behind they need to start working on the desilofication and transformation of their language data. During this presentation, we’ll talk more about the importance of language data and how it can help you advance your business. We’ll give you insights into the data services like cleaning, anonymization and clustering, and the different tools and platforms you can use.

Takeaways: Attendees will get an introduction to data services; learn about problems in data and why cleaning is required; hear the ten steps in cleaning, anonymizing and clustering data; see the available tools and their limitations; learn about cleaning based on sentence embeddings (Laser, LaBSE); see comparisons with examples; and hear about any new research/developments.