Harvesting Linguistic Data at Scale to Feed AI Models

Track: Multilingual AI | AI2 |
Thursday, June 6, 2024, 2:30pm – 3:00pm
Held in: Pembroke room
Georg Kirchner - Dell Technologies

This case study describes the harvesting of AI model training data from a legacy production environment. The presenters will share how they overcame data silos in a pre-AI age TMS of a large enterprise where translation memories were organized by the many products and product lines. They will discuss the limitations of translation memories and alternatives to gather and serve up AI model training data enriched with metadata for easy consumption by both linguists and ML engineers.


  • The data points needed to train AI models for translation and quality estimation;
  • Alternatives to translation memories to gather data sustainably;
  • The power of metadata for data curation and optimization of translation operations.