In a bid to help build better translation systems, Facebook has open-sourced FLORES-101, a first-of-its-kind, many-to-many evaluation data set covering 101 languages from all over the world.
For the first time, researchers will be able to reliably measure the quality of translations through 10,100 different translation directions — for example, directly from Hindi to Thai or Swahili.
For context, evaluating in and out of English would provide merely 200 translation directions.
The ‘FLORES-101' tool enables researchers to rapidly test and improve upon earlier multi-lingual translation models like ‘M2M-100'.
"We're making FLORES-101 publicly available because we believe in breaking down language barriers, and that means helping empower researchers to create more diverse (and locally relevant) translation tools — ones that may make it as easy to translate from, say, Bengali to Marathi as it is to translate from English to Spanish," Facebook said in a statement.
‘FLORES-101' focuses on what are known as low-resource languages, such as Amharic, Mongolian, and Urdu, which do not currently have extensive data sets for natural language processing research.
The data set contains the same set of sentences across all languages, enabling researchers to evaluate the performance of any and all translation directions.
"I think (FLORES) is a really exciting resource to help improve the representation of many languages within the machine translation community," said Graham Neubig, Professor at the Carnegie Mellon University Language Technology Institute in the School of Computer Science.
"It is certainly one of the most extensive resources that I know of that covers so many languages from all over the world, in a domain of such relevance to information access as Wikipedia text."