Benchmarking is critical to drive innovation in ai/ml science, increase quality, and instill #trust in users of such models; it turns out that trust is probably the most important factor in the decision-making of whether or not to apply or integrate a system or model into an application or workflow.

Generally, in science, we often like to improve our systems towards one metric that corresponds with quality (like BLEU or COMET for Machine Translation), and only secondarily consider other metrics in follow-on stages to ascertain if a developed system falls within acceptable  ranges (e.g. latency and model size).

It is already a good step forward that the science community recently started doubling down in looking at things that raise the trust of users in our system integration: we now start asking ourselves questions like “is our tech biased along the lines of gender, race, or cultural background?”. Some science teams test if their models might use for example a #translation that is culturally inappropriate, or unsafe for the use with children, or has a different comprehension complexity than what the audience needs (“reading level”), etc.

Benchmarking all this, comprehensively, reliably, at scale, and in a scientifically sound manner is no cake walk. Hence, most leading science teams invest considerable time, effort, and money to perform proper benchmarking.

The US government established a federal agency called the National Institute of Standards and Technology (NIST) that helped science teams build novel baselines that lead to so many of the innovations we are now using daily, including most cloud machine translation services that we use daily, the voice enabled personal assistants and many, many more. And NIST is by far not “done” with AI, but is spawning more and more new use cases, more metrics, more competitions, to drive #innovation in the community. Innovation on behalf many users, being the citizen, or being another agency.

You will find today thousands of papers that deal with the topic of benchmarking, almost any peer-reviewed paper has an experimental section, where various metrics have to be discussed. Many of us had the experience that a paper was rejected because we were focussed on one metric, we forgot to look on the other. That will not get easier.

This is where my team at aiXplain, inc. and I want to help: we want to collaborate with you to understand and benchmark your model, before it hits the market or while it is being used. We want to help show potential challenges and what can be done to improve.

We want to help you also on the other hand understand what existing systems can do on your data, in your domain, in your use case, and show you the various options you have, if you want to apply system integration like machine translation.

Sign up to join aiXplain community and use the available tools, and let’s work together!

Pin It on Pinterest