With the onset of the COVID-19 pandemic, several fields have been negatively impacted with a select few lucky ones, one being the sector of Natural Language Processing Benchmarking AI development. From the use of machine-learning-based techniques to catalyze COVID-related drug discovery during the pandemic to higher participation rates in virtual AI research conferences, the field of AI has boomed, witnessing a 27% increase in AI investments amid the COVID-19 pandemic. 

One notable field within AI, however, is Natural Language Processing (NLP). According to the 2021 AI Index Report, “progress in NLP has been so swift that technical advances have started to outpace the benchmarks to test for them,” especially with tech giants such as Google and Microsoft already having implemented the BERT language model into their search engines.

The 2021 AI Index Report discusses this in further detail, reporting that “In recent years, advances in natural language processing technology have led to significant changes in large-scale systems that billions of people access. For instance, in late 2019, Google started to deploy its BERT algorithm into its search engine, leading to what the company said was a significant improvement in its in-house quality metrics. Microsoft followed suit, announcing later in 2019

that it was using BERT to augment its Bing search engine. 

English language understanding Natural Language Processing Benchmarking 


Launched in May 2019, SuperGLUE is a single-metric benchmark that evaluates the performance of a model on a series of language understanding tasks on established datasets. SuperGLUE replaced the prior GLUE benchmark (introduced in 2018) with more challenging and diverse tasks. 

The SuperGLUE score is calculated by averaging scores on a set of tasks. Microsoft’s DeBERTa model now tops the SuperGLUE leaderboard, with a score of 90.3, compared with an average score of 89.8 for SuperGLUE’s “human baselines.” This does not mean that AI systems have surpassed human performance on all SuperGLUE tasks, but it does mean that the average performance across the entire suite has exceeded that of a human baseline. The rapid pace of progress (Figure 2.3.1) suggests that SuperGLUE may need to be made more challenging or replaced by harder tests in the future, just as SuperGLUE replaced GLUE. 


The Stanford Question Answering Dataset, or SQuAD, is a reading-comprehension benchmark that measures how accurately an NLP model can provide short answers to a series of questions pertaining to a small article of text. The SQuAD test makers established a human performance benchmark by having a group of people read Wikipedia articles on a variety of topics and then answer multiple-choice questions about those articles. Models are given the same task and are evaluated on the F1 score or the average overlap between the model prediction and the correct answer. Higher scores indicate better performance. 

Two years after the introduction of the original SQuAD, in 2016, SQuAD 2.0 was developed once the initial Natural Language Processing Benchmarking revealed increasingly fast performances by the participants (mirroring the trend seen in GLUE and SuperGLUE). SQuAD 2.0 combines the 100,000 questions in SQuAD 1.1 with over 50,000 unanswerable questions written by crowd workers to resemble answerable ones. The objective is to test how well systems can answer questions and to determine when systems know that no answer exists. 

As Figure 2.3.2 shows, the F1 score for SQuAD 1.1 improved from 67.75 in August 2016 to surpass the human performance of 91.22 in September 2018—a 25-month period—whereas SQuAD 2.0 took just 10 months to beat human performance (from 66.3 in May 2018 to 89.47 in March 2019), a very interesting result. In 2020, the most advanced models of SQuAD 1.1 and SQuAD 2.0 reached the F1 scores of 95.38 and 93.01, respectively.”

To learn more about benchmarking within natural language processing, read here. To read about more topics within the realm of AI, go to aixplain.com

Pin It on Pinterest