MS MARCO V1 Leaderboard(Closed)

The MS MARCO dataset was released at NIPS 2016. We appreciate the more than 1,000 downloads from the Research community in less than one month and the progress so far! Its exciting to see the research community coming together to solve this difficult problem. Here are the BLEU-1 and ROUGE-L scores for the best models we evaluated to date on the MS MARCO v1.1 test set. We will regularly update our leaderboard as we get submissions. Follow us on Twitter for updates.

Rank Model Submission Date Rouge-L Bleu-1
1 MARS
YUANFUDAO research NLP
March 26th, 2018 49.72 48.02
2 V-Net
Baidu NLP
February 15th, 2018 46.15 44.46
3 S-Net
Microsoft AI and Research [Tan et al. '17]
June 2017 45.23 43.78
4 R-Net
Microsoft AI and Research [Wei et al. '16]
May 2017 42.89 42.22
5 HieAttnNet
Akaitsuki
March 26th, 2018 42.25 44.79
6 BiAttentionFlow+
ShanghaiTech University GeekPie_HPC team
March 11th, 2018 41.45 38.12
7 ReasoNet
Microsoft AI and Research [Shen et al. '16]
April 28th, 2017 38.81 39.86
8 Prediction
Singapore Management University [Wang et al. '16]
March 2017 37.33 40.72
9 FastQA_Ext
DFKI German Research Center for AI [Weissenborn et al. '17]
March 2017 33.67 33.93
10 FastQA
DFKI German Research Center for AI [Weissenborn et al. '17]
March 2017 32.09 33.99
11 Flypaper Model
ZhengZhou University
March 14th, 2018 31.74 34.15
12 DCNMarcoNet
Flying Riddlers @ Carnegie Mellon University
March 31st, 2018 31.30 23.86
13 ReasoNet Baseline
Trained on SQuAd, Microsoft AI & Research [Shen et al. '16]
December 2016 19.20 14.83

Human Performance

Can your model read, comprehend, and answer questions better than humans? The below is current human performance on the MS MARCO task (which we will improve in future versions). This was ascertained by having two judges answer the same question and measuring our metrics over their responses.
Model Rouge-L Bleu-1
Human Performance 47 46