MS MARCO Leaderboard

The MS MARCO dataset was released at NIPS 2016. We appreciate the more than 1,000 downloads from the Research community in less than one month and the progress so far! Its exciting to see the research community coming together to solve this difficult problem. Here are the BLEU-1 and ROUGE-L scores for the best models we evaluated to date on the MS MARCO v1.1 test set. We will regularly update our leaderboard as we get submissions. Follow us on Twitter for updates.

Rank Model Rouge-L Bleu-1
1 S-Net
Microsoft AI and Research [Tan et al. '17]
45.23 43.78
2 R-Net
Microsoft AI and Research [Wei et al. '16]
42.89 42.22
3 ReasoNet
Microsoft AI and Research [Shen et al. '16]
38.81 39.86
4 Prediction
Singapore Management University [Wang et al. '16]
37.33 40.72
5 FastQA_Ext
DFKI German Research Center for AI [Weissenborn et al. '17]
33.67 33.93
6 FastQA
DFKI German Research Center for AI [Weissenborn et al. '17]
32.09 33.99
7 ReasoNet Baseline
Trained on SQuAd, Microsoft AI & Research [Shen et al. '16]
19.20 14.83

Human Performance

Can your model read, comprehend, and answer questions better than humans? The below is current human performance on the MS MARCO task (which we will improve in future versions). This was ascertained by having two judges answer the same question and measuring our metrics over their responses.
Model Rouge-L Bleu-1
Human Performance 47 46