MS MARCO Leaderboard

The MS MARCO dataset was released at NIPS 2016. We appreciate the more than 1,000 downloads from the Research community in less than one month and the progress so far! Its exciting to see the research community coming together to solve this difficult problem. Here are the BLEU-1 and ROUGE-L scores for the best models we evaluated to date on the MS MARCO v1.1 test set. We will regularly update our leaderboard as we get submissions. Follow us on Twitter for updates.

Rank Model Bleu-1 Rouge-L
1 Prediction
Singapore Management University [Wang et al. '17]
40.72 37.33
2 FastQA_Ext
DFKI German Research Center for AI
33.93 33.67
3 FastQA
DFKI German Research Center for AI
33.99 32.09
4 ReasoNet
Trained on SQuAd, Microsoft AI & Research [Shen et al. '16]
14.83 19.20

Human Performance

Can your model read, comprehend, and answer questions better than humans? The below is current human performance on the MS MARCO task (which we will improve in future versions). This was ascertained by having two judges answer the same question and measuring our metrics over their responses.
Model Bleu-1 Rouge-L
Human Performance 46 47