MS MARCO V2 Leaderboard

First released at NIPS 2016 the MS MARCO dataset was an ambitious, real-world Machine Reading Comprehension Dataset. Based on feedback from the community, we designed and released the V2 dataset and its related challanges ranked by difficulty(easiet to hardest). Can your model read, comprehend, and answer questions better than humans?

1. Given a query and 10 passages provide the best answer availible based(Novice)

2. Given a query and 10 passages provide the best answer avaible in natural languauge that could be used by a smart device/digital assistant(Intermediate)

3. TBD(Expert)

Models are ranked by ROUGE-L Score



Novice Task

Rank Model Submission Date Rouge-L Bleu-1 F1
1 Human Performance April 23th, 2018 53.87 48.50 94.72
2 DNET++ QA Geeks June 1st, 2018 41.91 45.80 70.93
3 SNET+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS June 1st, 2018 39.82 42.27 70.96
4 SNET JY Zhao May 29th, 2018 38.63 42.11 70.96
5 DNET QA Geeks May 29th, 2018 33.30 29.12 74.36
6 BIDAF+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS May 29th, 2018 27.60 28.84 70.96
7 BiDaF Baseline(Implemented By MSMARCO Team)
Allen Institute for AI & University of Washington [Seo et al. '16]
April 23th, 2018 23.96 10.64 74.93

Intermediate Task

Rank Model Submission Date Rouge-L Bleu-1
1 Human Performance April 23th, 2018 63.21 53.03
2 CZNet S3R June 14st, 2018 41.68 37.52
3 Bayes QA Bin Bi of Alibabla NLP June 14st, 2018 41.11 43.54
4 SNET+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS June 1st, 2018 40.07 37.54
5 BIDAF+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS May 29th, 2018 32.22 28.33
6 DNET++ QA Geeks June 1st, 2018 26.15 32.24
7 DNET QA Geeks May 29th, 2018 25.19 30.73
8 SNET JY Zhao May 29th, 2018 24.66 30.78
9 BiDaF Baseline(Implemented By MSMARCO Team)
Allen Institute for AI & University of Washington [Seo et al. '16]
April 23th, 2018 16.91 9.30

MS MARCO V1 Leaderboard(Closed)

The MS MARCO dataset was released at NIPS 2016. We appreciate the more than 2,000 downloads from the Research community and the 13 model submissions! Its exciting to see the research community coming together to solve this difficult problem. Here are the BLEU-1 and ROUGE-L scores for the best models we evaluated to date on the MS MARCO v1.1 test set. If you have a model please train on the new 2.1 dataset and task

Rank Model Submission Date Rouge-L Bleu-1
1 MARS
YUANFUDAO research NLP
March 26th, 2018 49.72 48.02
2 Human Performance
December 2016 47.00 46.00
3 V-Net
Baidu NLP [Wang et al '18]
February 15th, 2018 46.15 44.46
4 S-Net
Microsoft AI and Research [Tan et al. '17]
June 2017 45.23 43.78
5 R-Net
Microsoft AI and Research [Wei et al. '16]
May 2017 42.89 42.22
6 HieAttnNet
Akaitsuki
March 26th, 2018 42.25 44.79
7 BiAttentionFlow+
ShanghaiTech University GeekPie_HPC team
March 11th, 2018 41.45 38.12
8 ReasoNet
Microsoft AI and Research [Shen et al. '16]
April 28th, 2017 38.81 39.86
9 Prediction
Singapore Management University [Wang et al. '16]
March 2017 37.33 40.72
10 FastQA_Ext
DFKI German Research Center for AI [Weissenborn et al. '17]
March 2017 33.67 33.93
11 FastQA
DFKI German Research Center for AI [Weissenborn et al. '17]
March 2017 32.09 33.99
12 Flypaper Model
ZhengZhou University
March 14th, 2018 31.74 34.15
13 DCNMarcoNet
Flying Riddlers @ Carnegie Mellon University
March 31st, 2018 31.30 23.86
14 BiDaF Baseline for V2 (Implemented By MSMARCO Team)
Allen Institute for AI & University of Washington [Seo et al. '16]
April 23th, 2018 26.82 12.93
15 ReasoNet Baseline
Trained on SQuAd, Microsoft AI & Research [Shen et al. '16]
December 2016 19.20 14.83