MS MARCO V2 Leaderboard

First released at NIPS 2016 the MS MARCO dataset was an ambitious, real-world Machine Reading Comprehension Dataset. Based on feedback from the community, we designed and released the V2 dataset and its related challanges ranked by difficulty(easiet to hardest). Can your model read, comprehend, and answer questions better than humans?

1. Given a query and 10 passages provide the best answer availible based(Novice)

2. Given a query and 10 passages provide the best answer avaible in natural languauge that could be used by a smart device/digital assistant(Intermediate)

3. TBD(Expert)

Models are ranked by ROUGE-L Score



Novice Task

Rank Model Submission Date Rouge-L Bleu-1 F1
1 Human Performance April 23th, 2018 53.870 48.50 94.72
2 VNET Baidu NLP June 19th, 2018 46.72 50.45 70.96
3 SNET + CES2S Bo Shao of SYSU University July 24th, 2018 44.96 46.36
4 DNET++ QA Geeks August 1st, 2018 43.18 47.86 70.93
5 SNET JY Zhao June 26th, 2018 42.36 46.14 70.96
6 SNET+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS June 1st, 2018 39.82 42.27 70.96
8 DNET QA Geeks May 29th, 2018 33.30 29.12 74.36
9 Extraction-net zlsh80826 July 30th, 2018 32.07 30.82 70.9581
10 BIDAF+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS May 29th, 2018 27.60 28.84 70.96
11 BiDaF Baseline(Implemented By MSMARCO Team)
Allen Institute for AI & University of Washington [Seo et al. '16]
April 23th, 2018 23.96 10.64 74.93

Intermediate Task

Rank Model Submission Date Rouge-L Bleu-1
1 Human Performance April 23th, 2018 63.21 53.03
2 VNET Baidu NLP July 4th, 2018 46.41 43.12
3 SNET + CES2S Bo Shao of SYSU University July 24th, 2018 45.04 40.62
4 ConZNet S3R July 16th, 2018 42.14 38.62
5 Bayes QA Bin Bi of Alibaba NLP June 14st, 2018 41.11 43.54
6 SNET+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS June 1st, 2018 40.07 37.54
7 BPG-NET Zhijie Sang of the Center for Intelligence Science and Technology Research(CIST) of the Beijing University of Posts and Telecommunications (BUPT) August 1st, 2018 38.18 34.72
8 BIDAF+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS May 29th, 2018 32.22 28.33
9 DNET++ QA Geeks August 1st, 2018 27.48 33.14
10 DNET QA Geeks May 29th, 2018 25.19 30.73
11 SNET JY Zhao May 29th, 2018 24.66 30.78
12 BiDaF Baseline(Implemented By MSMARCO Team)
Allen Institute for AI & University of Washington [Seo et al. '16]
April 23th, 2018 16.91 9.30

MS MARCO V1 Leaderboard(Closed)

The MS MARCO dataset was released at NIPS 2016. We appreciate the more than 2,000 downloads from the Research community and the 13 model submissions! Its exciting to see the research community coming together to solve this difficult problem. Here are the BLEU-1 and ROUGE-L scores for the best models we evaluated to date on the MS MARCO v1.1 test set. If you have a model please train on the new 2.1 dataset and task

Rank Model Submission Date Rouge-L Bleu-1
1 MARS
YUANFUDAO research NLP
March 26th, 2018 49.72 48.02
2 Human Performance
December 2016 47.00 46.00
3 V-Net
Baidu NLP [Wang et al '18]
February 15th, 2018 46.15 44.46
4 S-Net
Microsoft AI and Research [Tan et al. '17]
June 2017 45.23 43.78
5 R-Net
Microsoft AI and Research [Wei et al. '16]
May 2017 42.89 42.22
6 HieAttnNet
Akaitsuki
March 26th, 2018 42.25 44.79
7 BiAttentionFlow+
ShanghaiTech University GeekPie_HPC team
March 11th, 2018 41.45 38.12
8 ReasoNet
Microsoft AI and Research [Shen et al. '16]
April 28th, 2017 38.81 39.86
9 Prediction
Singapore Management University [Wang et al. '16]
March 2017 37.33 40.72
10 FastQA_Ext
DFKI German Research Center for AI [Weissenborn et al. '17]
March 2017 33.67 33.93
11 FastQA
DFKI German Research Center for AI [Weissenborn et al. '17]
March 2017 32.09 33.99
12 Flypaper Model
ZhengZhou University
March 14th, 2018 31.74 34.15
13 DCNMarcoNet
Flying Riddlers @ Carnegie Mellon University
March 31st, 2018 31.30 23.86
14 BiDaF Baseline for V2 (Implemented By MSMARCO Team)
Allen Institute for AI & University of Washington [Seo et al. '16]
April 23th, 2018 26.82 12.93
15 ReasoNet Baseline
Trained on SQuAd, Microsoft AI & Research [Shen et al. '16]
December 2016 19.20 14.83