MS MARCO V2 Leaderboard

First released at NIPS 2016 the MS MARCO dataset was an ambitious, real-world Machine Reading Comprehension Dataset. Based on feedback from the community, we designed and released the V2 dataset and its related challanges. Can your model read, comprehend, and answer questions better than humans?

1. Given a query and 1000 relevant passages rerank the passages based on relevance(Passage Re-Ranking)

2. Given a query and 10 passages provide the best answer availible based(Q&A)

3. Given a query and 10 passages provide the best answer avaible in natural languauge that could be used by a smart device/digital assistant(Q&A + Natural Langauge Generation)

Models are ranked by ROUGE-L Score



Passage Re-Ranking

Rank Model Submission Date MRR@10 On Eval MRR@10 On Dev
1 Neural Kernel Match IR (Conv-KNRM)(1)Yifan Qiao, (2)Chenyan Xiong, (3)Zhenghao Liu, (4)Zhiyuan Liu-Tsinghua University(1, 3, 4); Microsoft Research AI(2) [Dai et al. '18] Novmeber 28th, 2018 27.12 -
2 Neural Kernel Match IR (KNRM) (1)Chenyan Xiong, (2)Zhuyun Dai, (3)Jamie Callan, (4)Zhiyuan Liu, (5)Russell Power-Carnegie Mellon University(1,2,3);Tsinghua University( 4); Allen Institute for AI(5) [ Xiong et al. '17] December 10th, 2018 - -
3 Feature-based LeToR (1)Chenyan Xiong, (2)Zhuyun Dai, (3)Jamie Callan, (4)Zhiyuan Liu-Carnegie Mellon University(1,2,3);Tsinghua University( 4) December 10th, 2018 - -
4 Modifed DUET Microsoft AI and Research [Mitra et al. '17] Novmeber 1st, 2018 - -
5 BM25 Novmeber 1st, 2018 16.49 -

Q&A Task

Rank Model Submission Date Rouge-L Bleu-1
1 Human Performance April 23th, 2018 53.870 48.50
2 VNET Baidu NLP November 8th, 2018 51.63 54.37
3 Deep Cascade QA Ming Yan October 25th, 2018 46.84 49.42
4 SNET + CES2S Bo Shao of SYSU University July 24th, 2018 44.96 46.36
5 Masque NTT Media Intelligence Laboratories Seeptember 17th, 2018 44.48 43.72
6 Extraction-net zlsh80826 October 20th, 2018 43.66 44.44
7 SNET JY Zhao August 30th, 2018 43.59 46.29
8 BIDAF+ELMo+SofterMax Wang Changbao November 16th, 2018 43.59 45.86
9 DNETQA Geeks August 1st, 2018 43.18 47.86
10 Reader-Writer Microsoft Business Applications Group AI Research September 16th, 2018 42.07 43.62
11 SNET+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS June 1st, 2018 39.82 42.27
12 KIGN-QA Chenliang Li Novemeber 26th, 2018 36.79 35.24
13 BIDAF+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS May 29th, 2018 27.60 28.84
14 BiDaF Baseline(Implemented By MSMARCO Team)
Allen Institute for AI & University of Washington [Seo et al. '16]
April 23th, 2018 23.96 10.64
15 TrioNLP + BiDAF Trio.AI of the CCNU September 23rd, 2018 20.45 23.19

Q&A + Natural Langauge Generation Task

Rank Model Submission Date Rouge-L Bleu-1
1 Human Performance April 23th, 2018 63.21 53.03
2 VNET Baidu NLP November 8th, 2018 48.37 46.75
3 Masque NTT Media Intelligence Laboratories September 17th, 2018 46.81 47.64
4 SNET + CES2S Bo Shao of SYSU University July 24th, 2018 45.04 40.62
5 Reader-Writer Microsoft Business Applications Group AI Research September 16th, 2018 43.89 42.59
6 ConZNet Samsung Research July 16th, 2018 42.14 38.62
7 Bayes QA Bin Bi of Alibaba NLP June 14st, 2018 41.11 43.54
8 SNET+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS June 1st, 2018 40.07 37.54
9 BPG-NET Zhijie Sang of the Center for Intelligence Science and Technology Research(CIST) of the Beijing University of Posts and Telecommunications (BUPT) August 1st, 2018 38.18 34.72
10 KIGN-QA Chenliang Li Novemeber 26th, 2018 37.22 39.39
11 Deep Cascade QA Ming Yan October 25th, 2018 35.14 37.35
12 BIDAF+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS May 29th, 2018 32.22 28.33
13 DNET QA Geeks August 1st, 2018 27.48 33.14
14 BIDAF+ELMo+SofterMax Wang Changbao November 16th, 2018 26.75 34.54
15 SNET JY Zhao May 29th, 2018 24.66 30.78
16 Extraction-net zlsh80826 August 14th, 2018 24.65 32.08
17 BiDaF Baseline(Implemented By MSMARCO Team)
Allen Institute for AI & University of Washington [Seo et al. '16]
April 23th, 2018 16.91 9.30
18 TrioNLP + BiDAF Trio.AI of the CCNU September 23rd, 2018 14.22 16.04

MS MARCO V1 Leaderboard(Closed)

The MS MARCO dataset was released at NIPS 2016. We appreciate the more than 2,000 downloads from the Research community and the 13 model submissions! Its exciting to see the research community coming together to solve this difficult problem. Here are the BLEU-1 and ROUGE-L scores for the best models we evaluated to date on the MS MARCO v1.1 test set. If you have a model please train on the new 2.1 dataset and task

Rank Model Submission Date Rouge-L Bleu-1
1 MARS
YUANFUDAO research NLP
March 26th, 2018 49.72 48.02
2 Human Performance
December 2016 47.00 46.00
3 V-Net
Baidu NLP [Wang et al '18]
February 15th, 2018 46.15 44.46
4 S-Net
Microsoft AI and Research [Tan et al. '17]
June 2017 45.23 43.78
5 R-Net
Microsoft AI and Research [Wei et al. '16]
May 2017 42.89 42.22
6 HieAttnNet
Akaitsuki
March 26th, 2018 42.25 44.79
7 BiAttentionFlow+
ShanghaiTech University GeekPie_HPC team
March 11th, 2018 41.45 38.12
8 ReasoNet
Microsoft AI and Research [Shen et al. '16]
April 28th, 2017 38.81 39.86
9 Prediction
Singapore Management University [Wang et al. '16]
March 2017 37.33 40.72
10 FastQA_Ext
DFKI German Research Center for AI [Weissenborn et al. '17]
March 2017 33.67 33.93
11 FastQA
DFKI German Research Center for AI [Weissenborn et al. '17]
March 2017 32.09 33.99
12 Flypaper Model
ZhengZhou University
March 14th, 2018 31.74 34.15
13 DCNMarcoNet
Flying Riddlers @ Carnegie Mellon University
March 31st, 2018 31.30 23.86
14 BiDaF Baseline for V2 (Implemented By MSMARCO Team)
Allen Institute for AI & University of Washington [Seo et al. '16]
April 23th, 2018 26.82 12.93
15 ReasoNet Baseline
Trained on SQuAd, Microsoft AI & Research [Shen et al. '16]
December 2016 19.20 14.83