MS MARCO V2 Leaderboard

First released at NIPS 2016 the MS MARCO dataset was an ambitious, real-world Machine Reading Comprehension Dataset. Based on feedback from the community, we designed and released the V2 dataset and its related challanges. Can your model read, comprehend, and answer questions better than humans?

1. Given a query and 1000 relevant passages rerank the passages based on relevance(Passage Re-Ranking)

2. Given a query and 10 passages provide the best answer availible based(Q&A)

3. Given a query and 10 passages provide the best answer avaible in natural languauge that could be used by a smart device/digital assistant(Q&A + Natural Langauge Generation)

Q&A Models are ranked by ROUGE-L Score and Passage Ranking is Ranked by MRR@10. NOTE: MRR,ROUGE, and BLEU have been normalized to be out of 100 insead of 1 for easier reading.



Passage Re-Ranking

Rank Model Submission Date MRR@10 On Eval MRR@10 On Dev
1 SAN + BERT base Yu Wang, Xiaodong Liu, Jianfeng Gao - Deep Learning Group, Microsoft Research AI [Xiaodong, et al. '18] January 22th, 2019 35.93 36.97
2 BERT + Small Training Rodrigo Nogueira and Kyunghyun Cho - New York University [Nogueira, et al. '19] and [Code] January 7th, 2019 35.87 36.53
3 BERT + Projected Matching anonymous February 6th,2019 35.61 -
4 BERT base+ranking Hongyin Zhu February 8th, 2019 32.56 31.63
5 IRNet (Deep CNN/IR Hybrid Network) Dave DeBarr, Navendu Jain, Robert Sim, Justin Wang, Nirupama Chandrasekaran – Microsoft January 2nd, 2019 28.06 27.80
6 Neural Kernel Match IR (Conv-KNRM)(1)Yifan Qiao, (2)Chenyan Xiong, (3)Zhenghao Liu, (4)Zhiyuan Liu-Tsinghua University(1, 3, 4); Microsoft Research AI(2) [Dai et al. '18] Novmeber 28th, 2018 27.12 29.02
7 Neural Kernel Match IR (KNRM) ((1)Yifan Qiao, (2)Chenyan Xiong, (3)Zhenghao Liu, (4)Zhiyuan Liu-Tsinghua University(1, 3, 4); Microsoft Research AI(2) [ Xiong et al. '17] December 10th, 2018 19.82 21.84
8 Feature-based LeToR: simple-feature based RankSVM(1)Yifan Qiao, (2)Chenyan Xiong, (3)Zhenghao Liu, (4)Zhiyuan Liu-Tsinghua University(1, 3, 4); Microsoft Research AI(2) December 10th, 2018 19.05 19.47
9 BM25 Stephen E. Robertson; Steve Walker; Susan Jones; Micheline Hancock-Beaulieu & Mike Gatford (Implemented by MSMARCO Team) [ Robertson et al. '94] Novmeber 1st, 2018 16.49 16.70

Q&A Task

Rank Model Submission Date Rouge-L Bleu-1
1 Human Performance April 23th, 2018 53.870 48.50
2 Masque Q&A Style NTT Media Intelligence Laboratories [Nishida et al. '19] January 3rd, 2019 52.20 43.77
3 Deep Cascade QA Ming Yan [Yan et al. '18] December 12th, 2018 52.01 54.64
4 VNET Baidu NLP [Wang et al. '18] November 8th, 2018 51.63 54.37
5 Masque NLGEN Style NTT Media Intelligence Laboratories [Nishida et al. '19] January 3rd, 2019 48.92 48.75
6 BERT+ Multi-Pointer-Generator Tongjun Li of the ColorfulClouds Tech and BUPT December 31th, 2018 48.14 52.03
7 SNET + CES2S Bo Shao of SYSU University July 24th, 2018 44.96 46.36
8 Extraction-net zlsh80826 October 20th, 2018 43.66 44.44
9 SNET JY Zhao August 30th, 2018 43.59 46.29
10 BIDAF+ELMo+SofterMax Wang Changbao November 16th, 2018 43.59 45.86
11 DNETQA Geeks August 1st, 2018 43.18 47.86
12 Reader-Writer Microsoft Business Applications Group AI Research September 16th, 2018 42.07 43.62
13 SNET+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS June 1st, 2018 39.82 42.27
14 KIGN-QA Chenliang Li Dec 10th, 2018 38.83 36.29
15 lightNLP+BiDAFEnliple AI February 1st, 2019/a> 29.82 15.56
16 BIDAF+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS May 29th, 2018 27.60 28.84
17 BiDaF Baseline(Implemented By MSMARCO Team)
Allen Institute for AI & University of Washington [Seo et al. '16]
April 23th, 2018 23.96 10.64
18 TrioNLP + BiDAF Trio.AI of the CCNU September 23rd, 2018 20.45 23.19
19 BiDAF + LSTM Meefly January 15th,2019 15.30 11.95

Q&A + Natural Langauge Generation Task

Rank Model Submission Date Rouge-L Bleu-1
1 Human Performance April 23th, 2018 63.21 53.03
2 Masque NLGEN Style NTT Media Intelligence Laboratories [Nishida et al. '19] January 3rd, 2019< 49.61 50.13
3 VNET Baidu NLP [Wang et al. '18] November 8th, 2018 48.37 46.75
4 BERT+ Multi-Pointer-Generator Tongjun Li of the ColorfulClouds Tech and BUPT December 31th, 2018 47.37 45.09
5 SNET + CES2S Bo Shao of SYSU University July 24th, 2018 45.04 40.62
6 Reader-Writer Microsoft Business Applications Group AI Research September 16th, 2018 43.89 42.59
7 ConZNet Samsung Research July 16th, 2018 42.14 38.62
8 KIGN-QA Chenliang Li Novemeber 26th, 2018 41.72 43.60
9 Bayes QA Bin Bi of Alibaba NLP June 14st, 2018 41.11 43.54
10 SNET+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS June 1st, 2018 40.07 37.54
11 BPG-NET Zhijie Sang of the Center for Intelligence Science and Technology Research(CIST) of the Beijing University of Posts and Telecommunications (BUPT) August 1st, 2018 38.18 34.72
12 Deep Cascade QA Ming Yan October 25th, 2018 35.14 37.35
13 BIDAF+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS May 29th, 2018 32.22 28.33
14 Masque Q&A Style NTT Media Intelligence Laboratories [Nishida et al. '19] January 3rd, 2019 28.53 39.87
15 DNET QA Geeks August 1st, 2018 27.48 33.14
16 BIDAF+ELMo+SofterMax Wang Changbao November 16th, 2018 26.75 34.54
17 SNET JY Zhao May 29th, 2018 24.66 30.78
18 Extraction-net zlsh80826 August 14th, 2018 24.65 32.08
19 lightNLP+BiDAF Enliple AI February 1st, 2019/a> 21.03 10.83
20 BiDaF Baseline(Implemented By MSMARCO Team)
Allen Institute for AI & University of Washington [Seo et al. '16]
April 23th, 2018 16.91 9.30
21 TrioNLP + BiDAF Trio.AI of the CCNU September 23rd, 2018 14.22 16.04
22 BiDAF + LSTM Meefly January 15th,2019 11.94 17.34

MS MARCO V1 Leaderboard(Closed)

The MS MARCO dataset was released at NIPS 2016. We appreciate the more than 2,000 downloads from the Research community and the 13 model submissions! Its exciting to see the research community coming together to solve this difficult problem. Here are the BLEU-1 and ROUGE-L scores for the best models we evaluated to date on the MS MARCO v1.1 test set. If you have a model please train on the new 2.1 dataset and task

Rank Model Submission Date Rouge-L Bleu-1
1 MARS
YUANFUDAO research NLP
March 26th, 2018 49.72 48.02
2 Human Performance
December 2016 47.00 46.00
3 V-Net
Baidu NLP [Wang et al '18]
February 15th, 2018 46.15 44.46
4 S-Net
Microsoft AI and Research [Tan et al. '17]
June 2017 45.23 43.78
5 R-Net
Microsoft AI and Research [Wei et al. '16]
May 2017 42.89 42.22
6 HieAttnNet
Akaitsuki
March 26th, 2018 42.25 44.79
7 BiAttentionFlow+
ShanghaiTech University GeekPie_HPC team
March 11th, 2018 41.45 38.12
8 ReasoNet
Microsoft AI and Research [Shen et al. '16]
April 28th, 2017 38.81 39.86
9 Prediction
Singapore Management University [Wang et al. '16]
March 2017 37.33 40.72
10 FastQA_Ext
DFKI German Research Center for AI [Weissenborn et al. '17]
March 2017 33.67 33.93
11 FastQA
DFKI German Research Center for AI [Weissenborn et al. '17]
March 2017 32.09 33.99
12 Flypaper Model
ZhengZhou University
March 14th, 2018 31.74 34.15
13 DCNMarcoNet
Flying Riddlers @ Carnegie Mellon University
March 31st, 2018 31.30 23.86
14 BiDaF Baseline for V2 (Implemented By MSMARCO Team)
Allen Institute for AI & University of Washington [Seo et al. '16]
April 23th, 2018 26.82 12.93
15 ReasoNet Baseline
Trained on SQuAd, Microsoft AI & Research [Shen et al. '16]
December 2016 19.20 14.83