The MS MARCO datasets are intended for non-commercial research purposes only to promote advancement in the field of artificial intelligence and related areas, and is made available free of charge without extending any license or other intellectual property rights. The dataset is provided “as is” without warranty and usage of the data has risks since we may not own the underlying rights in the documents. We are not be liable for any damages related to use of the dataset. Feedback is voluntarily given and can be used as we see fit. Upon violation of any of these terms, your rights to use the dataset will end automatically.



Please contact us at ms-marco@microsoft.com if you own any of the documents made available but do not want them in this dataset. We will remove the data accordingly. If you have questions about use of the dataset or any research outputs in your products or services, we encourage you to undertake your own independent legal review. For other questions, please feel free to contact us.

Changes To Dataset

04.23.2018:We have released an updated to the dataset. V2.1 Includes the following:
1. Over 1 million queries
2. ~182k Well Formed Answers
3. Query type is now included for every query.
4. Bias in Evaluation set fixed(a small portion of answers for the V2.0 Evaluation set were able to be found in the v1.1 set and the v2.0 well formed sets, these have been removed from eval and added to train).
5. Utilities and Readme now availible.

03.01.2018:We have released an updated to the dataset. V2.0 Includes the following:
1. ~900,000 unique queries
2. ~160k Well Formed Answers

01.30.2017:We have released an update to the dataset! V1.1 contains the follwing:
1. Improvments to dataset and evaluation scripts

12.01.2016:We have released our dataset! V1.0 contains the follwing:
1. 100,000 unique query answer pairs

Follow us on Twitter!

As we improve the quality of our data we will publish updates to the dataset. Follow us on Twitter for updates.