Microsoft's internal research data set is officially open to the public, covering nine fields including NLP and CV

Planning to edit | Natalie

The author | Vani Mandava

The translator | nuclear coke

Edit | Debra

AI Front Line introduction:Microsoft Research has opened a new data project dedicated to fostering broad collaboration among the global research community. Experts say the project will open data to the public.
This will be a game changer for the big data community. Projects like Microsoft’s open data research can reduce barriers to data sharing and encourage repeatability through the power of cloud computing.“The Microsoft Research Outreach group has been working extensively with outside research teams and has been actively promoting the adoption of cloud research infrastructure over the past few years,” the company said in a blog post. In the process, we experienced the pervasivity of Jim Gray’s fourth development model for data-intensive science — almost all research projects now contain data elements. This trend also underscores the need for well-planned and meaningful data sets in the interdisciplinary and regional sciences beyond computer science.” That’s why Microsoft opened up the data project.

Please pay attention to the wechat public account “AI Front”, (ID: AI-front)

Today, we are pleased to introduce the Microsoft Research Open Data Project — a new set of cloud data repositories dedicated to fostering broad collaboration among the global research community. Microsoft Research Open Data will provide a convenient cloud hosting platform for data sets and represent the data management and research results that Microsoft has used in a range of projects over the years.

Why open?

Our goal is to provide a simple platform for Microsoft researchers and partners to share data sets and related research technologies and tools. The Microsoft Open Data For Research Project aims to simplify access to these datasets, facilitate collaboration among researchers using cloud resources, and make research as repeatable as possible. We will continue to shape and evolve the repository, adding new features based on community feedback.

We realize that researchers are currently using dozens of data repositories and are eager for capacity to match the needs of their existing work.

Figure 1 data sets in the Microsoft Research Open Data Project

“This is going to be a game changer for the big data community. “Projects like Microsoft’s open Data research can reduce barriers to data sharing and encourage repeatability through the power of cloud computing.”

-Sam Madden, Professor, MIT

With the volume of data increasing at an exponential rate, it is widely believed that the global data volume will exceed 150 ZB by 2025. It is clear that with such a large volume of data, the priority should be to bring processing resources to the data rather than to migrate massive amounts of data through Internet bandwidth. We believe that providing such a package that combines processing with data can make great practical sense.

Features: classification, wide coverage

The data sets in Microsoft Research Open Data are classified according to their main research areas, as shown in Figure 2. You can find links to research projects or publications in the dataset. You can browse the available datasets and download them, or you can automate the workflow for direct replication using Auzre subscriptions. The repository conforms to the highest performing standards in the field of data sharing as far as possible, and is designed to ensure the discoverability, accessibility, interoperability, and reuse of data sets; And the entire library does not contain any personally identifiable information. We will get feedback from our users to drive the site further.

FIG. 2 Data set classification

A sneak peek at selected data sets

Microsoft’s open datasets contain many useful datasets, and here are a few highlights:

Microsoft Machine Reading Comprehension (MS MARCO)

Microsoft Machine Reading Comprehension (MS MARCO) is a new large data set of reading comprehension and problem solving. In MS MARCO, all questions are sampled from true anonymous user queries. The contextual responses are extracted from real Web documents using the most advanced version of the Bing search engine. If the user can summarize the answer, the answer to the query is generated manually by them.

File size: 469.03 MB

File type: JSON

License: Microsoft Research Data License agreement

Last modified: 6/5/18

Categories: Social sciences, social media, etc

Details:

https://msropendata.com/datasets/2bda14a7-ee25-4092-8f2f-9272d48ae903

SigmaDolphin

A computer system for automatically solving mathematical word problems written in natural language. SigmaDolphin is a project launched at Microsoft Research Asia in early 2013. Its main goal is to build a computer intelligence system with natural language understanding and reasoning capabilities. We focus on autonomous problem solving applications that automatically solve problems (especially mathematical problems) written in natural language.

File size: 11.54 MB

File type: JSON, PDF, PKL, py, TXT

License: Microsoft Research Data License agreement

Last modified: 6/21/18

Categories: Mathematics, statistics, logic, etc

Details:

https://msropendata.com/datasets/f0e63bb3-717a-4a53-aa79-da339b0d7992

Microsoft studies a social media conversation corpus

This data set is a collection of 12,696 Tweet ids representing 4232 three-step session fragments extracted from Twitter logs. Each row in the dataset represents a separate context-message-response ternary, with crowdsourced annotators rating the quality of contextual responses on average 4 or higher. The data has been randomly divided into tuning (development) and test sets containing 2118 and 2114 ternary relationships, respectively. But this dataset in the natural language processing community is for academic research only. To access the underlying tweets and associated metadata, you need to call the Twitter API.

If you use similar material in your research, cite the following articles: Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Jiufeng Feng, Meg Mitchell, Jian-yun Nie, Jianfeng Gao and Bill Dolan, A Neural Network Approach to Context-Sensitive Generation of Conversational Responses, Conference of the North American Chapter of the Association for Computational Linguistics — Human Language Technologies (NAACL HLT – 2015).

More information and associated with this project can be in the HTTP: / / research.microsoft.com/en-us/projects/convo/ found.

File size: 245.46 KB

The file type is TXT

License: Microsoft Research data

License Agreement: Microsoft Research Data License Agreement

Last modified: 6/21/18

Categories: Social sciences, social media, etc

Details:

https://msropendata.com/datasets/2bda14a7-ee25-4092-8f2f-9272d48ae903

NewsQA

With so many written words being generated every second, how do we make sure we have the most up-to-date information available? Microsoft Research Montreal is addressing this problem by building AI systems that can read and understand large amounts of complex text in real time. The NewsQA dataset is designed to help the research community build algorithms that can answer questions that require human understanding and reasoning skills.

File size: 18.23 MB

File type: CSV, MD, PDF

License: Microsoft Research data

License Agreement: Microsoft Research Data License Agreement

Last modified: 6/21/18

Category: Computer science

Details:

https://msropendata.com/datasets/939b1042-6402-4697-9c15-7a28de7e1321

Training for double word embedding in Bing queries

The data will be used for research purposes only. The DESM Word Embeddings dataset contains a number of words that may be considered offensive, indecent, or otherwise offensive. Microsoft has not reviewed or modified the contents of the dataset. Microsoft does not accept responsibility for any inappropriate content generated through this dataset for convenience purposes only. Using data sets requires discretion at your own risk. If you have any questions, please contact the author of the paper.

File size: 10.38 GB

The file type is TXT

License: Microsoft Research data

License Agreement: Microsoft Research Data License Agreement

Last modified: 6/21/18

Category: Computer science

Details:

https://msropendata.com/datasets/30a504b0-cff2-4d4a-864f-3bc9a66f9d7e

Other selected data sets include Frames, Filling the Blanks for Mad Libs, etc., which will not be described in detail here.

How do I access Microsoft Open Data sets

Many of the data projects that Microsoft opened this time are data sets that are used by very advanced technologies within Microsoft. There are many data categories, a wide range of coverage, precious resources, and use and cherish them. The portal is presented first:

https://msropendata.com/

In addition to providing Data asset download options, users can copy Data sets directly to their Azure based Data Science virtual machines, as shown in Figure 3.

Figure 3 Copying data from Microsoft TopenData.com to an Azure based Linux VIRTUAL machine

The Data Science virtual machine comes preinstalled with a variety of development tools that are popular with researchers and practitioners, as shown in Figure 4.

Figure 4. Linux Data Science VIRTUAL machine

“I often get requests to share research data, and the individual sharing I’ve done in the past has had good results. With Azure, we can coordinate and catalog data sets on a unified platform, helping internal and external researchers access them more easily and encouraging collaboration. This will also provide Microsoft Research with easy access to shared data in the cloud.”

— John Krumm, principal AI researcher at Microsoft Research

The Microsoft Research Open Data Project is one of the results of the Microsoft Research Outreach data Science initiative. We would like to thank Microsoft’s teams, Microsoft researchers, industry partners, and academic advisors for their collaboration. Without their contribution, the project would not have been completed successfully.

Original link:

https://www.microsoft.com/en-us/research/blog/announcing-microsoft-research-open-data-datasets-by-microsoft-research-now -available-in-the-cloud/

Microsoft’s internal research data set is officially open to the public, covering nine fields including NLP and CV

Related Posts

From the Inception v1, v2, v3 and v4, RexNeXt to Xception to MobileNets, ShuffleNet, MobileNetV2

Adaptation to the antagonistic domain

Lossless file compression using recurrent Neural Networks: Stanford University proposes DeepZip