Microsoft won't be the first, and MIT won't be the last, to permanently remove a data set from the shelves

The Massachusetts Institute of Technology (MIT) has permanently pulled the Tiny Images DataSet offline after being accused of racism and misogyny.

The Massachusetts Institute of Technology (MIT) has issued an apology announcing that it has permanently removed the Tiny Images Dataset from the Internet, and is calling on the community to stop using and deleting the Dataset, so that users who already have it do not provide it to others.

In the past year, several well-known data sets released by companies and research institutions have been taken down or permanently banned, These include Microsoft’s MS Celeb 1M Celebrity Dataset, Duke University’s Duke MTMC Monitoring Dataset for Pedestrian Recognition, and Stanford University’s BrainWash Human Head Detection Dataset.

The Tiny Images DataSet has been a project of MIT since 2006. As its name suggests, this is a tiny data set of images.

Contains 79.3 million 32 × 32 pixel color Images, mostly captured from Google Images.

The data set is large, and the files, metadata and descriptors are stored in the form of binary files, which need to be loaded using the MATLAB toolbox and index data files

The size of the whole data set is nearly 400 GB, and the large size of the data set also makes the data integration one of the most popular data sets in the field of computer vision research.

A paper published with this dataset — “80 Million Tiny Images: A Large Dataset for Non-parametric Object and Scene Recognition is a large dataset for non-parametric object and scene recognition, which is available for reference 1718 times.

A paper that initiates self-checking of large data sets

The Tiny Images Dataset’s Image Dataset has been caught in the spotlight by a recently published paper “Large Image Dataset: A Pyrrhic Win for Computer Vision?” Large data sets: the hidden killer of computer vision?

The paper raises strong questions about the compliance of these large data sets.

Thesis Address:https://arxiv.org/pdf/2006.16…

The two authors are Vinay Prabhu, chief scientist at Unifyid. UnifyID is a Silicon Valley artificial intelligence startup that provides customers with a solution to authenticate their users.

The other author is Abeba Birhane, a PhD candidate at University College Dublin.

This paper mainly takes ImageNet-ILSVRC-2012 dataset as an example. The author finds that the dataset contains a small number of stealthy-taken images (such as stealthy-taken images of other people on the beach, and even private parts), and believes that these images seriously violate the privacy of the parties due to lax censorship.

What was once a classic data set is now politically incorrect

Unlike ImageNet’s alleged privacy violation, the Tiny Images Dataset is criticized for having tens of thousands of racist and misogynist Images in its Dataset.

It also points out that Tiny Images DataSet has more serious problems of discrimination and invasion of privacy due to its lack of any audit.

Tiny Images DataSet is partially selected

This comes to the Tiny Images DataSet, which tags nearly 80 million Images into 75,000 categories based on the WordNet specification.

It is WordNet’s partial markup that makes the dataset questionable.

WordNet Pot, image data set back together

WordNet was designed by psychologists, linguists, and computer engineers at Princeton University’s Cognitive Science Laboratory. Since its release in 1985, WordNet has been the most standardized and comprehensive English dictionary system in the English language.

Normative and comprehensive means to objectively collect English words existing in human society, and endued them with understanding and relevance.

In the Tiny Images DataSet, 53,464 different nouns from WordNet are used as tags for Images.

The statistics of sensitive words related to race and gender in the dataset

Because of this, direct references to expressions existing in human society will inevitably introduce some words involving racial discrimination and sexism.

For example, explicit insulting or derogatory terms _Bi*ch, Wh*re, Ni*_g_er_ and so on have been tagged along with judgmental terms such as molester molester, pedophile and so on.

Social impact needs to be measured before scientific research

The authors argue that large image datasets, many of which were constructed without careful measurement of social impact, may pose threats and harm to individual rights.

Because information is now open source, anyone can run a query using an open API to define or determine the identity or portrait of a human in an ImageNet or other data set, which is indeed dangerous and invasive to the person involved. The author also gives three solutions: first, synthesis of real images and data set distillation, such as the use (or enhancement) of composite images to replace real images during model training; The second is to strengthen the data set based on ethical filtering; The third is quantitative dataset auditing. The author conducts a cross-category quantitative analysis of ImageNet to assess the extent of ethical violations and to measure the feasibility of the model-based annotation approach.

Dataset removal: either due to self-awareness or external pressure

MIT is not the first data set to be taken down because of public pressure or self-awareness. Microsoft pulled the famous MS Celeb 1M dataset back in mid-2019 and announced it was no longer in use.

MS Celeb 1M Dataset is a dataset created by finding 1 million celebrities on the web, selecting 100,000 according to popularity, and then using a search engine to pick out about 100 images from each person.

MS Celeb 1M Dataset

MS Celeb 1M is often used for training in facial recognition. The dataset was originally used in MSR IRC competitions, which is one of the highest level image recognition competitions in the world. Companies including IBM, Panasonic, Alibaba, Nvidia and Hitachi also use the dataset.

One of the researchers pointed out that there are issues related to the ethics, origins and privacy of the facial-recognition image data set. This is because the images are from the Internet, although Microsoft says they were captured and obtained under a Creative Commons license C.C. (The people in the photos are not necessarily licensed, but licensed by the copyright owners.)

The agreement allows the photos to be used for academic research, but Microsoft does not effectively monitor the use of the data set once it is released.

In addition to the MS Celeb 1M dataset, there is the Duke MTMC monitoring dataset for pedestrian recognition published by Duke University, and the BrainWash human head detection dataset published by Stanford University.

Download the other data sets as soon as possible, and maybe tomorrow they will be taken down as well

The recent Black Lives Matter racial equality movement has thrown all walks of life in Europe and the United States into a panic. There are also constant discussions, disputes and reflections in the field of computer science and engineering.

Initially, businesses and organizations, such as GitHub and the Go language, began to modify their naming conventions, such that they should avoid “Blacklist” and “Whitelist” and instead use neutral terms like “Blocklist” and “Allowlist.” Or change the default branch name from “master” to “trunk.”

And LeCun, a deep learning pioneer, quit Twitter after being accused of racist and sexist comments.

Now, political correctness may turn on large data sets.

Admittedly, there are many under-considered and incomplete parts in the design of large data sets. But under the current conditions, simply pulling the relevant data sets from the shelves is not the best way to deal with bias.

These images, after all, aren’t just in these data sets, these biases, and they’re not just a few words in WordNet.

You take down your data set, your pictures are still everywhere on the Internet, you stop using WordNet, and these words are still in people’s minds. To address AI bias, we need to address longstanding social and cultural biases.

LeCun: A couple of tweets and I’m out of the loop. (Hands up)

Download address: https://hyper.ai/datasets/5361

Note: This data set is subject to compliance disputes. Please use it with caution.

– the –

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Microsoft won’t be the first, and MIT won’t be the last, to permanently remove a data set from the shelves

A paper that initiates self-checking of large data sets

What was once a classic data set is now politically incorrect

WordNet Pot, image data set back together

Social impact needs to be measured before scientific research

Dataset removal: either due to self-awareness or external pressure

Download the other data sets as soon as possible, and maybe tomorrow they will be taken down as well

Microsoft won’t be the first, and MIT won’t be the last, to permanently remove a data set from the shelves

A paper that initiates self-checking of large data sets

What was once a classic data set is now politically incorrect

WordNet Pot, image data set back together

Social impact needs to be measured before scientific research

Dataset removal: either due to self-awareness or external pressure

Download the other data sets as soon as possible, and maybe tomorrow they will be taken down as well

Related Posts

Classification regression Tree for Machine Learning (CART for Python)

TensorFlow’s method of loading multiple models

My “one Neural chat Model” with TensorFlow: a deep learning-based chatbot