Datasets, humans, and biases

A human decides what kind of data machines get and what kind of code they run. Machines do just what they’re programmed to do.

Datasets are created and curated by humans.
For example, labeled images in the gigantic, state-of-the-art database ImageNet have originally been scraped from the Internet, consisting of both stock photos and selfies and everything in between.

The labeling process of these images has been 100% human labor: via an online service, workers were recruited to view images and pick the most appropriate label from a word database (an equally giant, organized list of nouns). This list included neutral words such as “apple” but very loaded ones, ranging from “loser” to racial slurs. As an end result, some of the ImageNet images are very controversially labeled.

With biased data, we can only expect to repeat the same bias in our analysis.

In order to build a model that generalizes well to data outside of the training data, the training data needs to contain enough information that is relevant to the problem at hand. For example, if you create an image classifier that tells you what the image given to the algorithm is about, and you have trained it only on pictures of dogs and cats, it will assign everything it sees as either a dog or a cat. This would make sense if the algorithm is used in an environment where it will only see cats and dogs, but not if it is expected to see boats, cars, and flowers as well.

Join the Elements of AI online course!