Biased Data

Biased Data

Update: 2021-04-09
Share

Description

In this episode of Beneficial Intelligence, I discuss biased data. Machine Learning depends on large data sets, and unless you take care, ML algorithms will perpetuate any bias in the data it learns from.  

The famous ImageNet database contains 14 million labeled images. However, 6% of these have the wrong label. The labels are provided by humans paid very little per image, so they will work very fast. Unfortunately, as Nobel Prize winner Daniel Kahneman has shown, when humans work fast, they depend on their fast System 1 thinking that is very prone to bias. Thus, a woman in hospital scrubs is likely to be classified "nurse" and a man in the same clothes is likely to be classified "doctor." 

Google Translate was showing its bias when translating from Hungarian. Hungarian only has a gender-neutral pronoun, but the English translation was given a pronoun. The original gender-neutral phrases became "she does the dishes" and "he reads" in English.

As CIO or CTO, you need to make sure somebody ensures the quality of the data you use to train your machine learning algorithms. If you don't have a Chief Data Officer, maybe you have a Data Protection Officer who could reasonably be given this purview. But you cannot foist this responsibility on individual development teams under deadline pressure. It is your responsibility to ensure that any machine learning system is learning from clean, unbiased data. 

Beneficial Intelligence is a weekly podcast with stories and pragmatic advice for CIOs, CTOs, and other IT leaders. To get in touch, please contact me at sten@vesterli.com

Comments 
In Channel
People Shortage

People Shortage

2021-11-2605:43

Data Hoarding

Data Hoarding

2021-10-2907:29

Monoculture

Monoculture

2021-10-1509:04

Trust, but Verify

Trust, but Verify

2021-10-0109:34

Time to Recover

Time to Recover

2021-09-1708:28

Goal Fixation

Goal Fixation

2021-09-0309:10

Narrow Focus

Narrow Focus

2021-08-2008:28

Back to the Office

Back to the Office

2021-08-0608:38

Humans and Computers

Humans and Computers

2021-07-2306:42

Competition

Competition

2021-07-0910:18

Pseudo-Security

Pseudo-Security

2021-06-2507:53

Good Enough

Good Enough

2021-06-1807:55

Unnecessary Roadblocks

Unnecessary Roadblocks

2021-06-0409:08

Expectation Management

Expectation Management

2021-05-2807:50

Gaming the Metrics

Gaming the Metrics

2021-05-0710:31

Accidental Publication

Accidental Publication

2021-04-3007:55

Irrational Optimism

Irrational Optimism

2021-04-2308:05

Risk Aversion

Risk Aversion

2021-04-1605:23

Biased Data

Biased Data

2021-04-0907:29

loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

Biased Data

Biased Data

Sten Vesterli