Representing skin tone, or Google’s hubris versus the simplicity of Crayola

Google wants to “help computers ‘see’ our world”, and one of their ways of battling how current AI and machine learning systems perpetuate biases is to introduce a more inclusive scale of skin tone, the ‘Monk Skin Tone Scale’.

Watch the video to see how Google describes the relevance of their scale:

And here it is, all of the variety of human skin tone reduced to ten ‘orbs’:

10 orbs with different skin tone colours running from light to dark

Google launched the scale in May 2022 and touts this set of colours as a way to create a more fair version of AI. Seems like an important project right? So how did they make the scale? Was it a collaborative effort between multiple scientists with lots of testing and validating? Nope.

Instead they engaged Ellis Monk, a sociologist at Harvard who usually does research into how skin colour affects social stratification. He came up with a scale, for which Google then turned him into an eponym. Nowhere on Google’s website is there any scientific proof that this scale is a sufficient representation of human skin tones, nor does it become clear why Monk would actually be qualified to produce the scale.

So it shouldn’t come as a surprise to learn that Google’s scale does not represent the actual diversity in skin tone that exists. The Maryland Test Facility has clearly shown that in their research presented at the International Face Performance Conference in 2022.

The Center tests the quality of face detection and identification systems to be used at the U.S. border. Unique in their approach is that they use a device to measure the actual skin tone of the participants in the study. This is the population that they measured (in the U.S.):

Many different skin tones distributed over two axes: Lightness and Hue — Click to enlarge

It is as Astrid Roemer – one of the finest writers in the Dutch language – says, we are all red-skins (in the literal rather than the pejorative sense):

Many different skin tones places in a chroma wheel which show that all the skin tones fall somewhere on the red spectrum — Click to enlarge

The researchers also asked people to self-identify their race. It shows how people who identify as Black or African-American have a much wider spread in ‘lightness’ than people who identify as White:

Many different skin tones distributed over two axes: Lightness and Hue with the people who identify as Black in a wide circle, and the people who identify as white in a less wide circle — Click to enlarge

The researchers then go on to check how the actual skin tones of people match with the calibration tools that exist to calibrate the colour perception of camera’s. Here, for example, is the mismatch with the X-rite ColorChecker® Digital SG which is marketed as having “additional skin-tone reference colors” that “deliver greater accuracy and consistency over a wide variety of skin tones”:

Shows how the colours from the Xrite calibration tool have sparse, incomplete, and uneven coverage of real skin tones — Click to enlarge

And this is how Google’s Monk Skin Tone Scale fails:

How can Google’s scale be so terrible? Because it was never Google’s intention to truly represent people. Instead, their actual purpose is to increase the legibility of the world for its AI systems. They need consistent training of their models, and for that they need consistent labeling of skin colour (mostly done by people of color in the Global South). They say as much on their own website:

Scales for makeup can include upwards of 40 skin tones, but larger scales like these can be challenging for ML use cases, because of the difficulty of applying that many tones consistently across a wide variety of content, while maintaining statistical significance in evaluations. For example, it can become difficult for human annotators to differentiate subtle variation in skin tone in images captured in poor lighting conditions.

The Monk Skin Tone Scale should probably be seen as a tool to strengthen Google’s dominance in the AI field, through setting the standard for how ‘fairness’ should be measured. But it is highly unlikely that it will help solve the persistent inability to detect darker faces in the real world, which is partially the result of badly calibrated (web)camera’s.

The Maryland Test Facility’s researchers finished their presentation at the conference by describing how – according to their measurements – Crayola has actually managed to create a product that represents the true spectrum of colour that our human skins have. Their colors of the world line of crayons, markers and pencils beats Google and Monk at representation with ease:

A box of 24 Crayola crayons, in skin tone colours

Just let us never forget that even if a better representation of skin tone could be achieved, this still would never be a guarantee that technology would treat everyone equally.

See: Skin Tone Research @ Google.