 Hi, I'm Joy, and I research how computers detect, recognize, and classify people's faces. In my TED Featured Talk, I spoke about my experience with the coded gaze, my term for algorithmic bias. The system I was using worked well on my lighter-skinned friend's face, but when it came to detecting my face, it didn't do so well, until I put on a white mask. After my talk was posted, I tested my speaker image profile across different facial analysis demos. Two of the demos didn't detect my face. The other two? Well, they misgendered me. The demos didn't even distinguish between gender identity and biological sex. They just provided two labels, male and female. Now I wanted to see if these results were just because of my unique facial features or if this was something that was more of a pattern across other faces, too. So I began a project that became my MIT thesis, Gender Shades, or the long title, Gender Shades Intersectional Phenotypic and Demographic Evaluation of Face Data Sets and Gender Classifiers, or just Gender Shades. I wanted to see how well different gender classification systems worked across different people's faces and if the results changed based on somebody's gender or their skin type. I created a dataset of over a thousand images of parliament members ranked among the top ten in the world based on their representation of women in power. To get at a range of skin types, I chose three African countries and three European countries so I could see how the system performed on lighter skin and darker skin. Then I chose three companies to evaluate, IBM, Microsoft, and Face++, which has access to one of the largest datasets of Chinese faces. So now with the dataset and the companies, I decided to run a test. The companies appear to have relatively high accuracy overall. Microsoft performed best, achieving 94% accuracy on the whole dataset. All companies performed better on males than females. And all companies also performed better on lighter subjects than on darker subjects. When we analyzed the results by four subgroups, we saw that all companies performed worst on darker females. IBM and Microsoft performed best on lighter males and Face++ performed best on darker males compared to the others. IBM had the largest gap in accuracy, with a difference of 34% in error rates between lighter males and darker females. I was surprised to see multiple commercial products failing on over one in three women of color. In fact, as we tested women with darker and darker skin, the chances of being correctly gendered came close to a coin toss. While more research is needed to explore the specific reasons for the accuracy differences, one general issue is the lack of diversity in training images and benchmark datasets. Failure to separate accuracy results across traits like gender and skin type also makes it harder to identify differences. Companies should do better with commercially sold products, especially since the machine learning techniques that have made gender classification possible are applied to other domains of computer vision, like facial recognition and other areas of artificial intelligence, like predictive analytics. Predictive systems can help determine who has hired, granted a loan, or what information a particular individual sees. These data-centric technologies are vulnerable to bias and abuse. As a result, we must demand more transparency and more accountability. We have entered the age of automation overconfident, yet underprepared. If we fail to make ethical and inclusive artificial intelligence, we risk losing gains made in civil rights and gender equity under the guise of machine neutrality. The coded gaze reflected in the gender shades project must be faced.