Robot Bias - Do We Need New Laws?

Bias in NLP based chat moderation

Natural Language Processing, or NLP, seeks to have machines act as chat-robots in ways similar to that telephone call to a robot we're all so familiar with. NLP seeks to eliminate the "press or say one" and replace it with a text chat session that is, well, more natural.

NLP functionality has also expanded its role into that of a robot chat moderator. The moderator's function is to weed out "toxic" statements. As defined by The Conversation AI team, a research initiative founded by Jigsaw and Google (both part of Alphabet), a "toxic" statement is one that is "rude, disrespectful or otherwise likely to make someone leave a discussion."

The Conversation AI team [goes on to explain] the problem of Hidden Bias. "When [our team] first built toxicity models, they found that the models incorrectly learned to associate the names of frequently attacked identities with toxicity. Models predicted a high likelihood of toxicity for comments containing those identities (e.g. 'gay'), even when those comments were not actually toxic (such as 'I am a gay woman'). This happens because training data was pulled from available sources where unfortunately, certain identities are overwhelmingly referred to in offensive ways. Training a model from data with these imbalances risks simply mirroring those biases back to users."

A Recently Proposed Law

In April of 2019 legislators in the United States introduced a bill called the "Algorithmic Accountability Act of 2019" (AAA19).

One of many news articles about the bill can be found here: [https://techcrunch.com/2019/04/10/algorithmic-accountability-act/]

Details of the proposal

(Disclaimer: These are my interpretations only, please read it for yourself). The proposal is based on the fact that there will be an Impact Assessment on every automated decision system that falls under the new law. Not every such system will fall under this new law.

Impact Assessment - ... a study evaluating an automated decision system and ... the development process, including the design and training data ... for impacts on accuracy, fairness, bias, discrimination, privacy, and security.
The above also includes a provisions for consumers to have access to the results ... and may correct or object to its results.
Covered by this law - Only systems that are run by a person or company making more than 50M US Dollars/year, or has more than 1M users, or is in the business of data analysis
High Risk systems are also targeted - where the definition of this includes 4 major areas that pose a significant risk to A, B, C, and D
High Risk A - privacy or security of personal information and/or contributing to inaccurate, unfair, biased, or discriminatory decisions.
High Risk B - extensive evaluation of peoples work performance, economic situation, health, preferences, behavior, location, or movements.
High Risk C - information about race, color, national origin, political opinions, religion, trade union membership, genetic data, bio metric data, health, gender, gender identity, sexuality, sexual orientation, criminal convictions, or arrests.
High Risk D - uses systematic monitoring of a large publicly accessible physical place

Quite a bit of ground! Let's see how it might apply to a real world example.

A real world example

We can examine a recently curated data set. The data comes from over 1.8 million comment texts entered by the public. Each comment was rated by a panel of experts with regards to their toxicity, and also each comment was categorized as to whether or not it discussed one of the aforementioned identities. The graphic was created by going through every word of every comment, and created two word-frequency lists. Both lists were categorized as being on the topic of Homosexual, Gay, or Lesbian. The first list was categorized by a majority of the experts as being toxic, the second; not toxic. By "subtracting" one list from the other, we get a list of words that are on not-toxic list, but NOT on the toxic list.

Notice how the words indicating the topic seem to be missing, and therein is the difficulty of the hidden bias of the NLP. NLP must therefore go beyond individual word, and move towards sequences of words, and determine through machine learning techniques, which comments are toxic, regardless of their topic.

Implications of the Proposed Law AAA19

If we consider the data set used for training - does it really contain enough information such that words in the word cloud really represent words that are ONLY TYPICALLY used in non-Toxic Comment Texts relating to the topic in general, or do they instead implicate a possible problem relating to situation "High Risk C" - information about race, color, national origin, political opinions, religion, trade union membership, genetic data, biometric data, health, gender, gender identity, sexuality, sexual orientation, criminal convictions, or arrests?
How might we scrub such information so that the word cloud only contains words typically used in non-Toxic Comment Texts relating to the topic in general, and that do not contain unnecessary words?
If this result were examined by a lawmaker, how might they perceive this particular data set with regard to impacts on accuracy, fairness, bias, discrimination, privacy, and security?

If you'd like to see many more examples of word clouds, grouped into pairs of word clouds for comparison, feel free to see my Python software which has already been run showing results, but which you can modify if you are so inclined.

[https://www.kaggle.com/pnussbaum/benchmark-kernel-and-aaa19-v06]

Additional Details about NLP

Word Cloud Visualizations

The word cloud [sometimes called a Tag Cloud or Weighted List] is a popular data visualization mechanism that was used below.

Preprocessing

The data is first subdivided by criteria (ex: Toxic is either True or False, based on the "target" column being >= 0.5 or not, respectively).
After gathering the desired subset, we divide sentences into individual words.
Each word is then made into lower case, and "english" [nltk stopwords] are removed. Some example stopwords that are removed include: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
Only [lemmatized] (or root form) and alphabetic words are used. An example of the lemmatized form of leaves is: leaf
Once a subset of words is so created, it is converted into a dictionary. A dictionary contains a list of words and their respective word counts.
Finally, two dictionaries are compared to find words that are present in one dictionary but not the other.

Only then can we create a word cloud.

Implications on NLP software programming

When writing a software program, or creating a chat moderator robot, we really care more about "Accuracy."

Specifically, we want to look at issues of hidden bias and think of how they might be improved.

Take a look at the total word counts. Even though we have more than 1.8 million comment text examples, when we narrow down to topics and toxicity, our training set becomes much smaller. Do we have enough slack to cull from the dictionaries and still classify with consistent accuracy?

Word cloud Pairs

In the word cloud above, we talked about the overall accuracy (true versus false when categorizing toxicity). To remove hidden bias, we must also examine the accuracy in specific cases, such as "Subgroup positive and negative" "BPSN (Background Positive, Subgroup Negative)" and "BNSP (Background Negative, Subgroup Positive) AUC."

To do this, it is useful to examine word cloud pairs.

For example, in the case of the "Subgroup positive and negative" we restrict the data set to only the examples that mention the specific identity subgroup. A low accuracy value in this metric means the model does a poor job of distinguishing between toxic and non-toxic comments that mention the identity.

Instead of just looking at the word cloud for "Words found in (TOXIC = false) and (HGL = true) but not in (TOXIC = true) and (HGL = true)" as is done above; we might also want to look at the word cloud for "Words found in (TOXIC = true) and (HGL = true) but not in (TOXIC = false) and (HGL = true)." we can then compare these two word clouds and look for hidden bias.

What to look for in the word cloud pairs

First, you should perform an overall validity check. Look at the pair of word clouds. Could you, as a human reader, pick which cloud was which from the pair of subdivided data? Many times the answer is "yes" but some pairs are more obvious than others.

Next, look at the specifics of the pairs:

Are the words in the wordclouds, especially the larger font words, really representative of words that you would expect to see only in one subset and not the other? Why or why not?
If you really don't like a bunch of the words, imagine that you (or machine learning) removes them from the list. Do the word counts justify enough remaining words to do the classification?
What other techniques might you use besides these "subtracted dictionaries" to subdivide the space more accurately?
LSTM and other recurrent time varying models focus on the sequence of words, not just the individual words. Even a Hidden Markov Model could be used as a probability sequencer. How might you sanity check the "sequence dictionaries" just as the wordclouds helped you sanity check the word subsets.

Want to see examples of these word cloud pairs? Here is the software to do this, as well as the output the software creates: