Emotion, Language and the Individual Perspective: Addressing Bias in Speech to Text and AI

Kris Gartley December 18, 2019

Identifying emotions might seem like the easiest thing that speech to text software has to do. We can often tell by listening to people if they’re angry or happy, or surprised or bored. This is just one of many ways we intuitively use and interpret language. However, that intuition we rely on can be based on our perspectives and informed by what we personally believe about how language works.

We’ve talked about the challenges that arise from handling spontaneous human speech that is complex and variable (see, for example, Catherine, Gabriel and Neil’s blogs). One area of concern — the one I want to talk about in this blog — is the bias that informs our perceptions due to our differing associations and assumptions we have about the world.

We all bring our own unique perspectives and experiences to a range of activities, including the work that we do. On one hand, it’s why diversity is a strength as this ensures we include a variety of perspectives. On the other hand, in the absence of diversity, it’s what can perpetuate limited ways of thinking.

If one thing is for sure, it’s that bias is inevitable and permeates all aspects of our society, even technology. Bias is a very prevalent problem in technology, especially technology like ours which is based on artificial intelligence. It doesn’t take much of a Google search to find examples: here’s one from Fortune, one from the New York Times, and one from PBS. And that’s just from the first page of results!

There’s a potential danger in the way that technology is viewed. We sometimes think that because a machine or a piece of software said something, it must be more objective or correct than coming from a human. But, that’s just not how AI technology works, which learns from the humans that train it. Thousands and thousands of pieces of data are gathered together and labelled by humans. If the choice of data was biased, then the AI will produce biased results. If the labelling of data was biased, then again, the AI will produce biased results.

What about the bias of the humans who have the role of labelling data, as they themselves will have differing and unique perspectives? There is potential risk in labeling data for emotion if the humans doing the annotating have their own biases about language. Identifying emotions is much harder than it initially seems!

Let me point out that not all bias is bad. Some selectivity is actually needed for the sake of accuracy. In building a language model for a particular customer, we may need to favor a particular language variety (also sometimes called a “dialect”), which is a form of bias. That is, we label the data to make sure that the transcript correctly identifies words in that variety; and, in the process, the language model will not be as accurate for other varieties.

However, there are forms of bias that are potentially harmful. Bias can have a harmful impact even if there is no malicious intent because it can be implicit in our beliefs and actions. Much of what we have learned about the world is so ingrained that we don’t question the associations we have. Language is certainly an area where these biases come into play, and it is this language bias (or sociolinguistic bias) that poses a risk to the transcription and data labelling processes.

For example, when we identify the emotion of a particular person, based on their speech, we don’t come to the situation as a blank slate with no presuppositions or preconceived ideas. We have a whole huge set of beliefs — some of which we know about, others of which we don’t — which influence how we evaluate the other person’s emotional state.

Consider whether we think of someone’s tone as being rude or playful. Rudeness, and humor, are pretty culturally and contextually dependent, from the communities we belong to, both our geographic communities (i.e., where we grew up and where we live) and also our social communities (i.e., the groups we identify with because of common beliefs or practices).

Whether an utterance is rude thus depends on whether you, as the person hearing it, think of that kind of phrase, said in that kind of way, as rude. It’s not some objective feature of the words, such as the volume and pitch of the voice, or any other linguistic features. It would be a mistake to draw a one-to-one correspondence between linguistic form and social meaning. It depends on social or cultural context, which means that bias can be introduced.

The unfortunate reality is that we can’t entirely eliminate this kind of bias. It has been built in to the way that language works and influences the way that we interpret expressions. But recognizing that the bias exists, whether explicitly or implicitly, means we have a responsibility to address it. So how can we account for it, control it, and minimize the impact on the transcripts that speech to text software creates?

It can’t be solved with a set of rules. Guidelines are a very useful device, to help focus on relevant characteristics and how to evaluate them. But rules create the impression that there’s some clear answer, in every case, about whether emotion expressed in speech is positive or negative. And that’s probably not true.

It also can’t be solved by just creating bureaucratic processes, where someone in a leadership position gets to decide what the emotion is. Experience definitely matters. The more I listen to a particular caller or a particular agent, the easier it gets to figure out if the emotion is positive, negative or neutral. And the more I work on emotion labelling, the easier it gets. But having more experience doesn’t mean having any bias.

The best method is to get out of your own individual perspective. Have multiple people working on data labelling and, when there’s a divergence of opinion, discuss it openly. It may be that there are features that one person has overlooked, and consensus can be reached. Or, it may be that this is a genuinely ambiguous case, and we can’t say definitively that the emotion is positive or negative.

This process will break eventually, if the group gets too big. The environment needs to be one where differences of opinion can be discussed quickly and openly, which is challenging to do in very large groups. It will also break if the group is too small, as there won’t be enough diversity of opinion to help in controlling bias. But being open to the possibility of bias, and to discussing it and working through it, is the best method we’ve found for labelling emotion as accurately as possible.

Overall, there’s no technological solution to bias. The solution isn’t in tech but rather in society. And if we can socialize the problem of bias — in emotion labelling, and other areas of technology — starting with talking about it, and bringing in robust insights outside of technology (e.g., from the humanities and social sciences), we can continue to make progress in controlling it.

« Back

Kris Gartley

Kris Gartley has been an IT professional for over 16 years and is a certified Project Manager at Voci. Kris is passionate about process and project development as well as diversity and inclusion in the Information Technology workspace. Her certifications include ITIL, Project + and leadership certificates from Cornell University and The Center for Creative Leadership.

Emotion, Language and the Individual Perspective: Addressing Bias in Speech to Text and AI

Kris Gartley

Access our ASR API