Lend Me Your Voice
Season 7: Episode 4
Big tech’s power over language, means power over people. Bridget Todd talks to AI community leaders paving the way for open voice tech in their own languages and dialects.
In this episode: AI builders and researchers in the US, Kenya and New Zealand who say the languages computers learn to recognize today will be the ones that survive tomorrow — as long as communities and local startups can defend their data rights from big AI companies..
Halcyon Lawrence was a researcher of information design at Towson University in Maryland (via Trinidad and Tobago) who did everything Alexa told her to for a year.*
Keoni Mahelona is a leader of Indigenous data rights and chief technology officer of Te Hiku Media, a Māori community media network with 21 local radio stations in New Zealand.
Kathleen Siminyu is an AI grassroots community leader in Kenya and a machine learning fellow with Mozilla’s Common Voice working on Kiswahili voice projects.
IRL: Online Life is Real Life is an original podcast from Mozilla, the non-profit behind Firefox. In Season 7, host Bridget Todd talks to AI builders that put people ahead of profit.
*Sadly, following the recording of this episode, Dr. Halcyon Lawrence passed away. We are glad to have met her and pay tribute to her legacy as a researcher and educator. Thank you, Halcyon.
Bridget Todd: Hey Siri, play IRL Podcast.
Siri: Here’s the podcast, IRL: Online Life is Real Life.
Bridget Todd: Lots of us use virtual assistants. They’re part of our everyday lives. We use them to check the weather or the time. Or if you’re me, you might be like, “Hey, Siri, play Beyoncé.” But speech recognition systems don’t work equally well for everyone. They don’t even exist for many languages. Big Tech has stepped up to offer more diversity in their language models for speech and more, but it comes with a new set of problems.
Keoni Mahelona: How do I feel about Big Tech sort of paying attention to our marginalized or Indigenous languages? I guess the first thing I wonder is, why. Why do they care now? Do they genuinely care to ensure inclusivity online or did they finally realize that being more inclusive is better for them, and their bottom lines?
Bridget Todd: That’s Keoni Mahelona in New Zealand. We’ll hear more from him in a bit. In this episode, we meet technology builders who are reclaiming speech recognition with and for their own language communities.
This is IRL, an original podcast from Mozilla, the non-profit behind Firefox. I’m Bridget Todd. This season we meet people who are building artificial intelligence that puts people over profit. First, let’s make a stop in the US. We’re in Maryland, not far from where I live.
Halcyon Lawrence: I spent a year with Alexa and I allowed the device to do whatever the device heard me say.
Bridget Todd: This is Halcyon Lawrence. She’s an assistant professor of technical communication and information design at Towson University. Three years ago, she conducted an experiment with Amazon’s home assistant, Alexa, which is pretty popular in the US.
Halcyon Lawrence: So for example, I would ask, can you set a 5:30 alarm and the device would hear 5:50. And so I would just wake up at 5:50. I wanted to push and see, what is the level of inconvenience, right, that this device would allow me to do.
Bridget Todd: Halcyon grew up in Trinidad and Tobago. While Caribbean accents can still throw off voice tech by US companies, the tech has improved so much that it altered the focus of Halcyon’s research.
Bridget Todd: So why is it important for technology to be able to understand us?
Halcyon Lawrence: Well, I think, this is where it sort of speaks to the convenience and the question that arises is, convenient for whom? You know, the kinds of interactions that I have with most speech devices, like personal assistants, if they don’t understand me, it’s often very comical and maybe a minor inconvenience. And so that’s sort of part of the thesis. But let’s scale up, because these speech recognition devices are being deployed in a number of other spaces. So in the US, for example, they’re increasingly being used to automatically transcribe court recordings. They’re being used as aggression detectors in prisons, as well as schools. And so, you can well imagine these are spaces where being misheard or misunderstood can have deadly consequences.
Bridget Todd: Language, and how people speak, can be a really important marker of power and class. Halcyon says forcing people to speak a certain language, or a standardized version of a language, is one way colonial powers dominated people in her region and worldwide. She sees parallels in how digital technology pushes people to speak in certain ways just to be understood.
Halcyon Lawrence: One of the things that concerns me is the expectation that you speak with a standard accent, whether it be standard English or standard French, or any sort of standard language, suggests that anybody who does not speak with that standard accent, is misheard or misunderstood. And these are our vulnerable populations who turn up in spaces like prisons and courts of law where they need to be heard and understood accurately. So you know, it’s as important as asking the question why we need to be heard or understood in person, is no less important in the digital space.
Bridget Todd: So Halcyon, are there ways that you think that technology can be designed differently so that folks who maybe don’t speak North American or British English can be understood?
Halcyon Lawrence: So your question hits upon past me and current me. Past me when I started doing this research, the easy answer would’ve been yes, we need more representation in these devices. Right? If I can hear and be heard with a Trinidadian accent, surely that would solve the problem.
Bridget Todd: But recently, on a trip home, she was reminded how language is also used as resistance. For instance, by speaking in ways that cannot be understood by oppressors.
Halcyon Lawrence: I started visiting with friends, and I had forgotten how we have also used language to subvert colonial authority, that other kinds of dialects have emerged, that Patois has emerged as a way of subverting. And so the question then arises, what does it mean to give organizations access to that kind of voice data? What kind of power are we handing over if I am advocating for greater representation of languages and dialects and accents? And so I am in a bit of a conundrum right now thinking about the kind of research that I do, but more importantly, thinking about what I advocate for.
Bridget Todd: Let’s head to New Zealand. That’s the sound of the local radio station for the Indigenous Māori community in Kaitaia.
Keoni Mahelona: Te Hiku Radio is the community voice. Every day we speak to people within the community to tell us about everything, whether it’s to talk to us about the climate, the weather, or to talk to us about what sorts of foods are in season, in terms of hunting and gathering, or fishing, and what’s going on in politics or our health system, or you know, data sovereignty and artificial intelligence.
Bridget Todd: That’s Keoni Mahelona. He’s the chief technology officer of Te Hiku Media. That’s a Māori community media network with 21 local radio stations. It’s been around since the 1990s. Since 2014, Keoni, who is Hawaiian, and his partner Peter-Lucas Jones, who is Māori, have used the internet – and more recently AI – in their efforts to reverse the decline of the Māori Language, te reo Māori. Under colonial rule, speaking the language was forbidden. Now, it’s an official language of New Zealand.
Keoni Mahelona: Speech recognition is just a tool. These AI models are just a tool that enable us to do what we need to do. You know, the mission of our organization is about language revitalization and language promotion and cultural restoration, and promoting te reo Māori and the culture of Māori. So how we do that at our organization is we, we tell stories. We tell stories on the radio, we tell stories through video. We tell stories through live broadcasting. [Sound of broadcast] But we’ve been telling stories for more than 35 years. And a lot of those stories are captured on cassette tapes or VHS tapes. So we’re in this process of digitizing those tapes and now we want to make the content within them available.
Bridget Todd: A few years ago, Te Hiku Media was working on a project to transcribe historic broadcasts with elders who could explain the nuances in language and context. Keoni realized automatic speech recognition — or ASR for short — could help.
Keoni Mahelona: So as we were working on this project, we were like, ‘Wow, this is really hard. If an interview is an hour, it takes at least three hours to transcribe it’, right? So we thought, ‘Oh, why don’t we just train a machine to automatically transcribe this for us’ because, ‘Hey, you know, Siri existed at the time. ASR was a thing.’ So surely we could do it in te reo Māori. From a developer perspective, like we knew the technology existed, we knew there were open source projects out there we could use. But what we also knew is that this was actually a data problem, and that that would be the most important part of this project – was not just sort of getting the data, but we knew we had to gather this data in a way in which we could safeguard it and protect it, and ensure that it would only be used for the betterment of Māori and Māori things.
Bridget Todd: The data is actually voice recordings of short sentences paired with text. This is what a speech recognition engine — in this case, Mozilla’s Deep Speech — uses to decode what sounds go with which letters. For its dataset, Te Hiku Media reached out to community groups, like traditional dance troupes and canoe-racing teams, and soon gathered over 300 hours of speech.
Keoni Mahelona: We mobilized the community to read thousands of utterances to help us collect a corpus that would enable us to train an ASR. In doing that, we learned a lot. And one of the things we learned about the community who were pretty much giving their time to support this project, was that they wanted real-time feedback on their readings.
Bridget Todd: Keoni says they realized they could support language learning by giving people immediate feedback on how they pronounce words at the same time that they’re donating voice data.
Keoni Mahelona: We pretty much hacked Deep Speech and built a real-time pronunciation engine. It’s an app that we have called Rongo. It’s in the Apple and Google stores. Anyone can download it anywhere in the world.
Bridget Todd: Keoni says their speech project will make decades of audio recordings more accessible online.
Keoni Mahelona: One of the things we’re looking at is whether there’s any climate data embedded in our archives and how that can help us to better mitigate some of the effects of climate change. And you need ASR to actually do that, right? To go through all these archives and then transcribe it, and then find the data embedded in that. And unless we can document our knowledge, it won’t be available for our people in the future. I think that’s really the value in what we do with our community, right? We don’t do it for our community. We do this with our community.
Bridget Todd: Many Big Tech companies have been including Indigenous languages in their online services. And on the surface this seems like a good thing. But Keoni’s not so sure.
Keoni Mahelona: These companies don’t really know much about our languages or our cultures, and by simply trying to include us, they could actually do more harm than good to our communities, to our languages, especially languages that are in a state of revitalization. What we’ve seen in the past with tools like Translate from companies like Google and Microsoft, is the translation doesn’t really work very well. But people use the tool, and they treat the tool as sort of a hundred percent accurate. But the truth is, the algorithms they use, or the models they’ve trained, aren’t a hundred percent correct.
Bridget Todd: About five years ago, Indigenous language speakers started getting offers from a language tech company for 45-90 dollars an hour for their voice recordings. It was for an unspecified corporate purpose, but said the goal was to keep languages alive. Keoni says this approach is extractive and undermines the work of communities. Then, in 2022, OpenAI dropped a new multilingual speech recognition model called Whisper. It was trained on over six hundred thousand hours of audio from the Web — including over 1,300 hours of te reo Māori. How they sourced this data is secret.
Keoni Mahelona: We were very, very concerned when we heard about Whisper, because we thought, ‘Oh, well there we go. You know, no point doing this anymore, right?’ Because, ‘Hey, look, Big Tech has solved it for us. They’ve, they’ve saved our language, thank you.’ But we knew that the model was crap. Like, we knew it wasn’t gonna be good. Even though some of our, like, data scientists kind of had a quick play with it, they’re like, ‘Oh my god, it’s scarily good,’ the ones who had a play with it actually aren’t speakers or fluent speakers of te reo Māori. So when one of our language experts had a quick look, it was obvious it was absolute trash. And then we quantified, like, we quantified that trash.
Bridget Todd: Whisper is open source, but that doesn’t make it feel any less like unfair competition to Te Hiku Media.
Keoni Mahelona: We are absolutely now in competition with these tech companies. When we fine tuned Whisper with our data, our highly curated data of quality te reo Māori, we were able to create a model that was pretty good at recognizing te reo Māori. And it did perform better than our previous model, but our previous model was built on very old technology. So I think where we’re at now is that we know we can do better than them, despite only having like, you know, a handful of people in our team, not much money, and not much compute. Like we’ve proven we can do better than them for te reo Māori. But there’s still that existential risk of when will they be as good as us or better than us? And understanding that when you also understand how will they achieve that? And the only way they can achieve that is with more language data, more Māori language data. So then we need to ask ourselves, how will they get more language data, or from where will they get that data? And that’s the concern.
Bridget Todd: Te Hiku Media says it’s the guardian, not the owner, of the data it collects and the software it creates for the community. The organization developed a special license called Kaitiakitanga that requires permission for reuse. This way the community has control over how they get benefits back. Keoni says this approach to data sovereignty is modeled after how Indigenous communities traditionally act as guardians of their land — to protect them from colonization for future generations.
Keoni Mahelona: And they’ve taken all our land, right? So what, what left do we have for them to take? Well, it’s our data. I mean, that’s, that’s pretty much it. You know, they’ve taken everything else.
Bridget Todd: Let’s meet someone now who cares deeply about speech recognition in African languages.
Kathleen Siminyu: My name is Kathleen Siminyu and I’m a machine learning fellow at Mozilla Foundation. In my career I’ve worked to build grassroots AI communities.
Bridget Todd: Kathleen lives in Kilifi, Kenya and works with Mozilla on Common Voice. It’s a platform for crowdsourcing open voice data in over a hundred languages. Its mission is to make voice technology more inclusive. Kathleen helps lead efforts to gather data for Kiswahili on Common Voice. This is a language spoken in several East African countries by as many as 200 million people. Until recently, it wasn’t a language open source developers could build speech applications for.
Kathleen Siminyu: So Common Voice is important because it’s an open dataset. Anybody can build on it, everyone can access the data and therefore the communities can start to build for the languages that they care about or they speak, or that those around them speak. My hope is that we open up the path for more voice technology. And by this, I mean, I can tell you a little story. At my first job, I worked at a company in the telco space, and we basically had products like voice and SMS. And I remember in an election year, we needed to be screening messages to make sure incitetful content is not being sent on our platform.
Bridget Todd: In a heated political moment in Kenya, Kathleen wanted to build a tool that would automatically search for messages inciting violence.
Kathleen Siminyu: And in my head, I thought, this is going to be super easy. But then I realized that none of the tools that existed were going to be of use because I needed tools for Kiswahili or other local languages spoken in the country.
Bridget Todd: Kathleen’s experience of not being able to build a tool in her own language, inspired her to do more research on her own. She soon discovered Masakhane, a network of researchers working on computer science and linguistics in African languages since 2019.
Kathleen Siminyu: I realized that okay, there’s other people who are interested in these problems, and one of the biggest projects, our first project was a machine translation project. Since then we’ve grown to other tasks. There is a lot of work coming out of this community.
Bridget Todd: Many global companies are gaining a foothold on AI across Africa. Networks like Masakhane and Deep Learning Indaba want to see AI shaped and owned by Africans. For Kathleen, working within communities is an opportunity to create voice technologies that respect language diversity.
Kathleen Siminyu: I think the benefit is the fact that the communities are aware of the nuances of the language. So, taking the context of speech recognition, I’ll give the example that we learn from the West that gender bias is likely, that accent bias is likely, but then we then have to look at an East African context and ask ourselves, “Okay, what bias is likely here?”
Bridget Todd: Working with linguists with local knowledge helped Kathleen understand how Kiswahili was standardized by Christian missionaries during colonization.
Kathleen Siminyu: This knowledge, for me, made me realize that we should not make the mistake of only building for standardized Kiswahili. There’s already this growing gap between the standardized version and the other dialects. And if we’re not careful, we’re continuing to push these other dialects to extinction.
Bridget Todd: Extinction. It’s like AI takes on the role of the colonizer, when certain dialects are favored over others. But convincing people to donate their voices isn’t easy.
Kathleen Siminyu: So incentivizing participation has been quite difficult. I think one reason is because AI is very much in the media right now, right? And everybody has this perception that people who are working in AI are making loads of money. So whenever we go into spaces and start talking about the work that we’re doing and why we want people to contribute to the data and tie it to the fact that AI tools can be built, they then want to know, okay, am I going to get paid? But in our program we are not paying people to contribute. So we have to be very creative about how we think about incentives.
Bridget Todd: Like many advocates for open tech in Africa, Kathleen is wrestling with how to build sustainable projects and businesses when the datasets are open. Because Big Tech uses these resources too. So more projects are considering alternatives to completely open licensing. There’s also been talk of creating something like a federation.
Kathleen Siminyu: From the startups, we’re learning that, you know, Big Tech coming into the scene and seeing our tools or our resources are multilingual and they cover, you know, this number of African languages, has meant that for startups it’s harder to get, say, VC funding, right? If you pitch to a VC, and they say, ‘Kiswahili is on Open AI’s Whisper already, why should we give you money? It’s a problem that’s already solved.’ So these questions are coming up often. How, how can we give startups within our network the advantage? These startups are building with the communities. Can we license the datasets such that the startups get access to them, or maybe not make the datasets open — have them only open within the network, such that these startups can have access to them, but not Big Tech.
Bridget Todd: With more than 7,000 languages worldwide, decisions about voice data today will influence how people communicate tomorrow. A lot more can be done. This goes for Big Tech, and for the open source communities getting squeezed by their dominance. Speech recognition is about more than just convenience. For people who depend on AI to recognize their voices at home on the phone, or even in court, these systems and the data they are built with, reinforce inequality. This is what can be challenged when communities reclaim a voice in AI to build for themselves.
Before this episode ends — I’ve got some sad news to share. Halcyon Lawrence — the first guest in this episode — passed away a few weeks after we spoke. In honor of her legacy, we’re glad we could still include her voice in this show. We hear you, Halcyon. Thank you for everything. To learn more about Halcyon, and our other guests, please visit our show notes. I’m Bridget Todd. You’ve been listening to IRL: Online Life is Real Life, an original podcast from Mozilla, the non-profit behind Firefox. Mozilla. Reclaim the internet.