I’ve been tracking the adoption of voice-first technology ever since I got my first Echo device around Thanksgiving of 2014 and started 20% of my sentences with “Alexa…”. And every so often I like to have guests join me for this series to see where things stand today with these devices, and how they’re being used. But I haven’t really focused on designing voice content before, which is why I was really excited to speak with Preston So. Preston is Senior Director, Product Strategy at Oracle, but more importantly for this conversation he is also author of the book, “Voice Content and Usability”.
Below is an edited transcript of our recent LinkedIn Live conversation. Click the embedded SoundCloud player to hear the full conversation.
Brent Leary: How has the pandemic impacted the role of voice from a content development in the context of digital transformation?
Preston So: This is a really interesting question. I’ll answer this from two different angles. The first is that when we started working on and I just realized that I haven’t actually mentioned this case study yet, even on this, on the show is that 5 or 6 years ago I had the opportunity to work on a team that built AskGeorgia.gov, which was the first ever voice interface for residents of the state of Georgia. Also, it was really one of the first ever content driven or informational voice interfaces in existence.
The two reasons why we wanted to build this and pilot this project were to serve those demographics, which I mentioned earlier are oftentimes ignored by or oftentimes not served as well by those websites that we built. And this is especially press, as we know a very pressing concern in the public sector, very, very pressing concern within local government and the two audiences that we wanted to serve word number one, elderly Georgians, who might not be able to necessarily use a website as easily. It might not necessarily be able to use a computer as quickly and also might not necessarily have the mobility to be able to travel to a county government office or an agency office. At the same time, we also wanted to focus on disabled Georgians. Those who might not be able to use a on a website as quickly as those who are using the website through its visual kind of approach. And also those who really don’t have the ability as well because of those issues of mobility, excuse me, to actually travel to an agency office and get their questions answered there. At the same time we were also dealing with in those days, of course, and still continuing on today, the lack of budget, the cash straps nature of state and local governments today where budgets are being slashed left and right and oftentimes those hotline wait times were growing and growing and growing on the phone.
The reason I brought this case study up is I think the coronavirus pandemic has really magnified how certain audiences face not only these really kind of very, very problematic systems of oppression in society, but also really deep barriers to accessing the information and content and transactions that they need. And if you think about, of course, who’s been impacted most by the impact of the pandemic and the effects of the pandemic it is those who are people with disabilities or those who are elderly. And especially if you can’t even leave your home, how do you actually get the information you need? So I think we in some ways, pre-saved a lot of the work that’s happening right now with digital transformation today, where a lot of organizations are now realizing, and this is of course modulating through a lot of the work that now we have seen on remote working on distributed workforces all of that, but also now how best to serve customers in that B to C angle, how do we actually make sure that those who are our customers, those who are users, those who are our actual demographics can interact with our content in ways that don’t require them potentially to do things that put them in danger.
And I think there’s several things that have accelerated in this regard. The first is along the voice access as we saw, I think it was last year, smart home systems, smart speakers sales have gone through the roof. I mean, it’s now, 35% of Americans now have a smart speaker at home, but by the same token as well, we’ve also had an incredible amount of growth in gaming headsets and gaming technologies. So virtual reality headsets, wearable devices and these really portend, I think the shift of content away from the written medium from the visual medium, that we are really used to over the past few decades into a much more multi-faceted kind of context where now we could potentially be interacting with our content through an Oculus Rifts or through our smartphones, through our Samsung TV, through our iPhones and our iPads, but also of course through an Amazon Alexa and this really kind of, for me, I think the biggest thing that’s happened with the coronavirus pandemic is that it’s really kind of accelerated the arrival of that time, where organizations now have to understand that it’s not just the web anymore.
It’s not just mobile, it’s 15 different things. It’s, all of these different considerations and if you’re just now getting to thinking about web and mobile you’re already behind.
Progress to date on voice content development
Brent Leary: Are we were we, where you expected us to be with voice being a piece of the interaction channel between consumers and vendors?
Preston So: Yes and no. I think there’s from the maker standpoint, I think so. And what I mean by that is, as I mentioned earlier, we’ve got these really great tools that are out there, Botsociety these new startups that are developing really designer friendly tools that allow for you to do like the sort of old Dreamweaver or Microsoft front page approach to building websites. You take that over to a voice interface and suddenly you don’t have to be writing, let’s say very low level hardware code or writing in, let’s say natural language processing or natural language understanding into a bot. At the same time though I think there’s a long ways away and I think that we’re not really quite where I thought that we would be at this point, but I think a lot of that is also because AI itself is not quite as far along as a lot of people necessarily thought.
One of the reasons for that is we are experiencing this time right now where a lot of the voice interfaces that we’ve built are fundamentally still clearly digital automated that don’t really have an actual means of communicating in a way that really we can hear ourselves in. One example of this is that you look at some of the Bilingual Communities in South Texas or in New York city and you hear people literally switch between Spanish and English in the middle of a sentence or people who yeah, exactly people who are in Mumbai or a new Delhi who switched between Hindi and English mid-sentence or a switch between Marathi and English in mid-sentence.
And these are populations that don’t hear themselves within these voice interfaces, let alone all the communities of color who also don’t feel that they can hear their own sort of dialects and their own sort of colloquialisms and their own sort of manners of speaking within these voice interfaces. There’s some interesting steps in the right direction that kind of go partially there, but not really. I mean, the first of course is I think I’ve been very surprised and happy about what ways is doing in terms of allowing you to kind of configure those voices that read out those statements like police reported ahead or vehicle on shoulder, or keep left.
There’s also of course new services that are emerging like Amazon Polly, Amazon Polly’s really interesting because it will take some input of written texts like a paragraph or a page or whatever and it will read it out in a British accent or a South African accent or an American accent, a women’s voice and all sorts of various kind of gauges that you can twist and play around with. But still fundamentally, of course, that’s written texts that’s not necessarily been optimized for speech.
There’s no algorithmic way to turn written texts into something that’s written in a more spoken style, but there’s also that kind of big worry that I have, which is when it comes to voice interfaces is actually being great and getting to that point of excellence that we expect in some ways I think it’s almost impossible. I think it’s almost a paradoxical statement to say that voice interfaces will be at this level of optimum behavior for everybody. Because the way that a voice interface sounds to me is going to be very different to the way that a voice interface sounds for somebody else. I think that’s really in gendered by the fact that if you look at Alexa or Siri or Cortana or Google Home, generally speaking the default voice, the default identity that comes out of this voice interface is somebody who sounds a lot like a cisgender straight white women who speaks with the general American or middle American dialect.
And there’s not necessarily a whole lot of space for people who are speakers of English as a second language or people who are code switchers. As I mentioned before, who switched between English and Spanish, right in the middle of the sentence or trans and non-binary communities who switched between straight and sort of modes of speech in terms of how they actually interact with each other until we hear those sorts of toggles until we hear that sort of reality that we have reflected in those voice interfaces. I don’t think we’ve actually reached that lofty goal.
What worries me today is that we’re facing a situation that’s unprecedented with the pandemic where a lot of these customer service agents, a lot of these frontline customer service workers are losing their jobs in favor of a more automated, mechanical voice interface approach. But most of these people that are losing their jobs that are being laid off that are, that are being superseded by voice interfaces at these corporations they’re generally people who live in the global south, the generally people who are from the Philippines or Indonesia or India who speak English in ways that should also be reflected in the voice interfaces that we have today if we so want them to.
Somebody who is a Filipino American should be able to hear a voice interface that sounds Filipino American as well on a voice interface. So while I think that in some ways, things have gotten really great for voice interface designers, I think for voice interface users, we’ve still got a long ways to go, and it’s going to be a few decades, I think before we even can kind of get to that point.
The near future of voice content design
Brent Leary: What do the next couple of years look like for voice content design?
Preston So: I certainly think that there’s going to be improvements in certain regards. There’s definitely going to be improvements when it comes to what I call the democratization of voice interface design. If you’re somebody who doesn’t know how to create a website, if you’re somebody who doesn’t write code, if you’re somebody who doesn’t actually do anything that is related to computer science, you can today create a voice interface, which is really the first time that we’ve ever done that before.
I think we still are very much focused on the idea of voice interfaces as something that’s used to turn off our lights, when we’re done with them to switch on starter up and preheating if you’ve got a smart home system. Let somebody at the door, which is the most recent commercial I have seen. And do other things that are not really that sort of complete concierge, that voice interfaces were supposed to be, right?
If you look at some of the more aspirational media about voice interfaces, for example, you look at 2001: A Space Odysseys HAL or you look at a Star Trek, the voice of Majel Barrett in Star Trek, or if you look at especially some of the sort of Black Mirror episodes that have come out recently, it’s not just that we want a assistant that can talk to us about doing this transaction or that transaction or doing this task on our behalf.
We also want to be able to have them potentially schedule our day, do things that are much more complex and multifaceted. For example, I don’t want to just buy tickets to a movie. I don’t want to just buy tickets to see Cruella or In the Heights. I want to actually find out about that movie. I want to find out what that score was in Rotten Tomatoes. I want to find out who the cast and crew are. And a lot of times these voice interfaces are still not equipped with that kind of capability.
There’s a paradox though; there’s a really interesting conflict though here, because right now we’ve seen a bit of segmentation happening. For example, if you go to, let’s say AMC theaters, right? Or you go to Hilton Hotels or Delta Airlines, if you want to ask Delta about Hilton, or you want to ask AMC theaters about some sort of other theater chain, they can’t help you.
What we’re seeing here is this interesting conflict between how these voice assistants and voice interfaces are trying to compete against each other, to be more and more broad in terms of their coverage of information across the web and transactions across the web. But also the fact that asked where to go for example, is only going to answer your questions about the state of Georgia or topics that are relevant to Georgia citizens, to residents in Georgia. So it’s a really interesting question. I think we’re going to see some sort of next phase of voice interfaces here in the very near future that are going to be trying to wash away some of these lines in the sand between topical and transactional considerations. And also we’ll begin to see much more content driven voice interfaces.
This is part of the One-on-One Interview series with thought leaders. The transcript has been edited for publication. If it's an audio or video interview, click on the embedded player above, or subscribe via iTunes or via Stitcher.