Conversational Interfaces and Where We Should Steer Them

Loren Davie
Anti Patter
Published in
4 min readAug 4, 2016

--

On Monday Peter Morville wrote a piece about his thoughts on Alexa, including some suggestions on how Amazon could make its voice agent platform more usable.

AI has always had problems managing expectations. Unlike a GUI or a command-line interface, it does a good enough impression of humanity that many people automatically assume that it’s capable of fully human levels of interaction. Inevitably these people are disappointed.

I don’t think anyone disputes that voice interfaces have a long way to go, driven in part by the aforementioned expectations. However, before we plunge into the trough of disillusionment, we should take some encouragement by how much progress has been made in the recent past — even in the last 5 years significant improvement in voice recognition accuracy has been made, to the point where voice agents can now understand what you are saying most of the time.

Paradigm Crutches

Morville suggests that systems like Alexa would benefit from the addition of a screen that could provide the means to gather and display more context to the user, for the purpose of accurately gauging their intent. In fact Alexa does provide a GUI-like “card” interface on a web app and in their mobile apps. So you could hook up an associated GUI with this app and an iPad if you wanted, I suppose.

But I strongly feel that placing reliance on a supplemental visual interface sacrifices some of the great strengths of Alexa. A purely auditory interface facilitates use cases that couldn’t (or shouldn’t) be supported with a visual interface: for example: driving. We need less screens in cars, not more.

Also relying on a visual interface is a punt. It’s the equivalent of suggesting that the GUI on the original Macintosh in 1984 wasn’t sufficiently rich, and that the solution was to append a command line interface to it. (I’m sure some people said that, at the time.) The solution isn’t to create a crutch for the new interface paradigm — rather, it’s to make the new interface richer.

Enriching the Interface

For example, a lot of the issues cited by Morville stem from the lack of sufficient context known by Alexa to properly infer his intent. Alexa doesn’t know which Vienna Teng song he wants, and he can’t remember its name. But what if Alexa could do this:

Morvile: Alexa, play that Vienna Teng song I like.

Alexa: Which one? I can list some songs for you, you can tell me which album it’s on, or you could hum a bit of it for me.

Morville: (Hums waif-y piano tune).

Alexa: I think you mean “Warm Strangers”. Is that it?

Morville: Yes

Alexa: Playing Warm Strangers by Vienna Teng.

And the next time he wants that song…

Morville: Alexa, play that Vienna Teng song I like.

Alexa: You mean “Warm Strangers”? That’s what you wanted last time.

Morville: That’s it, yes.

Alexa: Playing Warm Strangers by Vienna Teng.

The secret weapon here is the follow-up interview. This would occur because there is not sufficient context for Alexa to understand Morville’s intent.

This is how we would diagram this exchange, more or less, in CAVE Language:

CAVE Diagram for Follow Up Interview to build Context

Another approach, one being pioneered by former Siri developers Viv, is to build context through temporal proximity, also known as the thing that we were just talking about. For example:

Morville: Alexa, what’s the capital of Nebraska?

Alexa: Lincoln is the capital of Nebraska.

Morville: What’s its population?

Alexa: 268,738

The second question, what’s its population?, would be unanswerable in a context-free environment, but because our hypothetical future Alexa interaction would know that Nebraska was the city that we were just talking about, it would be able to use that knowledge to contextualize the follow-up question, inferring from it that Lincoln is the subject of the population question.

Here’s how we could diagram the temporal proximity context with CAVE Language:

CAVE Diagram for Context from Temporal Proximity

I share Morville’s hope that this is the end of the AI winter, and I agree that AI needs IA (someone, try to use AIIA, I dare you). But the way forward with conversational interfaces is not to hook them up to traditional GUIs, but rather to increase their richness in their native syntax: voice.

--

--