Interview mit Jonny Axelsson
26.08.2004
Jonny Axelsson ist einer der Hauptentwickler von Operas Voice-Engine und hat auch an der dort umgesetzten XHTML+Voice Spezifikation mitgearbeitet.
Please introduce yourself
I'm a split personality between Documentation and Technology & Research at Opera. I'm mostly active at the beginning of projects, when I get in contact with the developers, and at the end of it, when documentation is done. I was active in W3C and one of the authors of the XHTML+Voice specification. Making content available on different devices and making documents as device independent as possible is a topic I'm interested in.
What will change because of Voice?
Voice lets you enrich webpages and interact with the page, sometimes even without looking at the screen. Like wireless technologies have relieved you from being at a certain work station to be on the web, voice technologies can relieve you from the computer screen. Voice isn't supposed to replace screen, keyboard, and mouse, they are faster and more familiar in most situations. Many traditional use cases for Voice have been on mobile devices as they have tiny screens and lack of normal input devices like mouse or real keyboard. So it is easier and faster to speak to your PDA or car than battling with some crippled keyboard or stylus handwriting. Listening demands less concentration. It's a bit like comparing radio and TV, with Voice as the radio and the normal desktop PC as the TV. When watching TV you can't do anything else as you have to keep your eyes on the screen. But when listening to the radio you can do other things at the same time.
How did Voice develop?
The XHTML + Voice specification was published in December 2001 based on existing XHTML, VoiceXML, and XML Events standards. Currently Voice is at the same stage as the World Wide Web was 10 years ago. The Web was very limited, few standards existed and they had to be frequently revised, technology, bandwidth and the machines were not fully up to the task. Nothing was yet settled. Today we are in a very similar situation. Even if we know how the familiar stuff like HTML works, we don't know how Voice works. This is a good time to play with this new technology. What we learn from this will directly influence what the future of Voice will be. The standards and applications will change to fulfill their needs as they're in a moving process unlike HTML which is mostly done. And the technology is at a point where it is acceptable. You have no guarantee that it works perfectly, but at least it works.
Why should one develop for Voice?
Right now Voice is used for special purposes only, it isn't used to enhance common website as most people won't be able to use it. So instead it's mainly an alternative input device in situations where you can't use mouse or keyboard, in phone conversations, at certain workplaces, in a car. So Voice applications and websites have been created by developers expecting their users not to have a keyboard. Of course it can be used just for fun too. People will try to explore Voice like they did with CSS on [http:// www.csszengarden.com/ CSS Zen Garden]. They'll also play with Opera and its commands. I might like to be able to present a document in different ways using Voice and CSS. For example you could create an OperaShow and make Opera display the bullet points on the slides as you speak. But Voice won't be used by "bread and butter" web developers in the next years.
When will Voice be used by the mainstream?
The technology is getting better and better, standards are developing, everything is getting easier, more products have Voice support. Voice is developing fast, but the designers need time to learn handling Voice. It won't by used by the mainstream before you can find code on the web which you just copy&paste to your site, like you're doing with HTML and CSS today, and HTML editors like Dreamweaver add support for it. But there's no reason to believe that people will get used to it fast. Maybe it won't happen on desktop PCs, just on devices you're used to talk to, like mobile phones. People feel more comfortable when speaking to their phone than to their PC. And there are places where you don't want to be overheard talking to your device, for example in busses and trains. And of course you need the devices and websites/applications to take advantage of the standards. Multimodal pages are at the same stage as CSS 5 years ago. It will be commonplace, but it will take years. On some places you'll see it earlier, on others later, if at all. The way you use Voice is similar to JavaScript. It can make a web page more convenient and fast, or it can be needed to make the whole thing work. So there might be places where Voice is just an enhancement you can enable if you please. On other places Voice might be the basis for the whole website.
How is XHTML+Voice supposed to work on websites?
The first word that comes to my mind is reuse. X+V is an extension of the web. You can take every website and just add a couple of elements and it'll be voice enabled. It still looks the same, but there's some nice functionality ready for you in the background. It's easy to use the same Voice component over and over again. Normal HTML sites and voice applications work differently. Websites don't wait for you, you just go from one place to another. You can see which image belongs to which text. Voice has a time dimension. If something is happening you get a response. And as a designer you have to think about: How can I let the user do what he wants to do? The Voice component acts on request and these requests are represented by events. So you first have to decide what you want to do with Voice, then what should trigger it.
How does Voice work in Opera?
Opera has two voice contexts. You can interact with the web page and directly with Opera. XHTML+Voice interprets your commands to the page, your voice setup what commands will cause Opera to do actions. When you interact with an page the author has set up responses to events you trigger. An event could be a click on a button or just that the page finishs loading. The response will be in the form of one or more user-machine dialogs. The outcome of these dialogs can be changes like filling in forms, changing the document, going to a new page.