"Look Who's Learning Who" or "It Takes Two to Learn When It's Zeroes and Ones"
I've been struggling with voice commands and dictation in at least small ways since the day I first figured them out. It's interesting to me that I think of this landmark as something of a birthday because I feel like I leveled up when I starting voice dictating texts. It's a bit like picturing my life before and after keyboarding class. I went from twenty-five words per minute to seventy-one words per minute with no mistakes. Get at me 10th grade.
I call it a birthday though because it's not just something I achieved through regimented labor, it was in hindsight periodic and inevitable. I wasn't going to go through life using at most four fingers at a time to type. The necessity for interfacing with computers was going to drag me along for the ride at some point.
My computer use skyrocketed now that the time delay on the interface had conversely plummeted. So I should expect the same with voice dictation. Although now I stand to gain not only quicker thought to action time, but also hands-freedom. Double win right? Not for me. See I struggle with typos when it comes to how I speak.
"Hey, so I was thinking I was going to pick up orange juice on the way home. But then I also remembered that you might have some. Already. Did I forget, that I remember that wrong?"
Perhaps not for everybody but certainly for me, speech patterns are much more fragmented than is obvious. That's not exactly what I said, but it's the best Google voice can do with what I did say. In speech not being interpreted by algorithms, I can throw ums and yaknowwhatImeans in there all day, and the thought goes on.
But knowing I can't do that here, there's correcting for proper sentence structure, speaking punctuation, vocal pauses causing end punctuation that shouldn't be, failed corrections. The result is a generally conversational question about groceries that descends into absent-minded word salad.
And it's not my fault. I don't have another person there listening to me. Here's when I do:
"yeah I just talked to jim and he says you are gonna be coming well he's bringing you I guess and you're gonna game with us which is wicked she asked me who was coming over tonight and I was like Dan the family man Murphy I can't believe that's the first time I've thought to call you that in..."
I'll stop there to be polite. It's a corny joke that I riff on semi-effectively. I only mean to illustrate a couple things. First is that I am in no way responsible for a filter when speaking. I didn't punctuate anything. I didn't say "comma." Ideas stopped and started, I referred to people without using their names, and I broke my poor keyboarding teacher's heart doing it I'm sure.
Second, I would never send a text by saying those things exactly. It would be horrendously punctuated, "Jim and he" turns into Jiminy, and half a dozen commas show up to make you read like a six year old telling a go nowhere story about the time he drank from the red cup. (Which may not be far from the truth)
So why can I talk that way to a person and not my phone's voice dictation? It's not just that the person is listening. If listening is hearing and comprehending sound, my phone is doing that. And it's not just that the person is interpreting my words based on knowledge of the real world. Text dictation (heretofore referred to as "Ted") is drawing upon whatever library of words we can put into data form. Which I understand to be a lot these days.
The difference is that Dan is listening to me, and I know Dan. Dan has been my best friend for my adult life. He has a wife and two delightful daughters. I have also talked horror movies with Dan for more hours than I can easily count. So when Dan is listening to what I'm saying, he is already picking up on the irony. Unrelated detours like gaming, and how it's wicked don't need correcting. I can get through my joke in 10-15 second.
Trying to use Ted to send that joke to Dan though...Five minutes. First trying to dictate it. Then not sure of the wording. Then switching between voice and touch to delete something that made no sense. Then back to voice. Then seeing a comma where it would be confusing. Third and final proofread, and I've now taken at least 20 times as long and barely accomplished the same thing.
Is the problem that Ted doesn't understand irony? No. There two problems. One is that Ted hasn't listened to my speech patterns for decades. The second is that I know Ted hasn't listened to my speech patterns for decades like Dan has. I worry that what I am communicating will be muddied in the process. That worry causes hesitation and mistakes.
To me, a user interface at it's most basic form is whatever it takes to go from a human thought of doing something to the completion of the task. The fewer steps, the better. But really it's the less time, the better. Because if you have 500 steps (typed words) between you and Dan, it's fine if they can be spoken in 30 seconds to Dan's live ear.
But if you don't have enough confidence that the words will make it through Ted's digital ears, their is doubt and even criticism. Critical thinking in my experience takes longer than the thinking that drives everyday speech. I might do a little of it to read Dan and bring the joke to a dismount if he doesn't find it as funny as I do. But I have practice there.
This is my great challenge with the Ted interface. Confidence. No small order. Fortunately for me, user-centered design means that they'll have to make Ted smarter to deal with my specific style of speaking until I'm comfy. Either that or start teaching "how to talk to entities with a synthetic database of knowledge generated through machine learning" to 10th graders.
"Hey Dan, just got my schedule. I have Home-Ec 3rd period. You?"
"Aw man, I've got HtTTEWaSDoKGTML. You know Hittie-was-doking-got-mill, how to talk to entities with a synthetic database of knowledge generated through machine learning."
"Dang. Yeah, I got Hittie after gym."
Until next time, when we talk about what Hittie 101 might look like.
Comments
Post a Comment