 started. Thank you guys for coming. I know it's the end of the day and everyone's probably pretty tired, but I really appreciate you guys coming to this talk. So I'm going to start out with a short poll. Can you guys raise your hand for me if you've ever used Siri or Google Voice or any other type of voice dictation software? So show of hands. Okay, great. So almost everyone in the room. And then I want you to raise your hand again if you would characterize that voice dictation software as 100% accurate all the time. Anyone? Okay, so that's pretty much what I expected. And that was basically my impression of voice dictation as well. So for example, when I was back in San Francisco in my apartment, putting together my slides, I tried to get Siri to turn on the lights above my desk and I had to ask about four different times. And in the end, she didn't even turn on the correct light. But that being said, today we are going to talk about voice dictation specifically with a technology called the Web Speech API. But first, introductions. Hi, I'm Cameron, and I build expert use software for stitchfix in Ruby on Rails. And you may have also heard expert use software described as internal tools. So what this means is that I build applications for other stitchfix employees. And I'm going to be spending most of our time today talking about a project that I worked on recently at Stitchfix. So I thought it would be a good idea to give a brief overview of what the company does so that everyone can level set and everyone's on the same page. So Stitchfix is an online personalization company currently focused on men's and women's apparel, accessories, and footwear. So the way that it works is you go online, you fill out a survey with your size and fit preferences, and then we match you with a personal stylist who puts together a box for you of five items. We pick the five items from our warehouse, we send it to your house, you try it on at home, and you get to keep what you love and send the rest back free of charge. So the previous slide showed a picture of a typical stitchfix box also known as a fix. And here I want to show the life cycle of one of those items in a fix. So before the item gets to the client, there's several different steps that it goes through. At the very beginning it involves a choice by the buyer to actually bring in the style to sell to our clients. The buyer places the order for the style to come in at a certain date, and then the vendor ships the items of the style to our warehouse on that date. Next, the warehouse receives and puts the items into inventory, and then they're available for the stylist to send out to the client. So once the stylist picks the item to go in the client's fix, we're back at the warehouse, and the warehouse picks the items out of inventory, packs them up, and ships them to the client. Then, like I mentioned before, the client can try on the items at home, and then the warehouse is going to process anything that the client returned. So we'll come back to that in a second, but now that we've talked a little bit about stitchfix, here's a brief overview of what we're going to cover today. So first I'm going to go through a case study featuring data entry by associates at our warehouse. Then I'll show you how you can get started with the web speech API to experiment with voice dictation on your own. I'll talk about some voice dictation challenges that we ran into and solutions that we implemented, and then I'll answer the question, is voice the right solution for you? So jumping into the case study, like many retail companies, stitchfix takes measurements of the items that we bring into inventory to eventually sell to our clients. So this is a diagram of a men's long sleeve woven shirt, and you can see six marks across the shirt, and these are called points of measure. So these are very specific technical retail measurements that we would take on the shirt. And there's actually hundreds of these measurements that can be taken, and they range from something as specific to the width of a button to as generic as the length or width of the shirt. But at stitchfix for any given men's shirt, we typically take about 15 to 20 measurements. And the part of the process where we take these measurements, if we go back to the life cycle of an item, it's at our warehouse when we receive the inventory from the vendor. And the way that the measurements are collected is just with a basic sewing measuring tape. So here you can see one of the men's shirts laid out flat on a table, and we're measuring across the shoulders with the measuring tape. So when I started working on this project, the goal of the project was to build an application to start capturing these measurements that we were taking at our warehouses. And the process was already in place before I started working on the project. So measurements were already being taken and collected, but the team was using Google sheets to record these measurements. And you'll see that as kind of a recurring theme in internal software is that we're taking existing processes and making them more efficient and scalable by building software to support them. And so that's exactly what we did in this project. So throughout this project, I got the opportunity to partner with my coworker on the user experience or UX design team at Stitchfix. And we worked together throughout the entire project from the user research phase, the prototyping phase, and the development phase. So here's a picture from our initial user research session where we went to the warehouse to observe the current measurement process before figuring out what type of tool we were going to build to support the process. And we had a couple main takeaways from the first user research session. So the pictures on the left and the right show some handmade props that the warehouse associates had made to aid them in the measuring process. And you can see from the diagram a couple slides ago that we really took inspiration from these props that they made and carried that through into the application. And then the middle photo shows one of the warehouse associates actually taking these measurements. And our main takeaway from that was that they were recording these measurements on very small laptop screens. And there was a lot of hunching over a lot of shifting in body language back and forth between measuring the garment and entering it into the keyboard. So before I talk about the rest of the process we went through to build this application, I wanted to give you guys some context so you can think about what we ended up building as we go through the rest of the process. So here's a quick demo of the final solution that we came up with. 23, 18 and one over four, nine and three quarters, eight and a half, four and seven eighths, two and three over four, 16 and one half, two and five over eight, save. So in case you guys haven't figured it out already, you are at one of the JavaScript talks at RailsConf. So we ended up going with voice dictation as our solution. This is a Rails app, but all of the voice dictation is built on the front but in all honesty, this isn't really a talk about JavaScript or even about voice dictation. It's a story about how to leverage the UX design process in engineering to build the best product for our users. So how did we do that? Well, let's finish the story. So after our initial user research session, we were we were pretty focused on the fact that the users were hunched over the small laptop and they had to switch back and forth from measuring and entering the measurements into the keyboard. So we wanted to test out measuring in pairs. We asked the associates to pair up so that one of them could continuously measure and dictate the measurements aloud, and then the other could type into the laptop. And the reasoning behind that is that our hypothesis was that if one of the associates could spend 100% of their time measuring, they wouldn't have to break their flow or their concentration. They wouldn't have to reset their body language or their hand position on the measuring tape and they would be able to be more efficient. So what we found from this test is that the associates kind of hated this concept of measuring in pairs. The person who was typing in the laptop felt like he was just sort of waiting around and wasn't really doing much and felt like he could have been more efficient if he had grabbed another shirt and started measuring himself. But what we did notice that was promising is that the associate who got to focus completely on measuring did seem to be more efficient and she didn't have to break her flow. She was able to continuously measure without breaking her concentration and without shifting her body language. So because of that finding, we then wanted to move forward with a voice usability study. So these two screenshots show our initial prototypes that we brought to the warehouse. The one on the left is a basic keyboard entry and the one on the right is the voice dictation prototype. And they don't look that much different. Just want to call that out. You're not really supposed to see a difference in these because this isn't so much of a UI change as an input change. But you can see in the voice dictation prototype, there's a click to speak button on the top for the associates to press when they're ready to start speaking into the application. But aside from that, the interfaces are pretty similar. So in this voice usability study, there were three main questions that we were hoping to answer. And the first one was around efficiency. So would voice entry affect the overall time to measure a garment? And then the second question that we wanted to answer was around accuracy. So a little bit of background. Our warehouses are pretty noisy environments. The associates often like to play music. They sing aloud to the music and they want to talk to their friends during their shifts. And so we were wondering if this would work out for voice dictation or if it would be hard to capture the input that the user was saying. And then the last one was a question a little bit around culture and workflow. How would the warehouse associates feel about voice entry? So a little bit of context into that. Any associate who's working on the measurements usually is doing it in about a four hour shift. So about half of a workday with breaks in between. And we didn't know if that would feel exhausting to be talking aloud for hours at a time or if they would prefer to be typing into a keyboard instead. So let's take a look at the results. Here are the results around efficiency. We tested these prototypes with two warehouse associates and you can see that participant one had a pretty dramatic increase in efficiency shaved about three minutes off of his measurement time with the voice data entry. And then participant two also saw a bit of a lift in efficiency but not quite as dramatic. And the interesting thing here is that participant two was already the more experienced person doing the measurements. And so he was already very ridiculously fast at taking measurements which is why he didn't have he didn't have quite the increase in efficiency as the less experienced associate. But we thought these were really promising results especially since we knew that we would be onboarding new people onto this process to be taking these measurements there seem to be a huge efficiency gain here. So the next thing that we wanted to take a look at was the accuracy. And we found that investing in the right headset was really the key here and we were able to mitigate the accuracy issues from the noisy environment. So this is the headset that we ended up purchasing for our warehouse associates. The microphone has a pretty narrow input range. And then the most important factor here is that the microphone can be flipped up into the headset and it stops recording when it's flipped up. So this was important to us in terms of keeping the culture going in the warehouse. The associates could move seamlessly back and forth between measuring and singing along or talking to their friends and they didn't have to feel trapped by this voice dictation device. And then the last thing that we wanted to know was how the associates would feel about voice dictation. So here are some photos. The left one shows the keyboard entry prototype and the right one shows the voice dictation. And this is participant number one in the study and his main comment was that the voice dictation felt a lot better for his back. And you can't see as well in the keyboard picture but in the dictation picture he's standing up definitely straighter not as hunched over the laptop. And then this is participant two who was the most experienced and already pretty efficient at using the keyboard. And his main comment is that he liked that he never had to remove his hands from the measuring tape. So you can see in the photo on the left even when he's using keyboard entry he has kind of a one handed approach to typing into the keyboard. And since he's more experienced at measuring he really capitalized on the fact that if you don't have to completely reset your hands on the measuring tape each time you can move through the measurements faster. And so that was his main call out that with the voice data entry he could truly use two hands at all times to do the measurements. So now that you've seen how we utilized voice dictation with our warehouse associates I want to talk a little bit about how you can get started with the Web Speech API on your own. So first here's a bit of coffee script showing how to initialize the Web Speech API. And the really cool thing about this API is that there's no external library or anything that you need to pull in. This is available as part of the JavaScript language if you're using the Chrome browser. So it's really just as simple as initializing the WebKit speech recognition. And on that note like I mentioned this is available to use in Chrome with no external libraries. But the flip side of that is that it's only available in Chrome. So that's why internal tools make a really good candidate for using the Web Speech API because we can fully control our users browsers. But it probably wouldn't be the best solution for something that's customer facing where you have to be able to support every browser under the sun. And then below the screenshot is just a little code snippet showing that we're only initializing the speech recognition if it's defined. And then pretty much the only other thing that you have to do is start the recognition and record voice results. So you can also see a bit of code here in the middle where we have logic that restarts the recognition every time it ends. And this is so that the associates could continuously measure. They could go through every measurement on the form without having to click any button or actively turn on and off the voice recognition and the way that they could move in and out of the voice dictation was by flipping the microphone up in the headset as opposed to clicking anything on the keyboard or messing with the app at all. And then the last step is just getting the results back from the API and returning the transcript. So it's a pretty straightforward setup. And I now want to go into some of the challenges that we ran into with voice dictation and some of the solutions that we implemented. So the first challenge was around contextual formatting. You may or may not have noticed a couple of slides ago that the results that come back from the web speech API come back as an array. And this is because what the API is doing is it's recording context along the way as the user is speaking. And it actually returns snippets of speech along with the final result which is going to be the last element in the array. So let's look at two basic examples here. On the top you can see the user will start to speak and they say two. They continue speaking and they say two and a half. And then they finish off their sentence with two and a half ice creams. And so the API determines okay this person is speaking in sentence form. There's other context around. They're talking about ice cream. We're just going to return the words just as they said them and there's no need for additional formatting. But then in the second example the user starts out the same. They start to say the word two. They then continue on and they say two and a half and then they stop speaking. Which is the case for our application since our users are only recording numbers. And what the API is supposed to do here is from the lack of context it implies that the user is speaking a number. And so it transforms the text into the numeric version. And this is pretty awesome. I thought this was one of the most fascinating things about the API that you kind of just get out of the box was this contextual formatting. But unfortunately it doesn't really work 100% of the time. I think we saw about like a 50-50 success rate with this. And so what that meant for our users was that they were speaking aloud two and a half and they were seeing the words two and a half come on to the screen as opposed to the number. And that's really confusing when they're using a measuring tape and they're supposed to be entering data in fractions. And sometimes they get a fraction and sometimes they get words. But we were able to solve this pretty easily. I think because of the fact that we had such structured data we were only expecting our users to dictate numbers. So we were able to account for that and be able to do the contextual format formatting ourselves. So what we did is we set up an object which is a mapping between the numbers as words and then the numeric counterparts. And then every time we got a transcript back from the API we iterated through the object. We checked for matches in the keys and then if we found a match we just replaced it with the value which is the numeric version. So in addition to contextual formatting another challenge we ran into was dictation errors from our users which were a little bit harder to solve. So here's two examples that we ran into. The first one the user dictated 35 eighths which came out to be 35 over 8. Pretty much just as expected. But what the user was actually trying to say was 30 and 5 eighths but they just didn't enunciate they didn't actually physically pronounce the and. And so this was more of a matter of training the users in how the API would work and how it would record their results. So in this case it was literally recording exactly what they were saying but the user didn't say what they meant to say. And the same thing with the example on the bottom with four quarters would return four over four because that's four quarters but what they actually meant to say it was four and one quarter. It just didn't come out of their mouth the way that they were intending it to. So that's a little challenging because both of the results are valid fractions. So it's a little hard to catch these and luckily we do have a front and validation in the application that makes sure users reduce their fractions. And so these both of these examples will catch an error but that's only because they're non reduced fractions. So you could imagine other dictation errors that are totally valid fractions that it may not catch. So that one's a little bit harder to plan for. And then the last challenge that we ran into with voice dictation was around reliability. So if you go to the MDN documentation for the web speech API you'll see a notice at the top. This is an experimental technology and a bunch of caveats about not always being backwards compatible could have breaking changes etc etc. And after a few weeks of using the voice dictation in production ongoing on a daily basis we noticed some unexpected behavior. And the main unexpected behavior we noticed was that the users would get through about half of a page of measurements and the recording would just completely stop working altogether. And this proved pretty challenging to debug because there wasn't really a difference between those scenarios and the scenarios when it was working. And by that I mean there were no errors in the JavaScript console nothing really indicating that something was wrong. So it was pretty hard to debug and test. We initially turned to hardware as a potential problem. We thought oh maybe we made the wrong choice in headsets. So we tested a few different headset options even just regular earbuds and didn't really find anything there. We also tested different laptops Mac versus PC to see if it could potentially have to do with the laptop or the internal mic in the laptop. And then we made sure our users had up to date versions of Chrome which they all did. So it's still a little bit inconclusive and something that we're digging into more but we're not quite sure what's causing the reliability issues right now. But the good thing is that when working with an experimental technology we knew we had to have a fallback plan from the outset. So we never blocked the users from just entering the data into the form that you saw in the demo. So that's what they're using for the most part right now as we work through some of these reliability issues. And if you remember one of the main reasons that we wanted to implement voice dictation was for the user's comfort and their positioning in their body language. And so what we ended up doing was purchasing monitors that had large screens that we could stand up in the warehouse so that the associates could clearly see the form of measurements in front of them. They didn't have to hunch over and it was a much better experience for them. So I want to call out one other challenge that we've run into that's not related to voice dictation but has more to do with users entering data in the form. So you can imagine if a user is typing into the keyboard and they intend to type 10, they might accidentally slip and type an extra 0 and then we have invalid data essentially. We have a measurement of 100 instead of 10. And that's difficult to catch because 100 is a valid number. It's not any more invalid than 10. But what we had to do was implement suggested ranges for each of our measurements. So the way that we did that is for every single point of measure and again there's hundreds of these and they differ by type of item that we're measuring. We added a minimum and maximum value to the database that we were able to use and implement some front end warnings if the measurements were out of range. So here we show the orange warning but we're not ever blocking the users from submitting the form because it's definitely possible that a measurement could be out of range. We just want to catch the extreme outliers like 100 for across the shoulder which would never happen. So is voice the right solution for you? A couple thoughts on that. A few things that I would consider if you want to look into voice dictation as a potential solution for your users and the first being browser control. For the web speech API in particular it's only supported in Chrome right now which I think because we were building this as an internal tool that allowed us to experiment with it more easily than if you were potentially building a customer facing tool. I think the fact that we had structured data was also really helpful particularly in the contextual formatting. I'm not really sure how we would solve that problem if the API was returning unexpected data and we were just allowing any words to come through. So the fact that we were only expecting numbers we were only allowed to input numbers really helped us out there in order to quickly solve the problem. And then I think it's important since this is a pretty experimental technology that you have a flexible user base and a fallback plan. So building that trust with your users making sure they're willing to experiment with you and making sure they have the understanding that it might not be perfect especially for the first few iterations and communicating that there's always a fallback plan and making sure everyone's trained on the fallback plan is really key. So when I was thinking about my key takeaways making up to make up this slide I sort of came to the realization that this talk has served as a bit of a postmortem on this project for me. There's a lot to learn here but when I think about it there's a couple of things that I'd like you to take away from this story and those are around UX and engineering collaboration. So the first one is that the UX and engineering collaboration that we had allowed us to empathetically build expert use software. And by that I mean usually I'm working on software that's used by people who are sitting at a desk typing in on a keyboard. So this was the first time I had thought about things like body language and the user's comfort level while they were using the app. And that's something that I hope to bring to a lot more of my applications and products down the line. And then the collaboration also allowed us to quickly prototype early on. We were able to iterate and quickly solve problems. And so the couple prototypes that I showed at the beginning that we tested with our users, those were built 100% in code as a true collaboration between UX and engineering. And the reason that was beneficial is because we could get out a realistic prototype to our user, test it quickly and make iterations directly in the code. And some of that code ended up in our production version. So I think it allowed us to move through the process faster. It also allowed us to look at the problem from both the engineering and the user experience lens. So I want to thank a few people for their collaboration on this project. First and foremost, Sarah Poon, my co-worker on the UX design team who was with me every step of the way during this project. And everyone else on this list was also instrumental in getting it off the ground. And with that, thank you guys. Why did we choose speech versus something else? Well, were there any other options? I mean, the only options we were really considering was the speech versus the traditional keyboard entry. We haven't really looked into any technology like a smart measuring tape or something like that. But it sounds like a cool idea. Yeah. Are there limits to how long we can be recording for? So our impression was no, but the sort of intermittent recording that we're getting like might imply otherwise, if that makes sense. Whenever it's really hard to duplicate the problem, if you're just sort of testing at your laptop or even in the warehouse, and so we've noticed, we have noticed that it becomes more of a problem with like continuous use like hours at a time. So you're right, there could be something there. Although every time you submit the form, it stops recording and then starts back up again when they go to a new shirt to measure. So it's yeah, that's it's a good point. It's something to look into. Definitely. So the question was, was there any reasons using the web speech API versus some other speech? Have we played around with any other API? So like I said, this is still the very like early stages of the project. And we went with the web speech API pretty much because it was available quickly and easily for our initial prototyping and we didn't run into the issues early on. So we were like, well, let's not fix what's not broken. But now we're at the point where we probably have to evaluate some other options. But that's a good question. Oh yes. Good question. So the question was around if the warehouse associates saw something that they didn't expect or maybe had an error and they wanted to go back, did we implement anything that helped them do that? And yes, I didn't mention that in the talk, but we did implement some sort of like keyword triggers that would move the cursor around on the page. The one that was in the demo was the save, which submits the form. And then we did implement a back and a forward as well so that if they did hit a mistake, they didn't have to touch the keyboard to go back and change it. Did I say we have a warehouse in France? The question was if we implemented localization with different languages. We only have warehouses in the US right now. So that wasn't an issue, but you might have seen on the initialization side you can't set a language option. We haven't tried it yet, but I'm assuming it would work. Oh, yes. Am I able to share the model of headset it was that we were using? It's called Jabra. I think it's a pretty common company for like call centers. Anyway, that's that's how we found it. How many garments per day does the company measure? So I'm not quite sure and we're not fully ramped up since this is such an early project where we started with men's clothing because it was it was a little simpler to like wrap wrap our minds around like how we would measure it because men's clothing is primarily based on like a box or a square. Not to make any implications, but it's a little bit it's a little bit tougher to measure women's clothing because of the extreme silhouette variation and different styles of necklines and sleeves, etc, etc. So we rolled this out with men's, which is why most of the examples were for men's and then a small subset of women's blouses. So right now we have a handful of associates measuring for a few hours a day, but I'm not sure quite how many garments. Anyone else? All right. Thank you guys.