Microsoft word ispeech6/14/2023 ![]() That said, both systems had different error rates across dialects, with the lowest average error rates for General American English.ĭifferences in Word Error Rate (WER) by dialect were not robust enough to be significant for Bing (under a one way ANOVA) (F = 1.6, p = 0.21), but they were for YouTube’s automatic captions (F = 3.45,p < 0.05). YouTube’s captions were generally more accurate and more consistent. ![]() In general, there was more variation in the word error rate for Bing and overall the error rate tended to be a bit higher (although that could be due to the incomplete transcriptions we mentioned above). As you might be able to tell from the graphs below, there were pretty big differences between the two systems we looked at. Whatever the reason, the Bing transcriptions were less accurate overall than the YouTube transcriptions, even when we account for the fact that fewer words were returned. I don’t know if that’s the case, though, since I don’t have access to their source code. My theory is that there is a running confidence function for the accuracy of the transcription, and once the overall confidence drops below a certain threshold, you get back whatever was transcribed up to there. ![]() For some reason, a lot of our sound files were returned as only partial transcriptions. For this one, my co-author built a custom Android application that sent the files to the API & requested a long-form transcript back. (I would have used the API instead, but when I was doing this analysis there was no Python Google Speech API, even though very thorough documentation had already been released.)īing’s speech API was a little more complex. Systemsįor the YouTube captions, I just uploaded the speech files to YouTube as videos and then downloaded the subtitles. “General American” is the sort of news-caster style of speech that a lot of people consider unaccented–even though it’s just as much an accent as any of the others! You can hear a sample here.įor each variety, I did an acoustic analysis to make sure that speakers I’d selected actually did use the variety I thought they should, and they all did. I used speech data from four varieties: the South (speakers from Alabama), the Northern Cities (Michigan), California (California) and General American. It’s a collection of English speech from all over, originally collected to help actors sound more realistic. Speech Dataįor this project, I used speech data from the International Dialects of English Archive. With that in mind, I did a second analysis on both YouTube’s automatic captions and Bing’s speech API (that’s the same tech that’s inside Microsoft’s Cortana, as far as I know). Given recent results that find that natural language processing tools don’t work as well for African American English, I was especially interested in looking at automatic speech recognition (ASR) accuracy for African American English speakers. The only demographic information I had was where someone was from.I didn’t control for the audio quality, and since speech recognition is pretty sensitive to things like background noise and microphone quality, that could have had an effect.I controlled for time-of-upload in my statistical models, but it wasn’t the fairest system evaluation. I only looked at one system, YouTube’s automatic captions, and even that was over a period of several years instead of at just one point in time.The effects I found were pretty robust, but I wanted to replicate them for a couple of reasons: If you’ve been following my blog for a while, you may remember that last year I found that YouTube’s automatic captions didn’t work as well for some dialects, or for women.
0 Comments
Leave a Reply. |