Chapters

1Orientation and Course Overview

2AI Fundamentals for Everyone

3Machine Learning Essentials

4Understanding Data

5AI Terminology and Mental Models

6What Makes an AI-Driven Organization

7Capabilities and Limits of Machine Learning

8Non-Technical Deep Learning

9Workflows for ML and Data Science

10Choosing and Scoping AI Projects

11Working with AI Teams and Tools

12Case Studies: Smart Speaker and Self-Driving Car

Smart speaker problem framing Wake word detection basics Speech recognition pipeline Natural language understanding Personalization and context Privacy and consent tradeoffs Edge vs cloud decisions Error analysis in practice Voice assistant metrics Self-driving stack overview Perception systems basics Prediction and forecasting Motion planning basics Safety cases and testing Regulation and public trust

13AI Transformation Playbook

14Pitfalls, Risks, and Responsible AI

15AI and Society, Careers, and Next Steps

Courses/AI For Everyone/Case Studies: Smart Speaker and Self-Driving Car

Case Studies: Smart Speaker and Self-Driving Car

8177 views

Apply concepts to real-world systems to see tradeoffs and decisions in action.

Content

3 of 15

Speech recognition pipeline

Speech Pipeline: Sass & Substance

4181 views

beginner

humorous

science

education theory

gpt-5-mini

4181 views

Versions:

Speech Pipeline: Sass & Substance

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Speech Recognition Pipeline — The Secret Sauce Behind Your Smart Speaker (and the Car That Won't Text Back)

"Speech recognition isn't just a model — it's an orchestra. If one violin is out-of-tune, the whole symphony sounds like a cat singing Adele." — your slightly dramatic AI TA

You're already familiar with the wake-word basics and the high-level smart speaker problem framing we covered earlier. Now that the device has said "Hey, listen to me" (wake-word detected), what happens next? This is the speech recognition pipeline: the multi-stage journey from raw audio to a meaningful transcript and then to an action. We’ll build on the team & toolchain coordination principles from "Working with AI Teams and Tools" so you know who does what and when.

Quick roadmap (so you don't get lost mid-rant)

Capture & front-end (audio in)
Preprocessing (clean-up, voice activity)
Feature extraction (numbers, not emojis)
Acoustic + Language modeling (brains)
Decoding & post-processing (polish and punctuation)
NLU / intent -> action
Evaluation, deployment, and team responsibilities

1) Capture & Front-end: Where the story begins

Analogy: imagine a very needy microphone in a crowded café trying to hear your friend telling the secret recipe.

Multi-mic arrays & beamforming: Focus the listener. Smart speakers use arrays to amplify directionally. Cars use microphone arrays to isolate driver commands amid road noise.
Sampling rate & bit depth: The fidelity matters — music-grade vs speech-grade tradeoffs.

Questions: How far is the speaker from the device? How noisy is the environment? These drive hardware choices and pre-processing needs.

2) Preprocessing: Clean-up crew

Voice Activity Detection (VAD): Trim the silent bits. Keeps models from choking on dead air.
Noise reduction & dereverberation: Beamforming + adaptive filters. The goal: give the model something intelligible.
Normalizing levels: So the loud aunt and whispery sibling both get handled.

Why it matters: Garbage in, garbage out. Bad preproc means higher WER (Word Error Rate) and cranky users.

3) Feature extraction: Turning audio into math that the model eats

Historically:

MFCCs (Mel-Frequency Cepstral Coefficients) — classic, compact features inspired by human hearing
Filterbanks / log-mel spectrograms — richer, favored in deep-learning pipelines

Code-ish pseudocode:

wave = load_audio(file)
wave = resample(wave, 16_000)
spectrogram = stft(wave)
log_mel = mel_filterbank(spectrogram)
features = log(1 + log_mel)

This is where model choice (end-to-end vs hybrid) starts to matter.

4) The Brains: Acoustic Model + Language Model (or one big brain)

Two main approaches:

Hybrid (traditional): Acoustic model -> pronunciation lexicon -> decoding with a probabilistic language model (n-gram). Works well with limited data, modular.
End-to-end (E2E): Single neural model (CTC, LAS, RNN-T, Transformer-based) that maps audio directly to text. Fewer moving parts but needs tons of data.

Tradeoffs:

Latency: Cars need low-latency responses; smart speakers often tolerate a small round trip but privacy demands push toward local inference.
Data: End-to-end wants gargantuan datasets. Hybrid can leverage smaller corpora + lexicons.
Robustness: Hybrid systems sometimes better for rare-word/generalization; E2E models now close or better when trained well.

5) Decoding & Post-processing: Making the output human-friendly

Beam search / decoding stitches acoustic and language cues into plausible text.
Post-processing: punctuation insertion, capitalization, number normalization, profanity filtering.
Error correction: small rescoring models, contextual biasing (favor device-specific command phrases).

Imagine a producer running the transcript through a stylist before letting it onto the stage.

6) NLU / Intent extraction -> Action

Raw text becomes intent. This is where the system asks: is it a music request, a search query, or a car HVAC command?

Slot-filling: extract parameters like artist name, destination address
Dialog manager: decide follow-up questions (“Do you mean ‘Starbucks on 5th’?”)

Remember: wake-word just opened the channel. Intent detection closes the loop.

7) Evaluation, deployment & team roles (tie to "Working with AI Teams and Tools")

Metrics to watch:

WER/CER (Word/Character Error Rate)
Latency (user-perceived response time)
Real-time factor (RTF) — compute speed relative to audio length
Robustness tests: accents, overlapping speech, environment noise, adversarial triggers

Team responsibilities (short and sweet):

Product Manager: defines UX/requirements (latency budget, privacy constraints)
Data Engineers: ingest and pipeline audio, anonymize, store
Annotators / Labeling Team: produce transcripts & semantic labels
ML Engineers: build/train models, choose architectures (hybrid vs E2E)
MLOps / SRE: CI/CD, deployment, monitoring (drift, performance)
Security/Privacy Officer: ensures on-device policies, consent, GDPR compliance

Toolchain examples: audio ingestion -> labeling UI -> training infra (GPU/TPU) -> model serving (on-device runtime, server API) -> monitoring (WER drift, latency)

Smart Speaker vs Self-Driving Car — Quick Comparison

Concern	Smart Speaker	Self-Driving Car
Primary environment	Home (reverberant, familiar voices)	Noisy, multi-source road environment
Latency tolerance	Moderate	Very low
Privacy trend	On-device favored	Local often required, backup cloud for complex tasks
Wake word	Crucial	Usually explicit button or close-talk mic
Vocabulary	Broad consumer queries	Domain-specific (navigation, safety-critical)

Common misunderstandings (and why people keep getting them wrong)

People assume a single giant model solves everything. Nope. It's a pipeline with many components; improving one part doesn't always fix user experience.
"More data fixes it." More diverse and labeled data helps, but you also need good preproc, modeling choices, and domain adaptation.

Engaging question: If you could sacrifice either 50% of latency or 50% of WER for a safety-critical car command system, which would you choose? (Answer: depends on the command — “stop” requires low latency; “navigate to” benefits from lower WER.)

Closing — TL;DR + Takeaways

Speech recognition is a system, not a single model. Each stage (capture → preprocess → features → model → decode → NLU) matters.
Product constraints shape architecture. Privacy, latency, compute, and vocabulary determine hybrid vs end-to-end and on-device vs cloud choices.
Teams must coordinate. Data, ML, DevOps, and product must work together with shared metrics and toolchains (we talked about this in the previous module — use those roles).
Test like your life depends on it. For cars, it kind of does.

Final unhinged-but-true thought: You can have the fanciest transformer with the best WER on paper, but if your mic sucks and your VAD chops off half the sentence, it’s still going to sound like your device has stage fright. Build the whole pipeline, and build it with humans in mind.

Go forth and listen closely — literally and metaphorically. Your next step: map your product's constraints to a concrete pipeline and sketch a short checklist for data collection (noise profiles, accents, edge cases). Then bring that checklist to your data team and make glorious, annotated noise.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics