Computer Vision Techniques
Learn about computer vision, a field of AI that enables machines to interpret and process visual information.
Content
Image Processing
Versions:
Watch & Learn
AI-discovered learning video
Image Processing — The Good, the Bad, and the Pixels
"If NLP taught machines to read our messy sentences, image processing teaches them to see our messy world." — Probably me, dramatically.
You're coming off the NLP module where we taught models to wrangle words, connotations, and the occasional emoji. Now flip the script: instead of tokens and embeddings, we have pixels, color channels, and the eternal question — how do we turn a messy photo into useful data for a model? That's image processing: the preprocessing and low-level ops that make computer vision possible.
Why this matters (quick, no fluff)
- Machine learning models don't like noise, uneven lighting, or JPG artifacts. They especially hate inconsistency.
- Image processing standardizes, denoises, and extracts structure so higher-level CV methods (like object detection or segmentation) can actually work.
- Think of it as grooming raw visual data into something your model will swipe right on.
Big ideas at a glance
- Spatial vs Frequency domain — you can operate on pixels directly or transform to a domain where patterns are easier to manipulate.
- Filtering & kernels — small matrices that act like tiny, opinionated painters dragging their brush across the image.
- Thresholding & segmentation — deciding which pixels belong together (like telling drama kids to form a chorus line).
- Morphology — grow, shrink, clean up regions.
- Feature extraction — edges, corners, blobs: the primitives of visual understanding.
Spatial vs Frequency: Two ways of being messy-clean
| Domain | What you operate on | Good for | Analogy |
|---|---|---|---|
| Spatial | Pixel values | Local smoothing, sharpening, blurring | Brushing hair in small strokes |
| Frequency | Sine/cosine coefficients (FFT) | Removing periodic noise, analyzing textures | Saying "let's remove any wiggles at 60Hz" like a sound engineer |
Use a Fourier transform when patterns repeat and you want to filter by frequency. Use spatial filters for local neighborhood effects like blur or edge detection.
Filters and kernels — the tiny dictators
A kernel (or filter) is a small matrix that slides over the image and computes a weighted sum.
- Smoothing (Gaussian blur) — reduces noise, softens edges. Great before thresholding.
- Sharpening (Laplacian, unsharp mask) — emphasizes changes.
- Edge detection (Sobel, Prewitt, Canny) — finds boundaries. Canny is like the polite bouncer: smooths, finds gradients, then applies non-maximum suppression and hysteresis thresholding.
Pseudo-pseudocode (because you asked nicely):
for each pixel p in image:
neighborhood = get_pixels_around(p, kernel_size)
new_value = sum(neighborhood * kernel)
output[p] = new_value
Question: Why does smoothing before edge detection help? Because otherwise noise looks like a bunch of tiny edges and your model gets existential dread.
Thresholding & segmentation — making choices like a judge on a talent show
- Global thresholding: one value for the whole image (Otsu's method finds an optimal threshold automatically).
- Adaptive thresholding: threshold varies by local neighborhood (handy for uneven lighting).
- Morphological ops: "erode" removes small bits, "dilate" grows regions. Combine them: "open" cleans small specks; "close" fills tiny holes.
Real-world example: OCR. You threshold a scanned page to get crisp black text on white background, then run morphological ops to close tiny ink gaps so characters are recognized cleanly.
Feature extraction — edges, corners, and the cursed SIFT
Features are compact descriptions of important points.
- Edges: where intensity changes (Sobel, Canny).
- Corners: Harris corner detector — finds interest points (useful for tracking/stitching).
- Blobs: Laplacian of Gaussian / Difference of Gaussian (DoG) — detects regions of interest.
Local descriptors like SIFT or ORB summarize a patch so you can match it across images (stitch panoramas, track objects). Modern pipelines often combine these with learned features from CNNs.
Transforms & geometry: rotate, scale, and pretend nothing changed
- Affine transforms — rotation, translation, scale, shear. Preserve lines and parallelism.
- Perspective transforms — map between planes (useful for rectifying a picture of a whiteboard).
- Image pyramids — multi-scale representations (Gaussian and Laplacian pyramids). Useful for detecting objects at different sizes or building coarse-to-fine algorithms.
Mini quiz: Why use pyramids for a detector? Because something that looks like a cat at 100px might be a dog at 400px — scale matters.
Practical pipeline (example): From raw photo to model-ready input
- Resize to a standard size (consistent input dims).
- Convert color space if needed (RGB -> grayscale or HSV).
- Denoise (Gaussian or median filter).
- Normalize pixel values (0–1 or mean subtraction).
- (Optional) Augment: rotate, flip, crop, color jitter — helps generalization.
- Extract features or feed to a CNN.
Code snippet (conceptual):
img = read_image('photo.jpg')
img = resize(img, (224,224))
gray = rgb_to_gray(img)
blur = gaussian_blur(gray, k=5)
norm = (blur - blur.mean()) / blur.std()
# feed norm into model
Where this meets NLP (because you just came from that party)
- In NLP we normalize text (lowercase, remove punctuation); in vision we normalize pixels and remove noise.
- Tokenization in NLP ≈ feature detection in vision (both chop raw input into meaningful bits).
- Data augmentation in vision ≈ data augmentation in NLP (synonym replacement, back-translation). Both help models not to overfit to freaky examples.
Connecting disciplines helps: the pre-processing mindset is the same — make the model's life easier.
Common pitfalls (aka what students mess up)
- Over-blurring and losing detail.
- Using global thresholding on unevenly lit images.
- Forgetting to normalize input, causing training instability.
- Relying on handcrafted features when a learned representation is warranted (but also: handcrafted features still useful for small-data cases).
Closing — TL;DR and next moves
- Image processing is the toolkit that turns messy pixel soup into structured inputs: smoothing, filtering, thresholds, morphology, transforms, and features.
- It's the bridge between raw images and higher-level CV tasks like detection and segmentation.
Big takeaway: invest time learning the basics — they're cheap, fast, interpretable, and often solve problems that would otherwise require heavy ML models.
Next logical step in the course: apply these preprocessing steps to a small dataset and then compare performance of a simple classifier on raw vs processed images. Spoiler alert: the processed pipeline usually wins.
"Polish the pixels, then teach the network. Don't expect the network to polish for you."
Version note: This builds on our NLP lessons by showing how preprocessing and feature extraction play the same stabilizing role across modalities. Now go play with OpenCV, and make those pixels behave.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!