Courses/Artificial Intelligence for Professionals & Beginners/Data Science and AI

Data Science and AI

749 views

Exploring the intersection of data science and AI technologies.

Content

2 of 10

Data Collection Methods

Data Collection — Sass + Science

189 views

beginner

humorous

science

gpt-5-mini

189 views

Versions:

Data Collection — Sass + Science

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Where does the data actually come from? (Spoiler: not from a magic USB)

You already know what Data Science is (we covered that) and you've wrestled with NLP's messier corners — biased corpora, slippery tokenization, and speech recognition models that choke on accents. Now we ask the practical, slightly less glamorous question: how do we get the raw material that makes models sing (or at least not embarrass us in production)? This lesson is the bridge from "I know what a model is" to "I can feed it believable, useful, legal data."

TL;DR (aka the part you can bring to standups)

Data collection method shapes everything: representativeness, bias, label quality, cost, and legal risk.
Match method to problem: you wouldn’t use Twitter scraping as the backbone for clinical diagnostic data. Unless you like lawsuits.
For NLP and speech work: consider diversity (dialects, devices, noise), annotation clarity, and consent.

What is "Data Collection" (brief refresher)

Data collection = deliberate steps to acquire observations that will later be cleaned, labeled, and fed into models. It's not just grabbing stuff off the internet and calling it a day. The method determines your dataset's strengths and weaknesses.

Expert take: "Garbage in is a model's favorite excuse." If your data is biased or low-quality, your model will be charmingly terrible in predictable ways.

Major Data Collection Methods: What they are, when to use them, and a meme-worthy take

Primary vs Secondary
- Primary: you collect it yourself (surveys, experiments, sensors). Pros: control. Cons: expensive.
- Secondary: you use existing data (public datasets, APIs). Pros: fast. Cons: possibly misaligned with your problem.
Observational / Passive Logging
- Examples: server logs, clickstreams, mobile telemetry.
- Great for real user behavior. Watch for sampling biases (active users ≠ average users).
Surveys & Interviews
- Structured data, great for subjective phenomena (preferences, satisfaction).
- Be careful: question phrasing = world-shaper. Avoid leading questions unless you want leading results.
Experiments / A/B Tests
- Best for causal inference. If you want to know whether change A causes outcome B, randomize.
- Requires design, monitoring, and sometimes ethics approval.
Web Scraping & Public APIs
- Fast and massive, but messy and legally gray. Respect robots.txt, terms of service, and copyright.
- Example (pseudo-Python):

import requests
r = requests.get('https://api.example.com/sentences')
data = r.json()

Crowdsourcing & Human Annotation
- Platforms like Mechanical Turk, Labelbox, or Prolific. Essential for NLP labels (intent, entity spans, coreference).
- Requires clear guidelines and quality checks (gold-standard checks, consensus, inter-annotator agreement).
Sensors & IoT (time-series)
- For audio (speech recognition), accelerometers, environmental sensors.
- Ensure synchronized timestamps and calibration.
Third-party / Purchased Data
- Quick expansion, but bring a lawyer. Check licensing and provenance.
Simulated / Synthetic Data & Augmentation
- Use TTS for synthetic speech to augment low-resource accents; use data augmentation for images/text.
- Good for addressing class imbalance, but synthetic distribution mismatch is a real thing.

Quick comparison (super condensed)

Method	Best for	Speed / Cost	Main risk
Observational logs	Real behavior	Fast / Low cost	Bias (who's logged)
Surveys	Attitudes, labels	Medium / Medium cost	Question bias
APIs / Scraping	Large corpora	Fast / Low cost	Legal + noise
Crowdsourcing	Labels at scale	Fast / Variable cost	Quality control
Experiments	Causality	Slow / High cost	Ethical/operational complexity
Synthetic	Augmentation	Fast / Medium cost	Domain mismatch

Sampling, bias, and the dark arts of who you forgot to include

Types of sampling: random, stratified, convenience, systematic. Don’t default to convenience if representativeness matters.
Bias sources: selection bias, measurement bias, survivorship bias, reporting bias.
NLP-specific trap: training a speech recognizer on studio-recorded English and expecting it to handle subway announcements. That’s called optimism.

Red flags to watch for:

No documentation for how data was collected.
Over-representation of a demographic group.
Labeler disagreement not measured.

Labeling & Annotation — the unsung hero (or villain)

Create a labeling guide with examples and edge cases.
Measure agreement (Cohen's kappa, Krippendorff's alpha). If annotators disagree, the task might be ill-defined.
Use active learning to prioritize labeling high-value examples — saves time and money.
Tools: Labelbox, Prodigy, Doccano, custom UIs.

Privacy, legality, and ethics (yes, you must care)

Get consent where required. Anonymize PII. Store minimal necessary information.
Consider differential privacy or aggregation for sensitive analytics.
For scraping: respect copyright and terms of service.
For speech: record with explicit consent and note device/environment.

Practical mini-checklist (copy-paste into your project plan)

Define the prediction task and required data modalities.
Choose collection methods aligned to coverage needs (demographics, noise conditions, devices).
Design sampling strategy and pilot collection (small batch first).
Create annotation guidelines and pilot labels; measure inter-annotator agreement.
Run quality checks, de-duplicate, and document provenance.
Ensure legal/ethical compliance and secure storage.
Monitor dataset drift and maintain data pipelines.

Closing: Key takeaways and a final mic drop

Method matters more than you think: your model's behavior is a mirror of how you collected the data.
For NLP and speech, diversity and clear annotation are non-negotiable.
Document everything. If you can't explain how the data was collected, you can't defend the model.

Final thought: Good data collection is boring in the best way — rules, checks, and documentation. Bad data collection is exciting — people say "we scraped the web!" — until the model fails in production and suddenly you're very interested in ethics. Be the boring hero.

Version notes: This piece builds on our previous discussions about what Data Science is and the tricky realities of NLP (like annotation and speech diversity). Apply these methods thoughtfully and your models will repay you with predictable, testable behavior instead of surprises.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics