Data Science and AI
Exploring the intersection of data science and AI technologies.
Content
Data Collection Methods
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Where does the data actually come from? (Spoiler: not from a magic USB)
You already know what Data Science is (we covered that) and you've wrestled with NLP's messier corners — biased corpora, slippery tokenization, and speech recognition models that choke on accents. Now we ask the practical, slightly less glamorous question: how do we get the raw material that makes models sing (or at least not embarrass us in production)? This lesson is the bridge from "I know what a model is" to "I can feed it believable, useful, legal data."
TL;DR (aka the part you can bring to standups)
- Data collection method shapes everything: representativeness, bias, label quality, cost, and legal risk.
- Match method to problem: you wouldn’t use Twitter scraping as the backbone for clinical diagnostic data. Unless you like lawsuits.
- For NLP and speech work: consider diversity (dialects, devices, noise), annotation clarity, and consent.
What is "Data Collection" (brief refresher)
Data collection = deliberate steps to acquire observations that will later be cleaned, labeled, and fed into models. It's not just grabbing stuff off the internet and calling it a day. The method determines your dataset's strengths and weaknesses.
Expert take: "Garbage in is a model's favorite excuse." If your data is biased or low-quality, your model will be charmingly terrible in predictable ways.
Major Data Collection Methods: What they are, when to use them, and a meme-worthy take
Primary vs Secondary
- Primary: you collect it yourself (surveys, experiments, sensors). Pros: control. Cons: expensive.
- Secondary: you use existing data (public datasets, APIs). Pros: fast. Cons: possibly misaligned with your problem.
Observational / Passive Logging
- Examples: server logs, clickstreams, mobile telemetry.
- Great for real user behavior. Watch for sampling biases (active users ≠ average users).
Surveys & Interviews
- Structured data, great for subjective phenomena (preferences, satisfaction).
- Be careful: question phrasing = world-shaper. Avoid leading questions unless you want leading results.
Experiments / A/B Tests
- Best for causal inference. If you want to know whether change A causes outcome B, randomize.
- Requires design, monitoring, and sometimes ethics approval.
Web Scraping & Public APIs
- Fast and massive, but messy and legally gray. Respect robots.txt, terms of service, and copyright.
- Example (pseudo-Python):
import requests
r = requests.get('https://api.example.com/sentences')
data = r.json()
Crowdsourcing & Human Annotation
- Platforms like Mechanical Turk, Labelbox, or Prolific. Essential for NLP labels (intent, entity spans, coreference).
- Requires clear guidelines and quality checks (gold-standard checks, consensus, inter-annotator agreement).
Sensors & IoT (time-series)
- For audio (speech recognition), accelerometers, environmental sensors.
- Ensure synchronized timestamps and calibration.
Third-party / Purchased Data
- Quick expansion, but bring a lawyer. Check licensing and provenance.
Simulated / Synthetic Data & Augmentation
- Use TTS for synthetic speech to augment low-resource accents; use data augmentation for images/text.
- Good for addressing class imbalance, but synthetic distribution mismatch is a real thing.
Quick comparison (super condensed)
| Method | Best for | Speed / Cost | Main risk |
|---|---|---|---|
| Observational logs | Real behavior | Fast / Low cost | Bias (who's logged) |
| Surveys | Attitudes, labels | Medium / Medium cost | Question bias |
| APIs / Scraping | Large corpora | Fast / Low cost | Legal + noise |
| Crowdsourcing | Labels at scale | Fast / Variable cost | Quality control |
| Experiments | Causality | Slow / High cost | Ethical/operational complexity |
| Synthetic | Augmentation | Fast / Medium cost | Domain mismatch |
Sampling, bias, and the dark arts of who you forgot to include
- Types of sampling: random, stratified, convenience, systematic. Don’t default to convenience if representativeness matters.
- Bias sources: selection bias, measurement bias, survivorship bias, reporting bias.
- NLP-specific trap: training a speech recognizer on studio-recorded English and expecting it to handle subway announcements. That’s called optimism.
Red flags to watch for:
- No documentation for how data was collected.
- Over-representation of a demographic group.
- Labeler disagreement not measured.
Labeling & Annotation — the unsung hero (or villain)
- Create a labeling guide with examples and edge cases.
- Measure agreement (Cohen's kappa, Krippendorff's alpha). If annotators disagree, the task might be ill-defined.
- Use active learning to prioritize labeling high-value examples — saves time and money.
- Tools: Labelbox, Prodigy, Doccano, custom UIs.
Privacy, legality, and ethics (yes, you must care)
- Get consent where required. Anonymize PII. Store minimal necessary information.
- Consider differential privacy or aggregation for sensitive analytics.
- For scraping: respect copyright and terms of service.
- For speech: record with explicit consent and note device/environment.
Practical mini-checklist (copy-paste into your project plan)
- Define the prediction task and required data modalities.
- Choose collection methods aligned to coverage needs (demographics, noise conditions, devices).
- Design sampling strategy and pilot collection (small batch first).
- Create annotation guidelines and pilot labels; measure inter-annotator agreement.
- Run quality checks, de-duplicate, and document provenance.
- Ensure legal/ethical compliance and secure storage.
- Monitor dataset drift and maintain data pipelines.
Closing: Key takeaways and a final mic drop
- Method matters more than you think: your model's behavior is a mirror of how you collected the data.
- For NLP and speech, diversity and clear annotation are non-negotiable.
- Document everything. If you can't explain how the data was collected, you can't defend the model.
Final thought: Good data collection is boring in the best way — rules, checks, and documentation. Bad data collection is exciting — people say "we scraped the web!" — until the model fails in production and suddenly you're very interested in ethics. Be the boring hero.
Version notes: This piece builds on our previous discussions about what Data Science is and the tricky realities of NLP (like annotation and speech diversity). Apply these methods thoughtfully and your models will repay you with predictable, testable behavior instead of surprises.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!