Data Sources, Engineering, and Deployment
Acquire data from files, web, and databases; then test, package, version, and deploy reliable services.
Content
REST APIs and requests
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
REST APIs and requests — fetch data like a pro (Python)
You already learned web scraping and parsing JSON/XML — now imagine skipping the scraping drama and getting structured data straight from the source.
Why REST APIs matter for data science and engineering
APIs are the clean, official way to get data. While web scraping is a toolbox for when the gatekeeper refuses to cooperate, REST APIs are the golden door: documented, versioned, and usually predictable. In modern data pipelines and ML deployments, you'll use REST APIs to:
- Ingest datasets (e.g., financial ticks, weather, social media)
- Talk to microservices (feature stores, preprocessing, inference endpoints)
- Push results (logging, dashboards, third-party services)
If you enjoyed parsing JSON/XML before, think of REST as “JSON-on-demand” most of the time. Your parsing skills from earlier lessons are going to be used constantly.
Quick REST & HTTP refresher (in plain English)
- Endpoint: a URL that represents a resource or action, e.g.
https://api.example.com/v1/users - HTTP methods: GET (read, idempotent), POST (create/submit), PUT/PATCH (update), DELETE (remove)
- Status codes: 200 OK, 201 Created, 400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found, 429 Too Many Requests, 500 Server Error
- Content-Type: tells how data is encoded (application/json, multipart/form-data, image/png)
- Headers, params, body: headers carry metadata (auth, content-type), params are query-string filters, body holds JSON/form data for POST/PUT
Remember: GET = ask nicely for data. POST = hand over data or ask the server to run something.
The requests library — your everyday fetch tool
Install: pip install requests
Basic GET
import requests
resp = requests.get('https://api.example.com/data', params={'q': 'nyc', 'limit': 50})
resp.raise_for_status() # raises HTTPError on bad status
data = resp.json() # parsed JSON (tie-in: you learned JSON parsing earlier)
POST with JSON
payload = {'text': 'analyze this', 'lang': 'en'}
resp = requests.post('https://api.example.com/analyze', json=payload)
print(resp.status_code, resp.json())
Use a session for connection pooling and default headers
s = requests.Session()
s.headers.update({'Authorization': 'Bearer YOUR_TOKEN', 'Accept': 'application/json'})
resp = s.get('https://api.example.com/me')
Authentication patterns
- API keys:
Authorization: ApiKey xxxxxxxxxor?api_key=...(some do this — but prefer headers) - Bearer tokens / OAuth:
Authorization: Bearer <token>(common for user-scoped APIs) - Basic auth: username/password in headers (rare for modern public APIs)
Tip: Never hardcode keys. Use environment variables or secret managers (especially in deployments).
Pagination, rate limits, and polite scraping
APIs often split results into pages.
- Cursor-based pagination: API returns a
nextcursor URL — follow untilnextis null. - Page/limit: you pass
pageandlimitparams.
Example cursor loop:
items = []
url = 'https://api.example.com/v1/items'
params = {'limit': 100}
while url:
resp = s.get(url, params=params)
resp.raise_for_status()
body = resp.json()
items.extend(body['data'])
url = body.get('next') # next is a full URL or None
Rate limiting: servers might return 429 or headers like X-RateLimit-Remaining. Implement exponential backoff and respect Retry-After.
Example: retry with urllib3 Retry adapter
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
retry_strategy = Retry(
total=5,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["HEAD", "GET", "OPTIONS", "POST"]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
s.mount("https://", adapter)
s.mount("http://", adapter)
Error handling, timeouts, and robustness
- Always set a timeout:
requests.get(url, timeout=(3, 30))(connect, read) - Use
resp.raise_for_status()or handle status codes explicitly - For intermittent errors, prefer retries with exponential backoff
- Log responses and response bodies for debugging (avoid logging secrets)
Streaming large responses & downloading files
For large binary responses (images, datasets), stream to avoid memory bloat.
with s.get('https://cdn.example.com/large.csv', stream=True) as r:
r.raise_for_status()
with open('large.csv', 'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
Calling ML inference endpoints — tie-in with Deep Learning Foundations
You trained a PyTorch model and deployed it as a REST endpoint (e.g., FastAPI, Flask, TorchServe). Here's how to call it from Python:
Example: image classification endpoint that returns JSON probs
files = {'image': open('dog.jpg', 'rb')}
resp = s.post('https://inference.example.com/predict', files=files)
resp.raise_for_status()
result = resp.json() # e.g. {'predictions': [{'label': 'dog', 'score': 0.98}, ...]}
print(result['predictions'][0])
Alternative: send base64-encoded image in JSON (useful for text-only APIs):
import base64
b64 = base64.b64encode(open('dog.jpg','rb').read()).decode('ascii')
resp = s.post('https://inference.example.com/predict', json={'image_b64': b64})
Design note: when deploying models, expose a small JSON contract (input schema, output schema). Your client code becomes a tiny, robust consumer of that contract.
Async requests for high concurrency
If you need massive parallelism in data ingestion, use aiohttp or httpx (async). Requests is synchronous, which is fine for many ETL jobs, but async scales better for many small API calls.
Best practices checklist (engineering-ready)
- Use sessions for connection reuse
- Set timeouts and retries (with exponential backoff)
- Respect rate limits and
Retry-Afterheaders - Keep API keys out of source code (env vars or secrets manager)
- Validate API responses and handle missing fields gracefully
- Document the contract (input/output) used with ML endpoints
- Prefer streaming for large downloads
- Monitor latency/error rates and add observability when deployed
Quick summary — what to remember
- REST + requests = predictable, structured data ingestion. Use it before you scrape HTML.
- Handle pagination, timeouts, retries, and rate limits like a responsible engineer.
- For ML workflows, calling inference endpoints is just another API call — be explicit about formats, and parse JSON safely.
"APIs are like the polite neighbors of the web — they’ll give you what you need if you follow their rules. Treat rate limits like shared parking spots: nobody wins by hogging them."
Key takeaways:
- Master
requestsbasics (GET/POST/headers/params) - Implement robust retries + backoff
- Stream large responses and handle authentication securely
- Tie calls into your ML pipeline: fetch features, call inference, log results
Go build a tiny client that hits a public API, stores JSON results, and feeds them into a model you trained in the Deep Learning Foundations lesson. That's where these lessons start to sing.
Tags: beginner, intermediate, humorous, computer-science, data-engineering
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!