DataGigant
Data collection infrastructure with specialized scrapers for LinkedIn profiles, financial data, and web technology detection. Each scraper runs as an isolated microservice with its own API, orchestrated by Airflow. Monorepo structure with shared types across Python and TypeScript services.
Retrospective
The Good
Airflow orchestrates all the data pipelines. Each scraper (LinkedIn, finance, webtech) is its own microservice with its own API, which keeps failures isolated. The monorepo structure means shared types and configs without the npm package publishing dance.
The Bad
LinkedIn actively fights scrapers. Rate limits, CAPTCHAs, and session invalidation mean the LinkedIn pipeline needs constant babysitting. What works today breaks tomorrow.
The Ugly
The VPN rotation setup is held together with shell scripts and prayer. IP rotation works 80% of the time. The other 20% you're debugging iptables rules at midnight wondering why traffic is routing through the wrong exit node.