Back to The Grid
PROJECT #018

DataGigant

Data collection infrastructure with specialized scrapers for LinkedIn profiles, financial data, and web technology detection. Each scraper runs as an isolated microservice with its own API, orchestrated by Airflow. Monorepo structure with shared types across Python and TypeScript services.

Shipped: TBD
PythonTypeScriptAirflowPostgreSQLDocker

Retrospective

The Good

Airflow orchestrates all the data pipelines. Each scraper (LinkedIn, finance, webtech) is its own microservice with its own API, which keeps failures isolated. The monorepo structure means shared types and configs without the npm package publishing dance.

The Bad

LinkedIn actively fights scrapers. Rate limits, CAPTCHAs, and session invalidation mean the LinkedIn pipeline needs constant babysitting. What works today breaks tomorrow.

The Ugly

The VPN rotation setup is held together with shell scripts and prayer. IP rotation works 80% of the time. The other 20% you're debugging iptables rules at midnight wondering why traffic is routing through the wrong exit node.

Screenshots