Scraping & Data Collection
Overview
Section titled “Overview”The Observatorio scrapes legislative data from two chambers, each requiring a fundamentally different approach. The Chamber of Deputies (Diputados) exposes an open data portal with no anti-bot protection, so a straightforward HTTP client suffices. The Senate (Senado) sits behind Imperva’s Incapsula Web Application Firewall (WAF), which detects and blocks automated requests via TLS fingerprinting. Getting data out of the Senado required building a specialized anti-WAF client.
Two chambers, two stacks:
| Chamber | Client | Difficulty | Reason |
|---|---|---|---|
| Diputados | httpx | Low | Open data portal, no protection |
| Senado | curl_cffi | High | Incapsula WAF with TLS fingerprinting |
Data Sources
Section titled “Data Sources”| Source | Chamber | Base URL | Method | Volume |
|---|---|---|---|---|
| Datos Abiertos | Diputados | datos.abiertos.diputados.gob.mx | httpx + delay | ~4,600 vote events |
| Portal LXVI | Senado | senado.gob.mx/66/ | curl_cffi + TLS impersonation | 5,047 vote events |
| Directorio XLS | Senado | Official XLS files | pandas read_excel | LVIII-LXV legislatures |
Diputados Scraper
Section titled “Diputados Scraper”The Diputados scraper targets the open data portal at datos.abiertos.diputados.gob.mx. No anti-bot protection is present, so the stack is minimal:
- HTTP client: httpx with a configurable delay between requests (default 2.0 seconds)
- Parser: BeautifulSoup for HTML parsing where JSON endpoints are unavailable
- Data scope: voting records keyed by SITL IDs, legislator profiles, and party affiliations
The scraper pulls approximately 4,600 vote events. All data loads into a SQLite database with deduplication handled by the source_id field on each vote_event record.
Senado Scraper — The Anti-WAF Case Study
Section titled “Senado Scraper — The Anti-WAF Case Study”The Problem
Section titled “The Problem”The Senado portal at senado.gob.mx is protected by Incapsula (Imperva WAF). Standard HTTP clients — Python requests, httpx, even curl — get blocked immediately. The WAF detects automated traffic through three mechanisms:
- TLS fingerprinting: The JA3 hash of the TLS handshake identifies non-browser clients
- JavaScript challenges: Incapsula serves JS that real browsers execute automatically
- Behavioral analysis: Request patterns, timing, and cookie behavior are monitored
The first two iterations of the Senado scraper were blocked within minutes of starting a scrape run.
The Solution
Section titled “The Solution”The SenadoLXVIClient in senado/scrapers/shared/client.py uses curl_cffi with TLS fingerprint impersonation. This library wraps libcurl-impersonate, which can reproduce the exact TLS handshake of real browsers — matching JA3 hashes, cipher suites, and extensions.
class SenadoLXVIClient: _IMPERSONATE_TARGETS: tuple[BrowserTypeLiteral, ...] = ( "chrome", "safari", "chrome116", "chrome131", "edge", "chrome_android", )
MAX_REQUESTS_PER_SESSION: int = 10 WAF_CONSECUTIVE_THRESHOLD = 2Fingerprint Pool
Section titled “Fingerprint Pool”Six browser impersonation targets rotate across sessions. Each target presents a distinct JA3 hash to the WAF:
| Target | Profile |
|---|---|
chrome | Latest Chrome desktop |
safari | Safari desktop |
chrome116 | Chrome 116 desktop |
chrome131 | Chrome 131 desktop |
edge | Edge desktop |
chrome_android | Chrome mobile |
Session Management Strategy
Section titled “Session Management Strategy”The client uses a layered session strategy designed to minimize WAF detection while recovering gracefully from blocks:
- Active session: Fixed fingerprint from the pool, shared persistent cookies across requests within the session
- WAF block detected: Close the session immediately, discard all cookies (burned cookies carry WAF flags)
- New session: Rotate to the next fingerprint from the pool, perform a warm-up GET request to populate fresh cookies before scraping
- Proactive rotation: Rotate the session every 10 requests (
MAX_REQUESTS_PER_SESSION) before the WAF has a chance to flag the pattern
Circuit Breaker
Section titled “Circuit Breaker”A circuit breaker tracks consecutive WAF blocks. After WAF_CONSECUTIVE_THRESHOLD (2) consecutive blocks, the session is declared burned. The client raises SessionBurnedError, forces a mandatory pause, and must be restarted with a fresh session.
This prevents the scraper from hammering the WAF with requests that will never succeed.
Warm-up Procedure
Section titled “Warm-up Procedure”After creating a new session, the client issues a dummy GET request to the portal before making any real data requests. This warm-up request allows Incapsula to set its challenge cookies. A cold session without cookies gets blocked far more aggressively than one that has already passed the initial JS challenge.
Results
Section titled “Results”| Metric | Count |
|---|---|
| Vote events scraped | 5,047 |
| Senator profiles scraped | 1,754 |
| Iterations to get right | 3 |
Anti-WAF Strategy Diagram
Section titled “Anti-WAF Strategy Diagram”Request → Check Cache ├─ Cache Hit → Return cached data └─ Cache Miss → Send via curl_cffi ├─ Response OK → Cache + Return └─ WAF Detected (Incapsula markers) ├─ Consecutive < 2 → New session, rotate fingerprint, warm-up, retry └─ Consecutive ≥ 2 → SessionBurnedError → Pause + restartThe cache layer is not optional. Every cached page is one fewer request to the Senado portal, which means one fewer opportunity for the WAF to detect and block the scraper. For repeated scrape runs, the cache dramatically reduces exposure.
Data Quality and Processing
Section titled “Data Quality and Processing”Deduplication
Section titled “Deduplication”Each vote_event record carries a source_id field that maps back to the original identifier from the source portal. This enables idempotent scraping: running the scraper multiple times does not create duplicate records. The SQLite INSERT OR IGNORE pattern on source_id handles this at the database level.
Profile Enrichment
Section titled “Profile Enrichment”Legislator profiles are enriched with demographic and electoral data:
- Gender: 480 female / 598 male (across all loaded records)
- Seat type: MR (majority-relative) or PL (proportional-list)
- Circunscripción: Electoral district assignment for PL seats
Party Normalization
Section titled “Party Normalization”The normalize_party() function maps the mixed vote.group values returned by the portals to canonical organization IDs. Raw party names from the source data are inconsistent — abbreviations vary, coalitions create compound names, and historical parties have multiple labels. Normalization collapses all variants to a single canonical ID.
Membership Resolution
Section titled “Membership Resolution”Some legislators have multi-party memberships across their career. The scraper resolves this by vote frequency: the legislator is assigned to the party where they cast the most votes. This is a pragmatic heuristic — it correctly handles party switches and expulsions without requiring manual disambiguation.
Lessons Learned
Section titled “Lessons Learned”-
TLS fingerprinting is the primary bot detection mechanism for WAFs like Incapsula. Headers and user-agent strings are easy to spoof; the JA3 hash of the TLS handshake is not. Libraries like
curl_cffithat can impersonate real browser TLS stacks are essential. -
Proactive rotation beats reactive rotation. Rotating sessions before the WAF detects a pattern is far more effective than rotating after a block. The 10-request limit per session is conservative but reliable.
-
Cookie management matters. Burned cookies carry WAF flags. Discarding them entirely and starting fresh is better than trying to “fix” a flagged session.
-
Warm-up requests are essential. A cold session without Incapsula challenge cookies gets blocked on the first real request. The warm-up GET populates the necessary cookies.
-
Caching reduces exposure. Each cached page is one fewer request to the portal. For a scraper operating behind a WAF, minimizing total requests is a survival strategy, not just a performance optimization.
-
The Senado scraper took three iterations to get right. The first two were blocked within minutes. Iteration three introduced curl_cffi, fingerprint rotation, and proactive session management — and has been running reliably since.