Senatorial extractor architecture

camara-senadores-mex is a standalone Scrapy extractor that turns pages from the Mexican Senate into a local SQLite database. Its responsibility ends at downloading, parsing, and persisting available roll-call votes and profiles; it does not try to resolve Popolo representation or artificially fill historical gaps.

The architecture is organized around a hostile institutional source: the portal publishes useful content, but it does not expose it as a stable open-data API.

Senate portal
senado.gob.mx/66/

        ├── /66/votacion/{id}
        │       │
        │       └── POST AJAX viewTableVot.php

        └── /66/senador/{id}


        Scrapy + scrapy-impersonate


        temporal / vote / profile parsing


        local SQLite: senado.db

Layers

LayerRoleEvidence produced
Institutional sourceHTML pages and AJAX view under /66/.Vote HTML, AJAX fragments, profile pages.
Scrapy clientTraverses vote or profile IDs and preserves request context.Responses associated with vote_id or senador_id.
Anti-WAF mitigationUses scrapy-impersonate / curl_cffi for TLS fingerprinting.Requests with browser impersonation.
ParsingExtracts temporal metadata, roll-call votes, and available profiles.VotacionItem, VotoNominalItem, SenadorItem items.
PersistenceInserts/upserts into local SQLite.Tables for voting events, roll-call votes, and senators.
ValidationReads the database and counts anomalies without rewriting it.Auditable metrics and warnings.

Vote flow

For each vote, the spider starts from the HTML page:

https://www.senado.gob.mx/66/votacion/{id}

That first HTML response is used to recover navigation context, cookies, and temporal metadata when present. The roll-call table is not treated as a complete contract from the initial HTML: the current code always performs a second request to the AJAX endpoint for roll-call votes.

The current operational endpoint is:

POST https://www.senado.gob.mx/66/app/votaciones/functions/viewTableVot.php

with an application/x-www-form-urlencoded body equivalent to:

action=ajax&cell=1&order=DESC&votacion={id}&q=

and relevant headers:

Content-Type: application/x-www-form-urlencoded
X-Requested-With: XMLHttpRequest
Referer: <url of /66/votacion/{id}>

This sets an important boundary: in the version documented here, viewTableVot.php must not be described as a contractual GET. Any historical references to GET only describe earlier exploration or older implementations, not the current operational contract.

Profile flow

Profiles are obtained from IDs detected in roll-call votes:

https://www.senado.gob.mx/66/senador/{id}

The profile spider does not traverse an invented universal catalog. It reads the senador_id values present in votos_nominales that are still absent from the senadores table, tries to open the corresponding page, and stores only profiles with a valid information section.

Anti-WAF friction

The portal operates behind Incapsula. For that reason the project uses Scrapy with scrapy-impersonate, which integrates curl_cffi and enables requests with a browser TLS fingerprint.

The current configuration includes:

  • scrapy_impersonate.ImpersonateDownloadHandler download handlers for HTTP and HTTPS;
  • scrapy_impersonate.RandomBrowserMiddleware middleware;
  • meta={"impersonate": "chrome131"} on vote and AJAX requests;
  • enabled cookies, retries, and manual throttling.

Anti-WAF mitigation does not make the source stable. It only makes access sufficiently consistent for extraction, persistence, and auditing.

Output contract

The extractor output is operational: senado.db. Reading it correctly requires preserving the limits of the source:

  • IDs with no content are not filled artificially.
  • Missing profiles are not invented.
  • Empty values are preserved as extraction evidence.
  • The Popolo layer remains outside this repository and consumes the database as a later input.