Architecture
High-Level Overview
Section titled “High-Level Overview”The Observatorio del Congreso is a quantitative analysis platform for Mexico’s legislative branch (Cámara de Diputados + Senado de la República). It uses a unified Popolo-Graph schema stored in SQLite to model legislators, parties, votes, and informal power networks across seven legislatures (LX through LXVI, 2006-2026). The dataset covers approximately 3.5 million individual votes, 8,000 vote events, and 3,800 legislators.
Pipeline
Section titled “Pipeline”┌─────────────────────────────────────────────────────────────────────┐│ DATA COLLECTION ││ ││ ┌──────────────────────┐ ┌──────────────────────────────────┐ ││ │ Senado Scraper │ │ Diputados Scraper │ ││ │ curl_cffi + TLS │ │ httpx + BeautifulSoup │ ││ │ fingerprint │ │ │ ││ │ (Anti-WAF: │ │ SITL / INFOPAL open portal │ ││ │ Incapsula bypass) │ │ + datos.abiertos API │ ││ └──────────┬───────────┘ └──────────────┬───────────────────┘ ││ │ │ │└─────────────┼──────────────────────────────────┼────────────────────┘ │ │ ▼ ▼┌─────────────────────────────────────────────────────────────────────┐│ PARSE & LOAD ││ ││ ┌──────────────────────────────────────────────────────────────┐ ││ │ Transformers → Loaders (deduplication via source_id) │ ││ └──────────────────────────────┬───────────────────────────────┘ ││ │ │└─────────────────────────────────┼───────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────────┐│ STORAGE LAYER ││ ││ ┌──────────────────────────────────────────────────────────────┐ ││ │ SQLite (WAL mode) — congreso.db │ ││ │ Popolo-Graph Schema: 12 tables │ ││ │ area · organization · person · membership · post │ ││ │ motion · vote_event · vote · count │ ││ │ actor_externo · relacion_poder · evento_politico │ ││ └──────────────────────────────┬───────────────────────────────┘ ││ │ │└─────────────────────────────────┼───────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────────┐│ ANALYSIS LAYER ││ ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────────┐ ││ │ W-NOMINATE │ │ Co-voting │ │ Community Detection │ ││ │ (scipy, │ │ Matrix │ │ (networkx, │ ││ │ numpy) │ │ & Graph │ │ python-louvain) │ ││ └──────┬───────┘ └──────┬───────┘ └───────────┬──────────────┘ ││ │ │ │ ││ ┌──────┴───────┐ ┌──────┴───────┐ ┌───────────┴──────────────┐ ││ │ Centrality │ │ Power │ │ Empirical Power │ ││ │ (degree, │ │ Indices │ │ (from real voting │ ││ │ betweenness)│ │ (Shapley- │ │ coalitions) │ ││ │ │ │ Shubik, │ │ │ ││ │ │ │ Banzhaf) │ │ │ ││ └──────┬───────┘ └──────┬───────┘ └───────────┬──────────────┘ ││ │ │ │ │└─────────┼──────────────────┼───────────────────────┼─────────────────┘ │ │ │ ▼ ▼ ▼┌─────────────────────────────────────────────────────────────────────┐│ EXPORT LAYER ││ ││ ┌──────────────────────────────────────────────────────────────┐ ││ │ JSON files → public/data/observatorio/ │ ││ │ Pre-aggregated, static, no server-side computation │ ││ └──────────────────────────────┬───────────────────────────────┘ ││ │ │└─────────────────────────────────┼───────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────────┐│ VISUALIZATION LAYER ││ ││ ┌──────────────────────────────────────────────────────────────┐ ││ │ CachorroSpace (Astro + Starlight) │ ││ │ ECharts 6 via React islands │ ││ │ Interactive charts: NOMINATE maps, co-voting graphs, │ ││ │ power indices, community structures │ ││ └──────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────┘Technology Stack
Section titled “Technology Stack”| Component | Technology | Purpose |
|---|---|---|
| Scraping (Senado) | curl_cffi + TLS fingerprint impersonation | Anti-WAF evasion for Incapsula-protected portal |
| Scraping (Diputados) | httpx + BeautifulSoup | Open data portal scraping (SITL / INFOPAL) |
| Database | SQLite (WAL mode) | Unified Popolo-Graph storage |
| Analysis — NOMINATE | scipy, numpy, matplotlib | Ideal point estimation (W-NOMINATE algorithm) |
| Analysis — Networks | networkx, python-louvain | Co-voting graphs, community detection |
| Exports | JSON (static) | Pre-aggregated data for visualizations |
| Visualizations | ECharts 6 (React islands) | Interactive charts on CachorroSpace |
Data Sources
Section titled “Data Sources”| Source | URL | Chamber | Data |
|---|---|---|---|
| Cámara de Diputados | datos_abiertos / SITL / INFOPAL | Diputados | Voting records, legislator profiles, composition |
| Senado de la República | senado.gob.mx/66/ | Senado | Voting records, senator profiles, directorio |
Data Flow
Section titled “Data Flow”The pipeline processes data in four stages:
1. Scrape and Parse
Section titled “1. Scrape and Parse”Each chamber has a dedicated scraper with its own HTTP client, parser, and transformer modules:
- Senado:
curl_cffisession with TLS impersonation retrieves voting pages. Parsers extract vote data from HTML. Transformers normalize data into Popolo-Graph format. - Diputados:
httpxclient with file-based caching and rate limiting queries the SITL/INFOPAL systems. Parsers handle both XML and HTML responses.
2. Load into SQLite
Section titled “2. Load into SQLite”Data flows through loaders that insert records into congreso.db with deduplication via the source_id column on the vote_event table. The id_generator module produces human-readable IDs with prefixes (P01, O01, VE01, etc.).
3. Analysis
Section titled “3. Analysis”Analysis scripts read from SQLite and compute:
- W-NOMINATE: Ideal point estimation placing legislators on a 2D ideological map
- Co-voting matrix: Pairwise agreement rates between legislators, exported as weighted graphs
- Community detection: Louvain algorithm identifies voting blocs within co-voting networks
- Centrality: Degree and betweenness centrality measures on co-voting graphs
- Power indices: Shapley-Shubik and Banzhaf indices based on seat distributions
- Empirical power: Measured from real voting coalition data, not just seat counts
4. Export and Visualize
Section titled “4. Export and Visualize”The export_observatorio_json.py script reads analysis CSV outputs and produces static JSON files consumed by ECharts 6 visualizations embedded as React islands in CachorroSpace.
analysis/output/*.csv │ ▼export_observatorio_json.py │ ▼public/data/observatorio/*.json │ ▼React ECharts islands (CachorroSpace)Project Structure
Section titled “Project Structure”observatorio-congreso/├── db/│ ├── schema.sql # Popolo-Graph schema (12 tables)│ ├── senado_schema.sql # Senado-specific schema extensions│ ├── init_db.py # Database initialization + seed data│ ├── helpers.py # SQLite helper functions│ ├── id_generator.py # Human-readable ID generation (P01, O01...)│ ├── constants.py # Legislature mappings and constants│ ├── migrations/ # Schema migrations and data fixes│ └── congreso.db # SQLite database (~3.5M votes)│├── diputados/│ └── scraper/│ ├── client.py # httpx HTTP client with cache│ ├── config.py # Scraper configuration│ ├── pipeline.py # Main scraping pipeline│ ├── loader.py # SQLite loader (dedup via source_id)│ ├── models.py # Data models│ ├── legislatura.py # Legislature range logic│ └── parsers/│ ├── votaciones.py # Vote event parser│ ├── nominal.py # Nominal (roll-call) vote parser│ ├── desglose.py # Vote breakdown parser│ ├── diputado.py # Legislator profile parser│ └── composicion.py # Chamber composition parser│├── senado/│ └── scrapers/│ ├── shared/│ │ ├── client.py # Anti-WAF client (curl_cffi)│ │ ├── config.py # Scraper configuration│ │ └── models.py # Shared data models│ ├── votaciones/│ │ ├── __main__.py # CLI entry point│ │ ├── cli.py # Command-line interface│ │ ├── transformers.py # Data normalization│ │ ├── congreso_loader.py # SQLite loader│ │ └── parsers/│ │ └── lxvi_portal.py # LXVI portal parser│ └── perfiles/│ ├── __main__.py # CLI entry point│ ├── scraper.py # Profile scraper│ └── parsers/│ └── perfil_parser.py # Senator profile parser│├── analysis/│ ├── nominate.py # W-NOMINATE implementation│ ├── covotacion.py # Co-voting matrix and graph│ ├── covotacion_dinamica.py # Dynamic (time-windowed) co-voting│ ├── comunidades.py # Louvain community detection│ ├── centralidad.py # Degree and betweenness centrality│ ├── poder_partidos.py # Shapley-Shubik and Banzhaf indices│ ├── poder_empirico.py # Empirical power from real votes│ ├── run_analysis.py # Run all analyses│ ├── run_nominate.py # Run NOMINATE only│ ├── run_covotacion_dinamica.py # Run dynamic co-voting│ ├── visualizacion.py # General visualization exports│ ├── visualizacion_nominate.py # NOMINATE chart data│ ├── visualizacion_dinamica.py # Dynamic co-voting chart data│ ├── visualizacion_poder.py # Power indices chart data│ ├── visualizacion_articulo.py # Article-specific visualizations│ ├── scripts/│ │ └── export_observatorio_json.py # CSV → JSON for ECharts│ ├── analisis-diputados/ # Diputados-specific analysis outputs│ ├── analisis-senado/ # Senado-specific analysis outputs│ └── analisis-bicameral/ # Cross-chamber analysis outputs│├── utils/│ ├── db_utils.py # Database utility functions│ ├── text_utils.py # Text normalization utilities│ └── tests/│ └── test_text_utils.py # Text utility tests│├── cache/ # HTTP response cache├── logs/ # Scraper logs├── pyproject.toml # Project dependencies (uv)└── scrape_diputados_all.sh # Batch scraper scriptDatabase Configuration
Section titled “Database Configuration”SQLite is configured for safe concurrent access and data integrity:
PRAGMA foreign_keys = ON;PRAGMA encoding = "UTF-8";PRAGMA journal_mode = WAL;PRAGMA busy_timeout = 5000;| Setting | Value | Purpose |
|---|---|---|
journal_mode | WAL | Concurrent reads without blocking writes |
foreign_keys | ON | Enforce referential integrity between tables |
busy_timeout | 5000ms | Wait up to 5 seconds if database is locked |
encoding | UTF-8 | Correct handling of Spanish characters (accents, ñ) |
Schema Overview
Section titled “Schema Overview”The Popolo-Graph schema contains 12 tables, organized into four groups:
Core Popolo entities (legislative data standard):
| Table | Purpose |
|---|---|
area | Geographic divisions (states, districts, constituencies) |
organization | Political parties, blocs, coalitions, institutions |
person | Legislators and political actors |
membership | Person-to-organization relationships with roles and dates |
post | Legislative positions within organizations and areas |
motion | Bills and legislative initiatives |
vote_event | Specific voting instances (chamber + date) |
vote | Individual legislator votes per event |
count | Aggregated vote counts per group per event |
Power network extensions (beyond standard Popolo):
| Table | Purpose |
|---|---|
actor_externo | External actors (governors, party leaders, judges) |
relacion_poder | Informal power relationships (loyalty, pressure, alliances) |
evento_politico | Political events that affect power dynamics |
Indexes
Section titled “Indexes”The schema includes indexes on the most common query patterns:
membershipqueries by person and by organizationvote_eventlookups by motion and bysource_id(deduplication)votequeries by voter and by eventcountqueries by event and by grouprelacion_poderqueries by source, target, and typepersonfiltering by internal faction (corriente_interna)
Integrity Triggers
Section titled “Integrity Triggers”Date validation triggers ensure end_date >= start_date on person and membership tables for both inserts and updates. These fire at the SQLite level to prevent data corruption regardless of which loader writes the data.