High-Level Overview

The Observatorio del Congreso is a quantitative analysis platform for Mexico’s legislative branch (Cámara de Diputados + Senado de la República). It uses a unified Popolo-Graph schema stored in SQLite to model legislators, parties, votes, and informal power networks across seven legislatures (LX through LXVI, 2006-2027). The dataset covers approximately 3.5 million individual votes, 9,437 vote events, and 4,840 persons. The codebase has 302 passing tests and runs on Python 3.12.

Pipeline

┌─────────────────────────────────────────────────────────────────────┐
│                         DATA COLLECTION                             │
│                                                                     │
│  ┌──────────────────────┐      ┌──────────────────────────────────┐ │
│  │  Senado Scraper      │      │  Diputados Scraper               │ │
│  │  curl_cffi + TLS     │      │  httpx + BeautifulSoup           │ │
│  │  fingerprint         │      │                                  │ │
│  │  (Anti-WAF:          │      │  SITL / INFOPAL open portal      │ │
│  │   Incapsula bypass)  │      │  + datos.abiertos API            │ │
│  └──────────┬───────────┘      └──────────────┬───────────────────┘ │
│             │                                  │                    │
└─────────────┼──────────────────────────────────┼────────────────────┘
              │                                  │
              ▼                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         PARSE & LOAD                                │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  Transformers → Loaders (deduplication via source_id)        │   │
│  └──────────────────────────────┬───────────────────────────────┘   │
                                │                                   │
└─────────────────────────────────┼───────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────┐
│                      STORAGE LAYER                                  │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  SQLite (WAL mode) — congreso.db                             │   │
│  │  Popolo-Graph Schema: 12 tables                              │   │
│  │  area · organization · person · membership · post            │   │
│  │  motion · vote_event · vote · count                          │   │
│  │  actor_externo · relacion_poder · evento_politico            │   │
│  └──────────────────────────────┬───────────────────────────────┘   │
                                │                                   │
└─────────────────────────────────┼───────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────┐
│                      ANALYSIS LAYER                                 │
│                                                                     │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────────┐  │
│  │  W-NOMINATE   │  │  Co-voting   │  │  Community Detection    │  │
│  │  (scipy,      │  │  Matrix      │  │  (nx.community,         │  │
│  │   numpy)      │  │  & Graph     │  │   built-in Louvain)     │  │
│  └──────┬───────┘  └──────┬───────┘  └───────────┬──────────────┘  │
│         │                  │                       │                 │
│  ┌──────┴───────┐  ┌──────┴───────┐  ┌───────────┴──────────────┐  │
│  │  Centrality   │  │  Power       │  │  Empirical Power         │  │
│  │  (degree,     │  │  Indices     │  │  (from real voting       │  │
│  │   betweenness)│  │  (Shapley-   │  │   coalitions)            │  │
│  │               │  │   Shubik,    │  │                          │  │
│  │               │  │   Banzhaf)   │  │                          │  │
│  └──────┬───────┘  └──────┬───────┘  └───────────┬──────────────┘  │
│         │                  │                       │                 │
└─────────┼──────────────────┼───────────────────────┼─────────────────┘
          │                  │                       │
          ▼                  ▼                       ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      EXPORT LAYER                                   │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  JSON files → public/data/observatorio/                      │   │
│  │  Pre-aggregated, static, no server-side computation          │   │
│  └──────────────────────────────┬───────────────────────────────┘   │
                                │                                   │
└─────────────────────────────────┼───────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────┐
│                   VISUALIZATION LAYER                               │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  CachorroSpace (Astro + Starlight)                           │   │
│  │  ECharts 6 via React islands                                 │   │
│  │  Interactive charts: NOMINATE maps, co-voting graphs,        │   │
│  │  power indices, community structures                         │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Technology Stack

ComponentTechnologyPurpose
Scraping (Senado)curl_cffi + TLS fingerprint impersonationAnti-WAF evasion for Incapsula-protected portal
Scraping (Diputados)httpx + BeautifulSoupOpen data portal scraping (SITL / INFOPAL)
DatabaseSQLite (WAL mode)Unified Popolo-Graph storage
Build systemhatchlingInstallable package via pyproject.toml
Analysis — NOMINATEscipy, numpy, matplotlibIdeal point estimation (W-NOMINATE algorithm)
Analysis — Networksnetworkx (built-in Louvain)Co-voting graphs, community detection
Analysis — Powernumpy, scipyShapley-Shubik O(n²W) DP, Banzhaf indices
ExportsJSON (static)Pre-aggregated data for visualizations
VisualizationsECharts 6 (React islands)Interactive charts on CachorroSpace
LoggingPython logging (centralized)Structured logging via runner_utils.setup_logging()

:::tip All analysis runs offline against the SQLite database. There is no server-side computation at visualization time — JSON exports are pre-computed and served as static files. :::

Data Sources

SourceURLChamberData
Cámara de Diputadosdatos_abiertos / SITL / INFOPALDiputadosVoting records, legislator profiles, composition
Senado de la Repúblicasenado.gob.mx/66/SenadoVoting records, senator profiles, directorio

:::note The Senado portal is protected by Incapsula WAF. The scraper uses curl_cffi with impersonate="chrome" to bypass TLS fingerprint detection. The Diputados portal is open-access and uses standard HTTP requests via httpx. :::

Data Flow

The pipeline processes data in four stages:

1. Scrape and Parse

Each chamber has a dedicated scraper with its own HTTP client, parser, and transformer modules:

  • Senado: curl_cffi session with TLS impersonation retrieves voting pages. Parsers extract vote data from HTML. Transformers normalize data into Popolo-Graph format.
  • Diputados: httpx client with file-based caching and rate limiting queries the SITL/INFOPAL systems. Parsers handle both XML and HTML responses.

2. Load into SQLite

Data flows through loaders that insert records into congreso.db with deduplication via the source_id column on the vote_event table. The id_generator module produces human-readable IDs with prefixes (P01, O01, VE01, etc.).

3. Analysis

Analysis scripts read from SQLite and compute:

  • W-NOMINATE: Ideal point estimation placing legislators on a 2D ideological map
  • Co-voting matrix: Pairwise agreement rates between legislators, exported as weighted graphs
  • Community detection: Louvain algorithm (via nx.community) identifies voting blocs within co-voting networks
  • Centrality: Degree and betweenness centrality measures on co-voting graphs
  • Power indices: Shapley-Shubik and Banzhaf indices based on seat distributions
  • Empirical power: Measured from real voting coalition data, not just seat counts

4. Export and Visualize

The export_observatorio_json.py script reads analysis CSV outputs and produces static JSON files consumed by ECharts 6 visualizations embedded as React islands in CachorroSpace.

analysis/output/*.csv


export_observatorio_json.py


public/data/observatorio/*.json


React ECharts islands (CachorroSpace)

Project Structure

observatorio-congreso/
├── pyproject.toml               # hatchling build-system, deps, ruff config

├── scraper_congreso/            # Installable package (pip install -e .)
│   ├── __init__.py
│   ├── diputados/               # Chamber of Deputies scraper
│   │   ├── __init__.py
│   │   ├── __main__.py          # python -m scraper_congreso.diputados
│   │   ├── client.py            # httpx HTTP client with SHA256 cache
│   │   ├── config.py            # Legislatures + party mappings
│   │   ├── models.py            # Pydantic data models
│   │   ├── pipeline.py          # Main scraping pipeline
│   │   ├── loader.py            # SQLite loader (dedup via source_id)
│   │   ├── legislatura.py       # Legislature range logic
│   │   ├── transformers.py      # SITL → Popolo-Graph normalization
│   │   └── parsers/
│   │       ├── votaciones.py    # Vote event parser
│   │       ├── nominal.py       # Roll-call vote parser
│   │       ├── desglose.py      # Vote breakdown parser
│   │       ├── diputado.py      # Legislator profile parser
│   │       └── composicion.py   # Chamber composition parser
│   │
│   ├── senadores/               # Senate scraper
│   │   ├── __init__.py
│   │   ├── client.py            # Anti-WAF client (curl_cffi, 6 fingerprints)
│   │   ├── config.py            # Scraper configuration
│   │   ├── models.py            # Shared data models
│   │   ├── votaciones/          # Voting records scraper
│   │   │   ├── __init__.py
│   │   │   ├── __main__.py      # python -m scraper_congreso.senadores.votaciones
│   │   │   ├── cli.py           # CLI entry point
│   │   │   ├── loader.py        # SQLite loader
│   │   │   ├── transformers.py  # Data normalization
│   │   │   └── parsers/
│   │   │       └── lxvi_portal.py  # Portal /66/ parser (GET + POST AJAX)
│   │   └── perfiles/            # Senator profiles scraper
│   │       ├── __init__.py
│   │       ├── __main__.py      # python -m scraper_congreso.senadores.perfiles
│   │       ├── scraper.py       # Profile scraper logic
│   │       └── parsers/
│   │           └── perfil_parser.py
│   │
│   └── utils/                   # Shared utilities
│       ├── __init__.py
│       ├── base_loader.py       # BaseLoader (shared SQLite patterns)
│       ├── db_helpers.py        # DB helper functions
│       ├── db_utils.py          # DB utility functions
│       ├── id_generator.py      # Human-readable IDs (P01, O01, VE01...)
│       ├── text_utils.py        # Text normalization
│       ├── config.py            # Shared config
│       └── logging_config.py    # Logging configuration

├── analysis/                    # 28 modules (~13.8K lines)
│   ├── constants.py             # PARTY_COLORS, ORG_TO_SHORT, PARTY_ORDER, COLORES_WEB
│   ├── config.py                # 8 tuneable parameters (thresholds, seeds, IDs)
│   ├── db.py                    # Data access layer (get_connection + 5 parametrized queries)
│   ├── runner_utils.py          # Shared logging, argparse, run_for_cameras
│   ├── nominate.py              # W-NOMINATE implementation
│   ├── covotacion.py            # Co-voting matrix and graph
│   ├── covotacion_dinamica.py   # Dynamic time-windowed co-voting (829 lines)
│   ├── comunidades.py           # Louvain via nx.community (seed=42)
│   ├── centralidad.py           # Degree and betweenness centrality
│   ├── poder_partidos.py        # Shapley-Shubik O(n²W) DP + Banzhaf
│   ├── poder_empirico.py        # Empirical power from real votes
│   ├── evolucion_partidos.py    # Party evolution analysis
│   ├── efecto_genero.py         # Gender effect analysis
│   ├── efecto_curul_tipo.py     # Seat type effect analysis
│   ├── trayectorias.py          # Individual legislator trajectories
│   ├── visualizacion.py         # General visualization exports
│   ├── visualizacion_nominate.py
│   ├── visualizacion_dinamica.py
│   ├── visualizacion_poder.py
│   ├── visualizacion_articulo.py
│   ├── run_analysis.py          # Run all analyses
│   ├── run_nominate.py          # Run NOMINATE only
│   ├── run_covotacion_dinamica.py
│   ├── run_evolucion_partidos.py
│   ├── run_efecto_genero.py
│   ├── run_efecto_curul_tipo.py
│   └── run_trayectorias.py

├── db/
│   ├── schema.sql               # Synchronized schema (18 indexes, 14 FKs, corrected CHECKs)
│   ├── init_db.py               # PRAGMA FK ON + seed data
│   ├── constants.py             # LEGISLATURAS_ORDERED, CAMARA_IDS, party mappings
│   ├── congreso.db              # SQLite database (~337MB)
│   ├── migrations/              # 25 documented migrations (all applied, idempotent)
│   │   └── README.md            # Migration docs
│   └── archived/                # Obsolete files (senado_schema.sql, legacy helpers)

├── tests/                       # 302 tests (passing)

├── scripts/
│   ├── mantener.sh              # Project maintenance script
│   ├── backup_db.sh             # Database backup
│   └── clean_cache.sh           # Cache cleanup

└── cache/                       # HTTP response cache

Database Configuration

SQLite is configured for safe concurrent access and data integrity:

PRAGMA foreign_keys = ON;
PRAGMA encoding    = "UTF-8";
PRAGMA journal_mode = WAL;
PRAGMA busy_timeout = 5000;

PRAGMA foreign_keys = ON is enforced in both db/init_db.py and analysis/db.py, ensuring referential integrity regardless of entry point.

SettingValuePurpose
journal_modeWALConcurrent reads without blocking writes
foreign_keysONEnforce referential integrity between tables
busy_timeout5000msWait up to 5 seconds if database is locked
encodingUTF-8Correct handling of Spanish characters (accents, ñ)

The schema defines 14 foreign keys with explicit ON DELETE / ON UPDATE actions: 3 use CASCADE (for dependent records that should propagate deletions) and 11 use RESTRICT (to prevent orphaned references).

Schema Overview

The Popolo-Graph schema contains 12 tables with 18 indexes, 14 foreign keys, and 5 corrected CHECK constraints. It is organized into four groups:

Core Popolo entities (legislative data standard):

TablePurpose
areaGeographic divisions (states, districts, constituencies)
organizationPolitical parties, blocs, coalitions, institutions
personLegislators and political actors
membershipPerson-to-organization relationships with roles and dates
postLegislative positions within organizations and areas
motionBills and legislative initiatives
vote_eventSpecific voting instances (chamber + date)
voteIndividual legislator votes per event
countAggregated vote counts per group per event

Power network extensions (beyond standard Popolo):

TablePurpose
actor_externoExternal actors (governors, party leaders, judges)
relacion_poderInformal power relationships (loyalty, pressure, alliances)
evento_politicoPolitical events that affect power dynamics

:::note All tables use human-readable IDs with prefixes (P01 for person, O01 for organization, VE01 for vote event, etc.). This makes debugging and manual queries significantly easier than opaque integer primary keys. :::

Schema Maintenance

The db/migrations/ directory contains 25 documented migration scripts, all applied and idempotent. Obsolete schema files (such as the former senado_schema.sql and legacy helper scripts) are preserved in db/archived/ for reference.

Indexes

The schema includes 18 indexes covering the most common query patterns:

  • membership queries by person and by organization
  • vote_event lookups by motion and by source_id (deduplication)
  • vote queries by voter and by event
  • count queries by event and by group
  • relacion_poder queries by source, target, and type
  • person filtering by internal faction (corriente_interna)

Integrity Constraints

Date validation CHECK constraints ensure end_date >= start_date on person and membership tables for both inserts and updates. These constraints enforce data integrity at the SQLite level regardless of which loader writes the data.

Data Volumes

MetricValue
Individual votes~3,510,053
Vote events~9,437
Persons~4,840
Organizations~20+
Legislatures7 (LX through LXVI, 2006-2027)
Tests302 passing
Migration scripts25 (all applied)

Build System

The project uses pyproject.toml with hatchling as its build backend, making the scraper installable as a package via pip install -e ..

Entry Points

python -m scraper_congreso.diputados           # Scrape Diputados
python -m scraper_congreso.senadores.votaciones # Scrape Senate votes
python -m scraper_congreso.senadores.perfiles   # Scrape Senate profiles

Dependencies

Core (scraper): curl_cffi, httpx, beautifulsoup4, lxml, pydantic

Dev: pytest, ruff

Analysis: numpy, pandas, scipy, networkx, matplotlib, polars