RTI public-authority universe scrapers

Scrapers that build the universe of public authorities registered on Indian state RTI (Right to Information) portals — every department, head of department, and sub-office down to each leaf — so the offices can be screened (e.g. for relevance to reservation data) and used to file RTI applications.

Most states' lists can be assembled by hand. Some portals (e.g. Tamil Nadu) nest authorities by level with a separate link per node and run into the thousands, so they need automation. This repo collects one self-contained scraper per state.

Layout

rti/
├── README.md            # this file
├── requirements.txt     # shared dependencies for all states
└── states/
    └── tamil_nadu/      # one self-contained scraper per state
        ├── README.md    # state-specific docs (site quirks, usage)
        ├── scrape.py    # Stage 1: crawl the portal, save raw HTML + manifest
        ├── parse.py     # Stage 2: rebuild the tree offline, write CSVs
        ├── categorize.py# Stage 3: keyword-screen offices for relevance
        ├── common.py    # shared identity + HTML-parsing helpers
        └── data/        # outputs (mostly gitignored — see below)

Each state directory follows the same three-stage shape: scrape → parse → categorize. See states/tamil_nadu/README.md for the worked example.

Setup

python3 -m venv .venv && . .venv/bin/activate
pip install -r requirements.txt

Then cd into a state directory and follow its README.

Adding a new state

Copy states/tamil_nadu/ to states/<state>/ and adapt the portal-specific knobs — the per-state logic is concentrated in a few constants:

common.py — BASE / ENTRY / INDEX URLs, the page-title prefixes (_TITLE_PREFIXES), the leaf-detection regex (_LEAF_RE), and the row/HTML selectors in parse_rows.
scrape.py — the root office name (e.g. "TAMIL NADU (ROOT)") and any geo-fencing / preflight checks specific to the portal.
parse.py — LEVEL_LABELS for the state's organizational hierarchy.
categorize.py — the KEYWORDS list for that state's social categories.

Data in git

Only the final deliverable, data/universe_categorized.csv, is committed. The large regenerable artifacts (raw HTML, manifest.jsonl, intermediate CSVs, logs) are gitignored — re-run the scrapers to reproduce them.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
states/tamil_nadu		states/tamil_nadu
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RTI public-authority universe scrapers

Layout

Setup

Adding a new state

Data in git

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RTI public-authority universe scrapers

Layout

Setup

Adding a new state

Data in git

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages