Multithreaded URL Downloader

A concurrent URL downloading system built in Python. The project implements a production-grade job processing pipeline with retry scheduling, lifecycle management, and real-time observability.

Architecture

The system is built around a producer-consumer pipeline:

Main Thread (producer)
    ↓ task_queue (bounded, maxsize=10)
Worker Threads (x4)
    ↓ on failure
Retry Queue (PriorityQueue)
    ↓
Retry Scheduler Thread → task_queue (re-enqueues after delay)

Monitor Thread (observability)
Job Registry (lifecycle tracking)

Main Thread reads URLs from a file, creates DownloadJob objects, registers them in the job registry, and enqueues them into the bounded task queue. Backpressure is applied naturally — if workers cannot keep up, the producer blocks on task_queue.put().

Worker Threads consume jobs from the task queue, perform HTTP downloads with streaming, classify responses, and route jobs to either terminal state (SUCCESS/FAILED) or the retry queue.

Retry Scheduler Thread monitors the retry queue and re-enqueues jobs into the task queue after their backoff delay has elapsed.

Monitor Thread logs queue sizes, metrics, and job registry state every second.

Features

Networking

Streaming HTTP downloads with chunked file writing
Content-Length-aware download validation — handles both declared-size and chunked transfer encoding responses
Configurable per-job timeout

Reliability

Exponential backoff with full jitter to prevent retry storms
Response classification: 2xx success, 4xx terminal failure (no retry), 5xx and network errors are retried
Configurable max retries per job

Concurrency

Bounded task queue with backpressure
Thread-safe metrics collector using locks
Semaphore-based lifecycle accounting — main thread blocks until every job reaches a terminal state, not until queues appear empty

Lifecycle Management

JobStatus state machine: PENDING → RUNNING → SUCCESS | FAILED | RETRY_SCHEDULED → RUNNING
JobRegistry tracks every job's state, attempt count, timestamps, bytes downloaded, and last error
Transition validation — illegal state transitions raise immediately
Phased shutdown sequence: wait for terminal states → stop scheduler → stop workers → stop monitor → drain results

Observability

Real-time monitoring of queue depths, throughput, and failure counts
Per-job registry queryable at any point during execution
Final metrics and registry summary logged at shutdown

Project Structure

url_downloader/
├── main.py               # Entry point, argument parsing
├── process_jobs.py       # Worker pool, scheduler, download pipeline
├── input_parser.py       # URL file parsing and job creation
├── metrics.py            # Thread-safe metrics collector
├── shared/
│   ├── models.py         # DownloadJob, JobResult, JobStatus, JobRecord
│   └── job_registry.py   # JobRegistry with transition validation
├── downloads/            # Output directory for downloaded files
└── test_urls.txt         # Sample URL list

Usage

python main.py test_urls.txt
python main.py test_urls.txt --max_workers 8
python main.py test_urls.txt --verbose

Arguments:

file_path — path to a file containing one URL per line
-w, --max_workers — number of worker threads (default: 4)
-v, --verbose — enable debug logging

Requirements

requests
validators

Key Design Decisions

Semaphore for completion tracking rather than joining on queue emptiness. A queue being empty is a transient observation — jobs in flight have already been dequeued but haven't reached a terminal state. The semaphore counter increments only on SUCCESS or FAILED, giving the main thread a precise signal that all work is genuinely complete.

Bounded task queue deliberately limits how many jobs can be enqueued ahead of workers. This creates backpressure — if workers are slow, the producer slows down rather than buffering unbounded work in memory.

Jitter on retry backoff prevents retry storms. Without jitter, all jobs that fail simultaneously will retry simultaneously, creating a synchronized wave of load. Randomizing the backoff spreads retries across time.

Phased shutdown ensures no work is abandoned. The scheduler is stopped before workers, so no new jobs can be re-enqueued after workers begin shutting down. Workers are stopped before the monitor, so the final metrics snapshot reflects all completed work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multithreaded URL Downloader

Architecture

Features

Networking

Reliability

Concurrency

Lifecycle Management

Observability

Project Structure

Usage

Requirements

Key Design Decisions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__pycache__		__pycache__
downloads		downloads
shared		shared
README.md		README.md
app.log		app.log
input_parser.py		input_parser.py
main.py		main.py
metrics.py		metrics.py
process_jobs.py		process_jobs.py
test_urls.txt		test_urls.txt

Folders and files

Latest commit

History

Repository files navigation

Multithreaded URL Downloader

Architecture

Features

Networking

Reliability

Concurrency

Lifecycle Management

Observability

Project Structure

Usage

Requirements

Key Design Decisions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages