A concurrent URL downloading system built in Python. The project implements a production-grade job processing pipeline with retry scheduling, lifecycle management, and real-time observability.
The system is built around a producer-consumer pipeline:
Main Thread (producer)
↓ task_queue (bounded, maxsize=10)
Worker Threads (x4)
↓ on failure
Retry Queue (PriorityQueue)
↓
Retry Scheduler Thread → task_queue (re-enqueues after delay)
Monitor Thread (observability)
Job Registry (lifecycle tracking)
Main Thread reads URLs from a file, creates DownloadJob objects, registers them in the job registry, and enqueues them into the bounded task queue. Backpressure is applied naturally — if workers cannot keep up, the producer blocks on task_queue.put().
Worker Threads consume jobs from the task queue, perform HTTP downloads with streaming, classify responses, and route jobs to either terminal state (SUCCESS/FAILED) or the retry queue.
Retry Scheduler Thread monitors the retry queue and re-enqueues jobs into the task queue after their backoff delay has elapsed.
Monitor Thread logs queue sizes, metrics, and job registry state every second.
- Streaming HTTP downloads with chunked file writing
Content-Length-aware download validation — handles both declared-size and chunked transfer encoding responses- Configurable per-job timeout
- Exponential backoff with full jitter to prevent retry storms
- Response classification: 2xx success, 4xx terminal failure (no retry), 5xx and network errors are retried
- Configurable max retries per job
- Bounded task queue with backpressure
- Thread-safe metrics collector using locks
- Semaphore-based lifecycle accounting — main thread blocks until every job reaches a terminal state, not until queues appear empty
JobStatusstate machine:PENDING → RUNNING → SUCCESS | FAILED | RETRY_SCHEDULED → RUNNINGJobRegistrytracks every job's state, attempt count, timestamps, bytes downloaded, and last error- Transition validation — illegal state transitions raise immediately
- Phased shutdown sequence: wait for terminal states → stop scheduler → stop workers → stop monitor → drain results
- Real-time monitoring of queue depths, throughput, and failure counts
- Per-job registry queryable at any point during execution
- Final metrics and registry summary logged at shutdown
url_downloader/
├── main.py # Entry point, argument parsing
├── process_jobs.py # Worker pool, scheduler, download pipeline
├── input_parser.py # URL file parsing and job creation
├── metrics.py # Thread-safe metrics collector
├── shared/
│ ├── models.py # DownloadJob, JobResult, JobStatus, JobRecord
│ └── job_registry.py # JobRegistry with transition validation
├── downloads/ # Output directory for downloaded files
└── test_urls.txt # Sample URL list
python main.py test_urls.txt
python main.py test_urls.txt --max_workers 8
python main.py test_urls.txt --verboseArguments:
file_path— path to a file containing one URL per line-w, --max_workers— number of worker threads (default: 4)-v, --verbose— enable debug logging
requests
validators
Semaphore for completion tracking rather than joining on queue emptiness. A queue being empty is a transient observation — jobs in flight have already been dequeued but haven't reached a terminal state. The semaphore counter increments only on SUCCESS or FAILED, giving the main thread a precise signal that all work is genuinely complete.
Bounded task queue deliberately limits how many jobs can be enqueued ahead of workers. This creates backpressure — if workers are slow, the producer slows down rather than buffering unbounded work in memory.
Jitter on retry backoff prevents retry storms. Without jitter, all jobs that fail simultaneously will retry simultaneously, creating a synchronized wave of load. Randomizing the backoff spreads retries across time.
Phased shutdown ensures no work is abandoned. The scheduler is stopped before workers, so no new jobs can be re-enqueued after workers begin shutting down. Workers are stopped before the monitor, so the final metrics snapshot reflects all completed work.