Skip to content

Zobam7/Multithreaded-URL-Downloader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multithreaded URL Downloader

A concurrent URL downloading system built in Python. The project implements a production-grade job processing pipeline with retry scheduling, lifecycle management, and real-time observability.

Architecture

The system is built around a producer-consumer pipeline:

Main Thread (producer)
    ↓ task_queue (bounded, maxsize=10)
Worker Threads (x4)
    ↓ on failure
Retry Queue (PriorityQueue)
    ↓
Retry Scheduler Thread → task_queue (re-enqueues after delay)

Monitor Thread (observability)
Job Registry (lifecycle tracking)

Main Thread reads URLs from a file, creates DownloadJob objects, registers them in the job registry, and enqueues them into the bounded task queue. Backpressure is applied naturally — if workers cannot keep up, the producer blocks on task_queue.put().

Worker Threads consume jobs from the task queue, perform HTTP downloads with streaming, classify responses, and route jobs to either terminal state (SUCCESS/FAILED) or the retry queue.

Retry Scheduler Thread monitors the retry queue and re-enqueues jobs into the task queue after their backoff delay has elapsed.

Monitor Thread logs queue sizes, metrics, and job registry state every second.

Features

Networking

  • Streaming HTTP downloads with chunked file writing
  • Content-Length-aware download validation — handles both declared-size and chunked transfer encoding responses
  • Configurable per-job timeout

Reliability

  • Exponential backoff with full jitter to prevent retry storms
  • Response classification: 2xx success, 4xx terminal failure (no retry), 5xx and network errors are retried
  • Configurable max retries per job

Concurrency

  • Bounded task queue with backpressure
  • Thread-safe metrics collector using locks
  • Semaphore-based lifecycle accounting — main thread blocks until every job reaches a terminal state, not until queues appear empty

Lifecycle Management

  • JobStatus state machine: PENDING → RUNNING → SUCCESS | FAILED | RETRY_SCHEDULED → RUNNING
  • JobRegistry tracks every job's state, attempt count, timestamps, bytes downloaded, and last error
  • Transition validation — illegal state transitions raise immediately
  • Phased shutdown sequence: wait for terminal states → stop scheduler → stop workers → stop monitor → drain results

Observability

  • Real-time monitoring of queue depths, throughput, and failure counts
  • Per-job registry queryable at any point during execution
  • Final metrics and registry summary logged at shutdown

Project Structure

url_downloader/
├── main.py               # Entry point, argument parsing
├── process_jobs.py       # Worker pool, scheduler, download pipeline
├── input_parser.py       # URL file parsing and job creation
├── metrics.py            # Thread-safe metrics collector
├── shared/
│   ├── models.py         # DownloadJob, JobResult, JobStatus, JobRecord
│   └── job_registry.py   # JobRegistry with transition validation
├── downloads/            # Output directory for downloaded files
└── test_urls.txt         # Sample URL list

Usage

python main.py test_urls.txt
python main.py test_urls.txt --max_workers 8
python main.py test_urls.txt --verbose

Arguments:

  • file_path — path to a file containing one URL per line
  • -w, --max_workers — number of worker threads (default: 4)
  • -v, --verbose — enable debug logging

Requirements

requests
validators

Key Design Decisions

Semaphore for completion tracking rather than joining on queue emptiness. A queue being empty is a transient observation — jobs in flight have already been dequeued but haven't reached a terminal state. The semaphore counter increments only on SUCCESS or FAILED, giving the main thread a precise signal that all work is genuinely complete.

Bounded task queue deliberately limits how many jobs can be enqueued ahead of workers. This creates backpressure — if workers are slow, the producer slows down rather than buffering unbounded work in memory.

Jitter on retry backoff prevents retry storms. Without jitter, all jobs that fail simultaneously will retry simultaneously, creating a synchronized wave of load. Randomizing the backoff spreads retries across time.

Phased shutdown ensures no work is abandoned. The scheduler is stopped before workers, so no new jobs can be re-enqueued after workers begin shutting down. Workers are stopped before the monitor, so the final metrics snapshot reflects all completed work.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages