CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is a dual-component system for scraping and serving Taiwan Legislative Yuan IVOD (Internet Video on Demand) transcripts:

Crawler Component (crawler/): Python-based scraper that extracts transcripts from IVOD service and stores them in relational databases (SQLite/PostgreSQL/MySQL) and Elasticsearch
Web Application (app/): Next.js React application providing search and browsing interface for transcripts

Architecture

Crawler Architecture (`crawler/`)

Core Module: ivod/core.py - Database setup, ORM models, HTTP/scraping utilities
Task Workflows: ivod/tasks.py - Main execution workflows (full/incremental/retry)
Entry Points: ivod_full.py, ivod_incremental.py, ivod_retry.py - CLI wrappers
Database Models: Single IVODTranscript model with status tracking for processing states
Multi-backend Support: SQLite, PostgreSQL, MySQL via SQLAlchemy
Elasticsearch Integration: Full-text search indexing with Chinese analysis support

Web Application Architecture (`app/`)

Framework: Next.js with TypeScript, API routes for backend logic
Database: Prisma ORM with multi-backend support (SQLite/PostgreSQL/MySQL)
Search: Elasticsearch integration with fallback to database search
MCP Server: Model Context Protocol integration for AI service access
UI Components: Modular React components for list, search, pagination, transcript viewing
Video Download: Streaming download component for IVOD video files with memory optimization
Styling: Tailwind CSS v4 with responsive design and Chinese font support
Testing: Jest + React Testing Library + Cypress E2E

Development Commands

Crawler Commands (`crawler/`)

# Setup environment
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install -r requirements-dev.txt  # For testing
cp .env.example .env                  # Configure DB_BACKEND and connection parameters
mkdir -p ../db                        # For shared SQLite database

# Data collection workflows
./ivod_full.py          # Full data capture (first run or reset)
./ivod_incremental.py   # Incremental update (daily)
./ivod_retry.py         # Retry failed records

# Elasticsearch indexing
./ivod_es.py            # Update search index

# Testing
pytest --cov=ivod --cov-report=term-missing
pytest -m integration  # Run integration tests only
TEST_SQLITE_PATH=../db/ivod_test.db python integration_test.py

# Comprehensive test coverage (2025-06 improvements)
pytest tests/tasks/test_tasks_comprehensive.py        # Workflow coverage tests
pytest tests/db/test_db_comprehensive.py             # Database coverage tests
pytest tests/db/test_database_env_comprehensive.py   # Environment config tests
pytest tests/crawler/test_crawler_comprehensive.py   # Crawler coverage tests

Web Application Commands (`app/`)

# Setup and development
npm install
cp .env.example .env  # Configure database and Elasticsearch
npm run prisma:prepare     # Update schema based on .env DB_BACKEND
npm run prisma:generate     # Generate Prisma client
npm run dev                 # Development server (localhost:3000)

# Production
npm run build              # Build for production
npm start                  # Start production server

# Testing
npm run test              # Jest unit tests (watch mode)
npm run test:ci           # Jest tests (CI mode)
npm run cypress:open      # Cypress E2E tests (interactive)
npm run cypress:run       # Cypress E2E tests (headless)

# Linting
npm run lint              # ESLint checks

Key Implementation Details

Database Schema

The system uses a single shared database with one main table:

IVODTranscript: Stores transcript metadata, content, and processing status
Status field tracks: 'success', 'failed', 'pending' for retry logic
Retry count prevents infinite retry loops (MAX_RETRIES=5)

Environment Configuration

Both components share similar .env configuration:

DB_BACKEND: "sqlite", "postgresql", or "mysql"
Database connection parameters specific to chosen backend
SKIP_SSL: Boolean for SSL verification bypass
Elasticsearch settings (ES_HOST, ES_PORT, ES_INDEX, etc.)

Data Flow

Crawler fetches IVOD data and stores in database with status tracking
Failed records can be retried via ivod_retry.py
Elasticsearch indexing runs separately via ivod_es.py
Web app reads from same database and searches via Elasticsearch (with DB fallback)
MCP server provides standardized API for AI services to access transcript data

Error Handling Patterns

HTTP failures are caught and marked with status="failed"
Empty/invalid content results in failed status for retry
SSL issues can be bypassed with skip_ssl=True
Database connection errors use SQLAlchemy's connection pooling

Testing Strategy

Crawler Testing

Unit tests with pytest framework
Integration tests marked with @pytest.mark.integration
Mock HTTP responses with requests-mock
In-memory SQLite for database testing
Coverage reporting included
Comprehensive Test Suite (2025-06): 1600+ lines of comprehensive tests for improved coverage
- Database availability testing with user feedback messages
- Environment configuration testing for development/production/testing scenarios
- Workflow logic testing including error handling and retry mechanisms
- HTTP scraping and SSL handling comprehensive coverage

Web Application Testing

Jest + React Testing Library for component tests
Cypress for E2E testing of user workflows
API route testing with mocked dependencies
Tests located in __tests__/ and cypress/integration/
Comprehensive Test Coverage (2025-06): 89.94% overall coverage achieved
- Comprehensive test suites for logger-client.ts (100% coverage)
- API middleware testing with full error scenario coverage (96.82%)
- Utilities testing with cross-database compatibility (93.75%)
- Edge case testing and browser environment compatibility
影片下載功能測試: 完整的單元測試與整合測試
- VideoDownloader 組件測試涵蓋各種情境
- 真實 M3U8 網址測試驗證實際下載功能
- 記憶體使用效率測試確保大檔案下載安全性
- 串流式下載技術的效能與穩定性驗證

Production Deployment Notes

Crawler Production

Deployed on Ubuntu with dedicated user account
Cron scheduling: incremental daily at 02:00, retry at 03:00, full monthly
Logs stored in logs/ directory
Python virtual environment isolation

Web Application Production

Next.js production build with npm run build
Can be deployed to Vercel, Docker, or Ubuntu+nginx+systemd
Prisma generates client based on DB_BACKEND environment variable
Static asset optimization included

Common Development Issues

Database Backend Switching

Update .env DB_BACKEND setting
Run npm run prisma:prepare to update schema provider
Regenerate Prisma client with npm run prisma:generate

SSL/HTTPS Issues

Set SKIP_SSL=True in .env if encountering certificate problems
Crawler includes openssl.cnf for SSL configuration

Date Range Configuration

Full capture start date is configurable in ivod/tasks.py run_full()
Incremental updates cover last 2 weeks by default

Elasticsearch Setup

Install Chinese analysis plugins: analysis-ik or analysis-smartcn
Index configuration handled automatically by ivod_es.py
Web app falls back to database search if Elasticsearch unavailable

影片下載功能 (Video Download Feature)

功能概述

網頁應用程式提供了先進的 IVOD 影片下載功能，讓使用者能夠直接在瀏覽器中下載立法院會議影片。此功能採用串流式下載技術，具備優異的記憶體使用效率，適用於大檔案下載。

核心特色

🚀 串流式下載技術

分批處理: 每批處理 5 個影片片段，避免一次性載入大量資料
記憶體優化: 下載 180MB 影片僅需約 10MB 記憶體 (效率比 17:1)
即時合併: 邊下載邊處理，無需等待全部片段完成
垃圾回收: 批次間自動暫停，讓瀏覽器進行記憶體清理

📱 跨平台相容性

桌面瀏覽器: 完全支援大檔案下載
行動裝置: 記憶體友善設計，適用於手機和平板
老舊設備: 低記憶體需求，提升相容性
格式支援: 輸出標準 TS 格式，相容 VLC 等播放器

🔧 智慧解析與處理

自動偵測: 識別主播放列表 vs. 影片片段列表
遞迴解析: 支援巢狀 M3U8 播放列表結構
路徑處理: 正確轉換相對路徑為絕對路徑
錯誤恢復: 自動跳過損壞片段，繼續下載其他部分

📊 即時進度顯示

詳細進度: 顯示百分比、當前片段數、已下載大小
視覺化: 漸層進度條與動畫載入指示器
狀態回饋: 清楚顯示下載狀態與可能的錯誤訊息
檔案資訊: 估算總大小與下載時間

技術實作

組件架構

VideoDownloader.tsx          // 主要下載組件
├── parseM3U8()              // M3U8 播放列表解析
├── downloadVideoStreaming() // 串流式下載核心邏輯
├── 分批處理邏輯              // 記憶體優化的批次下載
└── 進度與錯誤處理            // 使用者介面狀態管理

關鍵技術特點

Streams API: 使用 ReadableStream 進行串流處理
分批並行: 每批並行下載多個片段，提升效率
記憶體管理: 批次間暫停與垃圾回收機制
錯誤處理: 容錯設計，單一片段失敗不影響整體下載

效能指標

記憶體效率: 17.05x (下載資料量/記憶體增加量)
下載速度: 並行下載提升 3-5 倍速度
成功率: 自動跳過損壞片段，確保高下載成功率
檔案完整性: 嚴格按序合併，保證影片播放品質

使用方式

1. 基本使用

瀏覽任何 IVOD 影片詳細頁面
確認影片具有有效的 M3U8 網址
點擊橘色的「下載影片」按鈕
觀察即時下載進度
完成後自動儲存為 .ts 格式檔案

2. 系統需求

瀏覽器: Chrome 88+, Firefox 87+, Safari 14+, Edge 88+
記憶體: 建議至少 1GB 可用記憶體
儲存空間: 根據影片大小預留足夠空間
網路: 穩定的網際網路連線

3. 故障排除

下載失敗: 檢查網路連線與影片網址有效性
記憶體不足: 關閉其他瀏覽器分頁釋放記憶體
檔案損壞: 重新下載或使用其他播放器開啟
速度緩慢: 檢查網路頻寬與伺服器回應速度

測試與驗證

自動化測試

# 單元測試
npm test VideoDownloader

# 整合測試  
npm test VideoDownloader.integration

# 真實網址測試
node scripts/manual-video-download-test.js

# 記憶體效率測試
node scripts/test-streaming-download.js

測試涵蓋範圍

✅ 各種 M3U8 格式的解析測試
✅ 網路錯誤與重試機制測試
✅ 記憶體使用量監控與驗證
✅ 大檔案下載穩定性測試
✅ 跨瀏覽器相容性測試

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Architecture

Crawler Architecture (crawler/)

Web Application Architecture (app/)

Development Commands

Crawler Commands (crawler/)

Web Application Commands (app/)

Key Implementation Details

Database Schema

Environment Configuration

Data Flow

Error Handling Patterns

Testing Strategy

Crawler Testing

Web Application Testing

Production Deployment Notes

Crawler Production

Web Application Production

Common Development Issues

Database Backend Switching

SSL/HTTPS Issues

Date Range Configuration

Elasticsearch Setup

影片下載功能 (Video Download Feature)

功能概述

核心特色

🚀 串流式下載技術

📱 跨平台相容性

🔧 智慧解析與處理

📊 即時進度顯示

技術實作

組件架構

關鍵技術特點

效能指標

使用方式

1. 基本使用

2. 系統需求

3. 故障排除

測試與驗證

自動化測試

測試涵蓋範圍

最佳實務建議

開發者指南

使用者建議

Crawler Architecture (`crawler/`)

Web Application Architecture (`app/`)

Crawler Commands (`crawler/`)

Web Application Commands (`app/`)