Add command-line interface for tiktoken by alpnix · Pull Request #508 · openai/tiktoken

alpnix · 2026-03-14T08:43:39Z

Summary

Implements a comprehensive command-line interface for tiktoken that enables token counting directly from the shell, addressing the feature request in #473.

🎯 Problem Solved

Currently, tiktoken requires writing Python code to count tokens. This PR adds a tiktoken CLI command that allows users to:

Count tokens in files and directories from the command line
Estimate context window usage for codebases
Integrate token counting into shell scripts and CI/CD pipelines
Quickly check token counts without writing code

✨ Features

Basic Usage

tiktoken count file.txt
tiktoken count --model gpt-4o document.txt
tiktoken count --encoding o200k_base script.py

Directory Operations

tiktoken count --recursive ./src/
tiktoken count --glob "*.py" --recursive ./project/

Output Formats

tiktoken count --json file.txt           # JSON output
tiktoken count --csv ./src/              # CSV format
tiktoken count --per-file ./codebase/    # Per-file breakdown
tiktoken count --summary ./project/      # Summary statistics

📝 Implementation

Files Added/Modified

tiktoken/cli.py (new)
- Complete CLI implementation with argparse
- Support for files, directories, and glob patterns
- Multiple output formats (text, JSON, CSV)
- Model and encoding selection
- Error handling for binary files and edge cases
setup.py (modified)
- Added console_scripts entry point: tiktoken = tiktoken.cli:main
- Enables tiktoken command after pip install
tests/test_cli.py (new)
- Comprehensive test suite for CLI functions
- Tests for file collection, token counting, output formatting
- Can be run standalone or with pytest
CLI.md (new)
- Complete documentation with examples
- Use cases (CI/CD, cost estimation, context window planning)
- Command reference and troubleshooting

🎓 Use Cases

1. Context Window Estimation

$ tiktoken count --model gpt-4-turbo --recursive ./codebase/
Total tokens: 45,230
# Result: Fits in GPT-4 Turbo's 128k context

2. Cost Estimation

$ tiktoken count --json ./documents/ > tokens.json
# Calculate API costs using token counts

3. CI/CD Integration

#!/bin/bash
TOKENS=$(tiktoken count --recursive ./src/ | grep "Total" | awk '{print $3}')
if [ $TOKENS -gt 50000 ]; then
  echo "Error: Exceeds token budget"
  exit 1
fi

4. Documentation Analysis

$ tiktoken count --glob "*.md" --per-file --recursive ./docs/
docs/README.md: 1,250 tokens
docs/API.md: 3,420 tokens
docs/GUIDE.md: 2,150 tokens

🧪 Testing

Syntax Validation

✅ All Python files compile successfully

Test Suite

python3 tests/test_cli.py
# ✓ test_count_tokens_in_text
# ✓ test_count_tokens_in_file
# ✓ test_collect_files_single_file
# ✓ test_collect_files_directory
# ✓ test_format_output_json
# ✓ test_format_output_csv
# ✅ All tests passed!

Integration Ready

Works with existing tiktoken encodings
No breaking changes to existing code
Console script installs automatically with pip

📦 What's Included

~600 lines of well-documented Python code
Comprehensive CLI with 10+ options
3 output formats (text, JSON, CSV)
Full test suite (6+ tests)
Complete documentation with real-world examples

🚀 Benefits

Developer Experience: No Python needed for quick token checks
Integration: Works with shell scripts, CI/CD, automation
Flexibility: Multiple encodings, models, and output formats
Performance: Leverages tiktoken's fast Rust implementation
Completeness: Production-ready with tests and docs

⚙️ Design Decisions

Argparse: Standard library, no extra dependencies
Console script: Standard Python packaging pattern
Error handling: Gracefully skips binary files, clear error messages
Output formats: JSON/CSV for programmatic use, text for humans
Glob support: Flexible file filtering
Model support: All OpenAI models via encoding_for_model()

🎯 Closes

Closes #473

Implements a comprehensive CLI tool that addresses issue openai#473 by enabling token counting directly from the command line. Features: - Count tokens in files and directories - Support for all OpenAI model encodings - Recursive directory processing with glob patterns - Multiple output formats (text, JSON, CSV) - Per-file breakdowns and summary statistics - Integration-friendly for CI/CD pipelines Usage examples: tiktoken count file.txt tiktoken count --model gpt-4o document.txt tiktoken count --recursive --glob "*.py" ./project/ tiktoken count --json --summary ./codebase/ Implementation details: - Added tiktoken/cli.py with full CLI implementation - Updated setup.py with console_scripts entry point - Added comprehensive tests in tests/test_cli.py - Created CLI.md with usage documentation and examples Benefits: - Quick token estimation for context window planning - Cost estimation for API usage - CI/CD integration for token budget enforcement - Shell script integration - No Python code required for basic token counting Closes openai#473

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add command-line interface for tiktoken#508

Add command-line interface for tiktoken#508
alpnix wants to merge 1 commit intoopenai:mainfrom
alpnix:add-cli-support

alpnix commented Mar 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alpnix commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

🎯 Problem Solved

✨ Features

Basic Usage

Directory Operations

Output Formats

📝 Implementation

Files Added/Modified

🎓 Use Cases

1. Context Window Estimation

2. Cost Estimation

3. CI/CD Integration

4. Documentation Analysis

🧪 Testing

Syntax Validation

Test Suite

Integration Ready

📦 What's Included

🚀 Benefits

⚙️ Design Decisions

🎯 Closes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alpnix commented Mar 14, 2026 •

edited

Loading