Skip to content

Add command-line interface for tiktoken#508

Open
alpnix wants to merge 1 commit intoopenai:mainfrom
alpnix:add-cli-support
Open

Add command-line interface for tiktoken#508
alpnix wants to merge 1 commit intoopenai:mainfrom
alpnix:add-cli-support

Conversation

@alpnix
Copy link
Copy Markdown

@alpnix alpnix commented Mar 14, 2026

Summary

Implements a comprehensive command-line interface for tiktoken that enables token counting directly from the shell, addressing the feature request in #473.

🎯 Problem Solved

Currently, tiktoken requires writing Python code to count tokens. This PR adds a tiktoken CLI command that allows users to:

  • Count tokens in files and directories from the command line
  • Estimate context window usage for codebases
  • Integrate token counting into shell scripts and CI/CD pipelines
  • Quickly check token counts without writing code

✨ Features

Basic Usage

tiktoken count file.txt
tiktoken count --model gpt-4o document.txt
tiktoken count --encoding o200k_base script.py

Directory Operations

tiktoken count --recursive ./src/
tiktoken count --glob "*.py" --recursive ./project/

Output Formats

tiktoken count --json file.txt           # JSON output
tiktoken count --csv ./src/              # CSV format
tiktoken count --per-file ./codebase/    # Per-file breakdown
tiktoken count --summary ./project/      # Summary statistics

📝 Implementation

Files Added/Modified

  1. tiktoken/cli.py (new)

    • Complete CLI implementation with argparse
    • Support for files, directories, and glob patterns
    • Multiple output formats (text, JSON, CSV)
    • Model and encoding selection
    • Error handling for binary files and edge cases
  2. setup.py (modified)

    • Added console_scripts entry point: tiktoken = tiktoken.cli:main
    • Enables tiktoken command after pip install
  3. tests/test_cli.py (new)

    • Comprehensive test suite for CLI functions
    • Tests for file collection, token counting, output formatting
    • Can be run standalone or with pytest
  4. CLI.md (new)

    • Complete documentation with examples
    • Use cases (CI/CD, cost estimation, context window planning)
    • Command reference and troubleshooting

🎓 Use Cases

1. Context Window Estimation

$ tiktoken count --model gpt-4-turbo --recursive ./codebase/
Total tokens: 45,230
# Result: Fits in GPT-4 Turbo's 128k context

2. Cost Estimation

$ tiktoken count --json ./documents/ > tokens.json
# Calculate API costs using token counts

3. CI/CD Integration

#!/bin/bash
TOKENS=$(tiktoken count --recursive ./src/ | grep "Total" | awk '{print $3}')
if [ $TOKENS -gt 50000 ]; then
  echo "Error: Exceeds token budget"
  exit 1
fi

4. Documentation Analysis

$ tiktoken count --glob "*.md" --per-file --recursive ./docs/
docs/README.md: 1,250 tokens
docs/API.md: 3,420 tokens
docs/GUIDE.md: 2,150 tokens

🧪 Testing

Syntax Validation

✅ All Python files compile successfully

Test Suite

python3 tests/test_cli.py
# ✓ test_count_tokens_in_text
# ✓ test_count_tokens_in_file
# ✓ test_collect_files_single_file
# ✓ test_collect_files_directory
# ✓ test_format_output_json
# ✓ test_format_output_csv
# ✅ All tests passed!

Integration Ready

  • Works with existing tiktoken encodings
  • No breaking changes to existing code
  • Console script installs automatically with pip

📦 What's Included

  • ~600 lines of well-documented Python code
  • Comprehensive CLI with 10+ options
  • 3 output formats (text, JSON, CSV)
  • Full test suite (6+ tests)
  • Complete documentation with real-world examples

🚀 Benefits

  1. Developer Experience: No Python needed for quick token checks
  2. Integration: Works with shell scripts, CI/CD, automation
  3. Flexibility: Multiple encodings, models, and output formats
  4. Performance: Leverages tiktoken's fast Rust implementation
  5. Completeness: Production-ready with tests and docs

⚙️ Design Decisions

  • Argparse: Standard library, no extra dependencies
  • Console script: Standard Python packaging pattern
  • Error handling: Gracefully skips binary files, clear error messages
  • Output formats: JSON/CSV for programmatic use, text for humans
  • Glob support: Flexible file filtering
  • Model support: All OpenAI models via encoding_for_model()

🎯 Closes

Closes #473

Implements a comprehensive CLI tool that addresses issue openai#473 by enabling
token counting directly from the command line.

Features:
- Count tokens in files and directories
- Support for all OpenAI model encodings
- Recursive directory processing with glob patterns
- Multiple output formats (text, JSON, CSV)
- Per-file breakdowns and summary statistics
- Integration-friendly for CI/CD pipelines

Usage examples:
  tiktoken count file.txt
  tiktoken count --model gpt-4o document.txt
  tiktoken count --recursive --glob "*.py" ./project/
  tiktoken count --json --summary ./codebase/

Implementation details:
- Added tiktoken/cli.py with full CLI implementation
- Updated setup.py with console_scripts entry point
- Added comprehensive tests in tests/test_cli.py
- Created CLI.md with usage documentation and examples

Benefits:
- Quick token estimation for context window planning
- Cost estimation for API usage
- CI/CD integration for token budget enforcement
- Shell script integration
- No Python code required for basic token counting

Closes openai#473
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for tiktoken CLI

1 participant