A Byte-Level Byte Pair Encoding (BPE) tokenizer library written in Rust.
This project implements a byte-level tokenizer capable of:
- Training merge rules from raw text
- Encoding runtime input into merged byte tokens
- Supporting special tokens
- Working directly on UTF-8 bytes
- Providing deterministic tokenization behavior
The implementation is designed for learning, experimentation, and building custom tokenizer pipelines for LLMs or NLP systems.
- Byte-level tokenization
- Trainable BPE merge rules
- UTF-8 safe byte processing
- Runtime encoding
- Configurable special tokens
- Deterministic merges
- Minimal dependencies
- Rust-first implementation
Byte-level BPE operates directly on raw bytes instead of words or Unicode characters.
Advantages:
- No unknown Unicode characters
- Language agnostic
- Handles emojis and symbols naturally
- Compatible with arbitrary binary-safe text
- Similar approach used by GPT-style tokenizers
Example:
Input:
"playing"
Initial bytes:
[p, l, a, y, i, n, g]
Merged tokens:
[play, ing]