Skip to content

j-p-d-e-v/byte-pair-encoding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Byte Pair Encoding (BPE) in Rust

A Byte-Level Byte Pair Encoding (BPE) tokenizer library written in Rust.

This project implements a byte-level tokenizer capable of:

  • Training merge rules from raw text
  • Encoding runtime input into merged byte tokens
  • Supporting special tokens
  • Working directly on UTF-8 bytes
  • Providing deterministic tokenization behavior

The implementation is designed for learning, experimentation, and building custom tokenizer pipelines for LLMs or NLP systems.


Features

  • Byte-level tokenization
  • Trainable BPE merge rules
  • UTF-8 safe byte processing
  • Runtime encoding
  • Configurable special tokens
  • Deterministic merges
  • Minimal dependencies
  • Rust-first implementation

Why Byte-Level BPE?

Byte-level BPE operates directly on raw bytes instead of words or Unicode characters.

Advantages:

  • No unknown Unicode characters
  • Language agnostic
  • Handles emojis and symbols naturally
  • Compatible with arbitrary binary-safe text
  • Similar approach used by GPT-style tokenizers

Example:

Input:
"playing"

Initial bytes:
[p, l, a, y, i, n, g]

Merged tokens:
[play, ing]

About

A Byte-Level Byte Pair Encoding (BPE) tokenizer library written in Rust.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages