Skip to content

bug fixes#507

Open
eeea2222 wants to merge 9 commits intoopenai:mainfrom
eeea2222:main
Open

bug fixes#507
eeea2222 wants to merge 9 commits intoopenai:mainfrom
eeea2222:main

Conversation

@eeea2222
Copy link
Copy Markdown

There is Cpu worker performance loss and i fix it.
and i found a unused placeholder and i fix it too

Copilot AI and others added 8 commits March 12, 2026 19:52
- Fix is_special_token: reference self._special_tokens.values() instead of
  undefined self._special_token_values (AttributeError bug)
- Fix encode_to_numpy: add missing UnicodeEncodeError handling for surrogate pairs
- Fix encode_batch: make disallowed_special frozenset wrapping consistent with encode()
- Fix registry.py: replace assert with proper RuntimeError for python -O compatibility
- Fix _encode_only_native_bpe: rename misleading _unused_pat variable to pat
- Improve CPU utilization: use os.cpu_count() for default thread count in batch methods
- Fix typo in Rust doc comment (gauranteed -> guaranteed)
- Add tests for is_special_token, _MAX_THREADS default, and python -O compatibility

Co-authored-by: eeea2222 <209839587+eeea2222@users.noreply.github.com>
Added mention of CPU performance enhancements.
Updated README to include project description and enhancements.
Updated README formatting and improved clarity.
Copilot AI review requested due to automatic review settings March 14, 2026 00:50
@eeea2222
Copy link
Copy Markdown
Author

please review my bug fixes because its very important if you five into it you can see it.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to address performance/robustness issues in tiktoken by tuning default threading for batch operations, improving error handling when loading encodings, and cleaning up a few correctness/documentation issues.

Changes:

  • Update batch encode/decode defaults to use a CPU-based thread count cap and add Unicode-surrogate fallback for encode_to_numpy.
  • Replace assert-based constructor assumptions in the registry with explicit runtime errors (and add related tests).
  • Fix a special-token check bug, remove an unused regex placeholder, and correct minor spelling/docs text.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tiktoken/registry.py Replaces assert on constructor loading with explicit error handling.
tiktoken/core.py Adds _MAX_THREADS, updates batch defaults, adds encode_to_numpy Unicode fallback, fixes is_special_token, removes unused regex placeholder.
tests/test_misc.py Adds tests for special-token detection, _MAX_THREADS, and running with -O.
src/lib.rs Fixes spelling in a doc comment.
README.md Adds an extra top-level heading describing the repo as a fork.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md
@@ -1,5 +1,7 @@
# ⏳ tiktoken

# **Tiktoken Fork for Test and Bug Fixes, CPU Performance Enhancement and More.**
Comment thread tiktoken/registry.py Outdated
Comment thread tests/test_misc.py
Comment on lines +57 to +58
assert len(names) > 0
assert "gpt2" in names
Comment thread tiktoken/core.py
Comment on lines 374 to +376
def is_special_token(self, token: int) -> bool:
assert isinstance(token, int)
return token in self._special_token_values
return token in self._special_tokens.values()
Comment thread tiktoken/core.py
Comment on lines +162 to 167
try:
buffer = self._core_bpe.encode_to_tiktoken_buffer(text, allowed_special)
except UnicodeEncodeError:
text = text.encode("utf-16", "surrogatepass").decode("utf-16", "replace")
buffer = self._core_bpe.encode_to_tiktoken_buffer(text, allowed_special)
return np.frombuffer(buffer, dtype=np.uint32)
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants