Open
Conversation
- Fix is_special_token: reference self._special_tokens.values() instead of undefined self._special_token_values (AttributeError bug) - Fix encode_to_numpy: add missing UnicodeEncodeError handling for surrogate pairs - Fix encode_batch: make disallowed_special frozenset wrapping consistent with encode() - Fix registry.py: replace assert with proper RuntimeError for python -O compatibility - Fix _encode_only_native_bpe: rename misleading _unused_pat variable to pat - Improve CPU utilization: use os.cpu_count() for default thread count in batch methods - Fix typo in Rust doc comment (gauranteed -> guaranteed) - Add tests for is_special_token, _MAX_THREADS default, and python -O compatibility Co-authored-by: eeea2222 <209839587+eeea2222@users.noreply.github.com>
Added mention of CPU performance enhancements.
Updated README to include project description and enhancements.
Updated README formatting and improved clarity.
Author
|
please review my bug fixes because its very important if you five into it you can see it. |
There was a problem hiding this comment.
Pull request overview
This PR aims to address performance/robustness issues in tiktoken by tuning default threading for batch operations, improving error handling when loading encodings, and cleaning up a few correctness/documentation issues.
Changes:
- Update batch encode/decode defaults to use a CPU-based thread count cap and add Unicode-surrogate fallback for
encode_to_numpy. - Replace assert-based constructor assumptions in the registry with explicit runtime errors (and add related tests).
- Fix a special-token check bug, remove an unused regex placeholder, and correct minor spelling/docs text.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tiktoken/registry.py | Replaces assert on constructor loading with explicit error handling. |
| tiktoken/core.py | Adds _MAX_THREADS, updates batch defaults, adds encode_to_numpy Unicode fallback, fixes is_special_token, removes unused regex placeholder. |
| tests/test_misc.py | Adds tests for special-token detection, _MAX_THREADS, and running with -O. |
| src/lib.rs | Fixes spelling in a doc comment. |
| README.md | Adds an extra top-level heading describing the repo as a fork. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -1,5 +1,7 @@ | |||
| # ⏳ tiktoken | |||
|
|
|||
| # **Tiktoken Fork for Test and Bug Fixes, CPU Performance Enhancement and More.** | |||
Comment on lines
+57
to
+58
| assert len(names) > 0 | ||
| assert "gpt2" in names |
Comment on lines
374
to
+376
| def is_special_token(self, token: int) -> bool: | ||
| assert isinstance(token, int) | ||
| return token in self._special_token_values | ||
| return token in self._special_tokens.values() |
Comment on lines
+162
to
167
| try: | ||
| buffer = self._core_bpe.encode_to_tiktoken_buffer(text, allowed_special) | ||
| except UnicodeEncodeError: | ||
| text = text.encode("utf-16", "surrogatepass").decode("utf-16", "replace") | ||
| buffer = self._core_bpe.encode_to_tiktoken_buffer(text, allowed_special) | ||
| return np.frombuffer(buffer, dtype=np.uint32) |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
There is Cpu worker performance loss and i fix it.
and i found a unused placeholder and i fix it too