fix: update space handling in token_split and improve ltrim/rtrim fun…#1422
fix: update space handling in token_split and improve ltrim/rtrim fun…#1422maikebing wants to merge 1 commit intoleejet:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Updates whitespace handling in string trimming and tokenization to avoid undefined behavior and improve Unicode/whitespace correctness in tokenizer logic.
Changes:
- Use
std::isspace(static_cast<unsigned char>(ch))inltrim/rtrimto avoid UB with negativecharvalues. - Replace
isspace(...)checks on Unicode codepoints intoken_splitwith the tokenizer’sis_space(...)helper for consistent whitespace detection.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/util.cpp | Makes ltrim/rtrim whitespace checks safe for non-ASCII bytes by casting to unsigned char. |
| src/tokenize_util.cpp | Switches tokenization whitespace checks from isspace to is_space to correctly handle codepoints. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::string token; | ||
| if (!is_letter(cps[i]) && !is_number(cps[i]) && !isspace(cps[i])) { | ||
| if (!is_letter(cps[i]) && !is_number(cps[i]) && !is_space(cps[i])) { | ||
| std::string token = codepoint_to_utf8(cp); |
There was a problem hiding this comment.
token is declared twice here (outer std::string token; and then an inner std::string token = ...), which shadows the outer variable and leaves it unused. This can trigger -Wshadow/-Wunused-variable warnings and makes the control flow harder to follow. Prefer declaring token only once (either remove the outer declaration, or assign to it inside the if).
| std::string token = codepoint_to_utf8(cp); | |
| token = codepoint_to_utf8(cp); |
wbruna
left a comment
There was a problem hiding this comment.
This PR shouldn't change the ggml submodule.
The change itself look reasonable, though, but I had to dig up a bit to understand it. A better explanation on the commit message would have been appreciated, because it's not an 'update': we have what looks like the wrong function being called on the tokenizer. Even more so, because the fix for the standard isspace theoretically behaving badly looks related, but it's actually a totally separate issue.
No description provided.