Skip to content

fix: update space handling in token_split and improve ltrim/rtrim fun…#1422

Open
maikebing wants to merge 1 commit intoleejet:masterfrom
IoTSharp:master
Open

fix: update space handling in token_split and improve ltrim/rtrim fun…#1422
maikebing wants to merge 1 commit intoleejet:masterfrom
IoTSharp:master

Conversation

@maikebing
Copy link
Copy Markdown

No description provided.

Copilot AI review requested due to automatic review settings April 15, 2026 04:36
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates whitespace handling in string trimming and tokenization to avoid undefined behavior and improve Unicode/whitespace correctness in tokenizer logic.

Changes:

  • Use std::isspace(static_cast<unsigned char>(ch)) in ltrim/rtrim to avoid UB with negative char values.
  • Replace isspace(...) checks on Unicode codepoints in token_split with the tokenizer’s is_space(...) helper for consistent whitespace detection.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/util.cpp Makes ltrim/rtrim whitespace checks safe for non-ASCII bytes by casting to unsigned char.
src/tokenize_util.cpp Switches tokenization whitespace checks from isspace to is_space to correctly handle codepoints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/tokenize_util.cpp
std::string token;
if (!is_letter(cps[i]) && !is_number(cps[i]) && !isspace(cps[i])) {
if (!is_letter(cps[i]) && !is_number(cps[i]) && !is_space(cps[i])) {
std::string token = codepoint_to_utf8(cp);
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

token is declared twice here (outer std::string token; and then an inner std::string token = ...), which shadows the outer variable and leaves it unused. This can trigger -Wshadow/-Wunused-variable warnings and makes the control flow harder to follow. Prefer declaring token only once (either remove the outer declaration, or assign to it inside the if).

Suggested change
std::string token = codepoint_to_utf8(cp);
token = codepoint_to_utf8(cp);

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@wbruna wbruna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR shouldn't change the ggml submodule.

The change itself look reasonable, though, but I had to dig up a bit to understand it. A better explanation on the commit message would have been appreciated, because it's not an 'update': we have what looks like the wrong function being called on the tokenizer. Even more so, because the fix for the standard isspace theoretically behaving badly looks related, but it's actually a totally separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants