fix: update space handling in token_split and improve ltrim/rtrim fun… by maikebing · Pull Request #1422 · leejet/stable-diffusion.cpp

maikebing · 2026-04-15T04:36:55Z

No description provided.

…ctions

Copilot

Pull request overview

Updates whitespace handling in string trimming and tokenization to avoid undefined behavior and improve Unicode/whitespace correctness in tokenizer logic.

Changes:

Use std::isspace(static_cast<unsigned char>(ch)) in ltrim/rtrim to avoid UB with negative char values.
Replace isspace(...) checks on Unicode codepoints in token_split with the tokenizer’s is_space(...) helper for consistent whitespace detection.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
src/util.cpp	Makes `ltrim`/`rtrim` whitespace checks safe for non-ASCII bytes by casting to `unsigned char`.
src/tokenize_util.cpp	Switches tokenization whitespace checks from `isspace` to `is_space` to correctly handle codepoints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-15T04:39:53Z

            std::string token;
-            if (!is_letter(cps[i]) && !is_number(cps[i]) && !isspace(cps[i])) {
+            if (!is_letter(cps[i]) && !is_number(cps[i]) && !is_space(cps[i])) {
                std::string token = codepoint_to_utf8(cp);


token is declared twice here (outer std::string token; and then an inner std::string token = ...), which shadows the outer variable and leaves it unused. This can trigger -Wshadow/-Wunused-variable warnings and makes the control flow harder to follow. Prefer declaring token only once (either remove the outer declaration, or assign to it inside the if).

Suggested change

std::string token = codepoint_to_utf8(cp);

token = codepoint_to_utf8(cp);

wbruna

This PR shouldn't change the ggml submodule.

The change itself look reasonable, though, but I had to dig up a bit to understand it. A better explanation on the commit message would have been appreciated, because it's not an 'update': we have what looks like the wrong function being called on the tokenizer. Even more so, because the fix for the standard isspace theoretically behaving badly looks related, but it's actually a totally separate issue.

fix: update space handling in token_split and improve ltrim/rtrim fun…

90bfa41

…ctions

Copilot AI review requested due to automatic review settings April 15, 2026 04:36

Copilot started reviewing on behalf of maikebing April 15, 2026 04:37 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

wbruna suggested changes Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: update space handling in token_split and improve ltrim/rtrim fun…#1422

fix: update space handling in token_split and improve ltrim/rtrim fun…#1422
maikebing wants to merge 1 commit intoleejet:masterfrom
IoTSharp:master

maikebing commented Apr 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

wbruna left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	std::string token = codepoint_to_utf8(cp);
	token = codepoint_to_utf8(cp);

Conversation

maikebing commented Apr 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

wbruna left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants