Skip to content

Commit 8e4fe08

Browse files
rascaniclaude
andauthored
Bump tokenizers submodule to fix sentencepiece GCC 15 build (#20135)
### Summary Updates extension/llm/tokenizers to include meta-pytorch/tokenizers#193, which bumps the sentencepiece submodule to pick up a missing `#include <cstdint>` (google/sentencepiece#1109). Without this, `pytorch_tokenizers` fails to compile inside the `executorch-ubuntu-26.04-gcc15` docker image, blocking the RISC-V baremetal CI (#19917). ### Test plan CI --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 0d8f437 commit 8e4fe08

2 files changed

Lines changed: 6 additions & 2 deletions

File tree

examples/models/parakeet/tokenizer_utils.cpp

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,10 @@
88

99
namespace {
1010

11+
// SentencePiece's word-boundary marker, spelled as UTF-8 bytes so this remains
12+
// a const char[] literal when compiled as C++20.
13+
constexpr char kSentencePieceWordBoundary[] = "\xE2\x96\x81";
14+
1115
bool is_whitespace_only(const std::string& token) {
1216
if (token.empty()) {
1317
return true;
@@ -36,7 +40,7 @@ bool is_special_token(const std::string& token) {
3640
if (token.rfind("##", 0) == 0) {
3741
return true;
3842
}
39-
if (token.rfind(u8"", 0) == 0) {
43+
if (token.rfind(kSentencePieceWordBoundary, 0) == 0) {
4044
return true;
4145
}
4246
if (is_whitespace_only(token)) {

0 commit comments

Comments
 (0)