Skip to content

Fix tagged template literal with unicode#15047

Open
pfgithub wants to merge 8 commits intomainfrom
pfg/fix-tagged-template-literal
Open

Fix tagged template literal with unicode#15047
pfgithub wants to merge 8 commits intomainfrom
pfg/fix-tagged-template-literal

Conversation

@pfgithub
Copy link
Contributor

@pfgithub pfgithub commented Nov 8, 2024

What does this PR do?

Fixes #8745, fixes #15492, fixes #15929, fixes #16763, fixes #18115, fixes #8207

console.log(String.raw`æ™`); // -> "æ™" instead of "\u00E6\u2122"
console.log(/æ™/.source); // -> "æ™" instead of "\u00E6\u2122"

Changes ascii_only to prefers_ascii. It will try to emit mostly ascii, but if a non-ascii character is encountered in a tagged template, it will emit it. Updates execution to scan the file to see if it contains any non-ascii and if it does, load it as utf-8 instead.

Performance impact:

  • 1.01x ± 0.02 slower for small files vs main (even using String.raw with unicode)
  • 1.02x ± 0.03 slower for hello world
  • 1.4x slower on a 11mb file with one unicode character in String.raw
Benchmark
const lotsofiles = Array.from({length: 10000}, (_, i) => "f"+i+".js");
const lotsofilesu = Array.from({length: 10000}, (_, i) => "u"+i+".js");
const benches = {
    "lotsofilesu.js": lotsofiles.map(f => `import "./${f}";`).join("\n"),
    "lotsofiles.js": lotsofiles.map(f => `import "./${f}";`).join("\n"),
    "helloworld.js": `console.log("Hello, world!");`,
    "longraw1.js": `console.log(String.raw\`${"Hello, world".repeat(1000000)}!\`);`,
    "longraw2.js": `console.log(String.raw\`${"Hello, world".repeat(1000000)}!™\`);`,
};
const versions = [
    "bun-d9a867a4b9d2711336260aab4ab946cc13f5b078",
    "bun-15047",
];

for (const name of lotsofiles) {
    await Bun.write(`benches/${name}`, "console.log(String.raw`lotsofiles "+name+"TM`)");
}
for (const name of lotsofilesu) {
    await Bun.write(`benches/${name}`, "console.log(String.raw`lotsofiles "+name+"™`)");
}
for (const [name, value] of Object.entries(benches)) {
    await Bun.write(`benches/${name}`, value);
}
for (const [name] of Object.entries(benches)) {
    Bun.spawnSync({
        cmd: ["hyperfine", "--warmup", "100", ...versions.map(version => `${version} benches/${name}`)],
        stdio: ["inherit", "inherit", "inherit"],
    });
}

mkdir -p benches && bun ./bench.js

  • when we encouter a character that must print as utf-8, convert the writer to utf-16. now the writer will convert text to utf-16 before printing. then, save it in the runtime transpiler cache as utf-16. in the case of printing a // @bun file, don't do this. in the case of loading a // @bun file, use cloneUTF8

@robobun
Copy link
Collaborator

robobun commented Nov 8, 2024

@pfgithub pfgithub changed the title Fix tagged template literal Fix tagged template literal with unicode Nov 8, 2024
@pfgithub pfgithub marked this pull request as ready for review November 8, 2024 20:14
@pfgithub pfgithub marked this pull request as draft November 9, 2024 02:08
@pfgithub pfgithub force-pushed the pfg/fix-tagged-template-literal branch from ea67f97 to 66eae13 Compare November 12, 2024 00:03
@cirospaciari
Copy link
Member

Maybe related: #17657

@pfgithub pfgithub force-pushed the pfg/fix-tagged-template-literal branch from 7f7102e to eb11922 Compare February 26, 2025 02:23
@Electroid
Copy link
Contributor

@claude Please revive this PR and rebase it to the current main. Make sure to build and run the changed tests to make sure everything works.

@claude

This comment was marked as off-topic.

@pfgithub pfgithub reopened this Jun 24, 2025
@pfgithub pfgithub force-pushed the pfg/fix-tagged-template-literal branch from eb11922 to 7cdb0bb Compare July 26, 2025 04:44
@pfgithub pfgithub marked this pull request as ready for review July 28, 2025 22:03
@brainkim
Copy link
Contributor

Is there something blocking this PR? I just did a protracted debugging session where I looked through Markdown source files, a markdown to HTML pipeline, a template tag implementation, ESBuild, and finally Bun. It’s 2025 please use Unicode 🙃🥰😝

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 6, 2025

Walkthrough

Switches Latin‑1 cloning to UTF‑8 in module loader/cache and Postgres message handling; refactors js_printer to use a new prefers_ascii field and updated APIs; and adds/adjusts tests for String.raw/tagged templates, RegExp sources, Unicode handling, shell behavior, and several regressions.

Changes

Cohort / File(s) Summary of changes
UTF-8 cloning in loaders/cache
src/bun.js/ModuleLoader.zig, src/bun.js/RuntimeTranspilerCache.zig
Replaced bun.String.cloneLatin1(...) uses with bun.String.cloneUTF8(...) when constructing/cloning source and output code; updated logging to use UTF‑8 length; bumped RuntimeTranspilerCache version expectation to 17.
Printer prefers_ascii refactor
src/js_printer.zig
Replaced ascii_only with prefers_ascii across APIs and internals; updated function signatures (canPrintWithoutEscape, estimateLengthForUTF8, writePreQuotedString, quoteForJSON, etc.), removed ascii_only from NewPrinter params, added prefers_ascii field, and updated all call sites and printing logic to propagate the new preference.
Postgres field message UTF-8 clone
src/sql/postgres/protocol/FieldMessage.zig
Changed localized_severity initialization from String.createUTF8(...) to String.cloneUTF8(...).
Bundler / transpiler tests: strings & templates
test/bundler/bundler_string.test.ts, test/bundler/transpiler/transpiler.test.js
Added tests covering String.raw/tagged template literals, escapes, Unicode, and RegExp source access; noted duplicated block in transpiler tests.
Shell test: Unicode expectation
test/js/bun/shell/bunshell.test.ts
Simplified the "escape unicode" assertion to expect the literal escaped sequence "\弟\氣\n".
Regression tests added/enabled
test/regression/issue/14976/14976.test.ts, test/regression/issue/8207/8207.test.ts, test/regression/issue/8745/8745.test.ts
Enabled previously skipped test (14976); added regex source/UTF‑16 handling tests (8207); added end‑to‑end UTF‑8/RTC build/run tests validating stdout/stderr/exit codes (8745).

Suggested reviewers

  • Jarred-Sumner

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description provides substantial detail about what the PR does, including specific problem examples, the implementation approach (changing ascii_only to prefers_ascii), and comprehensive performance benchmarking information. However, the description is missing the second required section from the template: "How did you verify your code works?" While the raw_summary shows that multiple test files were added or modified to cover the changes, this verification information is not documented in the PR description itself as the template requires. The description addresses only the first of two explicitly required template sections.
✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title "Fix tagged template literal with unicode" is concise, specific, and accurately summarizes the primary objective of the pull request. It clearly identifies the main issue being addressed (tagged template literals with Unicode characters) without being vague or overly broad. The title directly corresponds to the linked issues (#8745, #15492, #15929, #16763, #18115, #8207) which all relate to Unicode handling in templates and raw strings.
Linked Issues Check ✅ Passed All six linked issues are addressed by the code changes. The core refactoring from ascii_only to prefers_ascii in js_printer.zig, combined with UTF-8 handling in ModuleLoader.zig and RuntimeTranspilerCache.zig, directly resolves Unicode preservation issues in tagged templates (#8745), BOM handling (#15492), shell interpolation (#15929), emoji in transpilation (#16763), String.raw iteration encoding (#18115), and regex source handling (#8207). The raw summary confirms new regression tests were added for issues #8745 and #8207, and the PR description explicitly lists all six fixed issues with context showing how the prefers_ascii approach enables non-ASCII character emission in templates and regex sources.
Out of Scope Changes Check ✅ Passed All code changes are directly scoped to implementing the Unicode handling fix. The primary changes (js_printer.zig ascii_only refactoring, ModuleLoader.zig UTF-8 cloning, RuntimeTranspilerCache.zig version update and UTF-8 logic) directly support the objective of preserving non-ASCII characters in templates and regex sources. The FieldMessage.zig change (String.createUTF8 to String.cloneUTF8) aligns with the broader UTF-8 conversion effort. Test additions in multiple files are regression tests validating the fixes for the linked issues, and no unrelated functional changes or cleanup work appear in the changeset.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/js_printer.zig (1)

4948-4989: Bug: invalid escape for non‑BMP identifiers in printIdentifierUTF16

For code points > 0xFFFF, this prints \u followed by raw UTF‑8 bytes, which is invalid JS. Use \u{...} or surrogate pairs.

Apply:

@@
-                if (p.prefers_ascii and c > last_ascii) {
-                    switch (c) {
-                        0...0xFFFF => {
-                            p.print(
-                                [_]u8{
-                                    '\\',
-                                    'u',
-                                    hex_chars[c >> 12],
-                                    hex_chars[(c >> 8) & 15],
-                                    hex_chars[(c >> 4) & 15],
-                                    hex_chars[c & 15],
-                                },
-                            );
-                        },
-                        else => {
-                            p.print("\\u");
-                            var buf_ptr = p.writer.reserve(4) catch unreachable;
-                            p.writer.advance(strings.encodeWTF8RuneT(buf_ptr[0..4], CodeUnitType, c));
-                        },
-                    }
-                    continue;
-                }
+                if (p.prefers_ascii and c > last_ascii) {
+                    if (c <= 0xFFFF) {
+                        p.print([_]u8{
+                            '\\','u',
+                            hex_chars[(c >> 12) & 0xF],
+                            hex_chars[(c >> 8) & 0xF],
+                            hex_chars[(c >> 4) & 0xF],
+                            hex_chars[c & 0xF],
+                        });
+                    } else {
+                        // Use ES2015 identifier escape
+                        p.print("\\u{");
+                        std.fmt.formatInt(c, 16, .lower, .{}, p) catch unreachable;
+                        p.print("}");
+                    }
+                    continue;
+                }
🧹 Nitpick comments (1)
src/sql/postgres/protocol/FieldMessage.zig (1)

58-59: Dropping localized_severity — confirm intent

The initializer no longer constructs .localized_severity; such fields are now skipped. If intentional, add a brief comment explaining why it’s ignored and confirm downstream doesn’t display it.

Optionally keep the case (using String.cloneUTF8(message)) to preserve parity with severity if there’s no downside.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3c232b0 and 4eaca8e.

📒 Files selected for processing (10)
  • src/bun.js/ModuleLoader.zig (5 hunks)
  • src/bun.js/RuntimeTranspilerCache.zig (2 hunks)
  • src/js_printer.zig (24 hunks)
  • src/sql/postgres/protocol/FieldMessage.zig (1 hunks)
  • test/bundler/bundler_string.test.ts (1 hunks)
  • test/bundler/transpiler/transpiler.test.js (1 hunks)
  • test/js/bun/shell/bunshell.test.ts (1 hunks)
  • test/regression/issue/14976/14976.test.ts (1 hunks)
  • test/regression/issue/8207/8207.test.ts (1 hunks)
  • test/regression/issue/8745/8745.test.ts (1 hunks)
🧰 Additional context used
📓 Path-based instructions (17)
test/**

📄 CodeRabbit inference engine (.cursor/rules/writing-tests.mdc)

Place all tests under the test/ directory

Files:

  • test/regression/issue/8207/8207.test.ts
  • test/bundler/transpiler/transpiler.test.js
  • test/js/bun/shell/bunshell.test.ts
  • test/regression/issue/14976/14976.test.ts
  • test/bundler/bundler_string.test.ts
  • test/regression/issue/8745/8745.test.ts
test/regression/issue/+([0-9])/**

📄 CodeRabbit inference engine (.cursor/rules/writing-tests.mdc)

Place regression tests under test/regression/issue/[number] (a numeric directory per issue)

Files:

  • test/regression/issue/8207/8207.test.ts
  • test/regression/issue/14976/14976.test.ts
  • test/regression/issue/8745/8745.test.ts
test/**/*.{js,ts}

📄 CodeRabbit inference engine (.cursor/rules/writing-tests.mdc)

test/**/*.{js,ts}: Write tests in JavaScript or TypeScript using Bun’s Jest-style APIs (test, describe, expect) and run with bun test
Prefer data-driven tests (e.g., test.each) to reduce boilerplate
Use shared utilities from test/harness.ts where applicable

Files:

  • test/regression/issue/8207/8207.test.ts
  • test/bundler/transpiler/transpiler.test.js
  • test/js/bun/shell/bunshell.test.ts
  • test/regression/issue/14976/14976.test.ts
  • test/bundler/bundler_string.test.ts
  • test/regression/issue/8745/8745.test.ts
test/**/*.test.ts

📄 CodeRabbit inference engine (test/CLAUDE.md)

test/**/*.test.ts: Name test files *.test.ts and use bun:test
Do not write flaky tests: never wait for arbitrary time; wait for conditions instead
Never hardcode port numbers in tests; use port: 0 to get a random port
When spawning Bun in tests, use bunExe() and bunEnv from harness
Prefer async/await in tests; for a single callback, use Promise.withResolvers()
Do not set explicit test timeouts; rely on Bun’s built-in timeouts
Use tempDir/tempDirWithFiles from harness for temporary files and directories in tests
For large/repetitive strings in tests, prefer Buffer.alloc(count, fill).toString() over "A".repeat(count)
Import common test utilities from harness (e.g., bunExe, bunEnv, tempDirWithFiles, tmpdirSync, platform checks, GC helpers)
In error tests, assert non-zero exit codes for failing processes and use toThrow for synchronous errors
Use describe blocks for grouping, describe.each for parameterized tests, snapshots with toMatchSnapshot, and lifecycle hooks (beforeAll, beforeEach, afterEach); track resources for cleanup in afterEach
Use using/await using with Bun resources (e.g., Bun.listen/connect/spawn/serve) to ensure cleanup in tests

Files:

  • test/regression/issue/8207/8207.test.ts
  • test/js/bun/shell/bunshell.test.ts
  • test/regression/issue/14976/14976.test.ts
  • test/bundler/bundler_string.test.ts
  • test/regression/issue/8745/8745.test.ts
test/regression/issue/**

📄 CodeRabbit inference engine (test/CLAUDE.md)

Place regression tests under test/regression/issue/ and organize by issue number

Place regression tests under test/regression/issue/ (one per bug fix)

Files:

  • test/regression/issue/8207/8207.test.ts
  • test/regression/issue/14976/14976.test.ts
  • test/regression/issue/8745/8745.test.ts
test/**/*.test.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

test/**/*.test.{ts,tsx}: Test files must live under test/ and end with .test.ts or .test.tsx
In tests, always use port: 0; do not hardcode ports or roll your own random port
Prefer normalizeBunSnapshot for snapshotting test output instead of asserting raw strings
Do not write tests that assert absence of crashes (e.g., 'no panic' or 'no uncaught exception')
Use Bun’s Jest-compatible runner (import { test, expect } from "bun:test") for tests
Avoid shell commands like find or grep in tests; use Bun’s Glob and built-in tools instead
Prefer running tests via bun bd test and use provided harness utilities (bunEnv, bunExe, tempDir)
Use Bun.spawn with proper stdio handling and await proc.exited in process-spawning tests

Files:

  • test/regression/issue/8207/8207.test.ts
  • test/js/bun/shell/bunshell.test.ts
  • test/regression/issue/14976/14976.test.ts
  • test/bundler/bundler_string.test.ts
  • test/regression/issue/8745/8745.test.ts
test/bundler/**/*

📄 CodeRabbit inference engine (.cursor/rules/writing-tests.mdc)

Place bundler/transpiler/CSS/bun build tests under test/bundler/

Files:

  • test/bundler/transpiler/transpiler.test.js
  • test/bundler/bundler_string.test.ts
test/bundler/**

📄 CodeRabbit inference engine (CLAUDE.md)

Place bundler/transpiler tests under test/bundler/

Files:

  • test/bundler/transpiler/transpiler.test.js
  • test/bundler/bundler_string.test.ts
test/js/**/*.{js,ts}

📄 CodeRabbit inference engine (.cursor/rules/writing-tests.mdc)

Place JavaScript and TypeScript tests under test/js/

Files:

  • test/js/bun/shell/bunshell.test.ts
test/js/bun/**/*.{js,ts}

📄 CodeRabbit inference engine (.cursor/rules/writing-tests.mdc)

Place Bun API tests under test/js/bun/, separated by category (e.g., test/js/bun/glob/)

Files:

  • test/js/bun/shell/bunshell.test.ts
test/js/**

📄 CodeRabbit inference engine (test/CLAUDE.md)

Organize unit tests for specific features under test/js/ by module

Files:

  • test/js/bun/shell/bunshell.test.ts
test/js/bun/**

📄 CodeRabbit inference engine (CLAUDE.md)

Place Bun-specific API tests under test/js/bun/

Files:

  • test/js/bun/shell/bunshell.test.ts
src/**/*.zig

📄 CodeRabbit inference engine (.cursor/rules/building-bun.mdc)

Implement debug logs in Zig using const log = bun.Output.scoped(.${SCOPE}, false); and invoking log("...", .{})

In Zig code, manage memory carefully and use defer for cleanup of allocations/resources

src/**/*.zig: In Zig source files, place all @import statements at the bottom of the file
Use @import("bun") instead of @import("root").bun when importing Bun in Zig

Files:

  • src/bun.js/RuntimeTranspilerCache.zig
  • src/bun.js/ModuleLoader.zig
  • src/sql/postgres/protocol/FieldMessage.zig
  • src/js_printer.zig
**/*.zig

📄 CodeRabbit inference engine (.cursor/rules/javascriptcore-class.mdc)

**/*.zig: Declare the extern C symbol in Zig and export a Zig-friendly alias for use
Wrap the Bun____toJS extern in a Zig method that takes a JSGlobalObject and returns JSC.JSValue

Files:

  • src/bun.js/RuntimeTranspilerCache.zig
  • src/bun.js/ModuleLoader.zig
  • src/sql/postgres/protocol/FieldMessage.zig
  • src/js_printer.zig
src/bun.js/**/*.zig

📄 CodeRabbit inference engine (.cursor/rules/zig-javascriptcore-classes.mdc)

src/bun.js/**/*.zig: In Zig binding structs, expose generated bindings via pub const js = JSC.Codegen.JS and re-export toJS/fromJS/fromJSDirect
Constructors and prototype methods should return bun.JSError!JSC.JSValue to integrate Zig error handling with JS exceptions
Use parameter name globalObject (not ctx) and accept (*JSC.JSGlobalObject, *JSC.CallFrame) in binding methods/constructors
Implement getters as get(this, globalObject) returning JSC.JSValue and matching the .classes.ts interface
Provide deinit() for resource cleanup and finalize() that calls deinit(); use bun.destroy(this) or appropriate destroy pattern
Access JS call data via CallFrame (argument(i), argumentCount(), thisValue()) and throw errors with globalObject.throw(...)
For properties marked cache: true, use the generated Zig accessors (NameSetCached/GetCached) to work with GC-owned values
In finalize() for objects holding JS references, release them using .deref() before destroy

Files:

  • src/bun.js/RuntimeTranspilerCache.zig
  • src/bun.js/ModuleLoader.zig
src/**/js_*.zig

📄 CodeRabbit inference engine (.cursor/rules/registering-bun-modules.mdc)

src/**/js_*.zig: Implement JavaScript bindings in a Zig file named with a js_ prefix (e.g., js_smtp.zig, js_your_feature.zig)
Handle reference counting correctly with ref()/deref() in JS-facing Zig code
Always implement proper cleanup in deinit() and finalize() for JS-exposed types

Files:

  • src/js_printer.zig
src/{**/js_*.zig,bun.js/api/**/*.zig}

📄 CodeRabbit inference engine (.cursor/rules/registering-bun-modules.mdc)

Use bun.JSError!JSValue for proper error propagation in JS-exposed Zig functions

Files:

  • src/js_printer.zig
🧠 Learnings (16)
📚 Learning: 2025-10-04T09:51:30.294Z
Learnt from: CR
PR: oven-sh/bun#0
File: CLAUDE.md:0-0
Timestamp: 2025-10-04T09:51:30.294Z
Learning: Applies to test/**/*.test.{ts,tsx} : Prefer normalizeBunSnapshot for snapshotting test output instead of asserting raw strings

Applied to files:

  • test/js/bun/shell/bunshell.test.ts
📚 Learning: 2025-08-30T00:13:36.815Z
Learnt from: CR
PR: oven-sh/bun#0
File: .cursor/rules/zig-javascriptcore-classes.mdc:0-0
Timestamp: 2025-08-30T00:13:36.815Z
Learning: Applies to src/bun.js/bindings/generated_classes_list.zig : Update src/bun.js/bindings/generated_classes_list.zig to include new classes

Applied to files:

  • src/bun.js/ModuleLoader.zig
📚 Learning: 2025-10-04T09:51:30.294Z
Learnt from: CR
PR: oven-sh/bun#0
File: CLAUDE.md:0-0
Timestamp: 2025-10-04T09:51:30.294Z
Learning: Applies to test/**/*.test.{ts,tsx} : Prefer running tests via bun bd test <file> and use provided harness utilities (bunEnv, bunExe, tempDir)

Applied to files:

  • test/regression/issue/8745/8745.test.ts
📚 Learning: 2025-10-04T09:51:30.294Z
Learnt from: CR
PR: oven-sh/bun#0
File: CLAUDE.md:0-0
Timestamp: 2025-10-04T09:51:30.294Z
Learning: Applies to test/**/*.test.{ts,tsx} : Use Bun.spawn with proper stdio handling and await proc.exited in process-spawning tests

Applied to files:

  • test/regression/issue/8745/8745.test.ts
📚 Learning: 2025-08-30T00:12:56.803Z
Learnt from: CR
PR: oven-sh/bun#0
File: .cursor/rules/writing-tests.mdc:0-0
Timestamp: 2025-08-30T00:12:56.803Z
Learning: Applies to test/**/*.{js,ts} : Write tests in JavaScript or TypeScript using Bun’s Jest-style APIs (test, describe, expect) and run with bun test

Applied to files:

  • test/regression/issue/8745/8745.test.ts
📚 Learning: 2025-09-03T17:10:13.486Z
Learnt from: CR
PR: oven-sh/bun#0
File: test/CLAUDE.md:0-0
Timestamp: 2025-09-03T17:10:13.486Z
Learning: Applies to test/**/*.test.ts : When spawning Bun in tests, use `bunExe()` and `bunEnv` from `harness`

Applied to files:

  • test/regression/issue/8745/8745.test.ts
📚 Learning: 2025-10-04T09:51:30.294Z
Learnt from: CR
PR: oven-sh/bun#0
File: CLAUDE.md:0-0
Timestamp: 2025-10-04T09:51:30.294Z
Learning: Applies to test/**/*.test.{ts,tsx} : Use Bun’s Jest-compatible runner (import { test, expect } from "bun:test") for tests

Applied to files:

  • test/regression/issue/8745/8745.test.ts
📚 Learning: 2025-09-03T17:10:13.486Z
Learnt from: CR
PR: oven-sh/bun#0
File: test/CLAUDE.md:0-0
Timestamp: 2025-09-03T17:10:13.486Z
Learning: Applies to test/**/*.test.ts : Name test files `*.test.ts` and use `bun:test`

Applied to files:

  • test/regression/issue/8745/8745.test.ts
📚 Learning: 2025-08-30T00:12:56.803Z
Learnt from: CR
PR: oven-sh/bun#0
File: .cursor/rules/writing-tests.mdc:0-0
Timestamp: 2025-08-30T00:12:56.803Z
Learning: Applies to test/js/bun/**/*.{js,ts} : Place Bun API tests under test/js/bun/, separated by category (e.g., test/js/bun/glob/)

Applied to files:

  • test/regression/issue/8745/8745.test.ts
📚 Learning: 2025-09-03T17:10:13.486Z
Learnt from: CR
PR: oven-sh/bun#0
File: test/CLAUDE.md:0-0
Timestamp: 2025-09-03T17:10:13.486Z
Learning: Applies to test/**/*-fixture.ts : Name test fixture files that are spawned by tests with the suffix `-fixture.ts`

Applied to files:

  • test/regression/issue/8745/8745.test.ts
📚 Learning: 2025-08-30T00:12:56.803Z
Learnt from: CR
PR: oven-sh/bun#0
File: .cursor/rules/writing-tests.mdc:0-0
Timestamp: 2025-08-30T00:12:56.803Z
Learning: Applies to test/cli/**/*.{js,ts} : When testing Bun as a CLI, use spawn with bunExe() and bunEnv from harness, and capture stdout/stderr via pipes

Applied to files:

  • test/regression/issue/8745/8745.test.ts
📚 Learning: 2025-08-30T00:09:39.100Z
Learnt from: CR
PR: oven-sh/bun#0
File: .cursor/rules/dev-server-tests.mdc:0-0
Timestamp: 2025-08-30T00:09:39.100Z
Learning: Applies to test/bake/dev/bundle.test.ts : bundle.test.ts should contain DevServer-specific bundling tests

Applied to files:

  • test/regression/issue/8745/8745.test.ts
📚 Learning: 2025-09-03T17:10:13.486Z
Learnt from: CR
PR: oven-sh/bun#0
File: test/CLAUDE.md:0-0
Timestamp: 2025-09-03T17:10:13.486Z
Learning: Applies to test/**/*.test.ts : Import common test utilities from `harness` (e.g., `bunExe`, `bunEnv`, `tempDirWithFiles`, `tmpdirSync`, platform checks, GC helpers)

Applied to files:

  • test/regression/issue/8745/8745.test.ts
📚 Learning: 2025-08-30T00:09:39.100Z
Learnt from: CR
PR: oven-sh/bun#0
File: .cursor/rules/dev-server-tests.mdc:0-0
Timestamp: 2025-08-30T00:09:39.100Z
Learning: Applies to test/bake/bake-harness.ts : Provide and maintain shared test utilities: devTest, prodTest, devAndProductionTest, Dev, Client, and helpers in the harness

Applied to files:

  • test/regression/issue/8745/8745.test.ts
📚 Learning: 2025-08-30T00:09:39.100Z
Learnt from: CR
PR: oven-sh/bun#0
File: .cursor/rules/dev-server-tests.mdc:0-0
Timestamp: 2025-08-30T00:09:39.100Z
Learning: Applies to test/bake/{dev/*.test.ts,dev-and-prod.ts} : Import testing utilities (devTest, prodTest, devAndProductionTest, Dev, Client) from test/bake/bake-harness.ts

Applied to files:

  • test/regression/issue/8745/8745.test.ts
📚 Learning: 2025-08-30T00:09:39.100Z
Learnt from: CR
PR: oven-sh/bun#0
File: .cursor/rules/dev-server-tests.mdc:0-0
Timestamp: 2025-08-30T00:09:39.100Z
Learning: Applies to test/bake/dev/*.test.ts : Write Dev Server and HMR tests in test/bake/dev/*.test.ts using devTest from the shared harness

Applied to files:

  • test/regression/issue/8745/8745.test.ts
🧬 Code graph analysis (1)
test/regression/issue/8745/8745.test.ts (2)
test/harness.ts (2)
  • tempDirWithFiles (260-267)
  • bunExe (103-106)
test/js/bun/shell/bunshell.test.ts (1)
  • bunEnv (15-28)
🪛 ast-grep (0.39.5)
test/regression/issue/8207/8207.test.ts

[warning] 17-17: Regular expression constructed from variable input detected. This can lead to Regular Expression Denial of Service (ReDoS) attacks if the variable contains malicious patterns. Use libraries like 'recheck' to validate regex safety or use static patterns.
Context: new RegExp(replaced, "gu")
Note: [CWE-1333] Inefficient Regular Expression Complexity [REFERENCES]
- https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS
- https://cwe.mitre.org/data/definitions/1333.html

(regexp-from-variable)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Format
🔇 Additional comments (19)
src/bun.js/ModuleLoader.zig (1)

781-781: LGTM: Systematic encoding upgrade from Latin-1 to UTF-8.

The change from cloneLatin1 to cloneUTF8 across all source code string operations correctly enables preservation of non-ASCII characters in templates and regex sources, aligning with the PR objectives to fix Unicode handling in tagged template literals and RegExp sources.

Also applies to: 1079-1079, 1123-1123, 1292-1292, 1425-1425, 1863-1863, 1897-1897, 2519-2519, 2599-2599

src/bun.js/RuntimeTranspilerCache.zig (2)

16-17: LGTM: Cache version bump for UTF-8 encoding change.

The version increment and comment accurately reflect the encoding change. Bumping the cache version ensures cached Latin-1 content won't be incorrectly interpreted as UTF-8.


635-635: LGTM: Consistent UTF-8 encoding changes.

The switch from cloneLatin1 to cloneUTF8 and the debug logging adjustment from .latin1().len to .length() correctly implement UTF-8 handling in the transpiler cache, consistent with the ModuleLoader changes.

Also applies to: 643-643

test/regression/issue/8207/8207.test.ts (2)

3-20: LGTM: Comprehensive test for regex source UTF-16 preservation.

The test correctly validates that non-ASCII characters (¶, U+00B6) are preserved in regex sources and can be manipulated via string operations, addressing the parsel-js compatibility issue mentioned in #8207.

Note: The static analysis warning about ReDoS is a false positive—this is a test file using a known safe pattern, not production code processing untrusted input.


22-34: LGTM: Test validates regex matching behavior.

The test correctly validates that the regex with UTF-16 characters works at runtime, complementing the source string manipulation test.

test/regression/issue/8745/8745.test.ts (4)

1-10: LGTM: Well-structured test setup.

The test correctly:

  • Imports harness utilities as per coding guidelines
  • Uses UTF-8 byte arrays for expected output to validate Unicode handling
  • Creates fixtures including a large file to trigger runtime transpiler cache (RTC)

12-28: LGTM: Direct run test follows best practices.

The test correctly uses Bun.spawnSync with harness utilities (bunExe(), bunEnv) and proper stdio handling to validate that non-ASCII characters in template literals are preserved when running TypeScript directly.


30-80: LGTM: Build-then-run tests provide comprehensive coverage.

The tests validate that non-ASCII characters are correctly preserved through the build pipeline for both normal and minified builds, ensuring the UTF-8 handling works across all transpilation paths.


82-169: LGTM: Runtime transpiler cache tests validate caching path.

The tests correctly exercise the runtime transpiler cache (RTC) code path with large files, validating that UTF-8 handling works consistently whether cached or not. The comment honestly notes the uncertainty about the exact threshold.

The RTC threshold is defined in RuntimeTranspilerCache.zig:20 as MINIMUM_CACHE_SIZE = 50 * 1024, so the 16 MB fixture definitely triggers it.

src/js_printer.zig (9)

26-32: Escape policy switch matches objectives

Switching to prefers_ascii and allowing direct print of non‑ASCII when ASCII is not preferred while still escaping BOM/LS/PS is sound.

Please confirm BOM handling: raw template parts bypass canPrintWithoutEscape and print as-is. Is that intentional for Node/Deno parity for String.raw? Based on objectives.


109-143: String emission path updated coherently

estimateLengthForUTF8 and writePreQuotedString now consult prefers_ascii, preserving ASCII bias yet emitting UTF‑8 when allowed. Control characters, ${ in templates, and LS/PS are handled.

Also applies to: 145-156, 171-183, 183-207, 291-335


338-345: JSON quoting path OK

quoteForJSON propagates prefers_ascii and uses JSON‑safe escaping. No issues.


1537-1543: Propagation to UTF8/UTF16 string printers

Passing e.prefers_ascii into writePreQuotedString is correct and consistent.

Also applies to: 1548-1553


1909-1916: Identifier classification respects ASCII preference

Using p.prefers_ascii to choose Latin‑1 vs UTF‑16 identifier checks aligns with the “ASCII unless needed” goal.


2742-2750: Tagged template raw: flip to UTF‑8 emission when needed

Detecting non‑ASCII in .raw parts and flipping p.prefers_ascii = false ensures the file can be loaded as UTF‑8 so raw chars are preserved (fixes String.raw/tagged template issues).

This flip is global for the remainder of the file. Confirm this is intended (subsequent identifiers/strings may now print with Unicode instead of escapes), matching the loader’s “scan for non‑ASCII, load as UTF‑8” behavior described in the PR.

Also applies to: 2758-2766


3125-3130: RegExp .source correctness

For non‑ASCII regex, switching off ASCII preference and printing the raw literal preserves .source. Matches Node/Deno behavior.


5085-5094: Printer initialization change LGTM

Plumbing prefers_ascii into Printer.init is correct; no lifetime/alloc issues spotted.


5579-5673: Constructor sites consistent; one nuance in CJS path

  • printAst/printJSON propagate flags correctly.
  • In printCommonJS, NewPrinter(... is_bun_platform=false, ...) with a runtime prefers_ascii is fine (avoids bun‑specific branches) while still honoring ASCII preference.

Double‑check any CJS callers pass prefers_ascii=true when targeting Bun’s ASCII‑preferred output to retain historical behavior.

Also applies to: 5748-5768, 5933-5956

test/regression/issue/14976/14976.test.ts (1)

73-76: Good coverage for Unicode in tagged templates

Validates String.raw with non‑ASCII and ensures shell echo returns Unicode unchanged.

Comment on lines +95 to +96
TaggedTemplate7: { expr: 'String.raw`\xE6${"one"}`', print: true },
TaggedTemplate8: { expr: 'String.raw`\u{10334}${"two"}→`', print: true },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix the \xE6 escape so the test exercises the raw escape path
Line 95 currently reads 'String.raw\xE6${"one"}', but inside a JavaScript string literal \xE6 is parsed immediately into the character æ. That means this entry never emits a backslash in the generated source and ends up testing the same “literal æ” case covered by TaggedTemplate5, instead of the intended “written as \x… escape” scenario. Please escape the backslash so the bundler sees the actual \xE6 sequence.

-  TaggedTemplate7: { expr: 'String.raw`\xE6${"one"}`', print: true },
+  TaggedTemplate7: { expr: 'String.raw`\\xE6${"one"}`', print: true },
🤖 Prompt for AI Agents
In test/bundler/bundler_string.test.ts around lines 95-96, the test entry uses
the JavaScript string literal 'String.raw`\xE6${"one"}`' which causes \xE6 to be
interpreted as the character æ at parse time; change the literal to contain an
escaped backslash so the bundler sees the actual backslash escape sequence
(e.g., use '\\xE6' inside the string) so the test exercises the raw-escape path
rather than the pre-parsed æ character.

@VityaSchel
Copy link

this is a blocker for me, I use emojis in sql syntax with drizzle :)

Copy link
Contributor

@markovejnovic markovejnovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I guess, but I don't know enough to be comfortable being the sole stamp on this PR.

}

pub fn canPrintWithoutEscape(comptime CodePointType: type, c: CodePointType, comptime ascii_only: bool) bool {
pub fn canPrintWithoutEscape(comptime CodePointType: type, c: CodePointType, prefers_ascii: bool) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to drop the comptime?

.severity => FieldMessage{ .severity = String.cloneUTF8(message) },
// Ignore this one for now.
// .localized_severity => FieldMessage{ .localized_severity = String.createUTF8(message) },
// .localized_severity => FieldMessage{ .localized_severity = String.cloneUTF8(message) },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a correct/necessary change?

@lcswillems
Copy link

@markovejnovic Any chance it could get pushed for review internally? 🙏 This issue is such a pain for non-english people. I've given more details here: #15929 (comment)

@marcelxpfeifer
Copy link

Is there any why we can help with getting this PR merged? It would solve quite a few problems for me and my team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

10 participants