Skip to content

Add Unicode-focused tests for string length and pattern handling #829

@Shristibot

Description

@Shristibot

Summary

Right now, string-related keywords like minLength, maxLength, pattern, and propertyNames don’t have much coverage for Unicode strings that clearly exercise the spec’s “length in Unicode code points” semantics. Most Unicode-related cases live in optional/non-bmp-regex.json and optional/ecmascript-regex.json, and those focus more on regex engine features than on core length and basic Unicode handling.

Motivation

  • The validation spec and accompanying docs define minLength/maxLength in terms of Unicode code points, not bytes, and they use Unicode examples to illustrate this.
  • In practice, implementations often differ when non‑ASCII text is involved, so having a few explicit tests helps confirm that validators are using code‑point length as required.
  • The existing optional regex tests show that engine‑specific behaviour is already isolated under optional/, which leaves room for a small number of portable Unicode examples in the core suite.

Proposal

  1. Required tests (core)
  • Extend tests/*/minLength.json and maxLength.json (for drafts like draft2020‑12 and later) with cases where:

      - Non‑ASCII strings (for example, simple emoji or common non‑Latin text) sit right at or around the length boundary,                making the code‑point counting explicit.
      - The same numeric minLength/maxLength is applied to both ASCII and non‑ASCII examples, so it’s clear that both are measured in code points rather than bytes.
    
  • Extend pattern.json and propertyNames.json with a few straightforward Unicode examples where:

      -  Patterns match literal Unicode characters without relying on flags or advanced Unicode properties.
      -  propertyNames includes keys that contain Unicode letters, again without depending on engine‑specific behaviour.
    

2.Optional tests

  • Under tests/*/optional/ (next to non-bmp-regex.json and ecmascript-regex.json), add a small set of tests for more advanced Unicode scenarios—things like zero‑width characters, combining marks, or RTL sequences—where behaviour is more tied to the regex engine.
  • These would be clearly marked as optional and aimed at implementations that want to exercise richer Unicode/regex support beyond what the core spec strictly requires.

Required tests should stay within behaviour clearly mandated by the spec (no grapheme‑cluster rules or engine‑specific flags), with more advanced Unicode and regex behaviour covered under optional/ instead.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions