Fix: numbers in snake_case are merged with the preceding word by D-Walther · Pull Request #74 · hit9/bitproto

D-Walther · 2025-09-05T08:27:54Z

E.g. an enum value called MY_123_ENUM gets split into my123_enum, which is then uppercased to MY123_ENUM.

I refactored the function to be easier to understand and debug, as each processing step can now be inspected individually.

- E.g. MY_123_ENUM gets split into my123_enum - Refactor to make it easier to debug the intermediate steps

D-Walther · 2025-09-05T08:43:10Z

Building on top of this, I've experimented a bit with splitting numbers from words as well. I think I've found a fairly intuitive approach, but it relies on a few assumptions and there could be edge-cases I've not thought of. So I'd be interested in your opinion on it before I open a PR. @hit9
Here is my proposal.

hit9 · 2025-09-20T09:12:19Z

Thanks !

However, here are still some bad cases:

In [4]: snake_case("HTTPServer")
Out[4]: 'httpserver'             # expect: 'http_server'    ,  works before but fails now

In [5]: snake_case("getHTTPResponseCode")
Out[5]: 'get_httpresponse_code'     # expect: 'get_http_response_code',  works before but fails now

In [7]: snake_case("Snake42Case")
Out[7]: 'snake42case'   # expect: 'snake_42_case'  ,  fails both before ('snake42_case') and after change

In [15]: snake_case("__init__")
Out[15]: 'init'  # expect: '__init__' (without changes)

In [16]: snake_case("GPU3DModel")
Out[16]: 'gpu3dmodel' # expect: 'gpu_3_d_model'            (bests to 'gpu_3d_model', but it's hard for a ‘non-human’ program.. and, rules of thumb are innumerable)

Also, thanks for the proposal — I agree with the principles.

One design approach that I and ChatGPT came up with is this::

Snake case — practical rules (with one example each)

Camel boundaries — split at …x|X…, …0|X…, and …X|y…
e.g.: snakeCase → snake_case
Letter↔digit boundaries — split at …a|1… and …1|a…
e.g.: snake2Word → snake_2_word
Respect explicit tokenization — if the original has both _ and mixed case, don’t split letters/digits inside those tokens
e.g.: MyMessage_v1 → my_message_v1
ALL-UPPER tokens are atomic for digits — for ^[A-Z0-9]+$ tokens, don’t split letters and digits
e.g.: TI82 → ti82
Preserve edges; normalize core — keep leading/trailing _, convert -→_, collapse internal __+, then lowercase
e.g.: __Init__ → __init__

And the changes based on this approach are implemented in this pr : #75

D-Walther added 2 commits September 4, 2025 15:00

Add failing test

0509b26

Fix: numbers in uppercase are merged with previous word. Refactor.

c970668

- E.g. MY_123_ENUM gets split into my123_enum - Refactor to make it easier to debug the intermediate steps

hit9 merged commit c970668 into hit9:master Sep 20, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

Fix: numbers in snake_case are merged with the preceding word#74

Fix: numbers in snake_case are merged with the preceding word#74
hit9 merged 2 commits intohit9:masterfrom
D-Walther:master

D-Walther commented Sep 5, 2025

Uh oh!

D-Walther commented Sep 5, 2025

Uh oh!

hit9 commented Sep 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Comments

Conversation

D-Walther commented Sep 5, 2025

Uh oh!

D-Walther commented Sep 5, 2025

Uh oh!

hit9 commented Sep 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants