Skip to content

Comments

Fix: numbers in snake_case are merged with the preceding word#74

Merged
hit9 merged 2 commits intohit9:masterfrom
D-Walther:master
Sep 20, 2025
Merged

Fix: numbers in snake_case are merged with the preceding word#74
hit9 merged 2 commits intohit9:masterfrom
D-Walther:master

Conversation

@D-Walther
Copy link
Contributor

E.g. an enum value called MY_123_ENUM gets split into my123_enum, which is then uppercased to MY123_ENUM.

I refactored the function to be easier to understand and debug, as each processing step can now be inspected individually.

- E.g. MY_123_ENUM gets split into my123_enum
- Refactor to make it easier to debug the intermediate steps
@D-Walther
Copy link
Contributor Author

Building on top of this, I've experimented a bit with splitting numbers from words as well. I think I've found a fairly intuitive approach, but it relies on a few assumptions and there could be edge-cases I've not thought of. So I'd be interested in your opinion on it before I open a PR. @hit9
Here is my proposal.

@hit9
Copy link
Owner

hit9 commented Sep 20, 2025

Thanks !

However, here are still some bad cases:

In [4]: snake_case("HTTPServer")
Out[4]: 'httpserver'             # expect: 'http_server'    ,  works before but fails now

In [5]: snake_case("getHTTPResponseCode")
Out[5]: 'get_httpresponse_code'     # expect: 'get_http_response_code',  works before but fails now

In [7]: snake_case("Snake42Case")
Out[7]: 'snake42case'   # expect: 'snake_42_case'  ,  fails both before ('snake42_case') and after change

In [15]: snake_case("__init__")
Out[15]: 'init'  # expect: '__init__' (without changes)

In [16]: snake_case("GPU3DModel")
Out[16]: 'gpu3dmodel' # expect: 'gpu_3_d_model'            (bests to 'gpu_3d_model', but it's hard for a ‘non-human’ program.. and, rules of thumb are innumerable)

Also, thanks for the proposal — I agree with the principles.

One design approach that I and ChatGPT came up with is this::


Snake case — practical rules (with one example each)

  1. Camel boundaries — split at …x|X…, …0|X…, and …X|y…
    e.g.: snakeCasesnake_case

  2. Letter↔digit boundaries — split at …a|1… and …1|a…
    e.g.: snake2Wordsnake_2_word

  3. Respect explicit tokenization — if the original has both _ and mixed case, don’t split letters/digits inside those tokens
    e.g.: MyMessage_v1my_message_v1

  4. ALL-UPPER tokens are atomic for digits — for ^[A-Z0-9]+$ tokens, don’t split letters and digits
    e.g.: TI82ti82

  5. Preserve edges; normalize core — keep leading/trailing _, convert -_, collapse internal __+, then lowercase
    e.g.: __Init____init__


And the changes based on this approach are implemented in this pr : #75

@hit9 hit9 merged commit c970668 into hit9:master Sep 20, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants