Word Frequency CLI Tool

A CLI tool for producing an English word-frequency datasets as .csv files from local .txt files.

Existing solutions produced too many nonsense/malformed tokens and included unnecessary metadata in their outputs. I wanted a "batteries-included", easy-to-use, word frequency counter that could be pointed at individual text files.

Note: Claude Code was used extensively for implementation

Description

Uses spaCy's transformer model (en_core_web_trf) for lemmatization, with a custom fallback for known lemmatization errors
Disables spaCy's tokenizer for a custom tokenizer
Relies on known English language heuristics (such as the fun fact that the longest English word is "Pneumonoultramicroscopicsilicovolcanoconiosis" at 45 letters)
Removes all tokens with non-English unicode characters such as diacritics
Support for batched and parallel processing
On-disk aggregation using sqlite3

Installation

Prerequisites

Python 3.13+

Setup

pipx install word-frequency

CLI Usage

Basic Usage:

word-frequency --input_filepath="input.txt" --output_filepath="output.csv"

There is an example word-frequency dataset at data/sample_ebook_word_freq.csv that has been derived from a Chinese webfiction novel.

Advanced Options:

word-frequency \
    --input_filepath="input.txt" \
    --output_filepath="output.csv" \
    --batch_size=8 \
    --n_process=4 \
    --chunk_size=1000000

Parameters:

input.txt: Filepath of the text file to process
output.csv: Filepath of the CSV output file
--chunk_size: Size of text chunks in characters (default: 500,000)
--batch_size: Number of text chunks to process as a single batch (default: 4)
--n_process: Number of parallel processes (default: 2)

Note on memory pressure:

total_memory_footprint ~= n_process x batch_size x chunk_size = n_process x batch_size x max_seq_len x embedding_dim

# Spacy's models represent each token with a 300-dimensional vector of FP32 (4 bytes each).
# Each process also loads a ~1.5GB model into memory
total_memory_footprint_in_gb = (n_process x batch_size x 500,000 x 300 x 4 / (1024 ^ 3)) + + (n_process x 1.5)

Will vary by system, but you should leave ~6GB overhead for the OS. If you find yourself going OOM, reduce batch_size first then chunk_size

Contributing

Contributions are welcome. See CONTRIBUTING.md

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.serena		.serena
data		data
src/word_frequency		src/word_frequency
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Word Frequency CLI Tool

Description

Installation

Prerequisites

Setup

CLI Usage

Contributing

About

Uh oh!

Releases

Packages

Languages

BumaldaOverTheWater94/word-frequency

Folders and files

Latest commit

History

Repository files navigation

Word Frequency CLI Tool

Description

Installation

Prerequisites

Setup

CLI Usage

Contributing

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages