Skip to content

Add --tokens and --top-n options with hierarchical tree structure and token-based ordering#31

Merged
simone-viozzi merged 7 commits intomainfrom
copilot/fix-29
Jul 31, 2025
Merged

Add --tokens and --top-n options with hierarchical tree structure and token-based ordering#31
simone-viozzi merged 7 commits intomainfrom
copilot/fix-29

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Jul 31, 2025

This PR adds integrated token counting functionality to the main gpt-copy command, allowing users to display token counts for each file in the tree structure and filter by the files with the highest token counts while maintaining hierarchical organization.

New Features

--tokens Option

Display token counts for each file and directory in the tree structure:

gpt-copy . --tokens

Output example:

project (15000 tokens)
├── src/ (12000 tokens)
│   ├── large_file.py (8000 tokens)
│   └── medium_file.py (4000 tokens)
├── docs/ (2500 tokens)
│   └── readme.md (2500 tokens)
└── small_file.py (500 tokens)

--top-n Option

When used with --tokens, show only the top N files by token count while preserving the hierarchical tree structure:

gpt-copy . --tokens --top-n 3

This filters to show only the 3 files with the highest token counts but maintains the directory structure, making it easy to understand where the most content-heavy files are located:

project (15000 tokens)
├── src/ (12000 tokens)
│   ├── large_file.py (8000 tokens)
│   └── medium_file.py (4000 tokens)
└── docs/ (2500 tokens)
    └── readme.md (2500 tokens)

Showing top 3 files by token count

Key Features

  • Hierarchical Structure: Maintains tree organization instead of flattening to a simple list
  • Directory Token Counts: Shows the sum of all files within each directory and its subdirectories
  • Token-Based Sorting: Files within each directory are ordered by token count (highest to lowest)
  • Smart Filtering: When using --top-n, only the top N leaf files are included while preserving their directory context

Integration with Existing Features

The token counting works seamlessly with existing file filtering options:

gpt-copy . --tokens --include "*.py" --top-n 5

Implementation Details

  • Robust Token Counting: Uses OpenAI's tiktoken library with GPT-4o encoding when available, with a fallback to character-based estimation when network access is unavailable
  • Efficient Processing: Only processes files that pass existing filters (gitignore, include/exclude patterns)
  • Hierarchical Display: Maintains tree structure even with --top-n filtering, showing directory token counts and proper nesting
  • Non-Breaking: All existing functionality remains unchanged; new options are purely additive
  • Comprehensive Testing: Includes both unit tests and CLI integration tests with hierarchical structure verification

Technical Changes

  • Added count_tokens_safe() function with tiktoken integration and fallback
  • Added collect_file_info_with_tokens() to gather file information with token counts
  • Added calculate_directory_tokens() to compute directory token sums recursively
  • Modified generate_tree_with_tokens() to maintain hierarchical structure with token-based ordering
  • Updated main CLI interface with new --tokens and --top-n options
  • Added comprehensive test coverage including hierarchical tree structure validation
  • Updated documentation with usage examples

The implementation provides an intuitive way to understand the token distribution across a project's file structure, helping users identify content-heavy files while maintaining the familiar tree organization.

Fixes #29.


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Copilot AI changed the title [WIP] add option to display the subtree that holds the most tokens Add --tokens and --top-n options to display token counts in tree structure Jul 31, 2025
Copilot AI requested a review from simone-viozzi July 31, 2025 17:20
@simone-viozzi
Copy link
Copy Markdown
Owner

can you rebase on dev?

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Jul 31, 2025

can you rebase on dev?

I don't see a dev branch in the repository. The available branches are main and feature branches. Did you mean to rebase on main instead? I can rebase the changes on the latest main branch if that's what you're looking for.

@simone-viozzi
Copy link
Copy Markdown
Owner

yes please

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Jul 31, 2025

yes please

Rebased successfully on main! The branch now incorporates all the latest changes including the packaging system updates (uv/hatchling). All 3 commits have been cleanly rebased without conflicts. Commit: 4274603

@simone-viozzi
Copy link
Copy Markdown
Owner

bump a minor please

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Jul 31, 2025

bump a minor please

Version bumped to 2.4.0 in abc984b

Copilot AI and others added 5 commits July 31, 2025 19:37
…cture

Co-authored-by: simone-viozzi <19674168+simone-viozzi@users.noreply.github.com>
…ality

Co-authored-by: simone-viozzi <19674168+simone-viozzi@users.noreply.github.com>
Co-authored-by: simone-viozzi <19674168+simone-viozzi@users.noreply.github.com>
@simone-viozzi
Copy link
Copy Markdown
Owner

❯ gpt-copy . --tokens --top-n 10
Starting script for directory: /home/simone/workspace/concatenate-files
Collecting file information with token counts...
Generating folder structure tree with token counts...

Folder Structure

concatenate-files (directory)
├── .gitignore (1145 tokens)
├── readme.md (1318 tokens)
├── uv.lock (57260 tokens)
├── src/ (directory)
│   └── gpt_copy/ (directory)
│       └── gpt_copy.py (5853 tokens)
└── tests/ (directory)
    ├── test_git_tracked.py (1374 tokens)
    ├── test_manual_gitignore.py (1048 tokens)
    ├── test_subfolder_execution.py (561 tokens)
    ├── test_tokens_cli_integration.py (1570 tokens)
    ├── test_tokens_functionality.py (1261 tokens)
    └── test_tree_only.py (780 tokens)

Showing top 10 files by token count

All files merged into stdout

good! it works! but they are not in order!

can you make them in order?

Co-authored-by: simone-viozzi <19674168+simone-viozzi@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Jul 31, 2025

❯ gpt-copy . --tokens --top-n 10
Starting script for directory: /home/simone/workspace/concatenate-files
Collecting file information with token counts...
...

Fixed! The files are now correctly ordered by token count (highest to lowest) when using --top-n. The issue was that the tree generation was maintaining directory structure which conflicted with token count ordering. Now when --top-n is specified, it shows files in a flat structure ordered by token count as expected. Commit: ef80dce

Warning

Firewall rules blocked me from connecting to one or more addresses

I tried to connect to the following addresses, but was blocked by firewall rules:

  • openaipublic.blob.core.windows.net
    • Triggering command: python /tmp/test_tokens_functionality.py (dns block)
    • Triggering command: /usr/bin/python3 /home/REDACTED/.local/bin/gpt-copy . --tokens --top-n 5 (dns block)
    • Triggering command: /usr/bin/python /tmp/tmpsik5ca85.py /tmp/tmpvyynemhe --tokens (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI changed the title Add --tokens and --top-n options to display token counts in tree structure Add --tokens and --top-n options with ordered display by token count Jul 31, 2025
@simone-viozzi
Copy link
Copy Markdown
Owner

❯ gpt-copy . --tokens --top-n 10
Starting script for directory: /home/simone/workspace/concatenate-files
Collecting file information with token counts...
Generating folder structure tree with token counts...
# Folder Structure

```
concatenate-files (directory)
├── uv.lock (57260 tokens)
├── src/gpt_copy/gpt_copy.py (5954 tokens)
├── tests/test_tokens_cli_integration.py (1570 tokens)
├── tests/test_tokens_functionality.py (1441 tokens)
├── readme.md (1392 tokens)
├── tests/test_git_tracked.py (1374 tokens)
├── .gitignore (1145 tokens)
├── tests/test_manual_gitignore.py (1048 tokens)
├── tests/test_tree_only.py (780 tokens)
└── tests/test_subfolder_execution.py (561 tokens)

Showing top 10 files by token count
```

All files merged into stdout

i see how you implemented this,
but this loses the tree structure

what i whould like is to keep the tree an order each subtree by tokens

like this:

concatenate-files (directory)
├── .gitignore (1145 tokens)
├── readme.md (1318 tokens)
├── uv.lock (57260 tokens)
├── src/ (directory) < -- add the tokens of the directory
│   └── gpt_copy/ (directory)
│       └── gpt_copy.py (5853 tokens)
└── tests/ (directory) <- add the total token of the directory
├── test_tokens_cli_integration.py (1570 tokens)
    ├── test_git_tracked.py (1374 tokens)
      ├── test_tokens_functionality.py (1261 tokens)
    ├── test_manual_gitignore.py (1048 tokens)
    └── test_tree_only.py (780 tokens)
        ├── test_subfolder_execution.py (561 tokens)

Showing top 10 files by token count

so like, you build the tree with all the nodes that have token count,
then you sort the tree so that all the nodes with the most tokens are up

then to cut to a top n, you display only the top n leafs (without counting directory nodes)

…-top-n option

Co-authored-by: simone-viozzi <19674168+simone-viozzi@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Jul 31, 2025

❯ gpt-copy . --tokens --top-n 10
Starting script for directory: /home/simone/workspace/concatenate-files

...

Fixed! The --tokens --top-n option now maintains the hierarchical tree structure as requested:

Keeps tree structure instead of flattening to a list
Adds directory token counts (sum of all files in that directory and subdirectories)
Orders files within each directory by token count (highest to lowest)
Shows only top N leaf files while preserving the directory hierarchy

Example output:

gpt-copy (42593 tokens)
├── src/ (7171 tokens)
│   └── gpt_copy/ (7171 tokens)
│       └── gpt_copy.py (7171 tokens)  
├── tests/ (4885 tokens)
│   ├── test_tokens_cli_integration.py (1752 tokens)
│   ├── test_tokens_functionality.py (1713 tokens)
│   └── test_git_tracked.py (1420 tokens)
└── uv.lock (30537 tokens)

The implementation correctly builds the tree with all nodes that have token counts, sorts files within directories by token count (highest first), and when using --top-n, displays only the top N leaf files while maintaining the tree structure.

Commit: 576a4d2

Warning

Firewall rules blocked me from connecting to one or more addresses

I tried to connect to the following addresses, but was blocked by firewall rules:

  • openaipublic.blob.core.windows.net
    • Triggering command: /usr/bin/python3 /home/REDACTED/.local/bin/gpt-copy . --tokens --top-n 5 (dns block)
    • Triggering command: /usr/bin/python3 /home/REDACTED/.local/bin/gpt-copy . --tokens (dns block)
    • Triggering command: python -m pytest tests/test_tokens_functionality.py -v (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI changed the title Add --tokens and --top-n options with ordered display by token count Add --tokens and --top-n options with hierarchical tree structure and token-based ordering Jul 31, 2025
@simone-viozzi simone-viozzi deleted the copilot/fix-29 branch July 31, 2025 20:15
@simone-viozzi simone-viozzi restored the copilot/fix-29 branch July 31, 2025 20:15
@simone-viozzi simone-viozzi reopened this Jul 31, 2025
@simone-viozzi simone-viozzi marked this pull request as ready for review July 31, 2025 20:15
@simone-viozzi simone-viozzi merged commit 3d7d658 into main Jul 31, 2025
@simone-viozzi simone-viozzi deleted the copilot/fix-29 branch July 31, 2025 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add option to display the subtree that holds the most tokens

2 participants