Skip to content

Performance profiling of zstash update#414

Closed
forsyth2 wants to merge 4 commits intomainfrom
zstash-update-performance-profile
Closed

Performance profiling of zstash update#414
forsyth2 wants to merge 4 commits intomainfrom
zstash-update-performance-profile

Conversation

@forsyth2
Copy link
Collaborator

@forsyth2 forsyth2 commented Dec 16, 2025

Summary

Objectives:

  • Reduce database time in zstash update

Issue resolution:

Select one: This pull request is...

  • a bug fix: increment the patch version
  • a small improvement: increment the minor version
  • a new feature: increment the minor version
  • an incompatible (non-backwards compatible) API change: increment the major version

Big Change

  • To merge, I will use "Create a merge commit". That is, this change is large enough to require multiple units of work (i.e., it should be multiple commits).

1. Does this do what we want it to do?

Required:

  • Product Management: I have confirmed with the stakeholders that the objectives above are correct and complete.
  • Testing: I have added at least one automated test. Every objective above is represented in at least one test.
  • Testing: I have considered likely and/or severe edge cases and have included them in testing.

If applicable:

  • Testing: this pull request adds at least one new possible command line option. I have tested using this option with and without any other option that may interact with it.
    • No command line options are added.

2. Are the implementation details accurate & efficient?

Required:

  • Logic: I have visually inspected the entire pull request myself.
  • Logic: I have left GitHub comments highlighting important pieces of code logic. I have had these code blocks reviewed by at least one other team member.

If applicable:

  • Dependencies: This pull request introduces a new dependency. I have discussed this requirement with at least one other team member. The dependency is noted in zstash/conda, not just an import statement.
    • No new dependencies are added.

3. Is this well documented?

Required:

  • Documentation: by looking at the docs, a new user could easily understand the functionality introduced by this pull request.
    • There are no changes to how users interact with zstash. This is only a performance improvement.

4. Is this code clean?

Required:

  • Readability: The code is as simple as possible and well-commented, such that a new team member could understand what's happening.
  • Pre-commit checks: All the pre-commits checks have passed.

If applicable:

  • Software architecture: I have discussed relevant trade-offs in design decisions with at least one other team member. It is unlikely that this pull request will increase tech debt.

@forsyth2 forsyth2 self-assigned this Dec 16, 2025
@forsyth2 forsyth2 force-pushed the zstash-update-performance-profile branch from 92e0b69 to 40b0b51 Compare December 16, 2025 00:52
@forsyth2
Copy link
Collaborator Author

Setup
cd ~/ez/zstash
git status
# On branch issue-374-tar-deletion-rebased20251124
# nothing to commit, working tree clean
git fetch upstream main
git checkout -b zstash-update-performance-profile upstream/main
lcrc_conda # Activate conda. Alternatives: nersc_conda, compy_conda
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n zstash-performance-profile-20251215
conda activate zstash-performance-profile-20251215
# Make changes here
pre-commit run --all-files
python -m pip install .
cd tests/integration/bash_tests/run_from_any/
chmod 755 performance_for_update.bash
./performance_for_update.bash --work-dir /home/ac.forsyth2/ez/zstash/tests/utils/test_update_performance --num-files 1000 --num-dirs 50

Performance output for ./performance_for_update.bash --work-dir /home/ac.forsyth2/ez/zstash/tests/utils/test_update_performance --num-files 1000 --num-dirs 50:

[SUCCESS] Update profiling completed in 0 seconds

[INFO] ==========================================
[INFO] Performance Analysis
[INFO] ==========================================

[INFO] File Gathering Breakdown:
INFO: PERFORMANCE (walk): Completed filesystem walk
INFO: PERFORMANCE (sort): Sorted 2002 entries: 0.00 seconds
INFO: PERFORMANCE (normalize): Normalized paths: 0.00 seconds
INFO: PERFORMANCE (get_files_to_archive): TOTAL TIME: 0.01 seconds

[INFO] Database Comparison Breakdown:
INFO: PERFORMANCE: Total comparison time: 0.22 seconds
INFO: PERFORMANCE: Files checked: 2000
INFO: PERFORMANCE: New files to archive: 1000
INFO: PERFORMANCE: Average rate: 8977.6 files/sec
INFO:   - stat operations: 0.06s (26.9%)
INFO:   - database queries: 0.13s (56.2%)
INFO:   - comparison logic: 0.04s (16.5%)

[INFO] ==========================================
[INFO] Analysis Commands
[INFO] ==========================================
[SUCCESS] View full logs:
  cat /home/ac.forsyth2/ez/zstash/tests/utils/test_update_performance/zstash_profile_logs_20251215_184046/create.log       # Initial archive creation
  cat /home/ac.forsyth2/ez/zstash/tests/utils/test_update_performance/zstash_profile_logs_20251215_184046/update.log       # Update profiling

[SUCCESS] Extract specific metrics:
  grep 'PERFORMANCE' /home/ac.forsyth2/ez/zstash/tests/utils/test_update_performance/zstash_profile_logs_20251215_184046/update.log | less
  grep 'PERFORMANCE (walk)' /home/ac.forsyth2/ez/zstash/tests/utils/test_update_performance/zstash_profile_logs_20251215_184046/update.log
  grep 'database queries' /home/ac.forsyth2/ez/zstash/tests/utils/test_update_performance/zstash_profile_logs_20251215_184046/update.log

[SUCCESS] Compare times:
  grep 'TOTAL TIME' /home/ac.forsyth2/ez/zstash/tests/utils/test_update_performance/zstash_profile_logs_20251215_184046/update.log

[INFO] ==========================================
[WARNING] Cleaning up test data...
[SUCCESS] Test directory removed
[INFO] Logs preserved in: /home/ac.forsyth2/ez/zstash/tests/utils/test_update_performance/zstash_profile_logs_20251215_184046

[INFO] ==========================================
[SUCCESS] Profiling Complete!
[INFO] ==========================================
Claude's analysis & suggestions

Excellent! The profiling is working. Even with just 2,000 files, we can already see the bottleneck clearly:

Key Findings from Your Test:

Database queries are the bottleneck - Taking 56.2% of the comparison time (0.13s out of 0.22s total).

With only 2,000 files, this isn't a huge problem (0.22 seconds total). But let's extrapolate to a real simulation:

Scaling Analysis:

Current: 2,000 files → 0.22 seconds (8,978 files/sec)

Projected for real simulations:
- 50,000 files  → ~5.6 seconds
- 100,000 files → ~11 seconds
- 500,000 files → ~56 seconds
- 1,000,000 files → ~1.9 minutes
- 5,000,000 files → ~9.3 minutes

But this assumes linear scaling. In practice, database queries often slow down with larger databases, so the problem gets worse at scale.

Recommendations:

1. Test with More Files

Try running with a larger dataset to see if performance degrades non-linearly:

# Medium test
./zstash_profile.sh --work-dir /scratch/myuser --num-files 50000 --num-dirs 500

# Large test  
./zstash_profile.sh --work-dir /scratch/myuser --num-files 100000 --num-dirs 1000

2. Optimization Strategies

Since database queries are 56.2% of the time, here are the best optimizations:

A. Add a database index (easiest fix):

# In update_database(), after opening the database:
cur.execute("CREATE INDEX IF NOT EXISTS idx_files_name ON files(name)")

B. Batch database queries instead of one query per file:

# Instead of querying in a loop, load all relevant data once
cur.execute("SELECT name, size, mtime FROM files")
db_cache = {row[0]: (row[1], row[2]) for row in cur.fetchall()}
# Then check against the in-memory dict

C. Use in-memory database for comparison:

# Load entire files table into memory
# Compare in Python instead of SQLite queries

@chengzhuzhang As predicted, the database querying appears to be the biggest bottleneck. Options for next steps:

  1. Me run this profiling on a larger test directory, to gather more information.
  2. Have @wlin7 create a dev environment and do a production run with these performance logs included.
  3. Go ahead and implement one of Claude's proposed optimizations. (Option A seems invalid -- we already have index.db after all...).

Please let me know what you would prefer.

@chengzhuzhang
Copy link
Collaborator

@forsyth2 thanks for setting this up. The next steps sound reasonable to me. Please start by testing with a real case.

@forsyth2
Copy link
Collaborator Author

@wlin7 This branch adds performance profiling information. Can you use this branch when next running zstash update for a real use case? Thanks! Directions below:

Set up git

# If you don't already have a zstash repo, get it with:
git clone git@github.com:E3SM-Project/zstash.git

cd zstash # Enter zstash repo.
git status
# Make sure the output includes: nothing to commit, working tree clean
# If not, you'll need to clean up the git workspace first.
git remote -v
# You should see:
# origin	git@github.com:E3SM-Project/zstash.git (fetch)
# origin	git@github.com:E3SM-Project/zstash.git (push)
git fetch origin zstash-update-performance-profile
git checkout -b zstash-update-performance-profile origin/zstash-update-performance-profile # Checkout this PR's branch.

Set up conda

# Activate conda.
# I have a function in ~/.bashrc to do this. You probably have a different method.

# Set up conda environment.
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n zstash-update-performance-profile-env
conda activate zstash-update-performance-profile-env
pre-commit run --all-files # Optional, but ensures the files are formatted & linted.
python -m pip install . # Install the code of this branch into the newly created environment.

Test

Here, run zstash update as you would normally.

Make sure:

  • You're in the conda environment we created above.
  • You include the -v flag for verbosity.
  • You tee to an update log.
zstash update ... -v 2>&1 | tee "$UPDATE_LOG" # Name the update log whatever you what

Review results

After running zstash update, you can review the update log directly or run a script like the following to summarize the performance profiling:

# File gathering metrics
if grep -q "PERFORMANCE (get_files_to_archive)" "$UPDATE_LOG"; then
    echo ""
    print_info "File Gathering Breakdown:"
    grep "PERFORMANCE (walk):" "$UPDATE_LOG" | tail -5
    grep "PERFORMANCE (sort):" "$UPDATE_LOG"
    grep "PERFORMANCE (normalize):" "$UPDATE_LOG"
    grep "PERFORMANCE (get_files_to_archive): TOTAL TIME:" "$UPDATE_LOG"
fi

# Database comparison metrics
if grep -q "PERFORMANCE: Database comparison completed" "$UPDATE_LOG"; then
    echo ""
    print_info "Database Comparison Breakdown:"
    grep "PERFORMANCE: Total comparison time:" "$UPDATE_LOG"
    grep "PERFORMANCE: Files checked:" "$UPDATE_LOG"
    grep "PERFORMANCE: New files to archive:" "$UPDATE_LOG"
    grep "PERFORMANCE: Average rate:" "$UPDATE_LOG"
    grep "stat operations:" "$UPDATE_LOG"
    grep "database queries:" "$UPDATE_LOG"
    grep "comparison logic:" "$UPDATE_LOG"
fi

# Overall summary
if grep -q "PERFORMANCE: Update complete - Summary:" "$UPDATE_LOG"; then
    echo ""
    print_info "Overall Time Breakdown:"
    grep "PERFORMANCE: Update complete - Summary:" -A 6 "$UPDATE_LOG" | tail -6
fi

That will produce output similar to the above comment.

@chengzhuzhang
Copy link
Collaborator

@forsyth2 I mean you can start with testing a real case before handling to @wlin7 , and can attempt fixes. This would be more streamlined than having Wuyin to test.

@forsyth2
Copy link
Collaborator Author

testing a real case

@chengzhuzhang Please define "real case"; that sounds like you mean a production run. Do you just mean running this script with an even larger number of auto-generated files? And/or without the dry-run option?

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Dec 16, 2025

Maybe I should have clarified that after testing with synthetic data, please test on a real zstash archive.

@forsyth2
Copy link
Collaborator Author

Set up new script

Setup details

Set up zstash with performance profiling:

cd ~/ez/zstash
git status
# On branch zstash-update-performance-profile
# nothing to commit, working tree clean
git log
# Good, last commit: Performance profiling of zstash update
# Good, matches https://github.com/E3SM-Project/zstash/pull/414/commits
lcrc_conda # Activate conda.
conda activate zstash-performance-profile-20251215
python -m pip install .

Make a new testing script:

cd tests/integration/bash_tests/run_from_any/
touch performance_update_existing_archive.bash
# Make edits to `performance_update_existing_archive.bash`
chmod 755 performance_update_existing_archive.bash

cd ~/ez/zstash
git add -A # Add new test script to git
pre-commit run --all-files # Iterate until passing
python -m pip install .

The new script was included in the second commit.

Find/make an archive to update

We have a few requirements:

  1. I need to have write permissions for DIR_THAT_WAS_ARCHIVED (i.e. src directory), so I can add/update files.
  2. I need to have write permissions for EXISTING_ARCHIVE (i.e., dst directory), so zstash update can function.
  3. It needs to be a "real zstash archive", meaning actual production data, not just files with toy data.
  4. It can't be a directory that's actually being used (e.g., by scientists or for zppy testing), because we'd be modifying files for no valid reason, which could interfere with runs. This would seem to directly conflict with point (3).
Details of creating an archive to update

How about we just create a new archive from scratch that looks more "real" than the toy data for testing?
We know from the token timeout testing that:

du -sh /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
# 24T	/lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201

We certainly don't want to transfer all that, but this is the only directory I have write permissions for that is in fact simulation data -- i.e., presumably a close enough substitute for "real" production data.

Let's archive just a small portion of that, update some files, and then run zstash update.

cd /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
ls
du -sh archive
# 12T	archive
du -sh post
# 166G	post
du -sh run
# 11G	run
du -sh init
# 6.9G	init
du -sh build
# 1.2G	build
du -sh case_scripts
# 1.5M	case_scripts
du -sh mpas-ocean
# 147K	mpas-ocean
du -sh mpas-seaice
# 147K	mpas-seaice

Let's use the build directory as a decent-sized but not overly large directory. Let's copy it so we're not modifying the original. From this comment, we know we can expect a transfer rate of roughly 0.43 TB/hour=430 GB/hour from LCRC Improv DTN to NERSC HPSS, so 1 GB should be very quick.

cd /lcrc/group/e3sm/ac.forsyth2/zstash_performance
cp -r /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201/build build
mkdir cache
mkdir logs

# Set up Globus
# https://app.globus.org/file-manager?two_pane=true > For "Collection", choose LCRC Improv DTN, NERSC HPSS
rm -rf ~/.zstash.ini
rm -rf ~/.zstash_globus_tokens.json
# https://auth.globus.org/v2/web/consents > Globus Endpoint Performance Monitoring > rescind all

# On Perlmutter:
# hsi
# mkdir zstash_performance_20251216_try2 # mkdir: /home/f/forsyth/zstash_performance_20251216_try2
# exit

# Create an archive in the first place so we have something to update.
zstash create --hpss=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/f/forsyth/zstash_performance_20251216_try2 --cache=/lcrc/group/e3sm/ac.forsyth2/zstash_performance/cache -v build/ 2>&1 | tee /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs/create.log
# Enter auth code
# ...
# INFO: Globus transfer 0187fe5f-dadd-11f0-a88e-0eb0b913a0ab, from 15288284-7006-4041-ba1a-6b52501e49f1 to 9cd89cfd-6d04-11e5-ba46-22000b92c6ec: zstash_performance_20251216_try2 index succeeded

That took longer than one minute, but still not too bad; about 15 minutes.

Run the performance profiling

First run details

Now, in performance_update_existing_archive.bash, we can set:

DIR_THAT_WAS_ARCHIVED=/lcrc/group/e3sm/ac.forsyth2/zstash_performance/build/
EXISTING_ARCHIVE=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/f/forsyth/zstash_performance_20251216

Then:

# Perlmutter:
# hsi
# ls /home/f/forsyth/zstash_performance_20251216_try2 # 000000.tar   index.db
# exit

cd /home/ac.forsyth2/ez/zstash/tests/integration/bash_tests/run_from_any
./performance_update_existing_archive.bash
# Need to change some logic in the performance script to have it print the summary.

# Perlmutter
# hsi
# ls /home/f/forsyth/zstash_performance_20251216_try2 # 000000.tar   index.db
# exit

# So, `zstash update` completed but 000001.tar wasn't transferred.
# https://app.globus.org/activity doesn't show any hanging activity.
# That's because we set 
# EXISTING_ARCHIVE=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/f/forsyth/zstash_performance_20251216
# not:
# EXISTING_ARCHIVE=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/f/forsyth/zstash_performance_20251216_try2
# Perlmutter:
# hsi
# ls /home/f/forsyth/zstash_performance_20251216_try2 # 000000.tar   index.db
# exit

cd /home/ac.forsyth2/ez/zstash/tests/integration/bash_tests/run_from_any
./performance_update_existing_archive.bash

# Perlmutter:
# hsi
# ls /home/f/forsyth/zstash_performance_20251216_try2 # 000000.tar   000002.tar   index.db
# exit
# Good, the tar was transferred! (000001.tar was skipped because of a path error on the first run)

Output

[SUCCESS] Update profiling completed in 75 seconds

[STEP] Step 3: Analyzing performance metrics from update log...
[INFO] ==========================================
[INFO] Performance Analysis
[INFO] ==========================================

[INFO] File Gathering Breakdown:
INFO: PERFORMANCE (walk): Completed filesystem walk
INFO: PERFORMANCE (sort): Sorted 8073 entries: 0.00 seconds
INFO: PERFORMANCE (normalize): Normalized paths: 0.01 seconds
INFO: PERFORMANCE (get_files_to_archive): TOTAL TIME: 4.05 seconds

[INFO] Database Comparison Breakdown:
INFO: PERFORMANCE: Total comparison time: 34.66 seconds
INFO: PERFORMANCE: Files checked: 8073
INFO: PERFORMANCE: New files to archive: 1000
INFO: PERFORMANCE: Average rate: 232.9 files/sec
INFO:   - stat operations: 30.46s (87.9%)
INFO:   - database queries: 2.24s (6.5%)
INFO:   - comparison logic: 1.95s (5.6%)

[INFO] Overall Time Breakdown:
INFO:   - Database open/config: 0.04s (0.1%)
INFO:   - File gathering: 4.05s (7.0%)
INFO:   - Database comparison: 34.66s (60.0%)
INFO:   - Tar preparation: 0.00s (0.0%)
INFO:   - Add files: 19.06s (33.0%)
INFO:   - TOTAL TIME: 57.82 seconds

[INFO] ==========================================
[INFO] Analysis Commands
[INFO] ==========================================
[SUCCESS] View full logs:
  cat /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs/update.log       # Update profiling

[SUCCESS] Extract specific metrics:
  grep 'PERFORMANCE' /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs/update.log | less
  grep 'PERFORMANCE (walk)' /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs/update.log
  grep 'database queries' /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs/update.log

[SUCCESS] Compare times:
  grep 'TOTAL TIME' /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs/update.log

[INFO] ==========================================
[SUCCESS] Profiling Complete!
[INFO] ==========================================

Key takeaways

  • "Database comparison" took 60.0% of "Overall Time"
    • "stat operations" took 87.9% of the time spent on "Database comparison"
  • "Add files" took 33.0% of "Overall Time"
  • "File gathering" took 7.0% of "Overall Time"

@chengzhuzhang Assuming this was a "realistic" test run, we can use this information to optimize zstash update.

What was tested here?

test on a real zstash archive.

@chengzhuzhang Unfortunately, I still found this rather ambiguous. "Real" could take on a number of different meanings:

  • Use an archive on NERSC HPSS that scientists are actually using. (Not a good idea to mess with an archive in active use.)
  • Use an archive that more closely approximates real simulation output. This itself could mean a number of things. The archive either 1) starts off with... or 2) should be updated with...
    • a certain (i.e., realistic) number of files. (Here, we had 8,000 files and added 1,000 more)
    • certain (i.e., realistic) size files.
    • a certain (i.e., realistic) total size of data. (Here, the total data size was a little over 1 GB, and the amount updated was minimal).

The above analysis was completed using these steps:

  1. Create a copy of a subdir of a real simulation directory -- i.e., use real data but not so much as to make testing unwieldy.
  2. Create a zstash archive of that copied subdir.
  3. Update that copied subdir with 1000 auto-generated files. This isn't exactly "realistic" but it's unclear to me what would count as a "realistic" update.
  4. Run zstash update to update the archive of that copied subdir.

Please let me know if this was appropriate test and if not, then which specific steps should be changed. Thanks!

@chengzhuzhang
Copy link
Collaborator

@forsyth2 this is a good test! And in line with something I would suggest. It shows clearly dataset comparison took up much longer than file gathering. Among dataset comparison, Stat operations took most of time, what operation is this exactly?
If possible, can we have a larger test to confirm on the timing breakup, an example would be to add the ./run directory and zstash update again.

@forsyth2
Copy link
Collaborator Author

this is a good test! And in line with something I would suggest.

Great, thanks!

Among dataset comparison, Stat operations took most of time, what operation is this exactly?

That's this block in update.py:

        # Time stat operations
        stat_op_start = time.time()
        statinfo: os.stat_result = os.lstat(file_path)
        mdtime_new: datetime = datetime.utcfromtimestamp(statinfo.st_mtime)
        mode: int = statinfo.st_mode
        # For symbolic links or directories, size should be 0
        size_new: int
        if stat.S_ISLNK(mode) or stat.S_ISDIR(mode):
            size_new = 0
        else:
            size_new = statinfo.st_size
        stat_time += time.time() - stat_op_start

More specifically, the relevant call is to the os.lstat function.

can we have a larger test to confirm on the timing breakup, an example would be to add the ./run directory and zstash update again.

Yes, I can do that. ./run is about 10x larger than ./build so that would be a good scaling test.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 17, 2025

Stat operations took most of time

Claude confirms my thinking here -- there're just so many small files. We can try to batch those operations. First though, let me run the large update (adding run/ subdir) and see if that changes the profiling information.


The Problem

You're calling os.lstat() 8,073 times in a loop, and it's dominating your performance. At ~3.77ms per call, this is way too slow for a stat operation, which suggests you're likely on a network filesystem (NFS, Lustre, GPFS, etc.) or slow storage.

The Solution: Use os.scandir()

Instead of walking the tree with os.walk() and then separately calling lstat() on each file, use os.scandir() which caches the stat information during directory traversal. This is exactly what it was designed for.

@forsyth2
Copy link
Collaborator Author

Setup for larger test
# We saw previously:
cd /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
du -sh run
# 11G	run

# We want to run `zstash update` (with performance logging) to add this subdir.
# We had set: DIR_THAT_WAS_ARCHIVED=/lcrc/group/e3sm/ac.forsyth2/zstash_performance/build/
# So we're actually going to need to place `run/` inside `build/`,
# but that's fine since this isn't a production use case.

cd ~/ez/zstash
git status
# On branch zstash-update-performance-profile
# nothing to commit, working tree clean
git log
# Add second profiling script
# Performance profiling of zstash update
cd tests/integration/bash_tests/run_from_any/
# Copying will retain the run permissions:
cp performance_update_existing_archive.bash performance_zstash_update_after_manual_change.bash
# Make necessary edits
cd ~/ez/zstash
git add -A # Add new test script to git
lcrc_conda # Activate conda.
conda activate zstash-performance-profile-20251215
pre-commit run --all-files
python -m pip install .

cd /lcrc/group/e3sm/ac.forsyth2/zstash_performance/build/
cp -r /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201/run/ run/ # Very fast copy of 11GB
du -sh run
# 11G	run
cd ~/ez/zstash/tests/integration/bash_tests/run_from_any/
./performance_zstash_update_after_manual_change.bash
# tee: /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_after_manual_change/update.log: No such file or directory
# => Forgot to create the new logs directory
# That's not a huge problem though since it's also printing to the command line
mkdir /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_after_manual_change/
touch /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_after_manual_change/update.log
# Copy command line output to /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_after_manual_change/update.log
cp performance_zstash_update_after_manual_change.bash get_profile_summary_from_log.bash
# Make necessary edits
./get_profile_summary_from_log.bash

Results from larger test

[STEP] Analyzing performance metrics from update log...
[INFO] ==========================================
[INFO] Performance Analysis
[INFO] ==========================================

[INFO] File Gathering Breakdown:
INFO: PERFORMANCE (walk): Completed filesystem walk
INFO: PERFORMANCE (sort): Sorted 8187 entries: 0.00 seconds
INFO: PERFORMANCE (normalize): Normalized paths: 0.01 seconds
INFO: PERFORMANCE (get_files_to_archive): TOTAL TIME: 4.61 seconds

[INFO] Database Comparison Breakdown:
INFO: PERFORMANCE: Total comparison time: 52.90 seconds
INFO: PERFORMANCE: Files checked: 8187
INFO: PERFORMANCE: New files to archive: 8187
INFO: PERFORMANCE: Average rate: 154.8 files/sec
INFO:   - stat operations: 47.91s (90.6%)
INFO:   - database queries: 4.98s (9.4%)
INFO:   - comparison logic: 0.00s (0.0%)

[INFO] Overall Time Breakdown:
INFO:   - Database open/config: 0.06s (0.0%)
INFO:   - File gathering: 4.61s (1.0%)
INFO:   - Database comparison: 52.90s (11.6%)
INFO:   - Tar preparation: 0.00s (0.0%)
INFO:   - Add files: 398.99s (87.4%)
INFO:   - TOTAL TIME: 456.56 seconds

[INFO] ==========================================
[INFO] Analysis Commands
[INFO] ==========================================
[SUCCESS] View full logs:
  cat /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_after_manual_change/update.log       # Update profiling

[SUCCESS] Extract specific metrics:
  grep 'PERFORMANCE' /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_after_manual_change/update.log | less
  grep 'PERFORMANCE (walk)' /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_after_manual_change/update.log
  grep 'database queries' /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_after_manual_change/update.log

[SUCCESS] Compare times:
  grep 'TOTAL TIME' /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_after_manual_change/update.log

[INFO] ==========================================
[SUCCESS] Profiling Complete!
[INFO] ==========================================

Here we see stat operations: 47.91s (90.6%).

Summary of tests

Tests done so far:

  • Test 1: created an archive of 1,000 toy files, then added 1,000 new toy files. Also used --dry-run parameter.
  • Test 2: created an archive from a ~1 GB build/ subdir and then added 1,000 new toy files.
  • Test 3 (this comment): Used the src dir from Test 2 and added the ~11 GB run/ subdir.
Code block Test 1 Test 2 Test 3
File Gathering Breakdown:
PERFORMANCE (sort): Sorted n entries 2002 entries, 0.00 seconds 8073 entries, 0.00 seconds 8187 entries, 0.00 seconds
PERFORMANCE (normalize): Normalized paths 0.00 seconds 0.01 seconds 0.01 seconds
PERFORMANCE (get_files_to_archive): TOTAL TIME 0.01 seconds 4.05 seconds 4.61 seconds
Database Comparison Breakdown:
PERFORMANCE: Total comparison time 0.22 seconds 34.66 seconds 52.90 seconds
PERFORMANCE: Files checked 2000 8073 8187
PERFORMANCE: New files to archive 1000 1000 8187
PERFORMANCE: Average rate 8977.6 files/sec 232.9 files/sec 154.8 files/sec
stat operations 0.06s (26.9%) 30.46s (87.9%) 47.91s (90.6%)
database queries 0.13s (56.2%) 2.24s (6.5%) 4.98s (9.4%)
comparison logic 0.04s (16.5%) 1.95s (5.6%) 0.00s (0.0%)
Overall Time Breakdown:
Database open/config N/A 0.04s (0.1%) 0.06s (0.0%)
File gathering N/A 4.05s (7.0%) 4.61s (1.0%)
Database comparison N/A 34.66s (60.0%) 52.90s (11.6%)
Tar preparation N/A 0.00s (0.0%) 0.00s (0.0%)
Add files N/A 19.06s (33.0%) 398.99s (87.4%)
TOTAL TIME N/A 57.82 seconds 456.56 seconds
Claude's suggestions based on above aggregate data

Looking at this performance data, here are the key optimization priorities:

Primary Bottleneck: stat operations (Database Comparison)

The most critical issue is the stat operations during database comparison, which becomes the dominant bottleneck as dataset size grows:

  • Test 1: 0.06s (27%)
  • Test 2: 30.46s (88%)
  • Test 3: 47.91s (91%)

This shows quadratic-like scaling that will only get worse with larger datasets. The stat operations are likely checking file metadata (size, mtime) for every file to determine what needs archiving.

Optimization strategies:

  • Batch stat operations instead of individual calls
  • Cache stat results that are reused
  • Use OS-level optimizations (e.g., os.scandir() which returns stat info during directory traversal)
  • Consider parallelizing stat calls across multiple threads

Secondary Bottleneck: Add files (Tar Creation)

This becomes the dominant cost in Test 3 (87.4%, 399 seconds) when archiving large amounts of actual data (8,187 files, ~11 GB).

However, this may be unavoidable I/O cost. Worth investigating:

  • Is compression a bottleneck?
  • Can tar creation be parallelized?
  • Are there inefficient file read patterns?

Lower Priority Optimizations:

Database queries (4-5 seconds in Tests 2-3): Already quite efficient at ~6-9% of runtime. Only optimize if other issues are solved.

File gathering (4-5 seconds): Minimal and scales well. Not a priority.

Summary Recommendation:

Focus on stat operations first - this is where you'll see the biggest gains, especially as datasets grow. The 48 seconds spent on stats in Test 3 is nearly pure overhead that could potentially be reduced by 50-80% with batching and caching strategies.

@forsyth2
Copy link
Collaborator Author

We can see above that when adding 10 GB to a 1 GB archive (Test 3), "Add files" overtakes "Database comparison" (which includes "stat operations"), which I imagine is expected.

@golaz
Copy link
Collaborator

golaz commented Dec 17, 2025

@forsyth2 : thanks for the nice profiling work. It seems to me that the first level of optimization would be to remove the stat (os.lstat()) operation since it accounts for such a large fraction of the database comparison time. We can probably do that by modifying get_files_to_archive to use os.scandir() and return the file list and their stats that we need.

If we need further improvements, it might also be possible to speed up other aspects of the database comparison.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 18, 2025

Setup for optimization test
# We saw previously:
cd /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
du -sh init
# 6.9G	init

# We want to run `zstash update` (with performance logging) to add this subdir.
# We had set: DIR_THAT_WAS_ARCHIVED=/lcrc/group/e3sm/ac.forsyth2/zstash_performance/build/
# So we're actually going to need to place `init/` inside `build/`,
# but that's fine since this isn't a production use case.

cd ~/ez/zstash
git status
# On branch zstash-update-performance-profile
# modified:   tests/integration/bash_tests/run_from_any/get_profile_summary_from_log.bash

# Conda env: zstash-performance-profile-20251215

git log
# 3 profiling commits, 3 optimization commits
git add -A
pre-commit run --all-files
python -m pip install .

cd /lcrc/group/e3sm/ac.forsyth2/zstash_performance/build/
cp -r /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201/init/ init/
du -sh init
# 6.9G	init
mkdir /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_optimization

cd ~/ez/zstash/tests/integration/bash_tests/run_from_any/
# DIR_THAT_WAS_ARCHIVED=/lcrc/group/e3sm/ac.forsyth2/zstash_performance/build/
# EXISTING_ARCHIVE=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/f/forsyth/zstash_performance_20251216_try2
#
# WORK_DIR=/lcrc/group/e3sm/ac.forsyth2/zstash_performance
# CACHE_DIR=${WORK_DIR}/cache

cd /lcrc/group/e3sm/ac.forsyth2/zstash_performance/build/
zstash update --hpss="globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/f/forsyth/zstash_performance_20251216_try2" --cache="/lcrc/group/e3sm/ac.forsyth2/zstash_performance/cache" -v 2>&1 | tee "/lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_optimization/update.log"

cd ~/ez/zstash/tests/integration/bash_tests/run_from_any/
./get_profile_summary_from_log.bash

On Perlmutter:

hsi
ls zstash_performance_20251216_try2/ 
# zstash_performance_20251216_try2/:
# 000000.tar   000002.tar   000003.tar   000004.tar   index.db 

# Test 2 added 000002.tar 
# Test 3 added 000003.tar
# This optimized run added 000004.tar
exit

Results from optimization test

[STEP] Analyzing performance metrics from update log...
[INFO] ==========================================
[INFO] Performance Analysis (OPTIMIZED VERSION)
[INFO] ==========================================

[INFO] File Gathering Breakdown (with stats collection):
INFO: PERFORMANCE (scandir): Completed filesystem walk with stats
INFO: PERFORMANCE (sort): Sorted 8198 entries: 0.00 seconds
INFO: PERFORMANCE (get_files_to_archive_with_stats): TOTAL TIME: 37.99 seconds

[INFO] Database Comparison Breakdown (OPTIMIZED):
INFO: PERFORMANCE: Total comparison time: 0.18 seconds
INFO: PERFORMANCE: Files checked: 8198
INFO: PERFORMANCE: New files to archive: 152
INFO: PERFORMANCE: Average rate: 44863.1 files/sec
INFO:   - database load: 0.18s (97.6%)
INFO:   - comparison (in-memory): 0.00s (2.3%)

[INFO] Optimization Impact:
INFO:   - stat operations eliminated: 8198 (100%)
INFO:   - All stats performed during initial filesystem walk

[INFO] Overall Time Breakdown:
INFO:   - Database open/config: 0.10s (0.0%)
INFO:   - File gathering: 37.99s (9.6%)
INFO:   - Database comparison: 0.18s (0.0%)
INFO:   - Tar preparation: 0.00s (0.0%)
INFO:   - Add files: 356.99s (90.3%)
INFO:   - TOTAL TIME: 395.26 seconds

[INFO] ==========================================
[INFO] Analysis Commands
[INFO] ==========================================
[SUCCESS] View full logs:
  cat /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_optimization/update.log       # Update profiling

[SUCCESS] Extract specific metrics:
  grep 'PERFORMANCE' /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_optimization/update.log | less
  grep 'PERFORMANCE (scandir)' /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_optimization/update.log
  grep 'database load' /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_optimization/update.log
  grep 'Optimization impact' /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_optimization/update.log

[SUCCESS] Compare times:
  grep 'TOTAL TIME' /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_optimization/update.log

[SUCCESS] View optimization benefits:
  grep 'stat operations eliminated' /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_optimization/update.log

[INFO] ==========================================
[SUCCESS] Profiling Complete!
[INFO] ==========================================

Review

@chengzhuzhang @golaz The optimizations are covered by the last 4 commits; they can be viewed in aggregate here. Can you do a high-level review of Claude's changes to update.py and utils.py?

If you agree with the general implementation logic, I can go ahead with:

  • Cleaning up the code (one thing I want to change is bringing a nested function out).
  • Running existing tests to make sure nothing breaks.
  • Turning performance logging statements into debug lines so they can be turned on/off based on logging level.
  • Consolidating the performance profiling scripts and placing them under a new performance/ directory, since they're not actually functionality tests like everything else under tests/.

@golaz
Copy link
Collaborator

golaz commented Dec 18, 2025

Took a very brief initial look and left some comment. The structure of the changes look reasonable. But it would be useful if you walk us through the changes when you better understand them yourself.

@chengzhuzhang
Copy link
Collaborator

@forsyth2 Copilot helped to summarize that the code change contains two mechanisms for attempting performance boost:

  1. os.scandir() retrieves both file paths and metadata (size, mtime) in a single stage while walking directories. By combining path traversal and stat retrieval, redundant system calls (and their overhead) are eliminated.
  2. Minimized Database Queries. All database rows are preloaded into an in-memory dictionary for a single bulk operation. File paths are compared directly in memory, which is much faster than database lookups.

This seem to be reasonable, but I didn't review code line by line. Please review and test on your side to confirm the correctness, and the performance results.

@forsyth2
Copy link
Collaborator Author

Yes, thank you both, I just wanted high level approvals before proceeding with further improvements. I will move on to the next steps now.

@forsyth2
Copy link
Collaborator Author

Update 2025-12-18 end-of-day:

Test 5 profiling results

[INFO] ==========================================
[INFO] Performance Analysis (OPTIMIZED VERSION)
[INFO] ==========================================

[INFO] File Gathering Breakdown (with stats collection):
DEBUG: PERFORMANCE (scandir): Completed filesystem walk with stats
DEBUG: PERFORMANCE (sort): Sorted 7087 entries: 0.00 seconds
DEBUG: PERFORMANCE (get_files_to_archive_with_stats): TOTAL TIME: 43.42 seconds

[INFO] Database Comparison Breakdown (OPTIMIZED):
DEBUG: PERFORMANCE: Total comparison time: 0.08 seconds
DEBUG: PERFORMANCE: Files checked: 7087
DEBUG: PERFORMANCE: New files to archive: 14
DEBUG: PERFORMANCE: Average rate: 86685.4 files/sec
DEBUG:   - database load: 0.08s (95.9%)
DEBUG:   - comparison (in-memory): 0.00s (4.0%)

[INFO] Optimization Impact:
DEBUG:   - stat operations eliminated: 7087 (100%)
DEBUG:   - All stats performed during initial filesystem walk

[INFO] Overall Time Breakdown:
DEBUG:   - Database open/config: 0.04s (0.0%)
DEBUG:   - File gathering: 43.42s (19.2%)
DEBUG:   - Database comparison: 0.08s (0.0%)
DEBUG:   - Tar preparation: 0.00s (0.0%)
DEBUG:   - Add files: 182.81s (80.8%)
DEBUG:   - TOTAL TIME: 226.36 seconds

[INFO] ==========================================
[INFO] Analysis Commands
[INFO] ==========================================
[SUCCESS] View full logs:
  cat /lcrc/group/e3sm/ac.forsyth2/zstash_performance/profile_update_20251218_run2/logs//create.log       # Create profiling
  cat /lcrc/group/e3sm/ac.forsyth2/zstash_performance/profile_update_20251218_run2/logs//update.log       # Update profiling

[SUCCESS] Extract specific metrics:
  grep 'PERFORMANCE' /lcrc/group/e3sm/ac.forsyth2/zstash_performance/profile_update_20251218_run2/logs/update.log | less
  grep 'PERFORMANCE (scandir)' /lcrc/group/e3sm/ac.forsyth2/zstash_performance/profile_update_20251218_run2/logs/update.log
  grep 'database load' /lcrc/group/e3sm/ac.forsyth2/zstash_performance/profile_update_20251218_run2/logs/update.log
  grep 'Optimization impact' /lcrc/group/e3sm/ac.forsyth2/zstash_performance/profile_update_20251218_run2/logs/update.log

[SUCCESS] Compare times:
  grep 'TOTAL TIME' /lcrc/group/e3sm/ac.forsyth2/zstash_performance/profile_update_20251218_run2/logs/update.log

[SUCCESS] View optimization benefits:
  grep 'stat operations eliminated' /lcrc/group/e3sm/ac.forsyth2/zstash_performance/profile_update_20251218_run2/logs/update.log


[INFO] ==========================================
[SUCCESS] Profiling Complete!
[INFO] ==========================================
[INFO] Working directory: /lcrc/group/e3sm/ac.forsyth2/zstash_performance/profile_update_20251218_run2/
[INFO] Logs available at: /lcrc/group/e3sm/ac.forsyth2/zstash_performance/profile_update_20251218_run2/logs/

Overall time breakdown

Test Archived... Added... File gathering Database comparison Add files
Test 1 1,000 toy files 1,000 toy files 0.01s 0.22s N/A
Test 2 ~1 GB build/ 1,000 toy files 4.05s (7.0%) 34.66s (60.0%) 19.06s (33.0%)
Test 3 ~1 GB from Test 2 ~11 GB run/ 4.61s (1.0%) 52.90s (11.6%) 398.99s (87.4%)
Test 4: optimized code, specifically 7th commit "Modify profiling script for optimizations" ~12 GB from Test 3 ~7 GB init/ 37.99s (9.6%) 0.18s (0.0%) 356.99s (90.3%)
Test 5 (see above), optimized code, specifically 14th commit "Use logger directly" ~1 GB build/ ~7 GB init/ 43.42s (19.2%) 0.08s (0.0%) 182.81s (80.8%)

Code review guide

Claude's guide:


Claude's walk-through

Code Review Guide: Zstash Update Performance Optimization

Executive Summary

This diff addresses a critical performance bottleneck in zstash update where the operation spent hours (or days for large simulations) identifying files that need archiving. The optimization achieves dramatic speedup by eliminating redundant filesystem stat operations through a single-pass directory walk that collects all file metadata upfront.

The Core Problem

Original Workflow (Inefficient)

  1. Walk filesystem to get list of file paths
  2. For each file in the list:
    • Query database to check if file exists
    • Call os.lstat() to get size/mtime
    • Compare with database values
    • Problem: This results in N stat operations for N files

Optimized Workflow

  1. Walk filesystem once using os.scandir(), collecting paths + stats simultaneously
  2. Load entire database into memory (single operation)
  3. Compare in-memory structures (no I/O)
  4. Result: Zero redundant stat operations, O(1) database lookups

Key Optimization Strategies

1. Collect Stats During Filesystem Walk (utils.py)

Before:

# Collected only paths
for root, dirnames, filenames in os.walk("."):
    file_tuples.append((root, filename))

After:

# DirectoryScanner class uses os.scandir()
stat_info = entry.stat(follow_symlinks=False)
size = 0 if stat_module.S_ISLNK(mode) else stat_info.st_size
mtime = datetime.utcfromtimestamp(stat_info.st_mtime)
self.file_stats[normalized_path] = (size, mtime)

Why it matters:

  • os.scandir() provides stat info "for free" during directory traversal
  • Eliminates the need to stat files again during database comparison
  • For 100,000 files, this saves 100,000 os.lstat() system calls

2. In-Memory Database Comparison (update.py)

Before:

for file_path in files:
    statinfo = os.lstat(file_path)  # ❌ Stat operation per file
    cur.execute("select * from files where name = ?", (file_path,))  # ❌ DB query per file
    while True:
        match = cur.fetchone()
        # ... comparison logic

After:

# Load database once into dictionary
archived_files: Dict[str, Tuple[int, datetime]] = {}
cur.execute("SELECT name, size, mtime FROM files")
for row in cur.fetchall():
    archived_files[file_path] = (size, mtime)

# Compare using pre-collected stats
for file_path in files:
    size_new, mdtime_new = file_stats[file_path]  # ✓ No I/O
    if file_path not in archived_files:  # ✓ O(1) lookup
        newfiles.append(file_path)

Why it matters:

  • Database queries reduced from N queries to 1 query
  • Dictionary lookup is O(1) vs O(log N) for database query
  • All data in RAM = orders of magnitude faster than disk I/O

Minimal Codebase Impact

Design Principle: Backward Compatibility

The refactor maintains the existing API surface:

# New optimized version
def get_files_to_archive_with_stats(cache, include, exclude) -> Dict[str, Tuple[int, datetime]]

# Legacy wrapper for backward compatibility
def get_files_to_archive(cache, include, exclude) -> List[str]:
    file_stats = get_files_to_archive_with_stats(cache, include, exclude)
    return list(file_stats.keys())

Benefit: Any code still using get_files_to_archive() continues to work unchanged while benefiting from the optimization internally.

Surgical Changes to update_database()

Only three sections modified in the critical path:

  1. File gathering (lines 388-392):

    - files = get_files_to_archive(cache, args.include, args.exclude)
    + file_stats = get_files_to_archive_with_stats(cache, args.include, args.exclude)
    + files = list(file_stats.keys())
  2. Database loading (lines 399-415):

    • Single SELECT query loads all archived files into dictionary
  3. Comparison loop (lines 420-448):

    • Replaced per-file stat + DB query with dictionary lookups

Result: Core logic remains unchanged; only data access patterns optimized.


Performance Instrumentation

Comprehensive Logging Without Runtime Overhead

The diff adds two performance logger classes that track timing at key phases:

FileGatheringPerformanceLogger (utils.py)

  • Tracks filesystem walk time
  • Monitors filtering operations
  • Reports sorting overhead
  • Provides rate metrics (files/sec, dirs/sec)

UpdatePerformanceLogger (update.py)

  • Phases: database open, file gathering, comparison, tar prep, add files
  • Detailed breakdown with percentages
  • Progress logging every 1000 files
  • Optimization impact reporting

Example output:

PERFORMANCE: Database comparison completed (OPTIMIZED)
PERFORMANCE: Total comparison time: 2.34 seconds
PERFORMANCE: Files checked: 50000
PERFORMANCE: Average rate: 21367.5 files/sec
PERFORMANCE: Optimization impact:
  - stat operations eliminated: 50000 (100%)

Profiling Script Integration

The profile_performance_for_update.bash script automates:

  • Fresh Globus authentication setup
  • Directory copying and preparation
  • Timed execution with logging
  • Automatic performance metric extraction

Review focus: Ensures performance measurements are consistent and reproducible across test runs.


Edge Cases & Correctness

1. Empty Directories

if not has_contents and path != ".":
    self.empty_dir_count += 1
    stat_info = os.lstat(path)
    mtime = datetime.utcfromtimestamp(stat_info.st_mtime)
    self.file_stats[normalized_path] = (0, mtime)
  • Maintains original behavior of tracking empty directories
  • Preserves mtime for empty dirs

2. Symbolic Links

stat_info = entry.stat(follow_symlinks=False)  # Matches os.lstat()
if stat_module.S_ISLNK(mode):
    size = 0
  • Uses follow_symlinks=False to match os.lstat() semantics exactly
  • Symbolic links recorded with size=0 (original behavior)

3. Cache Directory Exclusion

if entry.path == self.cache_path or entry.path.startswith(self.cache_path + os.sep):
    continue
  • Skips cache directory during scan (preserves original filtering)
  • Path normalization ensures consistent comparison

4. Duplicate Files in Database

if file_path in archived_files:
    existing_mtime = archived_files[file_path][1]
    if mtime > existing_mtime:
        archived_files[file_path] = (size, mtime)
  • Keeps the latest version if file appears multiple times
  • Maintains data integrity

Testing Strategy

Unit Tests (test_update.py, test_utils.py)

Coverage areas:

  1. Performance logger initialization and timing

    • Validates timing accuracy (within 0.1s tolerance)
    • Tests all logging methods complete without error
  2. In-memory comparison correctness

    • Unchanged files correctly identified
    • Changed files (size/mtime) detected
    • New files properly flagged
    • mtime tolerance (1 hour) respected
  3. File gathering with stats

    • Basic directory structure scanning
    • Include/exclude pattern filtering
    • Sorted order preservation
    • Empty directory handling
    • Symlink processing
    • Cache directory exclusion

Review checklist:

  • Do tests cover the optimization's correctness claims?
  • Are edge cases (empty dirs, symlinks) tested?
  • Does test_in_memory_comparison verify the new comparison logic?

Expected Performance Gains

Theoretical Analysis

For a simulation with:

  • N = 100,000 files
  • Database query time: ~0.01ms per query
  • Stat operation: ~0.1ms per file

Before optimization:

  • Stat operations: 100,000 × 0.1ms = 10,000ms (10 seconds)
  • DB queries: 100,000 × 0.01ms = 1,000ms (1 second)
  • Total: ~11 seconds (just for comparison)

After optimization:

  • Stat operations: 0 (collected during walk)
  • DB queries: 1 (load entire database)
  • In-memory comparison: 100,000 × 0.000001ms = 0.1ms
  • Total: ~0.1 seconds

Speedup: ~110x for comparison phase alone

Real-World Impact

For the reported issue (hours/days for large simulations):

  • With millions of files, the O(N) stat operations were dominating
  • Filesystem latency amplifies the problem on network storage
  • Optimization reduces comparison from O(N × filesystem_latency) to O(N × memory_access)

Code Review Checklist

Correctness

  • Does DirectoryScanner match os.walk() behavior exactly?
  • Are stat semantics (follow_symlinks=False) preserved?
  • Is mtime tolerance (TIME_TOL) applied correctly?
  • Are empty directories handled identically to original code?
  • Does path normalization prevent duplicate entries?

Performance

  • Are all redundant stat operations eliminated?
  • Is the database loaded into memory in a single query?
  • Are dictionary lookups used for O(1) comparison?
  • Does os.scandir() provide stat info efficiently?
  • Is progress logging frequent enough without overhead?

Maintainability

  • Are performance loggers well-documented?
  • Is the legacy API preserved for backward compatibility?
  • Are variable names clear (file_stats, archived_files)?
  • Is error handling comprehensive (PermissionError, OSError)?
  • Do comments explain why optimizations work?

Testing

  • Do unit tests verify correctness of new comparison logic?
  • Are edge cases (symlinks, empty dirs) tested?
  • Does profiling script enable reproducible benchmarking?
  • Are performance metrics logged at appropriate granularity?

Potential Concerns & Mitigations

1. Memory Usage

Concern: Loading entire database into memory could be problematic for massive archives.

Mitigation:

  • For 1M files, memory overhead is ~100MB (file path + 2 integers per entry)
  • Modern systems easily handle this
  • Database was already loaded incrementally in original code
  • Net memory increase is minimal

2. Dictionary Overhead

Concern: Python dict memory overhead vs database.

Analysis:

  • Dict provides O(1) lookup vs O(log N) for database B-tree
  • Memory trade-off is favorable given speed improvement
  • Database still used for persistence; dict is temporary

3. Code Duplication

Concern: DirectoryScanner reimplements directory walking.

Justification:

  • Necessary to use os.scandir() for stat collection
  • os.walk() doesn't expose stat info during traversal
  • Encapsulation in class keeps complexity contained

Recommendations for Reviewers

  1. Focus on the comparison logic (update.py, lines 420-448)

    • This is where the optimization's correctness is critical
    • Verify the in-memory comparison matches original SQL semantics
  2. Validate test coverage (test_update.py, test_utils.py)

    • Ensure test_in_memory_comparison covers all edge cases
    • Check that test_mtime_tolerance verifies TIME_TOL correctly
  3. Run profiling script on real data

    • Use profile_performance_for_update.bash on actual simulation data
    • Compare before/after metrics to validate claimed speedup
  4. Review performance logging verbosity

    • Ensure debug logs don't spam output in production
    • Verify metrics are actionable for performance tuning
  5. Check backward compatibility

    • Confirm get_files_to_archive() still works for existing callers
    • Test with include/exclude patterns to ensure filtering unchanged

Conclusion

This optimization achieves dramatic performance improvements (potentially 100x+ for large simulations) through a fundamental algorithmic change: collect file metadata once during the initial walk, then perform all comparisons in memory. The implementation is surgical, maintaining API compatibility while eliminating the O(N) stat operations that caused the hours-long delays reported in the issue.

The comprehensive performance instrumentation ensures the optimization's benefits are measurable and reproducible, while extensive unit tests verify correctness across edge cases.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 19, 2025

@chengzhuzhang @golaz The above comment is my latest status on this. I need to take some time to absorb all the information in it, but figured I should make it available in case either of you are interested / want to add suggestions. Thanks!

A few things come to mind so far:

  • It's interesting that File Gathering jumped up in time spent for Test 5 from Test 4 -- I don't know if that's because of 1) the sizes of the initial archive vs what's being added, 2) code changes between commits 7 and 14.
  • I'm a little concerned about Claude's note re: "Concern: Loading entire database into memory could be problematic for massive archives", but perhaps it's decently mitigated.

Action items for me:

  • Visually inspect code changes in this PR, using Claude's guide in the above comment.
  • Run ./profile_performance_for_update.bash with different-sized initial archives and different-sized additions, to see how performance numbers change.
  • Once confident no more code changes are required, rerun the entire test suite. Note: I did do this earlier, but since then, I have made a few code changes, mostly refactoring (which theoretically should give equivalent behavior).

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 19, 2025

It's interesting that File Gathering jumped up in time spent for Test 5 from Test 4 -- I don't know if that's because of 1) the sizes of the initial archive vs what's being added, 2) code changes between commits 7 and 14.

Run ./profile_performance_for_update.bash with different-sized initial archives and different-sized additions, to see how performance numbers change.

I'm also going to need to look at number of files being archived, not just the sizes. I assume the performance profiling will depend on both the number and size of files both initially and being added. I was trying to see if Claude could come up with a good analysis, but I think I need to collect the different combinations of data first.

@forsyth2
Copy link
Collaborator Author

I think I need to collect the different combinations of data first.

I've added an updated performance profiling script. Results below.

========================================
Analyzing: /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201/build/
========================================
Total size: 1.2GiB
Number of files: 7046
Average file size: 169KB

========================================
Analyzing: /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201/run/
========================================
Total size: 11GiB
Number of files: 111
Average file size: 96MB

========================================
Analyzing: /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201/init/
========================================
Total size: 6.9GiB
Number of files: 14
Average file size: 501MB

========================================
SUMMARY
========================================
Cumulative size of all directories: 19GiB
Number of directories: 3
Number of permutations we will test: 3
Space needed: 56GiB

Test configurations:

  • Test 123: create=build/, update1=run/, update2=init/
  • Test 231: create=run/, update1=init/, update2=build/
  • Test 312: create=init/, update1=build/, update2=run/
Test/Update File Gathering Database Comparison Add Files
Test_123_Update1 (+run) 5.59s (2.9%) 0.04s (0.0%) 186.16s (97.0%)
Test_123_Update2 (+init) 6.92s (6.0%) 0.04s (0.0%) 108.85s (94.0%)
Test_231_Update1 (+init) 0.19s (0.1%) 0.00s (0.0%) 139.91s (99.9%)
Test_231_Update2 (+build) 3.68s (4.1%) 0.02s (0.0%) 85.11s (95.8%)
Test_312_Update1 (+build) 13.17s (12.0%) 0.05s (0.0%) 96.10s (87.9%)
Test_312_Update2 (+run) 11.46s (4.8%) 0.12s (0.0%) 224.71s (95.1%)
Notes from Claude

Looking at your profiling data, here are the key relationships between directory characteristics and performance:

File Gathering Time

  • Strongly correlated with number of files: The build/ directory (7,046 files) consistently takes the longest to gather (3.68s-13.17s), while init/ (14 files) is fastest (0.19s)
  • Weakly affected by total size: Despite run/ being the largest directory (11GiB), it gathers quickly (5.59s-6.92s) because it only has 111 files

Add Files Time

  • Primarily driven by total data size: Larger directories take longer to add regardless of file count
    • run/ (11GiB): 186-225s
    • init/ (7GiB): 109-140s
    • build/ (1.2GiB): 85-96s
  • File count has minimal impact: build/ has 63× more files than init/ but adds faster due to much smaller total size

Database Comparison

  • Negligible across all tests (<0.12s), essentially independent of directory characteristics

Bottom line: File gathering is O(n) with file count, while adding files is O(n) with total data size. For your backup operations, total data volume dominates overall runtime (~95% of time), but large file counts will slow down the metadata scanning phase.

@forsyth2 forsyth2 force-pushed the zstash-update-performance-profile branch from d3ae23b to 589ac40 Compare December 19, 2025 23:15
@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 19, 2025

Cleaned up commit history by consolidating related commits; the 4 commits currently are:

  1. Initial performance profiling
  2. Initial optimizations
  3. Tests known to be passing
  4. Further improvements

@forsyth2 forsyth2 added the semver: small improvement Small improvement (will increment patch version) label Dec 19, 2025
@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 19, 2025

Now that the PR is approaching reviewable status, I have updated the top description to indicate this is no longer just a performance profiling PR that should never be merged.

Copy link
Collaborator Author

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My self-review for this PR

I've only reviewed update.py and utils.py here.

Remaining action items:

  • Have others review the design decisions/architecture changes of update.py and utils.py.
    • Decide if we want to keep the performance logging code & scripts. Pro: available for future use, off by default, Con: adds a lot of superfluous code
  • Self-review non-user-facing code (performance_profiling/, tests/) changes.
  • Rerun the entire test suite.

Claude's code review guide below, based on:

  • the issue description from #409.
  • asking if it's necessary to update create.py as well.
  • the performance results from earlier comment.
  • The diff for update.py and utils.py

Code review guide

Code Review Guide: Zstash Update Performance Optimization

Overview

This diff optimizes the zstash update operation by eliminating redundant file system stat calls during database comparison. The key insight is to collect file stats during the initial filesystem walk and reuse them during comparison, rather than calling os.lstat() on every file twice.

Performance Impact

Based on the test results, the optimizations show:

  • File Gathering: 0.1% - 12% of total time (varies by file count)
  • Database Comparison: ~0.0% of total time (dramatically reduced from previous implementation)
  • Add Files: 87.9% - 99.9% of total time (unchanged, still the bottleneck)

The optimization successfully addresses the file gathering and comparison bottleneck, though archiving itself remains the dominant time consumer.


Key Changes

1. New Function: get_files_to_archive_with_stats() (utils.py)

Purpose: Combines file discovery with stat collection in a single pass.

What to Review:

def get_files_to_archive_with_stats(
    cache: str, include: str, exclude: str
) -> Dict[str, Tuple[int, datetime]]:
  • Check: Returns Dict[str, Tuple[int, datetime]] mapping file_path → (size, mtime)
  • Check: Uses os.scandir() instead of os.walk() for better performance
  • Check: Collects stats during traversal (no second pass needed)
  • ⚠️ Verify: Empty directory handling - ensures they get (0, mtime) entries
  • ⚠️ Verify: Symlink handling - correctly sets size to 0 for symlinks
  • ⚠️ Verify: Cache directory exclusion works correctly

Critical Logic:

# For symbolic links or directories, size should be 0
if stat_module.S_ISLNK(mode):
    size = 0
else:
    size = stat_info.st_size
  • ✅ Matches original behavior in update.py lines 393-397

2. DirectoryScanner Class (utils.py)

Purpose: Encapsulates recursive directory scanning with stat collection.

What to Review:

class DirectoryScanner:
    def scan_directory(self, path: str):
  • Check: Properly handles PermissionError exceptions
  • Check: Skips cache directory correctly
  • Check: Uses entry.stat(follow_symlinks=False) to match os.lstat() behavior
  • ⚠️ Verify: Empty directory detection logic is sound
  • ⚠️ Verify: Path normalization is consistent with original implementation
  • 🔍 Test: Progress logging every 1000 directories doesn't cause performance issues

3. Optimized Database Comparison (update.py)

MAJOR CHANGE: Eliminated the os.lstat() call inside the file comparison loop.

Original Code (lines 210-234):

for file_path in files:
    statinfo: os.stat_result = os.lstat(file_path)  # ← EXPENSIVE!
    mdtime_new: datetime = datetime.utcfromtimestamp(statinfo.st_mtime)
    # ... comparison logic

New Code (lines 420-447):

# Load database once into memory
archived_files: Dict[str, Tuple[int, datetime]] = {}
cur.execute("SELECT name, size, mtime FROM files")
# Build lookup dictionary

# Compare using pre-collected stats
for file_path in files:
    size_new, mdtime_new = file_stats[file_path]  # ← O(1) lookup, no I/O!
    if file_path not in archived_files:
        newfiles.append(file_path)
    # ... comparison logic

What to Review:

  • Check: Database loaded once into memory (O(1) lookups vs O(n) queries)
  • Check: Comparison logic remains identical to original
  • Check: Handles duplicate database entries (keeps latest mtime)
  • ⚠️ Verify: Time tolerance check (TIME_TOL) works identically
  • 🔍 Test: Memory usage acceptable for large databases

Critical Invariant:

# Original comparison
if (size_new == match.size) and (
    abs((mdtime_new - match.mtime).total_seconds()) <= TIME_TOL
):

# New comparison  
if not (
    (size_new == archived_size)
    and (abs((mdtime_new - archived_mtime).total_seconds()) <= TIME_TOL)
):
  • ✅ Logic is equivalent (note the not inversion for newfiles condition)

4. Performance Logging Classes

Two new classes added for detailed performance tracking:

  • UpdatePerformanceLogger (in update.py)
  • FileGatheringPerformanceLogger (in utils.py)

What to Review:

  • Check: Logging only occurs at DEBUG level (no performance impact in production)
  • Check: Progress updates every 1000 items to avoid log spam
  • ⚠️ Verify: Timing measurements don't introduce overhead
  • 💡 Consider: Could this logging be made optional via a flag?

Backward Compatibility

Legacy Function Maintained

def get_files_to_archive(cache: str, include: str, exclude: str) -> List[str]:
    """LEGACY VERSION: For backward compatibility."""
    file_stats = get_files_to_archive_with_stats(cache, include, exclude)
    return list(file_stats.keys())

What to Review:

  • Check: Returns identical output format as before
  • ⚠️ Verify: create.py still works with this function (see question below)

Testing Checklist

Functional Tests

  • Empty directories are correctly identified and archived
  • Symlinks have size=0 in the database
  • Cache directory is properly excluded from archiving
  • Include/exclude patterns work identically to before
  • File ordering is deterministic and matches original
  • Time tolerance (TIME_TOL) comparisons work correctly
  • Database entries with duplicate paths handle latest mtime

Performance Tests

  • File gathering time reduced (confirmed in test results ✅)
  • Database comparison time near-zero (confirmed in test results ✅)
  • Memory usage acceptable for large file sets
  • No performance regression in add_files operation

Edge Cases

  • Permission denied errors handled gracefully
  • Very large file counts (millions of files)
  • Very large databases (millions of archived files)
  • Concurrent file modifications during scanning
  • Files with special characters in names
  • Deep directory hierarchies

Answer to Your Question: Does create.py Need Modification?

Short answer: No immediate modification needed, but it could benefit from optimization.

Analysis:

  1. create.py currently uses the legacy get_files_to_archive() function
  2. The legacy function now calls the optimized get_files_to_archive_with_stats() internally
  3. However, create.py will still call os.lstat() on each file during archiving

Recommendation:

# In create.py, if you want to optimize further:
# Instead of:
files = get_files_to_archive(cache, args.include, args.exclude)

# Consider:
file_stats = get_files_to_archive_with_stats(cache, args.include, args.exclude)
# Then pass file_stats to add_files() and modify add_files() to accept the stats

Why this matters: Based on your test results, the majority of time is spent in add_files (87-99%). If add_files is calling os.lstat() again, you could eliminate those redundant calls too.

Action Item: Review hpss_utils.add_files() to see if it's stating files again. If so, modify it to accept pre-collected stats.


Potential Issues to Watch

1. Memory Usage

archived_files: Dict[str, Tuple[int, datetime]] = {}

For very large databases (millions of files), this could consume significant memory. Monitor memory usage in production.

2. Path Normalization Consistency

normalized_path = os.path.normpath(entry.path)

Ensure all paths are normalized consistently between:

  • File gathering
  • Database storage
  • Database lookup

3. Empty Directory Mtime Fallback

except (OSError, PermissionError):
    # Fallback if we can't stat the directory
    self.file_stats[normalized_path] = (0, datetime.utcnow())

Using datetime.utcnow() as a fallback could cause issues. Consider whether this should fail instead or use a different sentinel value.

4. Duplicate Database Entries

if file_path in archived_files:
    existing_mtime = archived_files[file_path][1]
    if mtime > existing_mtime:
        archived_files[file_path] = (size, mtime)

This assumes keeping the latest mtime is correct. Verify this matches the original behavior when the database has duplicates.


Recommendations

High Priority

  1. ✅ Test with actual large E3SM datasets (millions of files)
  2. ✅ Monitor memory usage with large databases
  3. ⚠️ Add unit tests for edge cases (empty dirs, symlinks, permissions)
  4. ⚠️ Document the performance characteristics in user documentation

Medium Priority

  1. 💡 Consider optimizing add_files() to accept pre-collected stats
  2. 💡 Add configuration flag to enable/disable detailed performance logging
  3. 💡 Consider adding a database index on the name column if not present

Low Priority

  1. 💡 Profile memory usage and consider streaming approach for very large datasets
  2. 💡 Add regression tests comparing old vs new output for identical inputs

Summary

This is a well-designed optimization that successfully addresses the performance bottleneck in file gathering and database comparison. The code maintains backward compatibility while achieving significant speedups. The main concerns are memory usage at scale and ensuring the modified create.py workflow is also considered for optimization.

Approval Recommendation: ✅ Approve with minor testing verification of edge cases.

# Get actual mtime for empty directory
try:
stat_info = os.lstat(path)
mtime = datetime.utcfromtimestamp(stat_info.st_mtime)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Verify: Empty directory handling - ensures they get (0, mtime) entries


# For symbolic links or directories, size should be 0
if stat_module.S_ISLNK(mode):
size = 0
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Verify: Symlink handling - correctly sets size to 0 for symlinks

Comment on lines +142 to +145
# Skip the cache directory entirely
if entry.path == self.cache_path or entry.path.startswith(
self.cache_path + os.sep
):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Verify: Cache directory exclusion works correctly

Comment on lines +192 to +197
# Progress logging every 1000 directories
if self.dir_count % 1000 == 0:
elapsed = time.time() - self.walk_start_time
self.perf_logger.log_scandir_progress(
self.dir_count, self.file_count, elapsed
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 Test: Progress logging every 1000 directories doesn't cause performance issues

✅ All of the logger statements are only run in -v mode. (verbose mode sets logging level to debug). That said, see point in this review's top-level comment: "Decide if we want to keep the performance logging code & scripts"

Comment on lines +440 to +442
if not (
(size_new == archived_size)
and (abs((mdtime_new - archived_mtime).total_seconds()) <= TIME_TOL)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Verify: Time tolerance check (TIME_TOL) works identically

Previously: Looped through requested files, exiting early if the TIME_TOL condition was True, in which case we set new = False and file_path was not appended to newfiles.

Now: we already have the corresponding rows from the files table, so we simply check for each row if the TIME_TOL condition is False, in which case do append to file_pathtonewfiles`.

✅ Thus, we have TIME_TOL True => do NOT append, TIME_TOL False => append. It seems the logic is consistent then.

Comment on lines +413 to +419
# If file appears multiple times, keep the one with latest mtime
if file_path in archived_files:
existing_mtime = archived_files[file_path][1]
if mtime > existing_mtime:
archived_files[file_path] = (size, mtime)
else:
archived_files[file_path] = (size, mtime)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ ] Database entries with duplicate paths handle latest mtime

✅ Presumably this wasn't a problem in the previous implementation because we were handling the files themselves, so of course we were getting their latest accessible mtime.

Comment on lines +487 to +488
# Add files
perf.start_add_files()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ ] No performance regression in add_files operation

We haven't changed anything about add_files, so I don't imagine this would be an issue.

try:
entries = list(os.scandir(path))
except PermissionError:
logger.warning(f"Permission denied: {path}")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ ] Permission denied errors handled gracefully

Comment on lines +149 to +152
# Get stat info - scandir provides this efficiently
# Use entry.stat(follow_symlinks=False) to match os.lstat() behavior
stat_info = entry.stat(follow_symlinks=False)
mode = stat_info.st_mode
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ ] Concurrent file modifications during scanning
[ ] Files with special characters in names

Presumably these things are handled by stat??

Comment on lines +155 to +157
# Recursively scan subdirectory
self.scan_directory(entry.path)
has_contents = True
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ ] Deep directory hierarchies

This recursive call should handle deep hierachies fine.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 22, 2025

@chengzhuzhang @golaz Please review just the changes in zstash/update.py & zstash/utils.py. EDIT: It would also be good to take a look at the performance results above.

Design decisions questions I have for you:

  1. Should we also update create to have these performance improvements? (We could also decide to implement that later on, not part of this PR).
  2. Should we keep the profiling logs & scripts? Pro: available for future use, off by default, Con: adds a lot of superfluous code

The key logic changes to review:

  • zstash/update.py update_database() lines 405-447: Get all the files at once from the database, and use a Python dict to access them.
  • zstash/utils.py DirectoryScanner lines 113-190: Class that with a method that scans directories recursively. This logic replaces the original slower lstat functionality.
  • zstash/utils.py get_files_to_archive_with_stats() lines 263-341: re-implentation of original get_files_to_archive to use the directory-scanning technique.

Remaining action items for me, after your review:

  • Address your code review comments
  • Self-review non-user-facing code (performance_profiling/, tests/) changes.
  • Rerun the entire test suite.

@forsyth2 forsyth2 marked this pull request as ready for review December 22, 2025 19:39
@chengzhuzhang
Copy link
Collaborator

Thank you @forsyth2 ! Let's use the EZ meeting this afternoon to walk though this PR.

@chengzhuzhang
Copy link
Collaborator

@forsyth2 I'm trying to find performance comparison: performance data before this PR vs after. Which table(s) I should look?

@forsyth2
Copy link
Collaborator Author

performance data before this PR vs after

For that, you'll want to look at this comment "Overall time breakdown". Tests 1-3 are before; tests 4-5 are after.


The comment I referenced in

It would also be good to take a look at the performance results above.

is for different combinations of number of files & file sizes, all on the post-performance-update code.

Copy link
Collaborator Author

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action items:

  • Add same performance code to before-version. Then, do apples-to-apples comparison.
  • Do memory profiling too
    • How efficient is the dictionary access? Does it remain O(1) lookup even on a large number of keys?
  • For every test, start from the same identical index.db, Then, we're only looking at the changes from update.
    • Retrieve from an existing simulation. Will need an existing archive and extract it to disk. run zstash update, discard index.db, replace with old one. Restore index.db before each test.
    • Eventually, we can scale up to 20 TB test. @wlin7 can run that test.
  • Try running the same tests multiple times, to get an average.

@forsyth2
Copy link
Collaborator Author

Performance comparison

Methodology

I created a new branch zstash-issue409-after-changes that took the changes from this PR and removed any superfluous changes. See diff here.

I also created a branch that applies the same simple performance checking code to the main branch, zstash-issue409-before-changes. See diff here

To profile, I used the script that I've included in both of those branches. To have some consistency with previous profile runs, I used the same three subdirectories:

  • build/ -- a lot of small files
  • init/ -- a few large files
  • run/ -- a mid-number of mid-size files; largest total directory size of the 3

There are 12 test cases: for each of those 3 subdirs, create an an archive and update it with one of the other subdirs, and then test it with both hpss=none and by using Globus. That's 6x2=12 cases.

========================================
Analyzing: /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201/build/
========================================
Total size: 1.2GiB
Number of files: 7046
Average file size: 169KB

========================================
Analyzing: /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201/init/
========================================
Total size: 6.9GiB
Number of files: 14
Average file size: 501MB

========================================
Analyzing: /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201/run/
========================================
Total size: 11GiB
Number of files: 111
Average file size: 96MB

========================================
SUMMARY
========================================
Cumulative size of all directories: 19GiB

Results

Summary tables

Before changes:

Test Case Create Dir Update Dir HPSS File Gathering (Time/Memory) Database Comparison (Time/Memory) Add Files (Time/Memory) Total Time
test_create0_update1_hpss0 build/ init/ none 0.27s / 1.99MB 8.90s / 0.00MB 27.08s / 33.60MB 0m37.463s
test_create0_update1_hpss1 build/ init/ globus 0.25s / 1.99MB 10.60s / 0.00MB 80.38s / 33.60MB 1m45.768s
test_create0_update2_hpss0 build/ run/ none 0.33s / 2.01MB 9.91s / 0.01MB 44.07s / 33.68MB 0m55.759s
test_create0_update2_hpss1 build/ run/ globus 0.29s / 2.01MB 10.64s / 0.01MB 188.62s / 33.68MB 3m34.381s
test_create1_update0_hpss0 init/ build/ none 0.40s / 1.99MB 6.00s / 0.06MB 58.49s / 39.60MB 1m8.566s
test_create1_update0_hpss1 init/ build/ globus 0.42s / 1.99MB 6.38s / 0.06MB 85.69s / 39.60MB 1m48.248s
test_create1_update2_hpss0 init/ run/ none 0.00s / 0.04MB 0.02s / 0.01MB 42.69s / 33.68MB 0m43.102s
test_create1_update2_hpss1 init/ run/ globus 0.00s / 0.04MB 0.02s / 0.01MB 176.05s / 33.68MB 3m8.159s
test_create2_update0_hpss0 run/ build/ none 0.39s / 2.01MB 6.35s / 0.06MB 62.33s / 39.60MB 1m13.135s
test_create2_update0_hpss1 run/ build/ globus 0.30s / 2.01MB 5.34s / 0.06MB 84.63s / 39.60MB 1m45.531s
test_create2_update1_hpss0 run/ init/ none 0.00s / 0.04MB 0.02s / 0.00MB 27.87s / 33.60MB 0m28.145s
test_create2_update1_hpss1 run/ init/ globus 0.00s / 0.04MB 0.02s / 0.00MB 131.36s / 33.60MB 2m23.624s

After changes:

Test Case Create Dir Update Dir HPSS File Gathering (Time/Memory) Database Comparison (Time/Memory) Add Files (Time/Memory) Total Time
test_create0_update1_hpss0 build/ init/ none 6.27s / 2.58MB 0.23s / 2.46MB 28.34s / 33.60MB 0m36.178s
test_create0_update1_hpss1 build/ init/ globus 4.27s / 2.58MB 0.24s / 2.46MB 102.67s / 33.60MB 2m0.986s
test_create0_update2_hpss0 build/ run/ none 4.31s / 2.60MB 0.24s / 2.46MB 43.56s / 33.68MB 0m49.498s
test_create0_update2_hpss1 build/ run/ globus 5.32s / 2.60MB 0.25s / 2.46MB 226.25s / 33.68MB 4m5.803s
test_create1_update0_hpss0 init/ build/ none 3.48s / 2.58MB 0.02s / 0.07MB 63.42s / 39.71MB 1m9.522s
test_create1_update0_hpss1 init/ build/ globus 3.46s / 2.58MB 0.05s / 0.07MB 86.04s / 39.71MB 1m45.198s
test_create1_update2_hpss0 init/ run/ none 0.00s / 0.13MB 0.00s / 0.01MB 42.56s / 33.68MB 0m43.014s
test_create1_update2_hpss1 init/ run/ globus 0.00s / 0.13MB 0.00s / 0.01MB 86.17s / 33.68MB 1m38.072s
test_create2_update0_hpss0 run/ build/ none 3.36s / 2.60MB 0.06s / 0.10MB 62.90s / 39.71MB 1m10.506s
test_create2_update0_hpss1 run/ build/ globus 3.80s / 2.60MB 0.02s / 0.10MB 88.34s / 39.71MB 1m47.698s
test_create2_update1_hpss0 run/ init/ none 0.00s / 0.14MB 0.00s / 0.04MB 27.63s / 33.60MB 0m27.947s
test_create2_update1_hpss1 run/ init/ globus 0.00s / 0.14MB 0.00s / 0.04MB 69.70s / 33.60MB 1m22.806s

Visualizations

Claude was able to generate code to visualize this data.

  1. fig1_grouped_bars.png - Time breakdown by phase (before vs after)
  2. fig2_before_after_scatter.png - Before/after comparison scatter plot
  3. fig3_heatmaps.png - Performance matrix heatmaps
  4. fig4_memory_analysis.png - Memory profiling analysis

Fig. 1:

image

Key takeaways:

  • File gathering after changes never takes longer than database comparisons before changes.
  • Total time decreases greatly in 2 cases (both combinations of init/, the few large files directory, and run/, the largest directory)
  • Total time increases greatly in 2 cases (both where we started with build/ -- a directory with a lot of small files).

Fig. 2:

image

Key takeaways:

  • Total time decreases greatly in 2 cases (both combinations of init/, the few large files directory, and run/, the largest directory)
  • Total time increases greatly in 2 cases (both where we started with build/ -- a directory with a lot of small files).

Fig. 3:

image

Key takeways:

  • I didn't find this one as useful, but it does show the create with build/, update with run/ case really degrades in the after-changes version.

Fig. 4:

image

Key takeaways:

  • Top left subplot: 4 cases use more memory than others -- all of which involve updating with build/. This implies adding a lot of files will increase the memory usage. That is, the number of files is more important to memory than the size of the files. However, none of the cases show a major change in memory usage from before changes to after changes.
  • Top right subplot: 4 cases do use more memory after than before -- all of which involve updating an already archived build/ directory. This implies a lot of small files already in the archive will require more memory after changes during the database comparison phase than before changes.
  • I found the bottom 2 subplots less useful.

@forsyth2
Copy link
Collaborator Author

@chengzhuzhang @golaz I've conducted a comprehensive performance profiling covering 12 test cases before & after code changes. See comment above. I've listed my key takeaways as well.

Note I didn't use the restoring index.db method discussed earlier, because I found it important to vary the contents of the directory that originally exists (i.e., before applying update).

@wlin7 @TonyB9000 You may also be interested in the above performance analysis.

@chengzhuzhang
Copy link
Collaborator

@forsyth2 thanks for the analysis!
Figure 1 shows that the operations (file gathering and dataset comparison) are faster with this PR, with the largest reduction roughly cutting the time in half (?). The file-addition time is consistent for non-Globus cases. We see more variation in the Globus cases, likely due to uncertainty in network response and other things. The memory profiling is useful and may be more informative when applied to a larger, production-like test. I'm wondering, at this point, if we are ready for such a test? @wlin7 @golaz any advice on next steps?

@forsyth2
Copy link
Collaborator Author

Thanks! Just for further clarity, adding some more description:

operations (file gathering and dataset comparison) are faster with this PR

Correct, the green bar (add files) starts sooner in all visible cases.

largest reduction roughly cutting the time in half

Correct, we see a large reduction for test_create1_update2_hpss1 (create with init/, update with run/, use Globus). We can confirm with the total time column in the tables: total time drops from 3m8.159s to 1m38.072s. That's a 188.159 s / 98.072s = 1.919x speedup.

The file-addition time is consistent for non-Globus cases.

Correct, the left and right bars (= before & after) for each non-Globus case (from left-to-right: cases 1,3,5,7,9,11) show little variation.

We see more variation in the Globus cases, likely due to uncertainty in network response and other things.

Yes, those factors make it more difficult to directly compare the before/after behavior for Globus. Unfortunately, the Globus cases are the ones we care about the most (to resolve #409).

The memory profiling is useful and may be more informative when applied to a larger, production-like test. I'm wondering, at this point, if we are ready for such a test?

Yes, I would think so too. Indeed, it would be safe to run on a production-like case (as opposed to actual production use), since I haven't run the entire test suite on these specific versions (before-changes with profiling, after-changes with profiling) of the code. That said, clearly these 12 cases run.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Jan 5, 2026

@wlin7 @golaz Happy new year! Following up from the high-res meeting, here is the link to the performance profiling comment above. We can discuss it further at today's EZ meeting as well.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Jan 5, 2026

Action items:

  • Run on production case (e.g., a directory > ~n TB). We could copy a simulation that was archived. We could extract a simulation and then add one file, run zstash update and see how long it takes. In that case ,add-files should be trivial. Could run without Globus; we're interested exclusively in the time to find a new file. Can save index.db, remove the new tar, and restore it the second time around, then run the after-case the same way. Start with v3.LR.
  • Confirm the list of files for add-files is the same in the before- and after- cases.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Jan 8, 2026

Performance Comparison 2026-01-08

Methodology

Before- and after- branches

I used the before and after branches from the initial performance profiling:

  • zstash-issue409-after-changes that took the changes from this PR and removed any superfluous changes. See diff here.
  • branch that applies the same simple performance checking code to the main branch, zstash-issue409-before-changes. See diff here

--hpss setting

Since we are interested in the non-"Add Files" time, we don't need to use Globus and can instead focus on a non-Globus case.

Finding data to use

Expand

In the initial performance profiling, "Add files" was by far the largest component of the time profile. For this round of performance profiling, we will attempt to shrink it considerably by adding one file to a very large archive -- a v3.LR archive (all listed here).

We'll use v3.LR.amip_0151, which is one of the smaller archives, containing only 6 TB of data. For comparison, the largest directory used in the previous profile was only 11 GiB.

Let's check available space using Perlmutter's showquota:

+-------------+------------+-------------+----------------+------------+-------------+----------------+
| File system | Space used | Space quota | Space used (%) | Inode used | Inode quota | Inode used (%) |
+-------------+------------+-------------+----------------+------------+-------------+----------------+
|        home |   16.99GiB |    40.00GiB |          42.5% |    456.26K |       1.00M |          45.6% |
|    pscratch |  125.92GiB |    20.00TiB |           0.6% |     216.00 |      10.00M |           0.0% |
+-------------+------------+-------------+----------------+------------+-------------+----------------+

We can see that to use 6 TB of space, we must use pscratch, not home. Furthermore, we would not be able to extract an archive larger than ~10 TB, as many of the historical v3.LR runs are, since we'd need to store ~20 TB in total (cache + extraction).

Extracting data

Expand

We'll extract the data using zstash from E3SM Unified.

screen # This could run for a while, so let's use `screen`
source ~/.bash_profile # Activate my aliases
source /global/common/software/e3sm/anaconda_envs/load_latest_e3sm_unified_pm-cpu.sh # Activate E3SM Unified
cd /pscratch/sd/f/forsyth # We MUST be in `pscratch`
screen -ls # We need to know where this screen is, so we can come back to it.
# There is a screen on:
#         726620.pts-70.login04   (Attached)
# 1 Socket in /run/uscreens/S-forsyth.
mkdir performance_profiling
cd performance_profiling
mkdir cache
mkdir extraction
mkdir logs

zstash settings:

hpss: /home/projects/e3sm/www/CoupledSystem/E3SMv3/LR/v3.LR.amip_0151
cache: /pscratch/sd/f/forsyth/performance_profiling/cache
log: /pscratch/sd/f/forsyth/performance_profiling/logs/extract.log
cd extraction
zstash extract --hpss=/home/projects/e3sm/www/CoupledSystem/E3SMv3/LR/v3.LR.amip_0151 --cache=/pscratch/sd/f/forsyth/performance_profiling/cache -v 2>&1 | tee /pscratch/sd/f/forsyth/performance_profiling/logs/extract.log
# CTRL A D

screen -ls
# There is a screen on:
# 	726620.pts-70.login04	(Detached)
# 1 Socket in /run/uscreens/S-forsyth.

# Picking back up 2026-01-06 ##################################################
screen -ls
# No Sockets found in /run/uscreens/S-forsyth.

# We need to get back on the right login node...
hostname
#login17
ssh login04
hostname
# login04
screen -ls
# There is a screen on:
# 	726620.pts-70.login04	(Detached)
# 1 Socket in /run/uscreens/S-forsyth.
tail -n 1 /pscratch/sd/f/forsyth/performance_profiling/logs/extract.log
# INFO: Transferring file from HPSS: /pscratch/sd/f/forsyth/performance_profiling/cache/00000e.tar

# hex to decimal: 00000e -> 14
showquota
# +-------------+------------+-------------+----------------+------------+-------------+----------------+
# | File system | Space used | Space quota | Space used (%) | Inode used | Inode quota | Inode used (%) |
# +-------------+------------+-------------+----------------+------------+-------------+----------------+
# |        home |   16.99GiB |    40.00GiB |          42.5% |    456.26K |       1.00M |          45.6% |
# |    pscratch |    1.90TiB |    20.00TiB |           9.5% |     67.47K |      10.00M |           0.7% |
# +-------------+------------+-------------+----------------+------------+-------------+----------------+

# We're only up to 1.9 TiB after approximately 18 hours.
# We're extracting 6 TB of data, so we're about 1/3 of the way through.
# So, ~36 hours left?

# Picking back up 2026-01-07 ##################################################
screen -ls
# No Sockets found in /run/uscreens/S-forsyth.

# We need to get back on the right login node...
hostname
#login03
ssh login04
hostname
# login04
screen -ls
# There is a screen on:
# 	726620.pts-70.login04	(Detached)
# 1 Socket in /run/uscreens/S-forsyth.
tail -n 1 /pscratch/sd/f/forsyth/performance_profiling/logs/extract.log
# INFO: Transferring file from HPSS: /pscratch/sd/f/forsyth/performance_profiling/cache/00002a.tar

# hex to decimal: 00002a -> 42
showquota
# +-------------+------------+-------------+----------------+------------+-------------+----------------+
# | File system | Space used | Space quota | Space used (%) | Inode used | Inode quota | Inode used (%) |
# +-------------+------------+-------------+----------------+------------+-------------+----------------+
# |        home |   17.10GiB |    40.00GiB |          42.8% |    462.89K |       1.00M |          46.3% |
# |    pscratch |    5.23TiB |    20.00TiB |          26.1% |     71.10K |      10.00M |           0.7% |
# +-------------+------------+-------------+----------------+------------+-------------+----------------+

# We're up to 5.23 TiB, so we're getting close to done.

# End-of-day
tail -n 1 /pscratch/sd/f/forsyth/performance_profiling/logs/extract.log
# INFO: No failures detected when extracting the files. If you have a log file, run "grep -i Exception <log-file>" to double check.
showquota
# +-------------+------------+-------------+----------------+------------+-------------+----------------+
# | File system | Space used | Space quota | Space used (%) | Inode used | Inode quota | Inode used (%) |
# +-------------+------------+-------------+----------------+------------+-------------+----------------+
# |        home |   17.10GiB |    40.00GiB |          42.8% |    462.89K |       1.00M |          46.3% |
# |    pscratch |    5.66TiB |    20.00TiB |          28.3% |     89.62K |      10.00M |           0.9% |
# +-------------+------------+-------------+----------------+------------+-------------+----------------+

# Ok, ready to go. So, roughly 2 days to extract.

# Picking back up 2026-01-08 ##################################################
hostname
# login31
ssh login04
screen -ls
# There is a screen on:
# 	726620.pts-70.login04	(Detached)
# 1 Socket in /run/uscreens/S-forsyth.
screen -R
exit # Exit screen

ls /pscratch/sd/f/forsyth/performance_profiling/cache
# index.db

# Tars are missing because we didn't run with `--keep`.

# Let's review index.db
cd /pscratch/sd/f/forsyth/performance_profiling
mkdir index_copies
cp cache/index.db index_copies/index.db
cd index_copies
source /global/common/software/e3sm/anaconda_envs/load_latest_e3sm_unified_pm-cpu.sh # Unified
sqlite3 index.db
sqlite> .tables
config  files   tars  
sqlite> .schema config
CREATE TABLE config (
  arg text primary key,
  value text
);
sqlite> .schema files
CREATE TABLE files (
  id integer primary key,
  name text,
  size integer,
  mtime timestamp,
  md5 text,
  tar text,
  offset integer
);
sqlite> .schema tars
CREATE TABLE tars (
id integer primary key,
name text,
size integer,
md5 text
);
sqlite> SELECT 'config' as table_name, COUNT(*) as count FROM config
UNION ALL
SELECT 'files', COUNT(*) FROM files
UNION ALL
SELECT 'tars', COUNT(*) FROM tars;
config|3
files|88689
tars|46
sqlite> SELECT * FROM config;
hpss|globus://nersc/~/E3SMv3/AMIP/v3.LR.amip_0151
maxsize|137438953472
path|/lcrc/group/e3sm2/ac.wlin/E3SMv3/AMIP/v3.LR.amip_0151
sqlite> .quit
# Give this index.db a more descriptive name
mv index.db index_after_extraction.db

Run

Run Before-case

Expand

Setup:

cd ~/ez/zstash
git status
# On branch issue-374-refactor-tar-deletion
# nothing to commit, working tree clean

# Before-case
git checkout zstash-issue409-before-changes
git log --oneline | head -n 3
# 4a50a57 Add profiling script
# a6848e9 Add performance profiling for control case
# 01ffe34 Add test for tar deletion (#404)

# Good, matches commits on https://github.com/E3SM-Project/zstash/compare/main...zstash-issue409-before-changes
nersc_conda # Activate conda
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n zstash-profile-before-case-20260108
conda activate env-name zstash-profile-before-case-20260108
pre-commit run --all-files
python -m pip install .

cd /pscratch/sd/f/forsyth/performance_profiling/extraction
ls /pscratch/sd/f/forsyth/performance_profiling/extraction
# archive  build  case_scripts  init  mpas-seaice  post  run  tests

# Let's add a few files
echo "top level" > added_file0.txt
echo "build" > build/added_file1.txt
echo "init" > init/added_file2.txt
echo "run" > run/added_file3.txt

# Run `zstash update` from INSIDE the directory being archived!
zstash update --hpss=/home/projects/e3sm/www/CoupledSystem/E3SMv3/LR/v3.LR.amip_0151 --cache=/pscratch/sd/f/forsyth/performance_profiling/cache -v 2>&1 | tee /pscratch/sd/f/forsyth/performance_profiling/logs/before_case_update.log 
# Less than 50 minutes to run

Output:

For help, please see https://e3sm-project.github.io/zstash. Ask questions at https://github.com/E3SM-Project/zstash/discussions/categories/q-a.
DEBUG: Opening index database
DEBUG: Running zstash update
DEBUG: Local path : /lcrc/group/e3sm2/ac.wlin/E3SMv3/AMIP/v3.LR.amip_0151
DEBUG: HPSS path  : /home/projects/e3sm/www/CoupledSystem/E3SMv3/LR/v3.LR.amip_0151
DEBUG: Max size  : 137438953472
DEBUG: Keep local tar files  : False
INFO: Gathering list of files to archive
DEBUG: TIME PROFILE -- FILE GATHERING: 1.76 seconds
DEBUG: MEMORY PROFILE -- FILE GATHERING: current=10.10 MB, peak=23.09 MB
DEBUG: TIME PROFILE -- DATABASE COMPARISON: 966.21 seconds
DEBUG: MEMORY PROFILE -- DATABASE COMPARISON: current=0.00 MB, peak=0.00 MB
INFO: 20260108_163059_224926: Creating new tar archive 00002e.tar
INFO: Archiving added_file0.txt
INFO: Archiving build/added_file1.txt
INFO: Archiving init/added_file2.txt
INFO: Archiving run/added_file3.txt
DEBUG: 20260108_163059_243794: Closing tar archive 00002e.tar
INFO: 20260108_163059_244562: (add_files): Completed archive file 00002e.tar
INFO: Contents of the cache prior to `hpss_put`: ['index.db', '00002e.tar']
INFO: 20260108_163059_245460: DIVING: (add_files): Calling hpss_put to dispatch archive file 00002e.tar [keep, non_blocking] = [False, False]
INFO: 20260108_163059_245589: in hpss_transfer, prev_transfers is starting as []
INFO: Transferring file to HPSS: /pscratch/sd/f/forsyth/performance_profiling/cache/00002e.tar
ERROR: Error=Transferring file to HPSS: 00002e.tar, Command was `hsi -q "cd /home/projects/e3sm/www/CoupledSystem/E3SMv3/LR/v3.LR.amip_0151; put 00002e.tar"`. This command includes `hsi`. Be sure that you have logged into `hsi`. This command includes `cd`. Check that this directory exists and contains the needed files
DEBUG: stdout:
b''
DEBUG: stderr:
b'*** hpss_Open (create): Access denied [-13: HPSS_EACCES] \n    /home/projects/e3sm/www/CoupledSystem/E3SMv3/LR/v3.LR.amip_0151/00002e.tar\n*** put: Error -1 on transfer. 00002e.tar to /home/projects/e3sm/www/CoupledSystem/E3SMv3/LR/v3.LR.amip_0151/00002e.tar\n'
Traceback (most recent call last):
  File "/global/homes/f/forsyth/miniforge3/envs/zstash-profile-before-case-20260108/bin/zstash", line 7, in <module>
    sys.exit(main())
             ~~~~^^
  File "/global/homes/f/forsyth/miniforge3/envs/zstash-profile-before-case-20260108/lib/python3.13/site-packages/zstash/main.py", line 67, in main
    update()
    ~~~~~~^^
  File "/global/homes/f/forsyth/miniforge3/envs/zstash-profile-before-case-20260108/lib/python3.13/site-packages/zstash/update.py", line 35, in update
    result: Optional[List[str]] = update_database(args, cache)
                                  ~~~~~~~~~~~~~~~^^^^^^^^^^^^^
  File "/global/homes/f/forsyth/miniforge3/envs/zstash-profile-before-case-20260108/lib/python3.13/site-packages/zstash/update.py", line 317, in update_database
    failures = add_files(
        cur,
    ...<8 lines>...
        overwrite_duplicate_tars=args.overwrite_duplicate_tars,
    )
  File "/global/homes/f/forsyth/miniforge3/envs/zstash-profile-before-case-20260108/lib/python3.13/site-packages/zstash/hpss_utils.py", line 163, in add_files
    hpss_put(hpss, os.path.join(cache, tfname), cache, keep, non_blocking)
    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/f/forsyth/miniforge3/envs/zstash-profile-before-case-20260108/lib/python3.13/site-packages/zstash/hpss.py", line 159, in hpss_put
    hpss_transfer(hpss, file_path, "put", cache, keep, non_blocking, is_index)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/f/forsyth/miniforge3/envs/zstash-profile-before-case-20260108/lib/python3.13/site-packages/zstash/hpss.py", line 125, in hpss_transfer
    run_command(command, error_str)
    ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/f/forsyth/miniforge3/envs/zstash-profile-before-case-20260108/lib/python3.13/site-packages/zstash/utils.py", line 71, in run_command
    raise RuntimeError(error_str)
RuntimeError: Error=Transferring file to HPSS: 00002e.tar, Command was `hsi -q "cd /home/projects/e3sm/www/CoupledSystem/E3SMv3/LR/v3.LR.amip_0151; put 00002e.tar"`. This command includes `hsi`. Be sure that you have logged into `hsi`. This command includes `cd`. Check that this directory exists and contains the needed files

Of course, we can't add the file to the archive itself on HPSS because we don't have write-permissions to /home/projects/e3sm/www/CoupledSystem/E3SMv3/LR/v3.LR.amip_0151!

> hsi -q "ls /home/projects/e3sm/www/CoupledSystem/E3SMv3/LR/v3.LR.amip_0151"
                             
/home/projects/e3sm/www/CoupledSystem/E3SMv3/LR/v3.LR.amip_0151:
000000.tar                   000008.tar                   000010.tar                   000018.tar                   000020.tar                   000028.tar                   
000001.tar                   000009.tar                   000011.tar                   000019.tar                   000021.tar                   000029.tar                   
000002.tar                   00000a.tar                   000012.tar                   00001a.tar                   000022.tar                   00002a.tar                   
000003.tar                   00000b.tar                   000013.tar                   00001b.tar                   000023.tar                   00002b.tar                   
000004.tar                   00000c.tar                   000014.tar                   00001c.tar                   000024.tar                   00002c.tar                   
000005.tar                   00000d.tar                   000015.tar                   00001d.tar                   000025.tar                   00002d.tar                   
000006.tar                   00000e.tar                   000016.tar                   00001e.tar                   000026.tar                   index.db                     
000007.tar                   00000f.tar                   000017.tar                   00001f.tar                   000027.tar                   zstash_create_20241223.log 

We can see that 00002e.tar wasn't added to the HPSS archive, so write-protection was enforced correctly.

To bypass this error, we'd have to run zstash update --hpss=none, which would require the tar files to be present locally, which would require us to rerun the 2-day long extraction step as zstash extract --keep. However, we're more interested in the FILE GATHERING and DATABASE COMPARISON steps -- and both of those are done before a dry_run would exit. ADD_FILES is the only one we won't be able to profile.

In fact here, we don't even need to rerun with dry run. We already got the profiling data we need:

DEBUG: TIME PROFILE -- FILE GATHERING: 1.76 seconds
DEBUG: MEMORY PROFILE -- FILE GATHERING: current=10.10 MB, peak=23.09 MB
DEBUG: TIME PROFILE -- DATABASE COMPARISON: 966.21 seconds
DEBUG: MEMORY PROFILE -- DATABASE COMPARISON: current=0.00 MB, peak=0.00 MB
# Let's review index.db
cd /pscratch/sd/f/forsyth/performance_profiling
cp cache/index.db index_copies/index_after_update_before_case.db
cd index_copies
sqlite3 index_after_update_before_case.db
sqlite> SELECT 'config' as table_name, COUNT(*) as count FROM config
UNION ALL
SELECT 'files', COUNT(*) FROM files
UNION ALL
SELECT 'tars', COUNT(*) FROM tars;
config|3
files|88689
tars|46
sqlite> .quit

files didn't increase from after extraction: 88689 -> 88689. That makes sense because we didn't complete the file addition. We errored out at:

# hpss.py hpss_transfer():
error_str: str = "Transferring file {} HPSS: {}".format(transfer_word, name)

which occurs before:

# hpss.utils add_files():
cur.executemany("insert into files values (NULL,?,?,?,?,?,?)", archived)

Reset for after-case:

# We want to test the exact same file addition, so we actually don't need to delete those.
# We just want to change out the index.db
cd /pscratch/sd/f/forsyth/performance_profiling/
rm cache/index.db
cp index_copies/index_after_extraction.db cache/index.db

Run After-case

Expand
cd ~/ez/zstash
git status
# On branch zstash-issue409-before-changes
# nothing to commit, working tree clean

# After-case
git checkout zstash-issue409-after-changes
# We don't have this branch already
git fetch upstream zstash-issue409-after-changes
git checkout -b zstash-issue409-after-changes upstream/zstash-issue409-after-changes
git log --oneline | head -n 7
# 0f6ac29 Add profiling script
# 38b9da1 Streamline performance profiling
# 589ac40 Further improvements
# c33f978 Tests known to be passing
# 5f1f1ed Initial optimizations
# 41ec29e Initial performance profiling
# 01ffe34 Add test for tar deletion (#404)

# Good, matches commits on https://github.com/E3SM-Project/zstash/compare/main...zstash-issue409-after-changes
nersc_conda # Activate conda
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n zstash-profile-after-case-20260108
conda activate env-name zstash-profile-after-case-20260108
pre-commit run --all-files
python -m pip install .

# We already added the files in the before-case
# And restored the index.db.
# So, we can proceed.

cd /pscratch/sd/f/forsyth/performance_profiling/extraction
# Run `zstash update` from INSIDE the directory being archived!
# Use dry-run this time!
zstash update --hpss=/home/projects/e3sm/www/CoupledSystem/E3SMv3/LR/v3.LR.amip_0151 --cache=/pscratch/sd/f/forsyth/performance_profiling/cache -v --dry-run 2>&1 | tee /pscratch/sd/f/forsyth/performance_profiling/logs/after_case_update.log # 09:10

Output:

For help, please see https://e3sm-project.github.io/zstash. Ask questions at https://github.com/E3SM-Project/zstash/discussions/categories/q-a.
DEBUG: Opening index database
DEBUG: Running zstash update
DEBUG: Local path : /lcrc/group/e3sm2/ac.wlin/E3SMv3/AMIP/v3.LR.amip_0151
DEBUG: HPSS path  : /home/projects/e3sm/www/CoupledSystem/E3SMv3/LR/v3.LR.amip_0151
DEBUG: Max size  : 137438953472
DEBUG: Keep local tar files  : False
DEBUG: TIME PROFILE -- FILE GATHERING: 24.24 seconds
DEBUG: MEMORY PROFILE -- FILE GATHERING: current=29.26 MB, peak=80.27 MB
DEBUG: TIME PROFILE -- DATABASE COMPARISON: 6.06 seconds
DEBUG: MEMORY PROFILE -- DATABASE COMPARISON: current=30.71 MB, peak=32.56 MB
List of files to be updated
added_file0.txt
build/added_file1.txt
init/added_file2.txt
run/added_file3.txt

We have the necessary data:

DEBUG: TIME PROFILE -- FILE GATHERING: 24.24 seconds
DEBUG: MEMORY PROFILE -- FILE GATHERING: current=29.26 MB, peak=80.27 MB
DEBUG: TIME PROFILE -- DATABASE COMPARISON: 6.06 seconds
DEBUG: MEMORY PROFILE -- DATABASE COMPARISON: current=30.71 MB, peak=32.56 MB

Results

Before-case:

DEBUG: TIME PROFILE -- FILE GATHERING: 1.76 seconds
DEBUG: MEMORY PROFILE -- FILE GATHERING: current=10.10 MB, peak=23.09 MB
DEBUG: TIME PROFILE -- DATABASE COMPARISON: 966.21 seconds
DEBUG: MEMORY PROFILE -- DATABASE COMPARISON: current=0.00 MB, peak=0.00 MB

After-case:

DEBUG: TIME PROFILE -- FILE GATHERING: 24.24 seconds
DEBUG: MEMORY PROFILE -- FILE GATHERING: current=29.26 MB, peak=80.27 MB
DEBUG: TIME PROFILE -- DATABASE COMPARISON: 6.06 seconds
DEBUG: MEMORY PROFILE -- DATABASE COMPARISON: current=30.71 MB, peak=32.56 MB
Step Time ratio (Before over After)
FILE GATHERING 1.76/24.24 => 0.07x faster (i.e., 13.77x slower)
DATABASE COMPARISON 966.21/6.06 => 159.44x faster
Together 967.97/30.3 => 31.95x faster

So, the file gathering + database comparison step is ~32x faster, based on this profiling.

Next steps

  • Write script to repeat this profiling, but with different types of additions (e.g., lots of small files, a few big files). We're really not profiling memory well by adding only a few files.
  • (optional) I think we most care about file gathering & database comparison. If we do want to profile total time (i.e., including ADD_FILES), I would need to rerun without --dry-run, which would require re-extracting using --keep since I don't have write permissions to the archive on HPSS. showquota has me at 5.66/20.00 TiB, so in theory I have enough space to re-extract a 6 TB directory (plus the tars, so 12 TB total).

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Jan 8, 2026

Performance Comparison 2026-01-08 follow-up

Profiling script

Expand

Script written by me based on previous profiling run; then extended by Claude. Note it is imperfect; several manual steps had to be done to fix malformed data summaries.

#!/bin/bash

# Run with:
# screen
# cd /global/homes/f/forsyth/ez/performance_profiling
# source ~/.bashrc
# source ~/.bash_profile
# nersc_conda
# time ./zstash_performance_profiling.bash 2>&1 | tee zstash_performance_profiling.log
# CTRL A D
# tail -f zstash_performance_profiling.log

# Create conda environments by:
# cd ~/ez/zstash
# git checkout <before-branch or after-branch>
# rm -rf build
# conda clean --all --y
# conda env create -f conda/dev.yml -n env-name
# conda activate env-name
# pre-commit run --all-files
# python -m pip install .
BEFORE_CASE_CONDA_ENV=zstash-profile-before-case-20260108
AFTER_CASE_CONDA_ENV=zstash-profile-after-case-20260108

HPSS_PATH=/home/projects/e3sm/www/CoupledSystem/E3SMv3/LR/v3.LR.amip_0151

PROFILE_PATH=/pscratch/sd/f/forsyth/performance_profiling/
EXTRACTED_DIRECTORY_PATH=${PROFILE_PATH}extraction/
LOG_PATH=${PROFILE_PATH}logs/
CACHE_PATH=${PROFILE_PATH}cache/
INDEX_COPIES_PATH=${PROFILE_PATH}index_copies/
ORIGINAL_INDEX=${PROFILE_PATH}original_index/index_after_extraction.db
ANALYSIS_PATH=${PROFILE_PATH}analysis/

###############################################################################
source ~/.bashrc
source ~/.bash_profile
nersc_conda # Activate Conda using function defined in ~/.bashrc

# Create analysis directory if it doesn't exist
mkdir -p ${ANALYSIS_PATH}

reset_index()
{
    local case_name="${1}"

    mv ${CACHE_PATH}index.db ${INDEX_COPIES_PATH}index_${case_name}.db
    cp ${ORIGINAL_INDEX} ${CACHE_PATH}index.db
}

extract_metrics()
{
    local log_file="${1}"
    local output_file="${2}"
    
    echo "Extracting metrics from ${log_file}..."
    
    # Extract timing data
    file_gathering_time=$(grep "TIME PROFILE -- FILE GATHERING:" ${log_file} | sed -n 's/.*: \([0-9.]*\) seconds/\1/p')
    db_comparison_time=$(grep "TIME PROFILE -- DATABASE COMPARISON:" ${log_file} | sed -n 's/.*: \([0-9.]*\) seconds/\1/p')
    
    # Extract memory data
    file_gathering_mem=$(grep "MEMORY PROFILE -- FILE GATHERING:" ${log_file} | sed -n 's/.*current=\([0-9.]*\) MB, peak=\([0-9.]*\) MB/\1,\2/p')
    db_comparison_mem=$(grep "MEMORY PROFILE -- DATABASE COMPARISON:" ${log_file} | sed -n 's/.*current=\([0-9.]*\) MB, peak=\([0-9.]*\) MB/\1,\2/p')
    
    # Extract total time from 'time' command output
    total_time=$(grep "^real" ${log_file} | awk '{print $2}')
    
    # Extract list of files to be updated
    grep -A 1000 "List of files to be updated" ${log_file} | tail -n +2 | grep -v "^$" | grep -v "^DEBUG:" | grep -v "^INFO:" > ${output_file}.files
    
    # Write metrics to output file
    cat > ${output_file} << EOF
FILE_GATHERING_TIME=${file_gathering_time}
DB_COMPARISON_TIME=${db_comparison_time}
FILE_GATHERING_MEM=${file_gathering_mem}
DB_COMPARISON_MEM=${db_comparison_mem}
TOTAL_TIME=${total_time}
EOF
    
    echo "Metrics extracted to ${output_file}"
}

compare_and_report()
{
    local case_name="${1}"
    local before_metrics="${ANALYSIS_PATH}${case_name}_before_metrics.txt"
    local after_metrics="${ANALYSIS_PATH}${case_name}_after_metrics.txt"
    local before_files="${ANALYSIS_PATH}${case_name}_before_metrics.txt.files"
    local after_files="${ANALYSIS_PATH}${case_name}_after_metrics.txt.files"
    local report="${ANALYSIS_PATH}${case_name}_report.txt"
    
    echo "Generating comparison report..."
    
    # Source the metrics files
    source ${before_metrics}
    before_fg_time=$FILE_GATHERING_TIME
    before_db_time=$DB_COMPARISON_TIME
    before_fg_mem=$FILE_GATHERING_MEM
    before_db_mem=$DB_COMPARISON_MEM
    before_total=$TOTAL_TIME
    
    source ${after_metrics}
    after_fg_time=$FILE_GATHERING_TIME
    after_db_time=$DB_COMPARISON_TIME
    after_fg_mem=$FILE_GATHERING_MEM
    after_db_mem=$DB_COMPARISON_MEM
    after_total=$TOTAL_TIME
    
    # Calculate total processing time (sum of two steps)
    before_total_steps=$(echo "$before_fg_time + $before_db_time" | bc)
    after_total_steps=$(echo "$after_fg_time + $after_db_time" | bc)
    
    # Generate report
    cat > ${report} << EOF
========================================
Performance Comparison Report
Case: ${case_name}
========================================

TIMING COMPARISON
-----------------

| Step | Before | After | Ratio |
|------|--------|-------|-------|
| FILE GATHERING | ${before_fg_time}s | ${after_fg_time}s | $(echo "scale=2; $before_fg_time / $after_fg_time" | bc)x $(if (( $(echo "$before_fg_time > $after_fg_time" | bc -l) )); then echo "slower"; else echo "faster"; fi) |
| DATABASE COMPARISON | ${before_db_time}s | ${after_db_time}s | $(echo "scale=2; $before_db_time / $after_db_time" | bc)x $(if (( $(echo "$before_db_time > $after_db_time" | bc -l) )); then echo "slower"; else echo "faster"; fi) |
| Together | ${before_total_steps}s | ${after_total_steps}s | $(echo "scale=2; $before_total_steps / $after_total_steps" | bc)x $(if (( $(echo "$before_total_steps > $after_total_steps" | bc -l) )); then echo "slower"; else echo "faster"; fi) |
| Total Time | ${before_total} | ${after_total} | - |

MEMORY COMPARISON
-----------------

FILE GATHERING (current, peak):
  Before: ${before_fg_mem} MB
  After:  ${after_fg_mem} MB

DATABASE COMPARISON (current, peak):
  Before: ${before_db_mem} MB
  After:  ${after_db_mem} MB

FILE LIST COMPARISON
--------------------

EOF
    
    # Compare file lists
    if diff -q ${before_files} ${after_files} > /dev/null 2>&1; then
        echo "✓ File lists MATCH" >> ${report}
    else
        echo "✗ File lists DIFFER" >> ${report}
        echo "" >> ${report}
        echo "Differences:" >> ${report}
        diff ${before_files} ${after_files} >> ${report}
    fi
    
    echo "" >> ${report}
    echo "Files identified for update:" >> ${report}
    cat ${after_files} >> ${report}
    
    # Display report
    cat ${report}
    echo ""
    echo "Full report saved to: ${report}"
}

run_profiles()
{
    local case_name="${1}"

    conda activate ${BEFORE_CASE_CONDA_ENV}
    log_file=${LOG_PATH}update_${case_name}_before_changes.log
    cd ${EXTRACTED_DIRECTORY_PATH} # Must run update from the archived directory!
    time zstash update --hpss=${HPSS_PATH} --cache=${CACHE_PATH} -v --dry-run 2>&1 | tee ${log_file}
    
    extract_metrics ${log_file} ${ANALYSIS_PATH}${case_name}_before_metrics.txt

    reset_index ${case_name}_before_changes

    conda activate ${AFTER_CASE_CONDA_ENV}
    log_file=${LOG_PATH}update_${case_name}_after_changes.log
    cd ${EXTRACTED_DIRECTORY_PATH} # Must run update from the archived directory!
    time zstash update --hpss=${HPSS_PATH} --cache=${CACHE_PATH} -v --dry-run 2>&1 | tee ${log_file}
    
    extract_metrics ${log_file} ${ANALYSIS_PATH}${case_name}_after_metrics.txt

    reset_index ${case_name}_after_changes
    
    # Generate comparison report
    compare_and_report ${case_name}
}

generate_summary_table()
{
    local summary_file="${ANALYSIS_PATH}summary_table.txt"
    
    echo "Generating summary table across all cases..."
    
    # Header
    cat > ${summary_file} << 'EOF'
========================================
PERFORMANCE SUMMARY - ALL CASES
========================================

EOF
    
    # Collect all case names
    cases=($(ls ${ANALYSIS_PATH}*_report.txt 2>/dev/null | sed 's/.*\///;s/_report.txt//'))
    
    if [ ${#cases[@]} -eq 0 ]; then
        echo "No cases found to summarize."
        return
    fi
    
    # Build timing ratio table
    echo "| Step | ${cases[@]} |" >> ${summary_file}
    echo "|------|$(printf '%s|' "${cases[@]/#/---}")" >> ${summary_file}
    
    # FILE GATHERING row
    printf "| FILE GATHERING |" >> ${summary_file}
    for case in "${cases[@]}"; do
        source ${ANALYSIS_PATH}${case}_before_metrics.txt
        before_time=$FILE_GATHERING_TIME
        source ${ANALYSIS_PATH}${case}_after_metrics.txt
        after_time=$FILE_GATHERING_TIME
        
        if [ -n "$before_time" ] && [ -n "$after_time" ] && [ "$after_time" != "0" ]; then
            ratio=$(echo "scale=2; $before_time / $after_time" | bc)
            if (( $(echo "$before_time > $after_time" | bc -l) )); then
                direction="slower"
            else
                direction="faster"
            fi
            printf " ${before_time}/${after_time} = ${ratio}x ${direction} |" >> ${summary_file}
        else
            printf " N/A |" >> ${summary_file}
        fi
    done
    echo "" >> ${summary_file}
    
    # DATABASE COMPARISON row
    printf "| DATABASE COMPARISON |" >> ${summary_file}
    for case in "${cases[@]}"; do
        source ${ANALYSIS_PATH}${case}_before_metrics.txt
        before_time=$DB_COMPARISON_TIME
        source ${ANALYSIS_PATH}${case}_after_metrics.txt
        after_time=$DB_COMPARISON_TIME
        
        if [ -n "$before_time" ] && [ -n "$after_time" ] && [ "$after_time" != "0" ]; then
            ratio=$(echo "scale=2; $before_time / $after_time" | bc)
            if (( $(echo "$before_time > $after_time" | bc -l) )); then
                direction="slower"
            else
                direction="faster"
            fi
            printf " ${before_time}/${after_time} = ${ratio}x ${direction} |" >> ${summary_file}
        else
            printf " N/A |" >> ${summary_file}
        fi
    done
    echo "" >> ${summary_file}
    
    # Together row
    printf "| Together |" >> ${summary_file}
    for case in "${cases[@]}"; do
        source ${ANALYSIS_PATH}${case}_before_metrics.txt
        before_fg=$FILE_GATHERING_TIME
        before_db=$DB_COMPARISON_TIME
        before_total=$(echo "$before_fg + $before_db" | bc)
        
        source ${ANALYSIS_PATH}${case}_after_metrics.txt
        after_fg=$FILE_GATHERING_TIME
        after_db=$DB_COMPARISON_TIME
        after_total=$(echo "$after_fg + $after_db" | bc)
        
        if [ -n "$before_total" ] && [ -n "$after_total" ] && [ "$after_total" != "0" ]; then
            ratio=$(echo "scale=2; $before_total / $after_total" | bc)
            if (( $(echo "$before_total > $after_total" | bc -l) )); then
                direction="slower"
            else
                direction="faster"
            fi
            printf " ${before_total}/${after_total} = ${ratio}x ${direction} |" >> ${summary_file}
        else
            printf " N/A |" >> ${summary_file}
        fi
    done
    echo "" >> ${summary_file}
    
    echo "" >> ${summary_file}
    echo "Individual case reports available at: ${ANALYSIS_PATH}*_report.txt" >> ${summary_file}
    
    # Display summary
    cat ${summary_file}
    echo ""
    echo "Summary table saved to: ${summary_file}"
}

###############################################################################
# Test Cases
###############################################################################

add1()
{
    cd ${EXTRACTED_DIRECTORY_PATH}
    echo "top level" > added_file0.txt
    echo "build" > build/added_file1.txt
    echo "init" > init/added_file2.txt
    echo "run" > run/added_file3.txt
    cd -
}

undo_add1()
{
    cd ${EXTRACTED_DIRECTORY_PATH}
    rm added_file0.txt
    rm build/added_file1.txt
    rm init/added_file2.txt
    rm run/added_file3.txt
    cd -
}

add2()
{
    cd ${EXTRACTED_DIRECTORY_PATH}
    cp -r /global/cfs/cdirs/e3sm/forsyth/E3SMv2/v2.LR.historical_0201/build/ copied_build/
    cd -
}

undo_add2()
{
    cd ${EXTRACTED_DIRECTORY_PATH}
    rm -rf copied_build/
    cd -
}

add3()
{
    cd ${EXTRACTED_DIRECTORY_PATH}
    cp -r /global/cfs/cdirs/e3sm/forsyth/E3SMv2/v2.LR.historical_0201/init/ copied_init/
    cd -
}

undo_add3()
{
    cd ${EXTRACTED_DIRECTORY_PATH}
    rm -rf copied_init/
    cd -
}

# Case 1: Add a few small files
case_name=add_few_small_files
add1
run_profiles ${case_name}
undo_add1

# Case 2: Add many small files (build/)
case_name=add_many_small_files
add2
run_profiles ${case_name}
undo_add2

# Case 3: Add a few large files (init/)
case_name=add_few_large_files
add3
run_profiles ${case_name}
undo_add3

###############################################################################
# Generate final summary table
###############################################################################

generate_summary_table

Results

Performance

Manual fixes to script output
> cd /pscratch/sd/f/forsyth/performance_profiling/analysis
> cat summary_table.txt 
========================================
PERFORMANCE SUMMARY - ALL CASES
========================================

| Step | add_few_large_files add_few_small_files add_many_small_files |
|------|---add_few_large_files|---add_few_small_files|---add_many_small_files|
| FILE GATHERING | 1.26/28.56 = .04x faster | 1.13/34.79 = .03x faster | 2.09/48.07 = .04x faster |
| DATABASE COMPARISON | 1102.95/4.74 = 232.68x slower | 1118.40/4.76 = 234.95x slower | 1244.14/6.54 = 190.23x slower |
| Together | 1104.21/33.30 = 33.15x slower | 1119.53/39.55 = 28.30x slower | 1246.23/54.61 = 22.82x slower |

Individual case reports available at: /pscratch/sd/f/forsyth/performance_profiling/analysis/*_report.txt

Unfortunately, it appears the table generation in the script was rather imperfect. No matter, we can manually get the data from the logs.

> cd /pscratch/sd/f/forsyth/performance_profiling/logs

> cat update_add_few_small_files_before_changes.log
DEBUG: TIME PROFILE -- FILE GATHERING: 1.13 seconds
DEBUG: MEMORY PROFILE -- FILE GATHERING: current=10.10 MB, peak=23.09 MB
DEBUG: TIME PROFILE -- DATABASE COMPARISON: 1118.40 seconds
DEBUG: MEMORY PROFILE -- DATABASE COMPARISON: current=0.00 MB, peak=0.00 MB
> cat update_add_few_small_files_after_changes.log 
DEBUG: TIME PROFILE -- FILE GATHERING: 34.79 seconds
DEBUG: MEMORY PROFILE -- FILE GATHERING: current=29.26 MB, peak=80.27 MB
DEBUG: TIME PROFILE -- DATABASE COMPARISON: 4.76 seconds
DEBUG: MEMORY PROFILE -- DATABASE COMPARISON: current=30.71 MB, peak=32.56 MB

> cat update_add_many_small_files_before_changes.log
DEBUG: TIME PROFILE -- FILE GATHERING: 2.09 seconds
DEBUG: MEMORY PROFILE -- FILE GATHERING: current=11.11 MB, peak=25.04 MB
DEBUG: TIME PROFILE -- DATABASE COMPARISON: 1244.14 seconds
DEBUG: MEMORY PROFILE -- DATABASE COMPARISON: current=0.06 MB, peak=0.06 MB
> cat update_add_many_small_files_after_changes.log
DEBUG: TIME PROFILE -- FILE GATHERING: 48.07 seconds
DEBUG: MEMORY PROFILE -- FILE GATHERING: current=31.27 MB, peak=80.26 MB
DEBUG: TIME PROFILE -- DATABASE COMPARISON: 6.54 seconds
DEBUG: MEMORY PROFILE -- DATABASE COMPARISON: current=30.77 MB, peak=32.56 MB

> cat update_add_few_large_files_before_changes.log
DEBUG: TIME PROFILE -- FILE GATHERING: 1.26 seconds
DEBUG: MEMORY PROFILE -- FILE GATHERING: current=10.10 MB, peak=23.09 MB
DEBUG: TIME PROFILE -- DATABASE COMPARISON: 1102.95 seconds
DEBUG: MEMORY PROFILE -- DATABASE COMPARISON: current=0.00 MB, peak=0.00 MB
> cat update_add_few_large_files_after_changes.log
DEBUG: TIME PROFILE -- FILE GATHERING: 28.56 seconds
DEBUG: MEMORY PROFILE -- FILE GATHERING: current=29.26 MB, peak=80.27 MB
DEBUG: TIME PROFILE -- DATABASE COMPARISON: 4.74 seconds
DEBUG: MEMORY PROFILE -- DATABASE COMPARISON: current=30.71 MB, peak=32.56 MB

Let's construct two tables manually:

Time (sec) profile (simple before/after factor):

Component add_few_small_files add_many_small_files add_few_large_files
File gathering 1.13/34.79 = 0.03x faster (30.8x slower) 2.09/48.07 = 0.04x faster (23x slower) 1.26/28.56 = 0.04x faster (22.67x slower)
Database comparison 1118.40/4.76 = 234.96x faster 1244.14/6.54=190.24x faster 1102.95/4.74 = 232.69 faster
Both steps 1119.53/39.55 = 28.3x faster 1246.23/54.61=22.82x faster 1104.21/33.30 = 33.32 faster

Memory (at-peak, MB) profile (simple after/before factor, reversed from time profile for readability):

Component add_few_small_files add_many_small_files add_few_large_files
File gathering 80.27/23.09 = 3.48x more 80.26/25.04 = 3.21x more 80.27/23.09 = 3.48x more
Database comparison 32.56/0.00 = undefined 32.56/0.06= 542.67x more 32.56/0.00 = undefined

We obtained speedups from 22.82x to 33.32x in combined file gathering + database comparison time. The tradeoff is ~3.5x more memory usage during the file gathering phase and a very consistent 32.56 MB used for database comparison.

File lists

cd /pscratch/sd/f/forsyth/performance_profiling/analysis

wc -l add_few_small_files_before_metrics.txt.files # 4
wc -l add_few_small_files_after_metrics.txt.files # 4
diff add_few_small_files_before_metrics.txt.files add_few_small_files_after_metrics.txt.files
# Good, no diff

wc -l add_many_small_files_before_metrics.txt.files # 1000
wc -l add_many_small_files_after_metrics.txt.files # 1000
diff add_many_small_files_before_metrics.txt.files add_many_small_files_after_metrics.txt.files
# Large diff
# Same number of files though. Are they simply in a different order?

wc -l add_few_large_files_before_metrics.txt.files # 14 
wc -l add_few_large_files_after_metrics.txt.files #14
diff add_few_large_files_before_metrics.txt.files add_few_large_files_after_metrics.txt.files
# Good, no diff
Manual fixes to script output

Iterating with a file comparison function gets us to:

> compare_files("add_many_small_files_before_metrics.txt.files", "add_many_small_files_after_metrics.txt.files")
Counts: both=50, list1=950, list2=950

Let's try checking the logs themselves. Maybe the script incorrectly filtered files.

cd /pscratch/sd/f/forsyth/performance_profiling/logs
diff update_add_many_small_files_before_changes.log update_add_many_small_files_after_changes.log > diff_many_small_files.txt

Unfortunately, that's still showing a diff.

Let's expand the comparison function to a script:

from pathlib import Path
from collections import Counter

def analyze_paths(file_list, name):
    print(f"\n=== Analysis for {name} ===")
    
    # Get all unique directories
    dirs = set(Path(f).parent.as_posix() for f in file_list)
    print(f"Unique directories: {len(dirs)}")
    
    # Check depth distribution
    depths = Counter(len(Path(f).parts) for f in file_list)
    print(f"Depth distribution: {sorted(depths.items())}")
    
    # Check if files are in CMakeFiles subdirs
    in_cmakefiles = sum(1 for f in file_list if 'CMakeFiles' in f)
    print(f"Files in CMakeFiles dirs: {in_cmakefiles}")
    
    # Check specific patterns
    has_mod_stamp = sum(1 for f in file_list if f.endswith('.mod.stamp'))
    has_mod = sum(1 for f in file_list if f.endswith('.mod') and not f.endswith('.mod.stamp'))
    has_o = sum(1 for f in file_list if f.endswith('.o'))
    
    print(f"Files ending in .mod.stamp: {has_mod_stamp}")
    print(f"Files ending in .mod (not .stamp): {has_mod}")
    print(f"Files ending in .o: {has_o}")
    
    # Sample some paths
    print(f"First 3 paths: {file_list[:3]}")
    print(f"Last 3 paths: {file_list[-3:]}")

def compare_files(file1_path, file2_path):
    # Read the first file into a list
    with open(file1_path, 'r') as f1:
        lines1 = [line.strip() for line in f1]
    
    # Read the second file into a list
    with open(file2_path, 'r') as f2:
        lines2 = [line.strip() for line in f2]
    
    # Convert to sets for comparison
    set1 = set(lines1)
    set2 = set(lines2)
    
    # Find lines in both, only in list1, and only in list2
    in_both = sorted(set1 & set2)
    only_in_1 = sorted(set1 - set2)
    only_in_2 = sorted(set2 - set1)
    
    d = {
        'files_in_both': in_both,
        'files_in_list1': only_in_1,
        'files_in_list2': only_in_2
    }

    print(f"Counts: both={len(d['files_in_both'])}, list1={len(d['files_in_list1'])}, list2={len(d['files_in_list2'])}")

    # Check for patterns in what changed
    list1_extensions = set(f.split('.')[-1] for f in d['files_in_list1'])
    list2_extensions = set(f.split('.')[-1] for f in d['files_in_list2'])

    print("Extensions in both lists:", list1_extensions & list2_extensions)
    print("Extensions only in list1:", list1_extensions - list2_extensions)
    print("Extensions only in list2:", list2_extensions - list1_extensions)

    # Check path patterns
    list1_has_src = sum('copied_build/' in f for f in d['files_in_list1'])
    list2_has_src = sum('copied_build/' in f for f in d['files_in_list2'])
    print(f"Files with 'copied_build/' in path: list1={list1_has_src}, list2={list2_has_src}")

    # Check basenames
    list1_basenames = {Path(f).stem for f in d['files_in_list1']}
    list2_basenames = {Path(f).stem for f in d['files_in_list2']}

    common_basenames = list1_basenames & list2_basenames
    print(f"Files with same base name: {len(common_basenames)}")
    print(f"Examples: {list(common_basenames)[:10]}")

    analyze_paths(d['files_in_list1'], "lstat")
    analyze_paths(d['files_in_list2'], "scandir")
    analyze_paths(d['files_in_both'], "both")

if __name__ == "__main__":
    compare_files("/pscratch/sd/f/forsyth/performance_profiling/analysis/add_many_small_files_before_metrics.txt.files", "/pscratch/sd/f/forsyth/performance_profiling/analysis/add_many_small_files_after_metrics.txt.files")

Unfortunately, this still isn't giving much to go off of.

Let's try going back to the logs:

cd /pscratch/sd/f/forsyth/performance_profiling/logs
wc -l update_add_many_small_files_before_changes.log # 7,086
wc -l update_add_many_small_files_after_changes.log # 7,085
# The one-line diff is simply because `INFO: Gathering list of files to archive` doesn't get printed in the after-case
# That's 7,086-13=7,073 and 7,085-12=7,073 files each.
# Which means two things:
# 1. Both cases added 7,073 files.
# 2. The summarizing script extracted only 1,000 out 7,073 files from each log.

That last point is because of

grep -A 1000 "List of files to be updated" ${log_file} | tail -n +2 | grep -v "^$" | 

in the script. It didn't account for needing to check more than 1000 lines later.

So, are these 7,073 files identical, once sorted???

tail -n 7073 update_add_many_small_files_before_changes.log > ../analysis/add_many_small_files_before_metrics.txt.files_manual
tail -n 7073 update_add_many_small_files_after_changes.log > ../analysis/add_many_small_files_after_metrics.txt.files_manual

Now we find:

> python compare_file_lists.py 
Counts: both=7073, list1=0, list2=0

We find that the diff in the build/ file lists is actually because 1. the profiling script only processed the first 1,000 files to add and 2. each lists contains 7,073 files that identical if order does not matter.

Summary

Performance

  • Time: file gathering + database comparison time is 22-33x faster now. Least speedup when adding many small files. Biggest speedups when adding a few files (large or small).
  • Memory: Roughly 3x more (always a consistent ~80 MB) in the file gathering phase and a consistent ~30 MB in the database comparison phase.

Accuracy (i.e., do the list of files match up?)

  • In most cases, the file lists do in fact match up. I did run into a small problem when updating with a copied build directory. Both file lists (before-case, after-case) show 7,073 files added. Bad news: the lists are not identical. Good news: the sorted lists are identical. So, the question is do we care about list order? We could always add a sorted call in the after-changes (scandir) version.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Jan 8, 2026

@chengzhuzhang @golaz @wlin7 @TonyB9000 I've done further performance profiling (here & here)

Conclusions, copied from above:


Performance

  • Time: file gathering + database comparison time is 22-33x faster now. Least speedup when adding many small files. Biggest speedups when adding a few files (large or small).
  • Memory: Roughly 3x more (always a consistent ~80 MB) in the file gathering phase and a consistent ~30 MB in the database comparison phase.

Accuracy (i.e., do the list of files match up?)

  • In most cases, the file lists do in fact match up. I did run into a small problem when updating with a copied build directory. Both file lists (before-case, after-case) show 7,073 files added. Bad news: the lists are not identical. Good news: the sorted lists are identical. So, the question is do we care about list order? We could always add a sorted call in the after-changes (scandir) version.

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Jan 8, 2026

@forsyth2 Nice results! The speed up is much significant compared to those from small tests. This is great news.
Re: added file order, I'm not sure if it is important, but Wuyin or Chris might have a preference.

@forsyth2
Copy link
Collaborator Author

Closing in favor of #420.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

semver: small improvement Small improvement (will increment patch version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: speed up zstash update

3 participants