Conversation
92e0b69 to
40b0b51
Compare
Setupcd ~/ez/zstash
git status
# On branch issue-374-tar-deletion-rebased20251124
# nothing to commit, working tree clean
git fetch upstream main
git checkout -b zstash-update-performance-profile upstream/main
lcrc_conda # Activate conda. Alternatives: nersc_conda, compy_conda
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n zstash-performance-profile-20251215
conda activate zstash-performance-profile-20251215
# Make changes here
pre-commit run --all-files
python -m pip install .
cd tests/integration/bash_tests/run_from_any/
chmod 755 performance_for_update.bash
./performance_for_update.bash --work-dir /home/ac.forsyth2/ez/zstash/tests/utils/test_update_performance --num-files 1000 --num-dirs 50Performance output for Claude's analysis & suggestionsExcellent! The profiling is working. Even with just 2,000 files, we can already see the bottleneck clearly: Key Findings from Your Test:Database queries are the bottleneck - Taking 56.2% of the comparison time (0.13s out of 0.22s total). With only 2,000 files, this isn't a huge problem (0.22 seconds total). But let's extrapolate to a real simulation: Scaling Analysis:But this assumes linear scaling. In practice, database queries often slow down with larger databases, so the problem gets worse at scale. Recommendations:1. Test with More FilesTry running with a larger dataset to see if performance degrades non-linearly: # Medium test
./zstash_profile.sh --work-dir /scratch/myuser --num-files 50000 --num-dirs 500
# Large test
./zstash_profile.sh --work-dir /scratch/myuser --num-files 100000 --num-dirs 10002. Optimization StrategiesSince database queries are 56.2% of the time, here are the best optimizations: A. Add a database index (easiest fix): # In update_database(), after opening the database:
cur.execute("CREATE INDEX IF NOT EXISTS idx_files_name ON files(name)")B. Batch database queries instead of one query per file: # Instead of querying in a loop, load all relevant data once
cur.execute("SELECT name, size, mtime FROM files")
db_cache = {row[0]: (row[1], row[2]) for row in cur.fetchall()}
# Then check against the in-memory dictC. Use in-memory database for comparison: # Load entire files table into memory
# Compare in Python instead of SQLite queries@chengzhuzhang As predicted, the database querying appears to be the biggest bottleneck. Options for next steps:
Please let me know what you would prefer. |
|
@forsyth2 thanks for setting this up. The next steps sound reasonable to me. Please start by testing with a real case. |
|
@wlin7 This branch adds performance profiling information. Can you use this branch when next running Set up
|
@chengzhuzhang Please define "real case"; that sounds like you mean a production run. Do you just mean running this script with an even larger number of auto-generated files? And/or without the |
|
Maybe I should have clarified that after testing with synthetic data, please test on a real zstash archive. |
Set up new scriptSetup detailsSet up cd ~/ez/zstash
git status
# On branch zstash-update-performance-profile
# nothing to commit, working tree clean
git log
# Good, last commit: Performance profiling of zstash update
# Good, matches https://github.com/E3SM-Project/zstash/pull/414/commits
lcrc_conda # Activate conda.
conda activate zstash-performance-profile-20251215
python -m pip install .Make a new testing script: cd tests/integration/bash_tests/run_from_any/
touch performance_update_existing_archive.bash
# Make edits to `performance_update_existing_archive.bash`
chmod 755 performance_update_existing_archive.bash
cd ~/ez/zstash
git add -A # Add new test script to git
pre-commit run --all-files # Iterate until passing
python -m pip install .The new script was included in the second commit. Find/make an archive to updateWe have a few requirements:
Details of creating an archive to updateHow about we just create a new archive from scratch that looks more "real" than the toy data for testing? du -sh /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
# 24T /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201We certainly don't want to transfer all that, but this is the only directory I have write permissions for that is in fact simulation data -- i.e., presumably a close enough substitute for "real" production data. Let's archive just a small portion of that, update some files, and then run cd /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
ls
du -sh archive
# 12T archive
du -sh post
# 166G post
du -sh run
# 11G run
du -sh init
# 6.9G init
du -sh build
# 1.2G build
du -sh case_scripts
# 1.5M case_scripts
du -sh mpas-ocean
# 147K mpas-ocean
du -sh mpas-seaice
# 147K mpas-seaiceLet's use the cd /lcrc/group/e3sm/ac.forsyth2/zstash_performance
cp -r /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201/build build
mkdir cache
mkdir logs
# Set up Globus
# https://app.globus.org/file-manager?two_pane=true > For "Collection", choose LCRC Improv DTN, NERSC HPSS
rm -rf ~/.zstash.ini
rm -rf ~/.zstash_globus_tokens.json
# https://auth.globus.org/v2/web/consents > Globus Endpoint Performance Monitoring > rescind all
# On Perlmutter:
# hsi
# mkdir zstash_performance_20251216_try2 # mkdir: /home/f/forsyth/zstash_performance_20251216_try2
# exit
# Create an archive in the first place so we have something to update.
zstash create --hpss=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/f/forsyth/zstash_performance_20251216_try2 --cache=/lcrc/group/e3sm/ac.forsyth2/zstash_performance/cache -v build/ 2>&1 | tee /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs/create.log
# Enter auth code
# ...
# INFO: Globus transfer 0187fe5f-dadd-11f0-a88e-0eb0b913a0ab, from 15288284-7006-4041-ba1a-6b52501e49f1 to 9cd89cfd-6d04-11e5-ba46-22000b92c6ec: zstash_performance_20251216_try2 index succeededThat took longer than one minute, but still not too bad; about 15 minutes. Run the performance profilingFirst run detailsNow, in Then: # Perlmutter:
# hsi
# ls /home/f/forsyth/zstash_performance_20251216_try2 # 000000.tar index.db
# exit
cd /home/ac.forsyth2/ez/zstash/tests/integration/bash_tests/run_from_any
./performance_update_existing_archive.bash
# Need to change some logic in the performance script to have it print the summary.
# Perlmutter
# hsi
# ls /home/f/forsyth/zstash_performance_20251216_try2 # 000000.tar index.db
# exit
# So, `zstash update` completed but 000001.tar wasn't transferred.
# https://app.globus.org/activity doesn't show any hanging activity.
# That's because we set
# EXISTING_ARCHIVE=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/f/forsyth/zstash_performance_20251216
# not:
# EXISTING_ARCHIVE=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/f/forsyth/zstash_performance_20251216_try2# Perlmutter:
# hsi
# ls /home/f/forsyth/zstash_performance_20251216_try2 # 000000.tar index.db
# exit
cd /home/ac.forsyth2/ez/zstash/tests/integration/bash_tests/run_from_any
./performance_update_existing_archive.bash
# Perlmutter:
# hsi
# ls /home/f/forsyth/zstash_performance_20251216_try2 # 000000.tar 000002.tar index.db
# exit
# Good, the tar was transferred! (000001.tar was skipped because of a path error on the first run)OutputKey takeaways
@chengzhuzhang Assuming this was a "realistic" test run, we can use this information to optimize What was tested here?
@chengzhuzhang Unfortunately, I still found this rather ambiguous. "Real" could take on a number of different meanings:
The above analysis was completed using these steps:
Please let me know if this was appropriate test and if not, then which specific steps should be changed. Thanks! |
|
@forsyth2 this is a good test! And in line with something I would suggest. It shows clearly dataset comparison took up much longer than file gathering. Among dataset comparison, Stat operations took most of time, what operation is this exactly? |
Great, thanks!
That's this block in # Time stat operations
stat_op_start = time.time()
statinfo: os.stat_result = os.lstat(file_path)
mdtime_new: datetime = datetime.utcfromtimestamp(statinfo.st_mtime)
mode: int = statinfo.st_mode
# For symbolic links or directories, size should be 0
size_new: int
if stat.S_ISLNK(mode) or stat.S_ISDIR(mode):
size_new = 0
else:
size_new = statinfo.st_size
stat_time += time.time() - stat_op_startMore specifically, the relevant call is to the os.lstat function.
Yes, I can do that. |
Claude confirms my thinking here -- there're just so many small files. We can try to batch those operations. First though, let me run the large update (adding The ProblemYou're calling The Solution: Use
|
Setup for larger test# We saw previously:
cd /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
du -sh run
# 11G run
# We want to run `zstash update` (with performance logging) to add this subdir.
# We had set: DIR_THAT_WAS_ARCHIVED=/lcrc/group/e3sm/ac.forsyth2/zstash_performance/build/
# So we're actually going to need to place `run/` inside `build/`,
# but that's fine since this isn't a production use case.
cd ~/ez/zstash
git status
# On branch zstash-update-performance-profile
# nothing to commit, working tree clean
git log
# Add second profiling script
# Performance profiling of zstash update
cd tests/integration/bash_tests/run_from_any/
# Copying will retain the run permissions:
cp performance_update_existing_archive.bash performance_zstash_update_after_manual_change.bash
# Make necessary edits
cd ~/ez/zstash
git add -A # Add new test script to git
lcrc_conda # Activate conda.
conda activate zstash-performance-profile-20251215
pre-commit run --all-files
python -m pip install .
cd /lcrc/group/e3sm/ac.forsyth2/zstash_performance/build/
cp -r /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201/run/ run/ # Very fast copy of 11GB
du -sh run
# 11G run
cd ~/ez/zstash/tests/integration/bash_tests/run_from_any/
./performance_zstash_update_after_manual_change.bash
# tee: /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_after_manual_change/update.log: No such file or directory
# => Forgot to create the new logs directory
# That's not a huge problem though since it's also printing to the command line
mkdir /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_after_manual_change/
touch /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_after_manual_change/update.log
# Copy command line output to /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_after_manual_change/update.log
cp performance_zstash_update_after_manual_change.bash get_profile_summary_from_log.bash
# Make necessary edits
./get_profile_summary_from_log.bashResults from larger testHere we see Summary of testsTests done so far:
Claude's suggestions based on above aggregate dataLooking at this performance data, here are the key optimization priorities: Primary Bottleneck: stat operations (Database Comparison)The most critical issue is the stat operations during database comparison, which becomes the dominant bottleneck as dataset size grows:
This shows quadratic-like scaling that will only get worse with larger datasets. The stat operations are likely checking file metadata (size, mtime) for every file to determine what needs archiving. Optimization strategies:
Secondary Bottleneck: Add files (Tar Creation)This becomes the dominant cost in Test 3 (87.4%, 399 seconds) when archiving large amounts of actual data (8,187 files, ~11 GB). However, this may be unavoidable I/O cost. Worth investigating:
Lower Priority Optimizations:Database queries (4-5 seconds in Tests 2-3): Already quite efficient at ~6-9% of runtime. Only optimize if other issues are solved. File gathering (4-5 seconds): Minimal and scales well. Not a priority. Summary Recommendation:Focus on stat operations first - this is where you'll see the biggest gains, especially as datasets grow. The 48 seconds spent on stats in Test 3 is nearly pure overhead that could potentially be reduced by 50-80% with batching and caching strategies. |
|
We can see above that when adding 10 GB to a 1 GB archive (Test 3), "Add files" overtakes "Database comparison" (which includes "stat operations"), which I imagine is expected. |
|
@forsyth2 : thanks for the nice profiling work. It seems to me that the first level of optimization would be to remove the stat ( If we need further improvements, it might also be possible to speed up other aspects of the database comparison. |
Setup for optimization test# We saw previously:
cd /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
du -sh init
# 6.9G init
# We want to run `zstash update` (with performance logging) to add this subdir.
# We had set: DIR_THAT_WAS_ARCHIVED=/lcrc/group/e3sm/ac.forsyth2/zstash_performance/build/
# So we're actually going to need to place `init/` inside `build/`,
# but that's fine since this isn't a production use case.
cd ~/ez/zstash
git status
# On branch zstash-update-performance-profile
# modified: tests/integration/bash_tests/run_from_any/get_profile_summary_from_log.bash
# Conda env: zstash-performance-profile-20251215
git log
# 3 profiling commits, 3 optimization commits
git add -A
pre-commit run --all-files
python -m pip install .
cd /lcrc/group/e3sm/ac.forsyth2/zstash_performance/build/
cp -r /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201/init/ init/
du -sh init
# 6.9G init
mkdir /lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_optimization
cd ~/ez/zstash/tests/integration/bash_tests/run_from_any/
# DIR_THAT_WAS_ARCHIVED=/lcrc/group/e3sm/ac.forsyth2/zstash_performance/build/
# EXISTING_ARCHIVE=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/f/forsyth/zstash_performance_20251216_try2
#
# WORK_DIR=/lcrc/group/e3sm/ac.forsyth2/zstash_performance
# CACHE_DIR=${WORK_DIR}/cache
cd /lcrc/group/e3sm/ac.forsyth2/zstash_performance/build/
zstash update --hpss="globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/f/forsyth/zstash_performance_20251216_try2" --cache="/lcrc/group/e3sm/ac.forsyth2/zstash_performance/cache" -v 2>&1 | tee "/lcrc/group/e3sm/ac.forsyth2/zstash_performance/logs_update_optimization/update.log"
cd ~/ez/zstash/tests/integration/bash_tests/run_from_any/
./get_profile_summary_from_log.bashOn Perlmutter: hsi
ls zstash_performance_20251216_try2/
# zstash_performance_20251216_try2/:
# 000000.tar 000002.tar 000003.tar 000004.tar index.db
# Test 2 added 000002.tar
# Test 3 added 000003.tar
# This optimized run added 000004.tar
exitResults from optimization testReview@chengzhuzhang @golaz The optimizations are covered by the last 4 commits; they can be viewed in aggregate here. Can you do a high-level review of Claude's changes to If you agree with the general implementation logic, I can go ahead with:
|
|
Took a very brief initial look and left some comment. The structure of the changes look reasonable. But it would be useful if you walk us through the changes when you better understand them yourself. |
|
@forsyth2 Copilot helped to summarize that the code change contains two mechanisms for attempting performance boost:
This seem to be reasonable, but I didn't review code line by line. Please review and test on your side to confirm the correctness, and the performance results. |
|
Yes, thank you both, I just wanted high level approvals before proceeding with further improvements. I will move on to the next steps now. |
|
Update 2025-12-18 end-of-day: Test 5 profiling resultsOverall time breakdown
Code review guideClaude's guide: Claude's walk-throughCode Review Guide: Zstash Update Performance OptimizationExecutive SummaryThis diff addresses a critical performance bottleneck in The Core ProblemOriginal Workflow (Inefficient)
Optimized Workflow
Key Optimization Strategies1. Collect Stats During Filesystem Walk (
|
|
@chengzhuzhang @golaz The above comment is my latest status on this. I need to take some time to absorb all the information in it, but figured I should make it available in case either of you are interested / want to add suggestions. Thanks! A few things come to mind so far:
Action items for me:
|
I'm also going to need to look at number of files being archived, not just the sizes. I assume the performance profiling will depend on both the number and size of files both initially and being added. I was trying to see if Claude could come up with a good analysis, but I think I need to collect the different combinations of data first. |
I've added an updated performance profiling script. Results below. Test configurations:
Notes from ClaudeLooking at your profiling data, here are the key relationships between directory characteristics and performance: File Gathering Time
Add Files Time
Database Comparison
Bottom line: File gathering is O(n) with file count, while adding files is O(n) with total data size. For your backup operations, total data volume dominates overall runtime (~95% of time), but large file counts will slow down the metadata scanning phase. |
d3ae23b to
589ac40
Compare
|
Cleaned up commit history by consolidating related commits; the 4 commits currently are:
|
|
Now that the PR is approaching reviewable status, I have updated the top description to indicate this is no longer just a performance profiling PR that should never be merged. |
forsyth2
left a comment
There was a problem hiding this comment.
My self-review for this PR
I've only reviewed update.py and utils.py here.
Remaining action items:
- Have others review the design decisions/architecture changes of
update.pyandutils.py.- Decide if we want to keep the performance logging code & scripts. Pro: available for future use, off by default, Con: adds a lot of superfluous code
- Self-review non-user-facing code (
performance_profiling/,tests/) changes. - Rerun the entire test suite.
Claude's code review guide below, based on:
- the issue description from #409.
- asking if it's necessary to update
create.pyas well. - the performance results from earlier comment.
- The diff for
update.pyandutils.py
Code review guide
Code Review Guide: Zstash Update Performance Optimization
Overview
This diff optimizes the zstash update operation by eliminating redundant file system stat calls during database comparison. The key insight is to collect file stats during the initial filesystem walk and reuse them during comparison, rather than calling os.lstat() on every file twice.
Performance Impact
Based on the test results, the optimizations show:
- File Gathering: 0.1% - 12% of total time (varies by file count)
- Database Comparison: ~0.0% of total time (dramatically reduced from previous implementation)
- Add Files: 87.9% - 99.9% of total time (unchanged, still the bottleneck)
The optimization successfully addresses the file gathering and comparison bottleneck, though archiving itself remains the dominant time consumer.
Key Changes
1. New Function: get_files_to_archive_with_stats() (utils.py)
Purpose: Combines file discovery with stat collection in a single pass.
What to Review:
def get_files_to_archive_with_stats(
cache: str, include: str, exclude: str
) -> Dict[str, Tuple[int, datetime]]:- ✅ Check: Returns
Dict[str, Tuple[int, datetime]]mappingfile_path → (size, mtime) - ✅ Check: Uses
os.scandir()instead ofos.walk()for better performance - ✅ Check: Collects stats during traversal (no second pass needed)
⚠️ Verify: Empty directory handling - ensures they get(0, mtime)entries⚠️ Verify: Symlink handling - correctly sets size to 0 for symlinks⚠️ Verify: Cache directory exclusion works correctly
Critical Logic:
# For symbolic links or directories, size should be 0
if stat_module.S_ISLNK(mode):
size = 0
else:
size = stat_info.st_size- ✅ Matches original behavior in
update.pylines 393-397
2. DirectoryScanner Class (utils.py)
Purpose: Encapsulates recursive directory scanning with stat collection.
What to Review:
class DirectoryScanner:
def scan_directory(self, path: str):- ✅ Check: Properly handles
PermissionErrorexceptions - ✅ Check: Skips cache directory correctly
- ✅ Check: Uses
entry.stat(follow_symlinks=False)to matchos.lstat()behavior ⚠️ Verify: Empty directory detection logic is sound⚠️ Verify: Path normalization is consistent with original implementation- 🔍 Test: Progress logging every 1000 directories doesn't cause performance issues
3. Optimized Database Comparison (update.py)
MAJOR CHANGE: Eliminated the os.lstat() call inside the file comparison loop.
Original Code (lines 210-234):
for file_path in files:
statinfo: os.stat_result = os.lstat(file_path) # ← EXPENSIVE!
mdtime_new: datetime = datetime.utcfromtimestamp(statinfo.st_mtime)
# ... comparison logicNew Code (lines 420-447):
# Load database once into memory
archived_files: Dict[str, Tuple[int, datetime]] = {}
cur.execute("SELECT name, size, mtime FROM files")
# Build lookup dictionary
# Compare using pre-collected stats
for file_path in files:
size_new, mdtime_new = file_stats[file_path] # ← O(1) lookup, no I/O!
if file_path not in archived_files:
newfiles.append(file_path)
# ... comparison logicWhat to Review:
- ✅ Check: Database loaded once into memory (O(1) lookups vs O(n) queries)
- ✅ Check: Comparison logic remains identical to original
- ✅ Check: Handles duplicate database entries (keeps latest mtime)
⚠️ Verify: Time tolerance check (TIME_TOL) works identically- 🔍 Test: Memory usage acceptable for large databases
Critical Invariant:
# Original comparison
if (size_new == match.size) and (
abs((mdtime_new - match.mtime).total_seconds()) <= TIME_TOL
):
# New comparison
if not (
(size_new == archived_size)
and (abs((mdtime_new - archived_mtime).total_seconds()) <= TIME_TOL)
):- ✅ Logic is equivalent (note the
notinversion fornewfilescondition)
4. Performance Logging Classes
Two new classes added for detailed performance tracking:
UpdatePerformanceLogger(inupdate.py)FileGatheringPerformanceLogger(inutils.py)
What to Review:
- ✅ Check: Logging only occurs at DEBUG level (no performance impact in production)
- ✅ Check: Progress updates every 1000 items to avoid log spam
⚠️ Verify: Timing measurements don't introduce overhead- 💡 Consider: Could this logging be made optional via a flag?
Backward Compatibility
Legacy Function Maintained
def get_files_to_archive(cache: str, include: str, exclude: str) -> List[str]:
"""LEGACY VERSION: For backward compatibility."""
file_stats = get_files_to_archive_with_stats(cache, include, exclude)
return list(file_stats.keys())What to Review:
- ✅ Check: Returns identical output format as before
⚠️ Verify:create.pystill works with this function (see question below)
Testing Checklist
Functional Tests
- Empty directories are correctly identified and archived
- Symlinks have size=0 in the database
- Cache directory is properly excluded from archiving
- Include/exclude patterns work identically to before
- File ordering is deterministic and matches original
- Time tolerance (
TIME_TOL) comparisons work correctly - Database entries with duplicate paths handle latest mtime
Performance Tests
- File gathering time reduced (confirmed in test results ✅)
- Database comparison time near-zero (confirmed in test results ✅)
- Memory usage acceptable for large file sets
- No performance regression in
add_filesoperation
Edge Cases
- Permission denied errors handled gracefully
- Very large file counts (millions of files)
- Very large databases (millions of archived files)
- Concurrent file modifications during scanning
- Files with special characters in names
- Deep directory hierarchies
Answer to Your Question: Does create.py Need Modification?
Short answer: No immediate modification needed, but it could benefit from optimization.
Analysis:
create.pycurrently uses the legacyget_files_to_archive()function- The legacy function now calls the optimized
get_files_to_archive_with_stats()internally - However,
create.pywill still callos.lstat()on each file during archiving
Recommendation:
# In create.py, if you want to optimize further:
# Instead of:
files = get_files_to_archive(cache, args.include, args.exclude)
# Consider:
file_stats = get_files_to_archive_with_stats(cache, args.include, args.exclude)
# Then pass file_stats to add_files() and modify add_files() to accept the statsWhy this matters: Based on your test results, the majority of time is spent in add_files (87-99%). If add_files is calling os.lstat() again, you could eliminate those redundant calls too.
Action Item: Review hpss_utils.add_files() to see if it's stating files again. If so, modify it to accept pre-collected stats.
Potential Issues to Watch
1. Memory Usage
archived_files: Dict[str, Tuple[int, datetime]] = {}For very large databases (millions of files), this could consume significant memory. Monitor memory usage in production.
2. Path Normalization Consistency
normalized_path = os.path.normpath(entry.path)Ensure all paths are normalized consistently between:
- File gathering
- Database storage
- Database lookup
3. Empty Directory Mtime Fallback
except (OSError, PermissionError):
# Fallback if we can't stat the directory
self.file_stats[normalized_path] = (0, datetime.utcnow())Using datetime.utcnow() as a fallback could cause issues. Consider whether this should fail instead or use a different sentinel value.
4. Duplicate Database Entries
if file_path in archived_files:
existing_mtime = archived_files[file_path][1]
if mtime > existing_mtime:
archived_files[file_path] = (size, mtime)This assumes keeping the latest mtime is correct. Verify this matches the original behavior when the database has duplicates.
Recommendations
High Priority
- ✅ Test with actual large E3SM datasets (millions of files)
- ✅ Monitor memory usage with large databases
⚠️ Add unit tests for edge cases (empty dirs, symlinks, permissions)⚠️ Document the performance characteristics in user documentation
Medium Priority
- 💡 Consider optimizing
add_files()to accept pre-collected stats - 💡 Add configuration flag to enable/disable detailed performance logging
- 💡 Consider adding a database index on the
namecolumn if not present
Low Priority
- 💡 Profile memory usage and consider streaming approach for very large datasets
- 💡 Add regression tests comparing old vs new output for identical inputs
Summary
This is a well-designed optimization that successfully addresses the performance bottleneck in file gathering and database comparison. The code maintains backward compatibility while achieving significant speedups. The main concerns are memory usage at scale and ensuring the modified create.py workflow is also considered for optimization.
Approval Recommendation: ✅ Approve with minor testing verification of edge cases.
| # Get actual mtime for empty directory | ||
| try: | ||
| stat_info = os.lstat(path) | ||
| mtime = datetime.utcfromtimestamp(stat_info.st_mtime) |
There was a problem hiding this comment.
⚠️ Verify: Empty directory handling - ensures they get(0, mtime)entries
✅
|
|
||
| # For symbolic links or directories, size should be 0 | ||
| if stat_module.S_ISLNK(mode): | ||
| size = 0 |
There was a problem hiding this comment.
⚠️ Verify: Symlink handling - correctly sets size to 0 for symlinks
✅
| # Skip the cache directory entirely | ||
| if entry.path == self.cache_path or entry.path.startswith( | ||
| self.cache_path + os.sep | ||
| ): |
There was a problem hiding this comment.
⚠️ Verify: Cache directory exclusion works correctly
✅
| # Progress logging every 1000 directories | ||
| if self.dir_count % 1000 == 0: | ||
| elapsed = time.time() - self.walk_start_time | ||
| self.perf_logger.log_scandir_progress( | ||
| self.dir_count, self.file_count, elapsed | ||
| ) |
There was a problem hiding this comment.
🔍 Test: Progress logging every 1000 directories doesn't cause performance issues
✅ All of the logger statements are only run in -v mode. (verbose mode sets logging level to debug). That said, see point in this review's top-level comment: "Decide if we want to keep the performance logging code & scripts"
| if not ( | ||
| (size_new == archived_size) | ||
| and (abs((mdtime_new - archived_mtime).total_seconds()) <= TIME_TOL) |
There was a problem hiding this comment.
⚠️ Verify: Time tolerance check (TIME_TOL) works identically
Previously: Looped through requested files, exiting early if the TIME_TOL condition was True, in which case we set new = False and file_path was not appended to newfiles.
Now: we already have the corresponding rows from the files table, so we simply check for each row if the TIME_TOL condition is False, in which case do append to file_pathtonewfiles`.
✅ Thus, we have TIME_TOL True => do NOT append, TIME_TOL False => append. It seems the logic is consistent then.
| # If file appears multiple times, keep the one with latest mtime | ||
| if file_path in archived_files: | ||
| existing_mtime = archived_files[file_path][1] | ||
| if mtime > existing_mtime: | ||
| archived_files[file_path] = (size, mtime) | ||
| else: | ||
| archived_files[file_path] = (size, mtime) |
There was a problem hiding this comment.
[ ] Database entries with duplicate paths handle latest mtime
✅ Presumably this wasn't a problem in the previous implementation because we were handling the files themselves, so of course we were getting their latest accessible mtime.
| # Add files | ||
| perf.start_add_files() |
There was a problem hiding this comment.
[ ] No performance regression in
add_filesoperation
We haven't changed anything about add_files, so I don't imagine this would be an issue.
| try: | ||
| entries = list(os.scandir(path)) | ||
| except PermissionError: | ||
| logger.warning(f"Permission denied: {path}") |
There was a problem hiding this comment.
[ ] Permission denied errors handled gracefully
✅
| # Get stat info - scandir provides this efficiently | ||
| # Use entry.stat(follow_symlinks=False) to match os.lstat() behavior | ||
| stat_info = entry.stat(follow_symlinks=False) | ||
| mode = stat_info.st_mode |
There was a problem hiding this comment.
[ ] Concurrent file modifications during scanning
[ ] Files with special characters in names
Presumably these things are handled by stat??
| # Recursively scan subdirectory | ||
| self.scan_directory(entry.path) | ||
| has_contents = True |
There was a problem hiding this comment.
[ ] Deep directory hierarchies
This recursive call should handle deep hierachies fine.
|
@chengzhuzhang @golaz Please review just the changes in Design decisions questions I have for you:
The key logic changes to review:
Remaining action items for me, after your review:
|
|
Thank you @forsyth2 ! Let's use the EZ meeting this afternoon to walk though this PR. |
|
@forsyth2 I'm trying to find performance comparison: performance data before this PR vs after. Which table(s) I should look? |
For that, you'll want to look at this comment "Overall time breakdown". Tests 1-3 are before; tests 4-5 are after. The comment I referenced in
is for different combinations of number of files & file sizes, all on the post-performance-update code. |
There was a problem hiding this comment.
Action items:
- Add same performance code to before-version. Then, do apples-to-apples comparison.
- Do memory profiling too
- How efficient is the dictionary access? Does it remain O(1) lookup even on a large number of keys?
- For every test, start from the same identical
index.db, Then, we're only looking at the changes fromupdate.- Retrieve from an existing simulation. Will need an existing archive and extract it to disk. run
zstash update, discardindex.db, replace with old one. Restoreindex.dbbefore each test. - Eventually, we can scale up to 20 TB test. @wlin7 can run that test.
- Retrieve from an existing simulation. Will need an existing archive and extract it to disk. run
- Try running the same tests multiple times, to get an average.
Performance comparisonMethodologyI created a new branch I also created a branch that applies the same simple performance checking code to the To profile, I used the script that I've included in both of those branches. To have some consistency with previous profile runs, I used the same three subdirectories:
There are 12 test cases: for each of those 3 subdirs, create an an archive and update it with one of the other subdirs, and then test it with both ResultsSummary tablesBefore changes:
After changes:
VisualizationsClaude was able to generate code to visualize this data.
Key takeaways:
Key takeaways:
Key takeways:
Key takeaways:
|
|
@chengzhuzhang @golaz I've conducted a comprehensive performance profiling covering 12 test cases before & after code changes. See comment above. I've listed my key takeaways as well. Note I didn't use the restoring @wlin7 @TonyB9000 You may also be interested in the above performance analysis. |
|
@forsyth2 thanks for the analysis! |
|
Thanks! Just for further clarity, adding some more description:
Correct, the green bar (add files) starts sooner in all visible cases.
Correct, we see a large reduction for
Correct, the left and right bars (= before & after) for each non-Globus case (from left-to-right: cases 1,3,5,7,9,11) show little variation.
Yes, those factors make it more difficult to directly compare the before/after behavior for Globus. Unfortunately, the Globus cases are the ones we care about the most (to resolve #409).
Yes, I would think so too. Indeed, it would be safe to run on a production-like case (as opposed to actual production use), since I haven't run the entire test suite on these specific versions (before-changes with profiling, after-changes with profiling) of the code. That said, clearly these 12 cases run. |
|
Action items:
|
Performance Comparison 2026-01-08MethodologyBefore- and after- branchesI used the
|
| Step | Time ratio (Before over After) |
|---|---|
| FILE GATHERING | 1.76/24.24 => 0.07x faster (i.e., 13.77x slower) |
| DATABASE COMPARISON | 966.21/6.06 => 159.44x faster |
| Together | 967.97/30.3 => 31.95x faster |
So, the file gathering + database comparison step is ~32x faster, based on this profiling.
Next steps
- Write script to repeat this profiling, but with different types of additions (e.g., lots of small files, a few big files). We're really not profiling memory well by adding only a few files.
- (optional) I think we most care about file gathering & database comparison. If we do want to profile total time (i.e., including
ADD_FILES), I would need to rerun without--dry-run, which would require re-extracting using--keepsince I don't have write permissions to the archive on HPSS.showquotahas me at 5.66/20.00 TiB, so in theory I have enough space to re-extract a 6 TB directory (plus the tars, so 12 TB total).
Performance Comparison 2026-01-08 follow-upProfiling scriptExpandScript written by me based on previous profiling run; then extended by Claude. Note it is imperfect; several manual steps had to be done to fix malformed data summaries. #!/bin/bash
# Run with:
# screen
# cd /global/homes/f/forsyth/ez/performance_profiling
# source ~/.bashrc
# source ~/.bash_profile
# nersc_conda
# time ./zstash_performance_profiling.bash 2>&1 | tee zstash_performance_profiling.log
# CTRL A D
# tail -f zstash_performance_profiling.log
# Create conda environments by:
# cd ~/ez/zstash
# git checkout <before-branch or after-branch>
# rm -rf build
# conda clean --all --y
# conda env create -f conda/dev.yml -n env-name
# conda activate env-name
# pre-commit run --all-files
# python -m pip install .
BEFORE_CASE_CONDA_ENV=zstash-profile-before-case-20260108
AFTER_CASE_CONDA_ENV=zstash-profile-after-case-20260108
HPSS_PATH=/home/projects/e3sm/www/CoupledSystem/E3SMv3/LR/v3.LR.amip_0151
PROFILE_PATH=/pscratch/sd/f/forsyth/performance_profiling/
EXTRACTED_DIRECTORY_PATH=${PROFILE_PATH}extraction/
LOG_PATH=${PROFILE_PATH}logs/
CACHE_PATH=${PROFILE_PATH}cache/
INDEX_COPIES_PATH=${PROFILE_PATH}index_copies/
ORIGINAL_INDEX=${PROFILE_PATH}original_index/index_after_extraction.db
ANALYSIS_PATH=${PROFILE_PATH}analysis/
###############################################################################
source ~/.bashrc
source ~/.bash_profile
nersc_conda # Activate Conda using function defined in ~/.bashrc
# Create analysis directory if it doesn't exist
mkdir -p ${ANALYSIS_PATH}
reset_index()
{
local case_name="${1}"
mv ${CACHE_PATH}index.db ${INDEX_COPIES_PATH}index_${case_name}.db
cp ${ORIGINAL_INDEX} ${CACHE_PATH}index.db
}
extract_metrics()
{
local log_file="${1}"
local output_file="${2}"
echo "Extracting metrics from ${log_file}..."
# Extract timing data
file_gathering_time=$(grep "TIME PROFILE -- FILE GATHERING:" ${log_file} | sed -n 's/.*: \([0-9.]*\) seconds/\1/p')
db_comparison_time=$(grep "TIME PROFILE -- DATABASE COMPARISON:" ${log_file} | sed -n 's/.*: \([0-9.]*\) seconds/\1/p')
# Extract memory data
file_gathering_mem=$(grep "MEMORY PROFILE -- FILE GATHERING:" ${log_file} | sed -n 's/.*current=\([0-9.]*\) MB, peak=\([0-9.]*\) MB/\1,\2/p')
db_comparison_mem=$(grep "MEMORY PROFILE -- DATABASE COMPARISON:" ${log_file} | sed -n 's/.*current=\([0-9.]*\) MB, peak=\([0-9.]*\) MB/\1,\2/p')
# Extract total time from 'time' command output
total_time=$(grep "^real" ${log_file} | awk '{print $2}')
# Extract list of files to be updated
grep -A 1000 "List of files to be updated" ${log_file} | tail -n +2 | grep -v "^$" | grep -v "^DEBUG:" | grep -v "^INFO:" > ${output_file}.files
# Write metrics to output file
cat > ${output_file} << EOF
FILE_GATHERING_TIME=${file_gathering_time}
DB_COMPARISON_TIME=${db_comparison_time}
FILE_GATHERING_MEM=${file_gathering_mem}
DB_COMPARISON_MEM=${db_comparison_mem}
TOTAL_TIME=${total_time}
EOF
echo "Metrics extracted to ${output_file}"
}
compare_and_report()
{
local case_name="${1}"
local before_metrics="${ANALYSIS_PATH}${case_name}_before_metrics.txt"
local after_metrics="${ANALYSIS_PATH}${case_name}_after_metrics.txt"
local before_files="${ANALYSIS_PATH}${case_name}_before_metrics.txt.files"
local after_files="${ANALYSIS_PATH}${case_name}_after_metrics.txt.files"
local report="${ANALYSIS_PATH}${case_name}_report.txt"
echo "Generating comparison report..."
# Source the metrics files
source ${before_metrics}
before_fg_time=$FILE_GATHERING_TIME
before_db_time=$DB_COMPARISON_TIME
before_fg_mem=$FILE_GATHERING_MEM
before_db_mem=$DB_COMPARISON_MEM
before_total=$TOTAL_TIME
source ${after_metrics}
after_fg_time=$FILE_GATHERING_TIME
after_db_time=$DB_COMPARISON_TIME
after_fg_mem=$FILE_GATHERING_MEM
after_db_mem=$DB_COMPARISON_MEM
after_total=$TOTAL_TIME
# Calculate total processing time (sum of two steps)
before_total_steps=$(echo "$before_fg_time + $before_db_time" | bc)
after_total_steps=$(echo "$after_fg_time + $after_db_time" | bc)
# Generate report
cat > ${report} << EOF
========================================
Performance Comparison Report
Case: ${case_name}
========================================
TIMING COMPARISON
-----------------
| Step | Before | After | Ratio |
|------|--------|-------|-------|
| FILE GATHERING | ${before_fg_time}s | ${after_fg_time}s | $(echo "scale=2; $before_fg_time / $after_fg_time" | bc)x $(if (( $(echo "$before_fg_time > $after_fg_time" | bc -l) )); then echo "slower"; else echo "faster"; fi) |
| DATABASE COMPARISON | ${before_db_time}s | ${after_db_time}s | $(echo "scale=2; $before_db_time / $after_db_time" | bc)x $(if (( $(echo "$before_db_time > $after_db_time" | bc -l) )); then echo "slower"; else echo "faster"; fi) |
| Together | ${before_total_steps}s | ${after_total_steps}s | $(echo "scale=2; $before_total_steps / $after_total_steps" | bc)x $(if (( $(echo "$before_total_steps > $after_total_steps" | bc -l) )); then echo "slower"; else echo "faster"; fi) |
| Total Time | ${before_total} | ${after_total} | - |
MEMORY COMPARISON
-----------------
FILE GATHERING (current, peak):
Before: ${before_fg_mem} MB
After: ${after_fg_mem} MB
DATABASE COMPARISON (current, peak):
Before: ${before_db_mem} MB
After: ${after_db_mem} MB
FILE LIST COMPARISON
--------------------
EOF
# Compare file lists
if diff -q ${before_files} ${after_files} > /dev/null 2>&1; then
echo "✓ File lists MATCH" >> ${report}
else
echo "✗ File lists DIFFER" >> ${report}
echo "" >> ${report}
echo "Differences:" >> ${report}
diff ${before_files} ${after_files} >> ${report}
fi
echo "" >> ${report}
echo "Files identified for update:" >> ${report}
cat ${after_files} >> ${report}
# Display report
cat ${report}
echo ""
echo "Full report saved to: ${report}"
}
run_profiles()
{
local case_name="${1}"
conda activate ${BEFORE_CASE_CONDA_ENV}
log_file=${LOG_PATH}update_${case_name}_before_changes.log
cd ${EXTRACTED_DIRECTORY_PATH} # Must run update from the archived directory!
time zstash update --hpss=${HPSS_PATH} --cache=${CACHE_PATH} -v --dry-run 2>&1 | tee ${log_file}
extract_metrics ${log_file} ${ANALYSIS_PATH}${case_name}_before_metrics.txt
reset_index ${case_name}_before_changes
conda activate ${AFTER_CASE_CONDA_ENV}
log_file=${LOG_PATH}update_${case_name}_after_changes.log
cd ${EXTRACTED_DIRECTORY_PATH} # Must run update from the archived directory!
time zstash update --hpss=${HPSS_PATH} --cache=${CACHE_PATH} -v --dry-run 2>&1 | tee ${log_file}
extract_metrics ${log_file} ${ANALYSIS_PATH}${case_name}_after_metrics.txt
reset_index ${case_name}_after_changes
# Generate comparison report
compare_and_report ${case_name}
}
generate_summary_table()
{
local summary_file="${ANALYSIS_PATH}summary_table.txt"
echo "Generating summary table across all cases..."
# Header
cat > ${summary_file} << 'EOF'
========================================
PERFORMANCE SUMMARY - ALL CASES
========================================
EOF
# Collect all case names
cases=($(ls ${ANALYSIS_PATH}*_report.txt 2>/dev/null | sed 's/.*\///;s/_report.txt//'))
if [ ${#cases[@]} -eq 0 ]; then
echo "No cases found to summarize."
return
fi
# Build timing ratio table
echo "| Step | ${cases[@]} |" >> ${summary_file}
echo "|------|$(printf '%s|' "${cases[@]/#/---}")" >> ${summary_file}
# FILE GATHERING row
printf "| FILE GATHERING |" >> ${summary_file}
for case in "${cases[@]}"; do
source ${ANALYSIS_PATH}${case}_before_metrics.txt
before_time=$FILE_GATHERING_TIME
source ${ANALYSIS_PATH}${case}_after_metrics.txt
after_time=$FILE_GATHERING_TIME
if [ -n "$before_time" ] && [ -n "$after_time" ] && [ "$after_time" != "0" ]; then
ratio=$(echo "scale=2; $before_time / $after_time" | bc)
if (( $(echo "$before_time > $after_time" | bc -l) )); then
direction="slower"
else
direction="faster"
fi
printf " ${before_time}/${after_time} = ${ratio}x ${direction} |" >> ${summary_file}
else
printf " N/A |" >> ${summary_file}
fi
done
echo "" >> ${summary_file}
# DATABASE COMPARISON row
printf "| DATABASE COMPARISON |" >> ${summary_file}
for case in "${cases[@]}"; do
source ${ANALYSIS_PATH}${case}_before_metrics.txt
before_time=$DB_COMPARISON_TIME
source ${ANALYSIS_PATH}${case}_after_metrics.txt
after_time=$DB_COMPARISON_TIME
if [ -n "$before_time" ] && [ -n "$after_time" ] && [ "$after_time" != "0" ]; then
ratio=$(echo "scale=2; $before_time / $after_time" | bc)
if (( $(echo "$before_time > $after_time" | bc -l) )); then
direction="slower"
else
direction="faster"
fi
printf " ${before_time}/${after_time} = ${ratio}x ${direction} |" >> ${summary_file}
else
printf " N/A |" >> ${summary_file}
fi
done
echo "" >> ${summary_file}
# Together row
printf "| Together |" >> ${summary_file}
for case in "${cases[@]}"; do
source ${ANALYSIS_PATH}${case}_before_metrics.txt
before_fg=$FILE_GATHERING_TIME
before_db=$DB_COMPARISON_TIME
before_total=$(echo "$before_fg + $before_db" | bc)
source ${ANALYSIS_PATH}${case}_after_metrics.txt
after_fg=$FILE_GATHERING_TIME
after_db=$DB_COMPARISON_TIME
after_total=$(echo "$after_fg + $after_db" | bc)
if [ -n "$before_total" ] && [ -n "$after_total" ] && [ "$after_total" != "0" ]; then
ratio=$(echo "scale=2; $before_total / $after_total" | bc)
if (( $(echo "$before_total > $after_total" | bc -l) )); then
direction="slower"
else
direction="faster"
fi
printf " ${before_total}/${after_total} = ${ratio}x ${direction} |" >> ${summary_file}
else
printf " N/A |" >> ${summary_file}
fi
done
echo "" >> ${summary_file}
echo "" >> ${summary_file}
echo "Individual case reports available at: ${ANALYSIS_PATH}*_report.txt" >> ${summary_file}
# Display summary
cat ${summary_file}
echo ""
echo "Summary table saved to: ${summary_file}"
}
###############################################################################
# Test Cases
###############################################################################
add1()
{
cd ${EXTRACTED_DIRECTORY_PATH}
echo "top level" > added_file0.txt
echo "build" > build/added_file1.txt
echo "init" > init/added_file2.txt
echo "run" > run/added_file3.txt
cd -
}
undo_add1()
{
cd ${EXTRACTED_DIRECTORY_PATH}
rm added_file0.txt
rm build/added_file1.txt
rm init/added_file2.txt
rm run/added_file3.txt
cd -
}
add2()
{
cd ${EXTRACTED_DIRECTORY_PATH}
cp -r /global/cfs/cdirs/e3sm/forsyth/E3SMv2/v2.LR.historical_0201/build/ copied_build/
cd -
}
undo_add2()
{
cd ${EXTRACTED_DIRECTORY_PATH}
rm -rf copied_build/
cd -
}
add3()
{
cd ${EXTRACTED_DIRECTORY_PATH}
cp -r /global/cfs/cdirs/e3sm/forsyth/E3SMv2/v2.LR.historical_0201/init/ copied_init/
cd -
}
undo_add3()
{
cd ${EXTRACTED_DIRECTORY_PATH}
rm -rf copied_init/
cd -
}
# Case 1: Add a few small files
case_name=add_few_small_files
add1
run_profiles ${case_name}
undo_add1
# Case 2: Add many small files (build/)
case_name=add_many_small_files
add2
run_profiles ${case_name}
undo_add2
# Case 3: Add a few large files (init/)
case_name=add_few_large_files
add3
run_profiles ${case_name}
undo_add3
###############################################################################
# Generate final summary table
###############################################################################
generate_summary_tableResultsPerformanceManual fixes to script outputUnfortunately, it appears the table generation in the script was rather imperfect. No matter, we can manually get the data from the logs. Let's construct two tables manually: Time (sec) profile (simple
Memory (at-peak, MB) profile (simple
We obtained speedups from 22.82x to 33.32x in combined file gathering + database comparison time. The tradeoff is ~3.5x more memory usage during the file gathering phase and a very consistent 32.56 MB used for database comparison. File listscd /pscratch/sd/f/forsyth/performance_profiling/analysis
wc -l add_few_small_files_before_metrics.txt.files # 4
wc -l add_few_small_files_after_metrics.txt.files # 4
diff add_few_small_files_before_metrics.txt.files add_few_small_files_after_metrics.txt.files
# Good, no diff
wc -l add_many_small_files_before_metrics.txt.files # 1000
wc -l add_many_small_files_after_metrics.txt.files # 1000
diff add_many_small_files_before_metrics.txt.files add_many_small_files_after_metrics.txt.files
# Large diff
# Same number of files though. Are they simply in a different order?
wc -l add_few_large_files_before_metrics.txt.files # 14
wc -l add_few_large_files_after_metrics.txt.files #14
diff add_few_large_files_before_metrics.txt.files add_few_large_files_after_metrics.txt.files
# Good, no diffManual fixes to script outputIterating with a file comparison function gets us to: Let's try checking the logs themselves. Maybe the script incorrectly filtered files. cd /pscratch/sd/f/forsyth/performance_profiling/logs
diff update_add_many_small_files_before_changes.log update_add_many_small_files_after_changes.log > diff_many_small_files.txtUnfortunately, that's still showing a diff. Let's expand the comparison function to a script: from pathlib import Path
from collections import Counter
def analyze_paths(file_list, name):
print(f"\n=== Analysis for {name} ===")
# Get all unique directories
dirs = set(Path(f).parent.as_posix() for f in file_list)
print(f"Unique directories: {len(dirs)}")
# Check depth distribution
depths = Counter(len(Path(f).parts) for f in file_list)
print(f"Depth distribution: {sorted(depths.items())}")
# Check if files are in CMakeFiles subdirs
in_cmakefiles = sum(1 for f in file_list if 'CMakeFiles' in f)
print(f"Files in CMakeFiles dirs: {in_cmakefiles}")
# Check specific patterns
has_mod_stamp = sum(1 for f in file_list if f.endswith('.mod.stamp'))
has_mod = sum(1 for f in file_list if f.endswith('.mod') and not f.endswith('.mod.stamp'))
has_o = sum(1 for f in file_list if f.endswith('.o'))
print(f"Files ending in .mod.stamp: {has_mod_stamp}")
print(f"Files ending in .mod (not .stamp): {has_mod}")
print(f"Files ending in .o: {has_o}")
# Sample some paths
print(f"First 3 paths: {file_list[:3]}")
print(f"Last 3 paths: {file_list[-3:]}")
def compare_files(file1_path, file2_path):
# Read the first file into a list
with open(file1_path, 'r') as f1:
lines1 = [line.strip() for line in f1]
# Read the second file into a list
with open(file2_path, 'r') as f2:
lines2 = [line.strip() for line in f2]
# Convert to sets for comparison
set1 = set(lines1)
set2 = set(lines2)
# Find lines in both, only in list1, and only in list2
in_both = sorted(set1 & set2)
only_in_1 = sorted(set1 - set2)
only_in_2 = sorted(set2 - set1)
d = {
'files_in_both': in_both,
'files_in_list1': only_in_1,
'files_in_list2': only_in_2
}
print(f"Counts: both={len(d['files_in_both'])}, list1={len(d['files_in_list1'])}, list2={len(d['files_in_list2'])}")
# Check for patterns in what changed
list1_extensions = set(f.split('.')[-1] for f in d['files_in_list1'])
list2_extensions = set(f.split('.')[-1] for f in d['files_in_list2'])
print("Extensions in both lists:", list1_extensions & list2_extensions)
print("Extensions only in list1:", list1_extensions - list2_extensions)
print("Extensions only in list2:", list2_extensions - list1_extensions)
# Check path patterns
list1_has_src = sum('copied_build/' in f for f in d['files_in_list1'])
list2_has_src = sum('copied_build/' in f for f in d['files_in_list2'])
print(f"Files with 'copied_build/' in path: list1={list1_has_src}, list2={list2_has_src}")
# Check basenames
list1_basenames = {Path(f).stem for f in d['files_in_list1']}
list2_basenames = {Path(f).stem for f in d['files_in_list2']}
common_basenames = list1_basenames & list2_basenames
print(f"Files with same base name: {len(common_basenames)}")
print(f"Examples: {list(common_basenames)[:10]}")
analyze_paths(d['files_in_list1'], "lstat")
analyze_paths(d['files_in_list2'], "scandir")
analyze_paths(d['files_in_both'], "both")
if __name__ == "__main__":
compare_files("/pscratch/sd/f/forsyth/performance_profiling/analysis/add_many_small_files_before_metrics.txt.files", "/pscratch/sd/f/forsyth/performance_profiling/analysis/add_many_small_files_after_metrics.txt.files")Unfortunately, this still isn't giving much to go off of. Let's try going back to the logs: cd /pscratch/sd/f/forsyth/performance_profiling/logs
wc -l update_add_many_small_files_before_changes.log # 7,086
wc -l update_add_many_small_files_after_changes.log # 7,085
# The one-line diff is simply because `INFO: Gathering list of files to archive` doesn't get printed in the after-case
# That's 7,086-13=7,073 and 7,085-12=7,073 files each.
# Which means two things:
# 1. Both cases added 7,073 files.
# 2. The summarizing script extracted only 1,000 out 7,073 files from each log.That last point is because of in the script. It didn't account for needing to check more than 1000 lines later. So, are these 7,073 files identical, once sorted??? tail -n 7073 update_add_many_small_files_before_changes.log > ../analysis/add_many_small_files_before_metrics.txt.files_manual
tail -n 7073 update_add_many_small_files_after_changes.log > ../analysis/add_many_small_files_after_metrics.txt.files_manualNow we find: We find that the diff in the SummaryPerformance
Accuracy (i.e., do the list of files match up?)
|
|
@chengzhuzhang @golaz @wlin7 @TonyB9000 I've done further performance profiling (here & here) Conclusions, copied from above: Performance
Accuracy (i.e., do the list of files match up?)
|
|
@forsyth2 Nice results! The speed up is much significant compared to those from small tests. This is great news. |
|
Closing in favor of #420. |




Summary
Objectives:
zstash updateIssue resolution:
Select one: This pull request is...
Big Change
1. Does this do what we want it to do?
Required:
If applicable:
2. Are the implementation details accurate & efficient?
Required:
If applicable:
zstash/conda, not just animportstatement.3. Is this well documented?
Required:
zstash. This is only a performance improvement.4. Is this code clean?
Required:
If applicable: