Skip to content

Use KDTree instead of mdtraj.compute_contacts in _filter_unphysical_traj_masks to mitigate MemoryError and improve performance#158

Merged
sarahnlewis merged 7 commits intomicrosoft:mainfrom
ahmedselim2017:main
Sep 26, 2025
Merged

Conversation

@ahmedselim2017
Copy link
Copy Markdown
Contributor

Hi, first of all thanks for the project!

I was running bioemu for a structure with 994 residues for 10,000 samples. The model and side chain generation ran successfully but I got MemoryError: Unable to allocate 445. GiB for an array with shape (10000, 11954402) and data type float32 in the save_pdb_and_xtc function.

After some digging, I found that the error was caused by the mdtraj.compute_contacts function that is used to find clashes in the _filter_unphysical_traj_masks function. As the mdtraj.compute_contacts function works by first calculating the distances between atoms for each frame of the trajectory (ignoring i+1th and i+2th residues for residue i), it tries to create a huge array with a shape (10000, 11954402). To avoid creating this huge array, I processed each frame one by one.

Also, since we are not directly interested in the distances between all atom pairs but whether if there are any clashes, I have used KDTree to check if there are any clashes without directly computing the distance matrices, which improved the performance of finding clashes dramatically as you can see from the figure below where I found the clashes for the first N frames of the original trajectory. The mdtraj.compute_contacts can only find clashes up to 4000 frames as it gaves memory errors after that.

figure

@ahmedselim2017
Copy link
Copy Markdown
Contributor Author

@microsoft-github-policy-service agree

@franknoe
Copy link
Copy Markdown
Collaborator

Thank you for your contribution @ahmedselim2017 ! We'll review this asap.

@franknoe
Copy link
Copy Markdown
Collaborator

I am guessing that for small systems, computing all pairs is more efficient and this approach wins beyond a certain system size. Can we test for a couple of systems of different sizes, and implement a switchover if this assumption is correct?

@ahmedselim2017
Copy link
Copy Markdown
Contributor Author

I have performed the same benchmark for another protein with 214 residues and KDTree was still faster (although the difference was smaller compared to the larger structure). If you have any other proteins in mind I can test them too, but I would need to run bioemu & side-chain generation first which may take some time.

figure2

@franknoe
Copy link
Copy Markdown
Collaborator

Thanks! For a simple benchmark you can just run repeats of "GGS", which is extremely flexible, and will also produce a lot of clashes for longer chains. Maybe 50 and 100 residues will be informative - looking at these plots I'm getting the impression that 100 residues may be a good switchover point.

Which GPU are you using?

@ahmedselim2017
Copy link
Copy Markdown
Contributor Author

No problem! For the other proteins, I have used A6000 but as it has some other queued jobs, I am using RTX4080 for 50 and 100 residue test runs.

@ahmedselim2017
Copy link
Copy Markdown
Contributor Author

The MD equilibrium phase of the GGS repeated sequences did not progressed 1 step in >~10 hours probably because of the unnatural sequence. I will restart the runs with real proteins but it will probably take a similar time.

@franknoe
Copy link
Copy Markdown
Collaborator

franknoe commented Aug 4, 2025

Oh, that's weird. Any idea why that may happen @josejimenezluna ?

I think the code shouldn't just get stuck when encountering a weird sequence. In case if this is a problem of not finding an alignment at least an error message should be returned. Can we create a separate issue for this?

@ahmedselim2017
Copy link
Copy Markdown
Contributor Author

ahmedselim2017 commented Aug 6, 2025

The sidechain generation was crashed because of some issues with the mdtraj.join call in reconstruct_sidechains function. I have solved the issue in #160 but now need to regenerate sidechains, which will again take some time...

@ahmedselim2017
Copy link
Copy Markdown
Contributor Author

Hi again, I have ran the same benchmark with 50 and 100 residue proteins and as you have predicted for 100 residues or less KDTree was slower so we can use 100 residues as a cutoff.

100 50

Oh, that's weird. Any idea why that may happen @josejimenezluna ?

I think the code shouldn't just get stuck when encountering a weird sequence. In case if this is a problem of not finding an alignment at least an error message should be returned. Can we create a separate issue for this?

By the way I have came across the same issue in other runs that uses --filter_samples False too. Maybe the MD runs do not converge on the nonphysical samples?

@franknoe
Copy link
Copy Markdown
Collaborator

Sounds good! Leaving this to reviewed by @josejimenezluna after his return to office.

@ahmedselim2017
Copy link
Copy Markdown
Contributor Author

Hi, hope you are doing well! Are there any updates about the review? If you like, I can run some more tests but 100 residues looks like a good lower bound cutoff as KDTree is nearly the same with mdtraj.compute_contacts.

@sarahnlewis
Copy link
Copy Markdown
Collaborator

Thank you for the great work @ahmedselim2017. I wanted to add a test so I have made a branch from yours, where I added a test checking that the kdtree and mdtraj versions give the same answers for some junk coordinates. The branch is here: https://github.com/microsoft/bioemu/tree/sarahlewis/kdtree-tests. Could you please check if you're happy with those changes, if so, incorporate them in your branch, and I'll approve and merge your PR? I don't want to just make a PR directly from my branch because it feels like taking credit for your work.

One small thing I noticed (because of the hacky way I constructed the test trajectory) is that if a trajectory's 'time' attribute is too short, 'enumerate(traj)' does not iterate through all the frames. So I have changed your kdtree implementation to use 'range(len(traj))' instead.

@ahmedselim2017
Copy link
Copy Markdown
Contributor Author

Of course! Thanks for the feedback, added your changes.

Copy link
Copy Markdown
Collaborator

@sarahnlewis sarahnlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice work

@sarahnlewis sarahnlewis enabled auto-merge (squash) September 22, 2025 20:02
@sarahnlewis sarahnlewis merged commit ccba8db into microsoft:main Sep 26, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants