Skip to content

Add mmCIF -> PDB conversion#107

Merged
rasbt merged 11 commits intoBioPandas:mainfrom
a-r-j:main
Jan 4, 2023
Merged

Add mmCIF -> PDB conversion#107
rasbt merged 11 commits intoBioPandas:mainfrom
a-r-j:main

Conversation

@a-r-j
Copy link
Contributor

@a-r-j a-r-j commented Aug 6, 2022

Fixes #97 #108

Description

Adds support for writing PDB files to filestreams.

I started adding the conversion method proposed by @mrauha.

I ran into an issue writing the test:

from pandas.testing import assert_frame_equal

def test_mmcif_pdb_conversion():
    pdb =PandasPdb().fetch_pdb("3EIY")
    mmcif = PandasMmcif().fetch_mmcif("3EIY")
    mmcif_pdb = mmcif.get_pandas_pdb()
    
    assert_frame_equal(pdb.df["ATOM"].drop(columns=["line_idx"]), mmcif_pdb.df["ATOM"].drop(columns=["line_idx"]))
    assert_frame_equal(pdb.df["HETATM"].drop(columns=["line_idx"]), mmcif_pdb.df["HETATM"].drop(columns=["line_idx"]).reset_index(drop=True))

Which gives an error relating to atom numbering in the HETATM dataframes:

AssertionError                            Traceback (most recent call last)
/Users/arianjamasb/github/biopandas/dev.ipynb Cell 7' in <cell line: 11>()
      [8](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=7)     assert_frame_equal(pdb.df["ATOM"].drop(columns=["line_idx"]), mmcif_pdb.df["ATOM"].drop(columns=["line_idx"]))
      [9](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=8)     assert_frame_equal(pdb.df["HETATM"].drop(columns=["line_idx"]), mmcif_pdb.df["HETATM"].drop(columns=["line_idx"]).reset_index(drop=True))
---> [11](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=10) test_mmcif_pdb_conversion()

/Users/arianjamasb/github/biopandas/dev.ipynb Cell 7' in test_mmcif_pdb_conversion()
      [6](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=5) mmcif_pdb = mmcif.get_pandas_pdb()
      [8](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=7) assert_frame_equal(pdb.df["ATOM"].drop(columns=["line_idx"]), mmcif_pdb.df["ATOM"].drop(columns=["line_idx"]))
----> [9](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=8) assert_frame_equal(pdb.df["HETATM"].drop(columns=["line_idx"]), mmcif_pdb.df["HETATM"].drop(columns=["line_idx"]).reset_index(drop=True))

    [... skipping hidden 2 frame]

File ~/opt/anaconda3/envs/biopandas/lib/python3.8/site-packages/pandas/_libs/testing.pyx:52, in pandas._libs.testing.assert_almost_equal()

File ~/opt/anaconda3/envs/biopandas/lib/python3.8/site-packages/pandas/_libs/testing.pyx:167, in pandas._libs.testing.assert_almost_equal()

File ~/opt/anaconda3/envs/biopandas/lib/python3.8/site-packages/pandas/_testing/asserters.py:682, in raise_assert_detail(obj, message, left, right, diff, index_values)
    679 if diff is not None:
    680     msg += f"\n[diff]: {diff}"
--> 682 raise AssertionError(msg)

AssertionError: DataFrame.iloc[:, 1] (column name="atom_number") are different

DataFrame.iloc[:, 1] (column name="atom_number") values are different (100.0 %)
[index]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
[left]:  [1332, 1333, 1334, 1335, 1336, 1337, 1338, 1339, 1340, 1341, 1342, 1343, 1344, 1345, 1346, 1347, 1348, 1349, 1350, 1351, 1352, 1353, 1354, 1355, 1356, 1357, 1358, 1359, 1360, 1361, 1362, 1363, 1364, 1365, 1366, 1367, 1368, 1369, 1370, 1371, 1372, 1373, 1374, 1375, 1376, 1377, 1378, 1379, 1380, 1381, 1382, 1383, 1384, 1385, 1386, 1387, 1388, 1389, 1390, 1391, 1392, 1393, 1394, 1395, 1396, 1397, 1398, 1399, 1400, 1401, 1402, 1403, 1404, 1405, 1406, 1407, 1408, 1409, 1410, 1411, 1412, 1413, 1414, 1415, 1416, 1417, 1418, 1419, 1420, 1421, 1422, 1423, 1424, 1425, 1426, 1427, 1428, 1429, 1430, 1431, ...]
[right]: [1331, 1332, 1333, 1334, 1335, 1336, 1337, 1338, 1339, 1340, 1341, 1342, 1343, 1344, 1345, 1346, 1347, 1348, 1349, 1350, 1351, 1352, 1353, 1354, 1355, 1356, 1357, 1358, 1359, 1360, 1361, 1362, 1363, 1364, 1365, 1366, 1367, 1368, 1369, 1370, 1371, 1372, 1373, 1374, 1375, 1376, 1377, 1378, 1379, 1380, 1381, 1382, 1383, 1384, 1385, 1386, 1387, 1388, 1389, 1390, 1391, 1392, 1393, 1394, 1395, 1396, 1397, 1398, 1399, 1400, 1401, 1402, 1403, 1404, 1405, 1406, 1407, 1408, 1409, 1410, 1411, 1412, 1413, 1414, 1415, 1416, 1417, 1418, 1419, 1420, 1421, 1422, 1423, 1424, 1425, 1426, 1427, 1428, 1429, 1430, ...]

Any idea of the source of the discrepancy?

Update: This seems to be fixed now! 🥳

Related issues or pull requests

#97

Pull Request Checklist

  • Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
  • Added appropriate unit test functions in the ./biopandas/*/tests directories (if applicable)
  • Modify documentation in the corresponding Jupyter Notebook under biopandas/docs/sources/ (if applicable)
  • Ran PYTHONPATH='.' pytest ./biopandas -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./biopandas/classifier/tests/test_stacking_cv_classifier.py -sv)
  • Checked for style issues by running flake8 ./biopandas

@rasbt
Copy link
Member

rasbt commented Aug 6, 2022

Thanks for adding that! For the test, just curious, does this happen with other structures as well?

@a-r-j
Copy link
Contributor Author

a-r-j commented Aug 6, 2022

It's consistent across a few examples. I narrowed it down to TER records being the problem.

E.g.

# 3eiy.pdb
ATOM   1329  NZ  LYS A 175       8.861  36.709 -13.496  1.00 63.20           N  
ATOM   1330  OXT LYS A 175      10.451  35.432 -10.086  1.00 55.71           O  
TER    1331      LYS A 175                                                      
HETATM 1332  K     K A 176      24.990  43.276   0.005  0.50 24.45           K  
HETATM 1333 NA    NA A 177       1.633  34.181  11.897  1.00 26.73          NA  
# 3eiy.cif
ATOM   1329 N  NZ  . LYS A 1 196 ? 8.861   36.709 -13.496 1.00 63.20 ? 175 LYS A NZ  1 
ATOM   1330 O  OXT . LYS A 1 196 ? 10.451  35.432 -10.086 1.00 55.71 ? 175 LYS A OXT 1 
HETATM 1331 K  K   . K   B 2 .   ? 24.990  43.276 0.005   0.50 24.45 ? 176 K   A K   1 
HETATM 1332 NA NA  . NA  C 3 .   ? 1.633   34.181 11.897  1.00 26.73 ? 177 NA  A NA  1 

Now, I'm not sure how reliably TER records are placed in PDB files (some info here ) and if we can be super comfortable just ignoring it. I imagine it would cause a bit of a headache trying to make selections consistent between the same structure in PDB & .cif formats.

@rasbt
Copy link
Member

rasbt commented Aug 6, 2022

Hmm... I think that we could have it without the TER records and just a note about that in the docs for now. So that people are aware. Otherwise, inferring the TER records would be some additional work (I guess that could be done based on the CIF structure index? But I don't have that much experience with TER). My guess is the primary usecase of BioPandas is data analysis, and the TER records are not super necessary anyways.

So, for now I'd say let's warn people about the TER entries but not worry about it.

@pep8speaks
Copy link

pep8speaks commented Aug 7, 2022

Hello @a-r-j! Thanks for updating this PR.

Line 534:89: E501 line too long (95 > 88 characters)
Line 536:89: E501 line too long (106 > 88 characters)

Comment last updated at 2022-09-03 21:28:28 UTC

@a-r-j
Copy link
Contributor Author

a-r-j commented Aug 7, 2022

Sure, I added an offset arg so people at least have an access point. Hopefully that's a sufficient remedy for most usecases.

@rasbt
Copy link
Member

rasbt commented Aug 7, 2022

I think that works! Do you have a unit test handy? Otherwise, I can help looking into this!

@a-r-j
Copy link
Contributor Author

a-r-j commented Aug 8, 2022

Think I managed to infer it automatically based on the number of chains. There was also an issue in ATOM atom numbering for multichain proteins (again, TER records). So I figured we can infer the appropriate offsets for the ATOM and HETATM dfs from the number of chains.

Added appropriate unittest and seems to check out.

@rasbt
Copy link
Member

rasbt commented Aug 13, 2022

Wow this looks good! Currently traveling and will take one more close look in a calm moment, when I am not out and about. Thanks a lot!

@rasbt
Copy link
Member

rasbt commented Aug 18, 2022

Arg totally missed this and am currently traveling (on mobile). I will make a reminder to check this out as soon as I am back on my computer again!

Copy link
Member

@rasbt rasbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, there is just this one little thing regarding the keys I noticed.

"""
pandaspdb = PandasPdb()
# for a in ["ATOM", "HETATM"]:
for a in self.df.keys():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, but doesn't that include the OTHERS key as well? Maybe that could be more specific to the keys we actually care about

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes, good point. We could add an arg that defaults to ["ATOM", "HETATM"] to allow some more control. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's the "safer" choice

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@rasbt
Copy link
Member

rasbt commented Sep 3, 2022

Just did a little documentation update (sorry for the delay, I was traveling during the last couple of weeks, and there's tons of stuff to catch up on)

@a-r-j
Copy link
Contributor Author

a-r-j commented Jan 4, 2023

Happy New Year @rasbt ! Anything I can do to get this PR merged? :)

Copy link
Member

@rasbt rasbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks awesome. Thanks so much. And sorry for the long delay. Somehow it slipped through the cracks and I totally missed this PR.

There's just a small comment about the naming above.

self.code = self.data["entry"]["id"][0].lower()
return self

def get_pandas_pdb(self, offset_chains: bool = True, records: List[str] = ["ATOM", "HETATM"]) -> PandasPdb:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this perhaps be called convert_to_pandas_pdb? Just so that it doesn't indicate it's fetching the PDB


##### New Features

- Added a new `PandasMmcif.get_pandas_pdb()` class that converts the mmCIF file into a PDB structure. (Via [Arian Jamasb](https://github.com/a-r-j), PR #[107](https://github.com/rasbt/biopandas/pull/107/files))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above about the method naming

#

__version__ = "0.4.2"
__version__ = "0.5.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry to come back to this, but

__version__ = "0.5.0" should probably be __version__ = "0.5.0dev" so that we can collect some of the other things in the PR, and then we can change it to 0.5.0 when the the next release version goes out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep good point, I'll try to wrap the other two PRs tonight too :)

@rasbt rasbt merged commit da45886 into BioPandas:main Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mmCIF -> PDB conversion method

3 participants