-
Notifications
You must be signed in to change notification settings - Fork 0
mmCIF coordinate parser
Edoardo Sarti edited this page Mar 11, 2022
·
1 revision
This function is intended for parsing the coordinate section of mmCIF files coming from the Protein Data Bank resources.
The function returns a data structure with all the collected information, and a data structure with some statistics on the parsed file.
Remark: this parser contains filters that select and modify the information of the original record.
Here is a list of the operations performed by the parser:
- In case of multiple models, it keeps only the first model
- It only keeps
ATOMandHETATMrecords (note:TERrecords do not exist in mmCIF) - Eliminates double records. Double records are unfrequent errors in PDB files and should be deleted
- Classifies AltLoc records. AltLoc records are alternative locations of some residues: in order for having a real structure, an AltLoc should be chosen and the others must be discarded. The AltLoc ID resets at each chain, but not at each residue: this means that within a chain, residues can have a subset of alternative locations. Thus, for each chain, AltLoc residues are recorded, together with their atoms' occupancy. The AltLoc corresponding to the highest per-residue occupancy is chosen. If there is a tie, the AltLoc with the highest number of residues is chosen. If there still is a tie, the one being recorded first is chosen.
- Checks whether the file can be translated in PDB format. For this, the number of atoms must be <100,000 and all chain IDs must be one character long.
- Keeps
UNKresidues, but only records backbone atoms and the C_beta, if present - Residues that are not recognized by the residue parser (i.e., they are not among the 20 standard residues, and they are not
UNK) are treated asUNKif the chain's entity is a polypeptide, and are discarded otherwise. In order to perform this check, the "entity" loop in the header of the mmCIF file has to be accessed: if the parser does not find this information, all these residues are discarded, regardless of the nature of the chain - If the line is a
HETATMand the residue id anMSE(Selenomethionine), it will be edited in aMETresidue (ATOMrecord). Editing includes, of course, the replacement of theSE(selenium) atom with theSatom of theMETresidue (placed at the same coordinates). All otherHETATMrecord is discarded.
Naming conventions
Quick start
- Initializing the database
- Setting file permissions
- Singularity containers
- Running locusts for EncoMPASS
Navigating the database
- Database filesystem structure
- Key files
- Data structures
Main code
- Code overview
- OPM parsing
- Write OPM representation chart
- Structure alignment decision criteria
- Symmetry
- Check repo update
- Creating xml and squashfs
- see flowchart
Reference info