Skip to content

y1zhou/arpeggia

Repository files navigation

Arpeggia

This is a port of the Arpeggio library to Rust, with a focus on identifying certain protein-protein interactions in PDB and mmCIF files.

PyPI version License Python versions Ask DeepWiki

Features

  • Parse PDB and mmCIF files
  • Parse user selection of chain groups
  • Extract protein chains and residues
  • Calculate distances between residues
  • Identify protein-protein interactions
    • Steric clashes
    • VdW interactions
    • Hydrophobic interactions
    • Aromatic interactions
    • Cation-pi interactions
    • Ionic interactions
    • Hydrogen bonds
    • Weak hydrogen bonds
    • Disulfide bonds
    • Covalent bonds
  • Calculate SASA (Solvent Accessible Surface Area) at atom, residue, and chain levels
  • Calculate relative SASA (RSA) normalized by MaxASA values
  • Calculate SAP (Spatial Aggregation Propensity) scores for aggregation prediction
  • Calculate Shape Complementarity (SC) scores at protein-protein interfaces
  • Filter calculations to specific chains
  • Output results in various formats (e.g., JSON, CSV, Parquet)
  • Python bindings via PyO3
  • Returns Polars DataFrames for efficient data manipulation

Installation

Python Package (Recommended)

Install using pip:

pip install arpeggia

Or install from source using maturin:

git clone https://github.com/y1zhou/arpeggia.git
cd arpeggia
pip install maturin
maturin develop -v --release --features python

Rust Binary

For the command-line tool, you can install pre-built binaries from the GitHub Releases page, or build from source:

git clone https://github.com/y1zhou/arpeggia.git
cd arpeggia
cargo install --path .

This will install the arpeggia binary to your Cargo binary directory (usually ~/.cargo/bin).

Usage

Python API

import arpeggia

# Analyze protein contacts
contacts_df = arpeggia.contacts(
    "structure.pdb",
    groups="/",                    # All-to-all chain interactions
    vdw_comp=0.1,                 # VdW radii compensation
    dist_cutoff=6.5,              # Distance cutoff in Ångströms
    ignore_zero_occupancy=False   # Set True to ignore zero occupancy atoms
)
print(f"Found {len(contacts_df)} contacts")
print(contacts_df.head())

# Calculate solvent accessible surface area
# Atom-level (default)
sasa_df = arpeggia.sasa("structure.pdb", level="atom", probe_radius=1.4, n_points=100, model_num=0)
print(f"Calculated SASA for {len(sasa_df)} atoms")

# Residue-level SASA
residue_sasa = arpeggia.sasa("structure.pdb", level="residue")
print(f"Calculated SASA for {len(residue_sasa)} residues")

# Chain-level SASA for specific chains only
chain_sasa = arpeggia.sasa("structure.pdb", level="chain", chains="A,B")
print(f"Calculated SASA for chains A and B")

# Calculate relative SASA (RSA) normalized by Tien et al. (2013) MaxASA values
rsa_df = arpeggia.relative_sasa("structure.pdb")
print(f"Calculated RSA for {len(rsa_df)} residues")

# Calculate Spatial Aggregation Propensity (SAP) scores for aggregation prediction
sap_df = arpeggia.sap_score("antibody.pdb", level="residue")
print(f"Calculated SAP for {len(sap_df)} residues")

# SAP for specific chains (e.g., antibody heavy and light chains)
sap_hl = arpeggia.sap_score("antibody.pdb", chains="H,L", sap_radius=5.0)
print(f"Calculated SAP for H and L chains")

# Calculate buried surface area at the interface
bsa = arpeggia.dsasa("structure.pdb", groups="A,B/C,D")
print(f"Buried surface area: {bsa:.2f} Ų")

# Calculate Shape Complementarity at an interface
sc_score = arpeggia.sc("antibody_antigen.pdb", groups="H,L/A")
print(f"Shape Complementarity: {sc_score:.3f}")  # Typical values: 0.5-0.7

# Extract protein sequences
sequences = arpeggia.pdb2seq("structure.pdb")
for chain_id, seq in sequences.items():
    print(f"Chain {chain_id}: {seq}")

The functions return Polars DataFrames for efficient data manipulation. You can easily convert to pandas if needed:

import polars as pl

# Convert to pandas
contacts_pd = contacts_df.to_pandas()

# Or save directly to various formats
contacts_df.write_csv("contacts.csv")
contacts_df.write_parquet("contacts.parquet")

Command-Line Interface

The CLI provides the same functionality:

# Analyze contacts
arpeggia contacts -i structure.pdb -o output_dir -g "A,B/C,D" -t csv

# Analyze contacts, ignoring atoms with zero occupancy
arpeggia contacts -i structure.pdb -o output_dir --ignore-zero-occupancy

# Calculate SASA at different levels (atom, residue, chain)
arpeggia sasa -i structure.pdb -o output_dir --level atom
arpeggia sasa -i structure.pdb -o output_dir --level residue
arpeggia sasa -i structure.pdb -o output_dir --level chain

# Calculate SASA for specific chains only
arpeggia sasa -i structure.pdb -o output_dir --level residue --chains "A,B"

# Calculate relative SASA (RSA) for each residue
arpeggia relative-sasa -i structure.pdb -o output_dir

# Calculate SAP scores for aggregation prediction
arpeggia sap -i antibody.pdb -o output_dir --level residue

# Calculate SAP for specific chains (e.g., antibody H and L chains)
arpeggia sap -i antibody.pdb -o output_dir --chains "H,L"

# Calculate buried surface area at the interface
arpeggia dsasa -i structure.pdb -g "A,B/C,D"

# Calculate Shape Complementarity at an interface
arpeggia sc -i antibody_antigen.pdb -g "H,L/A"

# Extract sequences
arpeggia seq structure.pdb

To see all available options:

arpeggia help
arpeggia contacts --help

Chain Groups Specification

The groups parameter allows you to specify which chains interact with each other:

  • "/" - All chains interact with all chains (including self)
  • "A,B/C,D" - Chains A,B interact with chains C,D
  • "A/" - Chain A interacts with all other chains
  • "A,B/" - Chains A,B interact with all remaining chains

Development

To build the Python package in development mode:

pip install maturin polars
maturin develop -v --release --features python
python python/test_arpeggia.py

To run Rust tests:

cargo test

License

MIT License - see LICENSE file for details.

Credit

This project would not be possible without the following resources:

  • Arpeggio: Original Python library for protein-protein interaction analysis.
  • pdbtbx: The structural file parser doing all the heavy lifting.
  • RustSASA: Library for calculating solvent accessible surface area.
  • sc-rs: Library for calculating the Shape Complementarity by Lawrence & Colman (1993).
  • Rosetta: Where the Spatial Aggregation Propensity (SAP) score calculations are inspired from.

About

Calculation of interatomic interactions in molecular structures

Topics

Resources

License

Stars

Watchers

Forks

Contributors