Optimizing Parallel Graph Neural Network Training with Joint Caching and Resource-Aware Graph Partitioning

Directory Hierarchy

|-- CaPGNN
|   `-- communicator
|   `-- config            # offline configurations of experiments
|   `-- helper
|   `-- manager
|   `-- model             # customized PyTorch modules
|   `-- trainer
|   `-- util
|-- exp                   # experiment results
|-- utils

News:

CaPGNN is easily extensible to distributed systems, and we have released a demo of the distributed version: branch/dist.

The architecture of the distributed version is:

Setup

Software Dependencies

Ubuntu 20.04.6 LTS
Python 3.9.15
CUDA 12.1
PyTorch 2.3.0
DGL 2.3.0

Hardware Dependencies

The following configurations are recommended and not mandatory.

CPU: dual-core Intel® Xeon® Gold 6230
RAM: 768GB
GPUs: 2x NVIDIA Tesla A40, 2x NVIDIA RTX 3090, 2x NVIDIA RTX 3060, 2x NVIDIA GTX 1660Ti.
PCIe 3.0 x16

Installation

Running the following commands will create the virtual environment and install the dependencies. conda can be downloaded from anaconda.

conda clean --all -y
pip cache purge
conda create -n capgnn python=3.9.12
conda activate capgnn

pip install -r requirements.txt \
  --extra-index-url https://download.pytorch.org/whl/cu121 \
  -f https://data.dgl.ai/wheels/torch-2.3/cu121/repo.html \
  -f https://data.pyg.org/whl/torch-2.3.0+cu121.html

Note that the version of DGL must be compatible with the corresponding PyTorch version and Python version.
Additionally, if you have a GPU, the PyTorch version must also match your CUDA version.

Datasets

Datasets will be downloaded automatically if they are missing.

Dataset	\|V\|	\|E\|	f_dim	Classes
Yelp	716847	13954819	300	100
Reddit	232965	114615892	602	41
Flickr	89250	899756	500	7
CoraFull	19793	126842	8710	7
AmazonProducts	1569960	264339468	200	107
CoauthorPhysics	34493	495924	8415	5
ogbn-products	2449029	61859140	100	47

Usage

GPU performance

GPU communication capabilities (H2D, D2H, IDT):

python utils/eval_bw.py

The output in the console will be like:

07:29:28.949017 [0] Rank 0: NVIDIA GeForce RTX 3090
07:30:05.132368 [0] Size: 512M  Repeat: 50
07:31:10.345209 [0] HtoD 512M 50/50
07:31:10.346265 [0] Size: 512M  Repeat: 50
07:31:17.811319 [0] DtoH 512M 50/50
07:31:17.859560 [0] Size: 512M  Repeat: 50
07:31:19.343551 [0] IDT 512M 50/50
Timer Summary:
Key             Total        Ave        Std      Count
--------------------------------------------------
HtoD-512M      3.8012     0.0760     0.0194         50
DtoH-512M      5.3154     0.1063     0.0050         50
IDT-512M       0.0686     0.0014     0.0000         50
total         74.2611    74.2611     0.0000          1

GPU computation capabilities (SpMM and MM):

python utils/eval_mm.py

The output in the console will be like:

07:35:59.305771 [0] Rank 0: NVIDIA GeForce RTX 3060
>> spmm
07:35:59.307674 [0] Size: 512M  Repeat: 50
07:36:12.598796 [0] 512M 50/50
**********
07:36:12.599498 [0] 
Timer Summary:
Key             Total        Ave        Std      Count
--------------------------------------------------
matmul-512M     9.7713     0.1954     0.0078         50
total          13.2912    13.2912     0.0000          1

Partition the Graph

Before conducting training, run cspart.py to partition the coressponding graph into several subgraphs:

python cspart.py --dataset_index=6 --part_num=4 --our_partition=1 --gpus_index=0

Train the Model

Run main.py:

python main.py --dataset_index=6 --part_num=4 --gpus_index=0

The results are in the exp directory.

Experiment Customization

Adjust configurations in CaPGNN/config/*yaml to customize dataset, model, training hyperparameter or add new configurations.

License

Licensed under the MIT License.

If you use this repository in your work, please cite our paper:

@article{SONG2026132978,
  title = {CaPGNN: Optimizing parallel graph neural network training with joint caching and resource-aware graph partitioning},
  journal = {Neurocomputing},
  volume = {675},
  pages = {132978},
  year = {2026},
  issn = {0925-2312},
  doi = {https://doi.org/10.1016/j.neucom.2026.132978},
  url = {https://www.sciencedirect.com/science/article/pii/S0925231226003759},
  author = {Xianfeng Song and Yi Zou and Zheng Shi},
}

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
CaPGNN		CaPGNN
images		images
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cal_nodes.py		cal_nodes.py
cspart.py		cspart.py
gpu.py		gpu.py
hcache		hcache
main.py		main.py
requirements.txt		requirements.txt
tools.py		tools.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Optimizing Parallel Graph Neural Network Training with Joint Caching and Resource-Aware Graph Partitioning

Directory Hierarchy

Setup

Software Dependencies

Hardware Dependencies

Installation

Datasets

Usage

GPU performance

Partition the Graph

Train the Model

Experiment Customization

License

About

Uh oh!

Releases

Uh oh!

Languages

License

songxf1024/CaPGNN

Folders and files

Latest commit

History

Repository files navigation

Optimizing Parallel Graph Neural Network Training with Joint Caching and Resource-Aware Graph Partitioning

Directory Hierarchy

Setup

Software Dependencies

Hardware Dependencies

Installation

Datasets

Usage

GPU performance

Partition the Graph

Train the Model

Experiment Customization

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Languages