This is the repo for the submission Augmenting Knowledge Graphs for Better Link Prediction. This repo is structured as follows:
augment. This directory contains the code to run the literal graph augmentation for the input graph. All input are tab-separated files.augment_lp.pyis used to produce graph for link prediction, andaugment_np.pyis used to produce graph for numeric prediction.- To augment the dataset with link prediction, make sure the directory
data/{dataset}contains at least four files:train.txt: The training entity triples.valid.txt: The validation entity triples.test.txt: The testing entity triples.numerical_literals.txt: The literal triples.- Once you get the above files, augment the graph with
python augment/augment_lp.py --dataset {dataset} --bins {bins}.
- To augment the dataset with numericaprediction, make sure the directory
data/{dataset}contains at least four files:train_kge: The entity triples.train_100: The training literal triples.dev: The validation literal triples.test: The test literal triples.- Once you get the above files, augment the graph with
python augment/augment_np.py --dataset {dataset} --bins {bins}.
- To augment the dataset with link prediction, make sure the directory
pbg. This directory contains the code to run Darpa Wikidata using PyTorch-BigGraph. To run the code, we recommend installing PyTorch-BigGraph as documented in https://github.com/facebookresearch/PyTorch-BigGraph, and Faiss as documented in https://github.com/facebookresearch/faiss.rotate. This directory contains the code to run TransE and RotatE with negative sampling. To run the code, we recommend installing the environment documented in https://github.com/DeepGraphLearning/KnowledgeGraphEmbedding.tucker. This directory contains the code to run DistMult, ComplEx, ConvE, and TuckER with k-N sampling. To run the code, we recommend installing the environment documented in https://github.com/ibalazevic/TuckER.data. This directory is the default location to store the input graphs. The data for this project can be located at: https://drive.google.com/drive/folders/14XtfAsfchsS-gPUZ1_YtP1X3bFiCaOS6?usp=sharing.out. This directory is the default location to store output logs.numeric. This directory is the default location to store numeric prediction logs. Since augmenting the graph for either link prediction and numeric prediction will produce different data, we recommend using directorydatato store the augmented graph for link prediction, and directorynumericto store the augmented graph for numeric prediction.
The directory also contains the scripts which are deemed as the entry point of run the program:
run.pyis the script to run base embedding models, KBLN, and LiteralE on the given graph.- To run link prediction on the dataset:
python run.py --dataset {dataset} --model {model} - To run link prediction on the dataset with PyTorch-BigGraph:
python run.py --dataset {dataset} --model {model} --use_pbg - Of course, you can change the input folder if you save the augmented graphs in other locations. You can also change the output folder if needed. To run numeric prediction on the dataset:
python run.py --dataset {dataset} --model {model} --input {input_path} --output {output_path}.
- To run link prediction on the dataset:
run_batch.pyis the wrapper for run.py if user wants to run all six base models for an input graphs.summary.pyis the script to collect best results from the logs. Users can run the program usingpython summary.py --dataset {dataset} --model {model}to get the best metric for a given model. The program will run through all iterations of the log files and print the metrics of the epoch with bestvalidation MRR. The results will be ordered by model (if user does not specify a model), mrr, hits@1 and finally hits@10.numeric_eval.pyreports the numeric prediction performance (in terms ofMAE) for the given dataset and model. To enable numeric prediction, please make sure to include the--save_bestflag to produce the embedding file for each model.numeric_eval.pywill subsequently call thenumeric_eval.pyscripts insiderotateortucker, depends on the model chosen. Please adjust the--inputflag to where the augmented dataset and the embedding is stored.