Specialized Document Embeddings for Aspect-based Similarity of Research Papers

This repository contains the supplemental materials for the JCDL2022 paper Specialized Document Embeddings for Aspect-based Similarity of Research Papers (PDF on ArXiv). Trained models and datasets can be downloaded from GitHub releases and 🤗 Huggingface model hub.

Demo

Try your own papers on 🤗 Huggingface spaces.

How to use the pretrained models

We provide a SciBERT-based model for each of the three aspects: 🎯 malteos/aspect-scibert-task, 🔨 malteos/aspect-scibert-method, 🏷️ malteos/aspect-scibert-dataset. To use these models, you need to install 🤗 Transformers first via pip install transformers.

import torch
from transformers import AutoTokenizer, AutoModel

# load model and tokenizer (replace with `aspect-scibert-method` or `aspect-scibert-dataset)`)
tokenizer = AutoTokenizer.from_pretrained('malteos/aspect-scibert-task')  
model = AutoModel.from_pretrained('malteos/aspect-scibert-task')

papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
          {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]

# concatenate title and abstract
title_abs = [d['title'] + ': ' + (d.get('abstract') or '') for d in papers]

# preprocess the input
inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)

# inference
output = model(**inputs)

# Mean pool the token-level embeddings to get sentence-level embeddings
embeddings = torch.sum(
    output.last_hidden_state * inputs['attention_mask'].unsqueeze(-1), dim=1
) / torch.clamp(torch.sum(inputs['attention_mask'], dim=1, keepdims=True), min=1e-9)

Requirements

Python 3.7
CUDA GPU (for Transformers)

Installation

Create a new virtual environment for Python 3.7 with Conda:

conda create -n aspect-document-embeddings python=3.7
conda activate aspect-document-embeddings

Clone repository and install dependencies:

git clone https://github.com/malteos/aspect-document-embeddings
cd aspect-document-embeddings
pip install -r requirements.txt

Datasets

The datasets are compatible with Huggingface datasets and are downloaded automatically. To create the datasets directly from the Papers With Code data, run the following commands:

# Download PWC files (for the paper with downloaded the files 2020-10-27)
wget https://paperswithcode.com/media/about/papers-with-abstracts.json.gz
wget https://paperswithcode.com/media/about/evaluation-tables.json.gz
wget https://paperswithcode.com/media/about/methods.json.gz

# Build dataset
python -m paperswithcode.dataset save_dataset <input_dir> <output_dir>

Experiments

To reproduce our experiments, follow these steps:

Generic embeddings

Avg. FastText

# Train fastText word vectors
./data_cli.py train_fasttext paperswithcode_aspects ./output/pwc

# Build avg. fastText document vectors
./sbin/paperswithcode/avg_fasttext.sh

SciBERT

./sbin/paperswithcode/scibert_mean.sh

SPECTER

./sbin/paperswithcode/specter.sh

Retrofitted embeddings

For retrofitting we utilize Explicit Retroffing. Please follow their instruction to install it and update the EXPLIREFIT_DIR in the shell scripts accordingly. Then, you can run these scripts:

# Create constraints from dataset 
./sbin/paperswithcode/explirefit_prepare.sh

# Train retrofitting models
./sbin/paperswithcode/explirefit_avg_fasttext.sh
./sbin/paperswithcode/explirefit_specter.sh
./sbin/paperswithcode/explirefit_scibert_mean.sh

# Generate and evaluate retrofitted embeddings 
./sbin/paperswithcode/explirefit_convert_and_evaluate.sh

Transformers

# SciBERT
./sbin/paperswithcode/pairwise/scibert.sh

# SPECTER
./sbin/paperswithcode/specter_fine_tuned.sh

# Sentence-SciBERT
./sbin/paperswithcode/sentence_transformer_scibert.sh

Evaluation

After generating the document representations for all aspects and systems, the results can be computed and viewed with a Jupyter notebook. Figures and tables from the paper are part of the notebook.

# Run evaluations for all systems
./eval_cli.py reevaluate

# Open notebook for Tables and Figures
jupyter notebook evaluation.ipynb

# Open notebook for sample recommendations
jupyter notebook samples.ipynb

How to cite

If you are using our code or data, please cite our paper:

@InProceedings{Ostendorff2022,
  title = {Specialized Document Embeddings for Aspect-based Similarity of Research Papers},
  booktitle = {Proceedings of the {ACM}/{IEEE} {Joint} {Conference} on {Digital} {Libraries} ({JCDL})},
  author = {Ostendorff, Malte and Blume, Till, Ruas, Terry and Gipp, Bela and Rehm, Georg},
  year = {2022},
}

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Specialized Document Embeddings for Aspect-based Similarity of Research Papers

Demo

How to use the pretrained models

Requirements

Installation

Datasets

Experiments

Generic embeddings

Retrofitted embeddings

Transformers

Evaluation

How to cite

License

About

Uh oh!

Releases 1

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
environments		environments
experiments		experiments
hf_datasets		hf_datasets
models		models
paperswithcode		paperswithcode
sbin		sbin
.gitignore		.gitignore
README.md		README.md
data_cli.py		data_cli.py
eval_cli.py		eval_cli.py
evaluation.ipynb		evaluation.ipynb
requirements.txt		requirements.txt
samples.ipynb		samples.ipynb
sentence_transformer_cli.py		sentence_transformer_cli.py
trainer_cli.py		trainer_cli.py

Folders and files

Latest commit

History

Repository files navigation

Specialized Document Embeddings for Aspect-based Similarity of Research Papers

Demo

How to use the pretrained models

Requirements

Installation

Datasets

Experiments

Generic embeddings

Retrofitted embeddings

Transformers

Evaluation

How to cite

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Contributors

Uh oh!

Languages