How to accelerate the Representation step? #2158

shizhediao · 2024-09-24T20:47:20Z

shizhediao
Sep 24, 2024

Hi,

I am using BERTopic to process a large dataset (> 10M docs).
Currently, I find that if I process 100K docs, it takes around 12 mins. Considering 10M, it would take 1200 mins, which is too slow.
Could you help me think about any acceleration methods? Thanks!

Here is my code:

import os
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['GOTO_NUM_THREADS'] = '1'
os.environ['OMP_NUM_THREADS'] = '1'

from sklearn.cluster import KMeans
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
import torch
import time
from transformers import BitsAndBytesConfig
from sklearn.feature_extraction.text import CountVectorizer
import openai
import tiktoken

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, OpenAI, PartOfSpeech

from sklearn.datasets import fetch_20newsgroups
# from umap import UMAP
# from hdbscan import HDBSCAN
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
from sklearn.feature_extraction.text import CountVectorizer
import json

# When you have millions of documents or error issues, I would advise increasing the value of min_df as long as the topic representations might sense:
vectorizer_model = CountVectorizer(stop_words="english", min_df=15, ngram_range=(1, 2))

# KeyBERT
keybert_model = KeyBERTInspired()

# Part-of-Speech
pos_model = PartOfSpeech("en_core_web_sm")

# MMR
mmr_model = MaximalMarginalRelevance(diversity=0.3)

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=15, min_samples=2, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# Tokenizer
tokenizer= tiktoken.encoding_for_model("gpt-3.5-turbo")

# Create your representation model
client = openai.OpenAI(api_key="sk-")
# openai_model = OpenAI(client, model="gpt-3.5-turbo", exponential_backoff=True, chat=True, prompt=prompt)

openai_model = OpenAI(
    client,
    model="gpt-3.5-turbo", 
    delay_in_seconds=2, 
    chat=True,
    nr_docs=6,
    doc_length=100,
    tokenizer=tokenizer,
    exponential_backoff=True,
    prompt=prompt
)

# All representation models
representation_model = {
    # "KeyBERT": keybert_model,
    # "OpenAI": openai_model,  # Uncomment if you will use OpenAI
    "MMR": mmr_model,
    "POS": pos_model
}


def read_jsonl(folder_path):
    final_folder_path = os.path.join(folder_path, "result_data/final_output")
    results = []
    for file in os.listdir(final_folder_path):
        with open(os.path.join(final_folder_path, file), 'r') as f:
            for line in f:
                data = json.loads(line)
                text = data['text']
                results.append(text)
    return results
    
NUM_DOCS = 100000
# NUM_CLUSTERS = 100
BATCH_SIZE = 1024
MODEL_NAME = "dunzhang/stella_en_400M_v5"
# DATA_NAME = "robbiegwaldd/dclm-micro"
DATA_NAME = "dclm-pool-400m-1x-100subset-filtered-language-gopherrep-gopherqul-c4-fineweb"
base_model_name = MODEL_NAME.split("/")[-1]
base_data_name = DATA_NAME.split("/")[-1]

OUTPUT_PATH = "output"
output_dir = f"{OUTPUT_PATH}/{base_data_name}_docs_{NUM_DOCS}_model_{base_model_name}"
os.makedirs(output_dir, exist_ok=True)

# ds = load_dataset(DATA_NAME)
# corpus = ds["train"]["text"][:NUM_DOCS]
corpus = read_jsonl(DATA_NAME)[:NUM_DOCS]
print(f"Number of documents: {len(corpus)}")

# load embedding
embeddings = torch.load(f"{output_dir}/corpus_embeddings.pt")

# Create topic model

topic_model = BERTopic(
    # embedding_model=sentence_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    representation_model=representation_model,
    top_n_words=10,
    verbose=True,
    calculate_probabilities=False
)

topics, probabilities = topic_model.fit_transform(corpus, embeddings)

freq = topic_model.get_topic_info()
print("Number of topics: {}".format(len(freq)))
a_topic = freq.iloc[1]["Topic"] # Select the 1st topic
a_topic_words = topic_model.get_topic(a_topic) # Show the words and their c-TF-IDF scores
print(f"Topic {a_topic}: {a_topic_words}")

This is my log:

2024-09-24 13:38:23,513 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-09-24 13:38:32,403 - BERTopic - Dimensionality - Completed ✓
2024-09-24 13:38:32,404 - BERTopic - Cluster - Start clustering the reduced embeddings
[I] [13:38:30.244098] Transform can only be run with brute force. Using brute force.
2024-09-24 13:38:37,905 - BERTopic - Cluster - Completed ✓
2024-09-24 13:38:37,924 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-09-24 13:51:05,230 - BERTopic - Representation - Completed ✓
Number of topics: 858
Topic 0: [('dating', 0.01620230866298107), ('sex', 0.013907349647686902), ('porn', 0.01286772630783898), ('cock', 0.009247801669594189), ('pussy', 0.009216147510464423), ('sexy', 0.007549696844642859), ('fuck', 0.006888647673350913), ('girls', 0.0067619389398692374), ('escorts', 0.0067471337287701135), ('nude', 0.005837785170142008)]

MaartenGr · 2024-09-25T09:07:11Z

MaartenGr
Sep 25, 2024
Maintainer

Thank you for sharing the issue. Have you tried going through the example notebook on the README? It also shows some tricks for running c-TF-IDF on large datasets.

It might also be worthwhile to check which representation model is slow for you. You technically have three: MMR, PoS, and c-TF-IDF. I believe you commented out the OpenAI one. Check which one is slow and it would help figure out where to optimize.

6 replies

shizhediao Sep 25, 2024
Author

I haven't gone through the example notebooks. For the large dataset, do you mean this one? https://colab.research.google.com/drive/1W7aEdDPxC29jP99GGZphUlqjMFFVKtBC?usp=sharing

Just to ensure I don't miss other useful materials.

Thanks!

MaartenGr Sep 25, 2024
Maintainer

That is indeed the correct resource. Other than that, the official documentation contains tips and tricks at various places depending on your exact implementation.

shizhediao Oct 14, 2024
Author

Hi @MaartenGr
Thank you so much. I followed the great notebooks where I mainly use the GPU to accelerate computation.
However, I have 10 million documents, and I found that dimension reduction has become a new problem, which takes more than 3 hours.
Do you have any idea to accelerate the dimension reduction? the UMAP process.

MaartenGr Oct 15, 2024
Maintainer

If you are already using cuML with UMAP and you can see that the GPU is working, then there isn't much you can do to optimize it further. It might also be worthwhile to use a smaller dataset considering UMAP and HDBSCAN generally do not need so many documents to fit well. You can train on a smaller subset, like 2 million documents, and predict the rest. Unless of course, you are also looking into micro clusters.

shizhediao Oct 16, 2024
Author

Thank you so much for your reply! I just realized that it is a common practice to do sampling when we are dealing with huge data. I will try this way.

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to accelerate the Representation step? #2158

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to accelerate the Representation step? #2158

Uh oh!

Uh oh!

shizhediao Sep 24, 2024

Replies: 1 comment · 6 replies

Uh oh!

MaartenGr Sep 25, 2024 Maintainer

Uh oh!

shizhediao Sep 25, 2024 Author

Uh oh!

MaartenGr Sep 25, 2024 Maintainer

Uh oh!

shizhediao Oct 14, 2024 Author

Uh oh!

MaartenGr Oct 15, 2024 Maintainer

Uh oh!

shizhediao Oct 16, 2024 Author

shizhediao
Sep 24, 2024

Replies: 1 comment 6 replies

MaartenGr
Sep 25, 2024
Maintainer

shizhediao Sep 25, 2024
Author

MaartenGr Sep 25, 2024
Maintainer

shizhediao Oct 14, 2024
Author

MaartenGr Oct 15, 2024
Maintainer

shizhediao Oct 16, 2024
Author