This repository contains datasets and scripts for analyzing the OpenAI Developer Forum and GitHub issues data from major vendors (Gemini, Llama, and OpenAI). The goal is to explore community interaction patterns, identify challenges, and analyze trends.
- This folder contains all the figures used in the associated research paper, including charts, diagrams, and visualizations.
This folder contains GitHub issues data collected from three vendors:
Gemini/: GitHub issues data related to the Gemini project.Llama/: GitHub issues data related to the Llama project.OpenAI/OpenAI/:github.xlsx: Complete GitHub issues data for OpenAI.github_issues.xlsx: Filtered GitHub issues dataset.sampled_github_issues.xlsx: A sampled subset of the GitHub issues dataset, used for constructing the taxonomy.
This folder contains data collected from the OpenAI Developer Forum:
annotated_dataset.xlsx: Annotated forum posts dataset, used for constructing the taxonomy.posts.csv: Complete dataset of forum posts, including metadata (e.g., titles, links, times).users.xlsx: User profile data, including user activities and registration details.
- Crawling Scripts: Python scripts for extracting posts (
chatgpt.py,other.py) and user information (code.py,desc+data.py). - Data Organization: Includes text files (
page.txt,url_to_title.py) for organizing and parsing data. - RQ Analysis and Visualization: Original chart data and analysis scripts for RQ1 and RQ2 are stored in corresponding subfolders, with Excel files (
data.xlsx) supporting the analysis.
-
Prepare the Environment
Ensure Python 3.8+ is installed along with the necessary dependencies:
pip install -r requirements.txt
-
Run Crawling Scripts
Navigate to the
Scripts/folder and run the scripts to collect data:python chatgpt.py python other.py
-
Analyze Data
For popularity and difficulty analysis, refer to the scripts and data located in the RQ1 and RQ2 folders. For analyzing the challenge taxonomy, use the datasets annotated_dataset.xlsx and sampled_issues.xlsx. These files provide the necessary data for exploring and refining the categorization of challenges.
- Vendor Comparison: Analyze GitHub issues across Gemini, Llama, and OpenAI to identify common challenges and trends.
- Popularity Analysis: Explore trends in the popularity of forum posts over time.
- Difficulty Analysis: Assess the difficulty levels of forum post content.
Contributions and feedback are welcome!