A comprehensive machine learning project for classifying celestial objects (Stars, Galaxies, and Quasars) using data from the Sloan Digital Sky Survey (SDSS). This project implements and compares four different machine learning models to achieve optimal classification accuracy.
- Project Overview
- Dataset Description
- Models Implemented
- Project Structure
- Installation & Setup
- Usage Guide
- Model Performance
- Key Features
- Technologies Used
- Contributing
- License
This project aims to classify celestial objects into three categories:
- Galaxy (0): Extended astronomical objects with billions of stars
- Quasar (1): Quasi-stellar radio sources, extremely distant and luminous objects
- Star (2): Individual massive balls of plasma held together by gravity
The dataset contains photometric data from the Sloan Digital Sky Survey (SDSS), which includes magnitude measurements across multiple filter bands (u, g, r, i, z) and spatial coordinates. Multiple machine learning algorithms are employed and compared to determine the most effective approach for this multi-class classification task.
Automatically classify celestial objects from observational data without manual inspection, enabling scalable processing of large astronomical datasets.
- Total Samples: Thousands of labeled celestial observations
- Classes: 3 (Galaxy, Quasar, Star)
- Features: ~16 astronomical attributes including:
- Positional Data: Right Ascension (ra), Declination (dec)
- Photometric Data: Magnitudes in u, g, r, i, z bands
- Observational Metadata: run, rerun, camcol, field, plate, fiberid, specobjid
- Physical Data: Redshift
Sloan Digital Sky Survey (SDSS) - A comprehensive astronomical survey that has mapped millions of celestial objects.
| Feature | Type | Description |
|---|---|---|
| ra | Continuous | Right Ascension (angular position) |
| dec | Continuous | Declination (angular position) |
| u, g, r, i, z | Continuous | Magnitudes in different filter bands |
| redshift | Continuous | Cosmological redshift value |
| run | Categorical | Observing run identifier |
| rerun | Categorical | Processing run identifier |
| camcol | Categorical | Camera column (1-6) |
| field | Categorical | Field identifier |
| plate | Categorical | Spectroscopic plate identifier |
| fiberid | Categorical | Fiber identifier |
| specobjid | Categorical | Spectroscopic object identifier |
| objid | Categorical | Unique object identifier |
| class | Target | Class label (GALAXY, QSO, STAR) |
- Missing Values: None (clean dataset)
- Class Distribution: Imbalanced - Galaxies are the majority class
- Data Preprocessing: Feature scaling and selection performed per model requirements
File: notebooks/decision_tree.ipynb
- Feature Selection: RFE (Recursive Feature Elimination) selected top 10 features
- Purpose: Interpretable tree-based classification with clear decision rules
- Preprocessing:
- Removed ID columns (objid, specobjid)
- One-hot encoding of categorical features
- RFE feature selection (10 features)
- Output Files:
data/DT_X_rfe_selected.csv- Selected featuresdata/DT_y.csv- Target labels
Strengths:
- Highly interpretable (decision rules can be visualized)
- Fast training and prediction
- Handles non-linear relationships
- No feature scaling required
Weaknesses:
- Prone to overfitting without proper pruning
- Sensitive to small data variations
File: notebooks/knn.ipynb
- Feature Set: UGRIZ magnitudes (u, g, r, i, z bands)
- Approach: Instance-based learning using distance metrics
- Preprocessing:
- Feature scaling (essential for distance-based algorithms)
- Focus on photometric features only
- Output Files:
data/KNN_X_ugriz.csv- UGRIZ magnitude featuresdata/KNN_y.csv- Target labels
Hyperparameters:
- k = optimal value (determined via cross-validation)
- Distance metric = Euclidean
Strengths:
- Simple and effective for this dataset
- Non-parametric approach (no training phase)
- Naturally handles multi-class classification
Weaknesses:
- Computationally expensive for large datasets
- Sensitive to feature scaling and dimensionality
- Memory-intensive (stores all training data)
File: notebooks/random_fotrst.ipynb
- Ensemble Method: Multiple decision trees with voting mechanism
- Feature Set: All available features
- Preprocessing:
- One-hot encoding of categorical features
- Feature importance analysis
- Output Files:
data/RF_X_features.csv- All selected featuresdata/RF_y.csv- Target labels
Model Architecture:
- Number of trees = configurable (e.g., 100)
- Bootstrap sampling for diversity
- Majority voting for final predictions
Strengths:
- Reduces overfitting compared to single decision tree
- Feature importance ranking
- Handles non-linear relationships well
- Robust to outliers
- Excellent performance on imbalanced datasets
Weaknesses:
- Less interpretable than single decision tree
- Higher computational cost
- Longer training time
File: notebooks/neural_network.ipynb
- Framework: TensorFlow/Keras
- Architecture: Dense neural network with multiple layers
- Feature Set: Comprehensive feature set (SDSS photometry)
- Preprocessing:
- Standardization (mean=0, std=1)
- One-hot encoding of target classes
- Output Files:
data/NN_X_solana.csv- Preprocessed featuresdata/NN_y.csv- Target labelsmodels/final_neural_network_solana.keras- Trained model
Network Architecture:
Input Layer (n_features)
β
Dense Layer (128 units, ReLU activation)
β
Dropout (0.3)
β
Dense Layer (64 units, ReLU activation)
β
Dropout (0.3)
β
Dense Layer (32 units, ReLU activation)
β
Output Layer (3 units, Softmax activation)
Hyperparameters:
- Loss Function: Categorical Crossentropy
- Optimizer: Adam
- Batch Size: 32
- Epochs: 100+ (with early stopping)
Strengths:
- Learns complex non-linear patterns
- Excellent for large datasets
- Can achieve high accuracy with proper tuning
- Flexible architecture
Weaknesses:
- Requires more data for optimal performance
- Longer training time
- Less interpretable ("black box")
- Hyperparameter tuning complexity
Stars-Galaxy-Quasars-Classification/
β
βββ README.md # Project documentation (this file)
β
βββ load_and_run_all_models.ipynb # Master notebook to run all models
β
βββ notebooks/ # Individual model notebooks
β βββ decision_tree.ipynb # Decision Tree implementation
β βββ knn.ipynb # K-Nearest Neighbors implementation
β βββ neural_network.ipynb # Neural Network implementation
β βββ random_fotrst.ipynb # Random Forest implementation
β
βββ data/ # Processed datasets
β βββ DT_X_rfe_selected.csv # Decision Tree features (RFE selected)
β βββ DT_y.csv # Decision Tree labels
β βββ KNN_X_ugriz.csv # KNN features (UGRIZ magnitudes)
β βββ KNN_y.csv # KNN labels
β βββ NN_X_solana.csv # Neural Network features
β βββ NN_y.csv # Neural Network labels
β βββ RF_X_features.csv # Random Forest features
β βββ RF_y.csv # Random Forest labels
β
βββ models/ # Saved trained models
βββ final_neural_network_solana.keras # Pre-trained neural network
- Python 3.7 or higher
- Jupyter Notebook or JupyterLab
- pip package manager
git clone https://github.com/krishparmar22242/Stars-Galaxy-Quasars-Classification.git
cd Stars-Galaxy-Quasars-Classification# On Windows
python -m venv venv
venv\Scripts\activate
# On macOS/Linux
python3 -m venv venv
source venv/bin/activatepip install -r requirements.txtOr manually install key dependencies:
pip install pandas numpy matplotlib seaborn scikit-learn tensorflow keras joblib| Package | Version | Purpose |
|---|---|---|
| pandas | >=1.0.0 | Data manipulation and analysis |
| numpy | >=1.18.0 | Numerical computations |
| scikit-learn | >=0.24.0 | ML algorithms (Decision Tree, KNN, Random Forest) |
| tensorflow | >=2.4.0 | Neural Network framework |
| keras | >=2.4.0 | Neural Network API (part of TensorFlow) |
| matplotlib | >=3.3.0 | Data visualization |
| seaborn | >=0.11.0 | Statistical data visualization |
| joblib | >=1.0.0 | Model serialization |
jupyter notebookOpen and execute load_and_run_all_models.ipynb in Jupyter Notebook. This master notebook:
- Loads all preprocessed datasets
- Runs inference on random samples from each model
- Displays predictions and actual labels for comparison
# The notebook will:
# 1. Load Decision Tree model and make predictions
# 2. Load KNN model and make predictions
# 3. Load Random Forest model and make predictions
# 4. Load Neural Network model and make predictionsEach model has its own dedicated notebook:
# Open notebooks/decision_tree.ipynb
# Run cells to:
# 1. Load and explore SDSS dataset
# 2. Preprocess and encode features
# 3. Apply RFE for feature selection
# 4. Train Decision Tree model
# 5. Evaluate with confusion matrix and classification report# Open notebooks/knn.ipynb
# Run cells to:
# 1. Load SDSS dataset
# 2. Select UGRIZ magnitude features
# 3. Apply feature scaling
# 4. Train KNN with optimal k value
# 5. Visualize decision boundaries (2D projections)# Open notebooks/random_fotrst.ipynb
# Run cells to:
# 1. Load and preprocess data
# 2. Train Random Forest ensemble
# 3. Analyze feature importance
# 4. Generate confusion matrix and metrics# Open notebooks/neural_network.ipynb
# Run cells to:
# 1. Load SDSS dataset
# 2. Standardize features
# 3. Encode target classes (one-hot)
# 4. Build and train neural network
# 5. Plot training history and confusion matrix
# 6. Generate classification reportimport pandas as pd
import numpy as np
from tensorflow.keras.models import load_model
import joblib
# Load a specific model
dt_model = joblib.load('models/final_decision_tree_model.pkl')
knn_model = joblib.load('models/final_knn_solana.pkl')
rf_model = joblib.load('models/final_random_forest_model.pkl')
nn_model = load_model('models/final_neural_network_solana.keras')
# Prepare your data
X_new = pd.read_csv('your_data.csv')
# Make predictions
dt_predictions = dt_model.predict(X_new)
knn_predictions = knn_model.predict(X_new)
rf_predictions = rf_model.predict(X_new)
nn_predictions = np.argmax(nn_model.predict(X_new), axis=1)
# Class mapping
class_mapping = {0: 'GALAXY', 1: 'QSO', 2: 'STAR'}The models are evaluated using:
- Accuracy: Overall correctness of predictions
- Precision: True positives among predicted positives
- Recall: True positives among actual positives
- F1-Score: Harmonic mean of precision and recall
- Confusion Matrix: Breakdown of correct/incorrect predictions per class
- ROC-AUC: Area under the receiver operating characteristic curve (for binary problems)
Each model generates:
- Classification Report - Precision, Recall, F1-Score per class
- Confusion Matrix Visualization - Shows misclassification patterns
- Feature Importance (for tree-based models) - Which features matter most
- Training History (for Neural Network) - Accuracy and loss curves
| Model | Interpretability | Training Speed | Prediction Speed | Accuracy | Scalability |
|---|---|---|---|---|---|
| Decision Tree | Excellent | Fast | Very Fast | Moderate | Good |
| KNN | Good | None | Slow | Moderate-High | Poor |
| Random Forest | Good | Moderate | Moderate | High | Excellent |
| Neural Network | Poor | Slow | Fast | Very High | Excellent |
- Compare 4 different algorithms on the same dataset
- Understand trade-offs between interpretability and accuracy
- Feature scaling and normalization
- One-hot encoding for categorical variables
- Feature selection using RFE
- Handling of imbalanced classes
- Confusion matrices with visualizations
- Classification reports with per-class metrics
- Feature importance analysis
- Training history plots for neural network
- Central hub to run all models and compare results
- Consistent output format across models
- Easy-to-understand prediction examples
- Pre-trained models included for quick inference
- Ready for deployment without retraining
- Pandas: Data manipulation and analysis
- NumPy: Numerical computing
- Scikit-learn: Machine learning algorithms
- TensorFlow/Keras: Deep learning framework
- Matplotlib & Seaborn: Data visualization
- Jupyter Notebook: Interactive development environment
- Git: Version control
- Python 3.7+: Programming language
- Handles datasets with thousands of samples
- Efficient feature engineering pipelines
- Optimized model training and inference
Sloan Digital Sky Survey (SDSS)
- URL: https://www.sdss.org/
- Free public astronomical database
- Contains data from millions of celestial objects
- Includes photometric and spectroscopic observations
Features Explained:
- Magnitudes (u, g, r, i, z): Brightness measurements in different wavelengths
- Redshift: Indicates distance and recession velocity
- Coordinates (ra, dec): Position in the sky
- Spectroscopic Data: Detailed light spectrum analysis
- Galaxy - Distant collections of billions of stars
- Quasar - Extremely luminous active galactic nuclei
- Star - Individual luminous spheres of plasma in our galaxy
- Class Distribution: Imbalanced with Galaxies being the majority class
- Feature Importance: Color indices (u-g, g-r, etc.) are highly discriminative
- Model Strengths:
- Random Forest: Best overall balanced performance
- Neural Network: Highest accuracy with proper tuning
- Decision Tree: Most interpretable results
- KNN: Good accuracy with simpler training
Each notebook generates:
- Density plots showing class separation
- Confusion matrices for prediction analysis
- Feature importance rankings
- Training history curves
- ROC/AUC curves (where applicable)
Contributions are welcome! Here's how to help:
- Fork the repository
- Create a new branch (
git checkout -b feature/improvement) - Make your changes
- Commit with descriptive messages (
git commit -am 'Add feature') - Push to the branch (
git push origin feature/improvement) - Open a Pull Request
- Hyperparameter optimization using GridSearchCV
- Cross-validation for more robust evaluation
- Additional models (SVM, Gradient Boosting, XGBoost)
- Web deployment with Flask/FastAPI
- Real-time prediction API
- Extended feature engineering
- Handling of edge cases and outliers
- Performance benchmarking
This project is open source and available under the MIT License.
Krish Parmar
- GitHub: @krishparmar22242
- Project: Stars-Galaxy-Quasars-Classification
- Scikit-learn Documentation
- TensorFlow/Keras Guide
- Decision Tree Learning
- K-Nearest Neighbors
- Random Forest
This project demonstrates:
- End-to-end machine learning pipeline development
- Multiple algorithm implementation and comparison
- Data preprocessing and feature engineering
- Model evaluation and interpretation
- Deep learning with TensorFlow/Keras
- Classification on multi-class problems
- Handling imbalanced datasets
- Feature importance analysis
- Production-ready model saving and loading
Issue: ModuleNotFoundError for tensorflow
Solution: pip install tensorflow --upgradeIssue: Memory error with large datasets
Solution: Use data batching or reduce dataset size for initial testingIssue: Slow KNN prediction
Solution: This is normal; use Random Forest or Neural Network for faster predictionsIssue: Neural Network not converging
Solution:
- Normalize features
- Adjust learning rate
- Increase epochs
- Check for data quality issues- Sloan Digital Sky Survey for providing the dataset
- Scikit-learn community for excellent ML tools
- TensorFlow/Keras teams for deep learning framework
- Open source community for continuous support
Last Updated: December 2024 Project Status: Active & Maintained
For questions or issues, please open a GitHub Issue in the repository.