Skip to content

Saad259/Sales-Forecast-ML-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sales Forecast ML Model

Overview

This project builds an end-to-end sales forecasting machine learning pipeline using the Walmart Store Sales Forecasting dataset. The goal is to predict weekly sales at a store–department level by leveraging historical sales data, store metadata, economic indicators, and engineered time-series features.

The project is implemented entirely in Python, following a clean, modular structure similar to real-world production ML workflows.


Dataset

Source: Walmart Recruiting – Store Sales Forecasting (Kaggle)

Raw files used:

  • train.csv – historical weekly sales (target variable)
  • features.csv – economic indicators and markdown data
  • stores.csv – store metadata (type and size)
  • test.csv – future periods for prediction (used conceptually)

The datasets are merged into a single training table using:

  • Store, Date, IsHoliday (train + features)
  • Store (stores metadata)

Project Structure

project-root/
│
├── data/
│   └── raw/
│       ├── train.csv
│       ├── features.csv
│       ├── stores.csv
│       └── test.csv
│
├── src/
│   ├── load_data.py          # Load and merge raw datasets
│   ├── preprocess.py         # Data cleaning and preprocessing
│   ├── feature_engineering.py# Lag and rolling features
│   ├── model_baseline.py     # Baseline + ML model training
│
├── main.py                   # Project entry point
├── .gitignore
└── README.md

Data Preprocessing

Key preprocessing steps:

  • Filled missing MarkDown values with 0

  • Extracted time-based features:

    • Year
    • Month
    • Week (ISO calendar)
  • Encoded categorical store type:

    • A → 0, B → 1, C → 2
  • Sorted data by Store, Dept, and Date to preserve time order


Feature Engineering

To capture temporal patterns in sales data, the following features were added:

  • Lag features

    • lag_1 (previous week sales)
  • Rolling statistics

    • roll_mean_4 (4-week rolling average)

Rows with insufficient history after lag creation were dropped to avoid data leakage.


Models

Baseline Model

  • Naive baseline using historical averages
  • Baseline RMSE: ~21,000

Machine Learning Model

  • LightGBM Regressor
  • Automatically handles non-linear relationships and feature interactions
  • Trained using time-aware splits (no random shuffling)

Final RMSE: ~3,160

This represents a significant improvement over the baseline.


Evaluation Metric

  • Root Mean Squared Error (RMSE)

RMSE was chosen because:

  • It heavily penalizes large prediction errors
  • It is commonly used in regression and forecasting problems

How to Run

  1. Clone the repository
  2. Place raw data files in data/raw/
  3. Install dependencies:
pip install pandas numpy matplotlib scikit-learn lightgbm
  1. Run the pipeline:
python main.py

Key Learnings

  • Time-series forecasting requires time-aware validation, not random splits
  • Lag and rolling features provide strong predictive power for sales data
  • Gradient boosting models (LightGBM) perform well on tabular datasets
  • Clean project structure improves debuggability and scalability

Future Improvements

  • Add cross-validation using rolling windows
  • Tune LightGBM hyperparameters
  • Train separate models per store or department
  • Generate predictions for test.csv and visualize forecasts
  • Log experiments and metrics

Technologies Used

  • Python
  • pandas
  • NumPy
  • Matplotlib
  • scikit-learn
  • LightGBM

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages