Sales Forecast ML Model

Overview

This project builds an end-to-end sales forecasting machine learning pipeline using the Walmart Store Sales Forecasting dataset. The goal is to predict weekly sales at a store–department level by leveraging historical sales data, store metadata, economic indicators, and engineered time-series features.

The project is implemented entirely in Python, following a clean, modular structure similar to real-world production ML workflows.

Dataset

Source: Walmart Recruiting – Store Sales Forecasting (Kaggle)

Raw files used:

train.csv – historical weekly sales (target variable)
features.csv – economic indicators and markdown data
stores.csv – store metadata (type and size)
test.csv – future periods for prediction (used conceptually)

The datasets are merged into a single training table using:

Store, Date, IsHoliday (train + features)
Store (stores metadata)

Project Structure

project-root/
│
├── data/
│   └── raw/
│       ├── train.csv
│       ├── features.csv
│       ├── stores.csv
│       └── test.csv
│
├── src/
│   ├── load_data.py          # Load and merge raw datasets
│   ├── preprocess.py         # Data cleaning and preprocessing
│   ├── feature_engineering.py# Lag and rolling features
│   ├── model_baseline.py     # Baseline + ML model training
│
├── main.py                   # Project entry point
├── .gitignore
└── README.md

Data Preprocessing

Key preprocessing steps:

Filled missing MarkDown values with 0
Extracted time-based features:
- Year
- Month
- Week (ISO calendar)
Encoded categorical store type:
- A → 0, B → 1, C → 2
Sorted data by Store, Dept, and Date to preserve time order

Feature Engineering

To capture temporal patterns in sales data, the following features were added:

Lag features
- lag_1 (previous week sales)
Rolling statistics
- roll_mean_4 (4-week rolling average)

Rows with insufficient history after lag creation were dropped to avoid data leakage.

Models

Baseline Model

Naive baseline using historical averages
Baseline RMSE: ~21,000

Machine Learning Model

LightGBM Regressor
Automatically handles non-linear relationships and feature interactions
Trained using time-aware splits (no random shuffling)

Final RMSE: ~3,160

This represents a significant improvement over the baseline.

Evaluation Metric

Root Mean Squared Error (RMSE)

RMSE was chosen because:

It heavily penalizes large prediction errors
It is commonly used in regression and forecasting problems

How to Run

Clone the repository
Place raw data files in data/raw/
Install dependencies:

pip install pandas numpy matplotlib scikit-learn lightgbm

Run the pipeline:

python main.py

Key Learnings

Time-series forecasting requires time-aware validation, not random splits
Lag and rolling features provide strong predictive power for sales data
Gradient boosting models (LightGBM) perform well on tabular datasets
Clean project structure improves debuggability and scalability

Future Improvements

Add cross-validation using rolling windows
Tune LightGBM hyperparameters
Train separate models per store or department
Generate predictions for test.csv and visualize forecasts
Log experiments and metrics

Technologies Used

Python
pandas
NumPy
Matplotlib
scikit-learn
LightGBM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sales Forecast ML Model

Overview

Dataset

Raw files used:

Project Structure

Data Preprocessing

Feature Engineering

Models

Baseline Model

Machine Learning Model

Evaluation Metric

How to Run

Key Learnings

Future Improvements

Technologies Used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data/raw		data/raw
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Folders and files

Latest commit

History

Repository files navigation

Sales Forecast ML Model

Overview

Dataset

Raw files used:

Project Structure

Data Preprocessing

Feature Engineering

Models

Baseline Model

Machine Learning Model

Evaluation Metric

How to Run

Key Learnings

Future Improvements

Technologies Used

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages