This project builds an end-to-end sales forecasting machine learning pipeline using the Walmart Store Sales Forecasting dataset. The goal is to predict weekly sales at a store–department level by leveraging historical sales data, store metadata, economic indicators, and engineered time-series features.
The project is implemented entirely in Python, following a clean, modular structure similar to real-world production ML workflows.
Source: Walmart Recruiting – Store Sales Forecasting (Kaggle)
train.csv– historical weekly sales (target variable)features.csv– economic indicators and markdown datastores.csv– store metadata (type and size)test.csv– future periods for prediction (used conceptually)
The datasets are merged into a single training table using:
Store,Date,IsHoliday(train + features)Store(stores metadata)
project-root/
│
├── data/
│ └── raw/
│ ├── train.csv
│ ├── features.csv
│ ├── stores.csv
│ └── test.csv
│
├── src/
│ ├── load_data.py # Load and merge raw datasets
│ ├── preprocess.py # Data cleaning and preprocessing
│ ├── feature_engineering.py# Lag and rolling features
│ ├── model_baseline.py # Baseline + ML model training
│
├── main.py # Project entry point
├── .gitignore
└── README.md
Key preprocessing steps:
-
Filled missing
MarkDownvalues with0 -
Extracted time-based features:
YearMonthWeek(ISO calendar)
-
Encoded categorical store type:
A → 0,B → 1,C → 2
-
Sorted data by
Store,Dept, andDateto preserve time order
To capture temporal patterns in sales data, the following features were added:
-
Lag features
lag_1(previous week sales)
-
Rolling statistics
roll_mean_4(4-week rolling average)
Rows with insufficient history after lag creation were dropped to avoid data leakage.
- Naive baseline using historical averages
- Baseline RMSE: ~21,000
- LightGBM Regressor
- Automatically handles non-linear relationships and feature interactions
- Trained using time-aware splits (no random shuffling)
Final RMSE: ~3,160
This represents a significant improvement over the baseline.
- Root Mean Squared Error (RMSE)
RMSE was chosen because:
- It heavily penalizes large prediction errors
- It is commonly used in regression and forecasting problems
- Clone the repository
- Place raw data files in
data/raw/ - Install dependencies:
pip install pandas numpy matplotlib scikit-learn lightgbm
- Run the pipeline:
python main.py
- Time-series forecasting requires time-aware validation, not random splits
- Lag and rolling features provide strong predictive power for sales data
- Gradient boosting models (LightGBM) perform well on tabular datasets
- Clean project structure improves debuggability and scalability
- Add cross-validation using rolling windows
- Tune LightGBM hyperparameters
- Train separate models per store or department
- Generate predictions for
test.csvand visualize forecasts - Log experiments and metrics
- Python
- pandas
- NumPy
- Matplotlib
- scikit-learn
- LightGBM