This project was created by: Lidor Shachar, Andreas Moen, and Lars Andreas Strand.
It implements a comprehensive Business Intelligence solution for Brazilian e-commerce data (Olist dataset from Kaggle). It demonstrates end-to-end data engineering, analytics, and visualization capabilities, transforming raw data into actionable business insights.
Olist is a Brazilian marketplace connecting small businesses with customers. This project analyzes their operations to optimize delivery routes, predict customer satisfaction, and provide descriptive analytics for strategic decision-making.
- Star Schema: 7 dimensions (Date, Geography, Customer, Product, Seller, Payment Type, Order Status) and 3 fact tables (Sales, Delivery Performance, Customer Reviews)
- SCD Type 2: Implemented for Customer dimension to track historical changes
- Grain: Sales fact at order item level, Delivery at order level, Reviews at review level
- Pentaho Data Integration: Automated ETL jobs loading data from CSV sources into PostgreSQL
- Data Quality: Validation, error handling, and data cleansing steps
- Incremental Loading: Designed for periodic updates
- Descriptive Analytics: Statistical summaries, correlation analysis, and trend identification
- Predictive Analytics: Customer satisfaction prediction using Decision Tree Classifier (71% accuracy)
- Prescriptive Analytics: Delivery route optimization using Linear Programming
- Database: PostgreSQL
- ETL: Pentaho Kettle (PDI)
- Analytics: Python (pandas, scikit-learn, PuLP, matplotlib, seaborn)
- Visualization: PowerBI
- Version Control: Git
- Documentation: Markdown, ERD diagrams
- Dataset: Brazilian E-Commerce Public Dataset by Olist (Kaggle)
- Contributors: Lidor Shachar, Andreas Moen, Lars Andreas Strand
- Professor: Peyman Teymoori
Key answers to business questions derived from our descriptive, predictive, and prescriptive analytics:
- Revenue Trends and Seasonal Patterns: Total revenue peaked in November 2017 ($1.18M), with Q4 showing the highest seasonal performance. Average order value ranges from $130-170, with consistent monthly growth patterns.
- Customer Segmentation and Profitability: Analysis shows varying order volumes across months, with peak periods in Q4 due to holiday shopping. Geographic analysis reveals regional differences in order volumes and delivery performance.
- Geographic Performance Variations: Delivery times vary significantly by region, with cross-region shipments taking longer. Regional analysis shows satisfaction rates ranging from 64.7% in North-Northeast combinations to 86% in Center-West-North combinations.
- Delivery Performance as Satisfaction Driver: Delivery delays are the primary driver of customer satisfaction, with delay days accounting for 59.7% of model importance. Item count (24.3%) and actual delivery days (11.5%) are also significant factors.
- Model Performance: Decision Tree model achieves 71.2% accuracy on test data, with 85.7% precision and 75.6% recall. Cross-validation shows stable performance with 72.1% Β± 0.93% accuracy.
- Feature Importance Breakdown: Delivery delay days (59.7%), item count (24.3%), actual delivery days (11.5%), estimated delivery days (1.4%), shipping distance (0.8%), freight value (0.6%), and other factors.
- Delivery Route Optimization: Linear programming optimization suggests reassigning shipments to reduce cross-region transport, achieving 7.3% cost savings and 20% reduction in average delivery times (from 12.4 to 9.9 days).
- Warehouse-to-Customer Assignments: Optimization focuses on regional efficiency, with same-region shipments comprising 86.7% of total volume. This reduces operational complexity and improves delivery reliability.
- Scalability Insights: For peak seasons, the model recommends prioritizing same-region logistics, potentially handling 20% more volume with improved satisfaction rates (from 75.9% to 83.5%).
- PostgreSQL 13+
- Python 3.8+
- Pentaho Data Integration 9+
- PowerBI Desktop
- Brazilian E-Commerce Public Dataset by Olist from Kaggle:
https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce- Download as a zip and extract the CSV's for later use
- Create PostgreSQL database:
dw - Run schema creation:
Database/schema_creation.sql - Configure connection in
.envfile (see Analytics Setup step 2)
- Open Pentaho Spoon
- Import jobs from
ETL/folder - Modify the CSV's you downloaded from kaggle to be imported from the correct path
- Connect the transfomation to your data-warehouse
- Execute
extract_transform_load.kjbjob or each of the ETL's
NOTE: If you run the ETL's idependently, run all dimensions prior to fact tables
- Install Python dependencies:
pip install pandas numpy matplotlib seaborn scikit-learn pulp sqlalchemy python-dotenv
- Create .env file and configure database connection in
Analytics/.env# Database Configuration DB_URL=postgresql+psycopg2://your_username:your_password@localhost:5432/your_database # Optional: Additional configurations DB_HOST=localhost DB_PORT=5432 DB_NAME=your_database DB_USER=your_username DB_PASSWORD=your_password - Run Jupyter notebooks in order:
- Descriptive
- Predictive
- Prescriptive
- Open
Dashboard/BI_Dashboard.pbix
βββ Database/ # SQL schema and analytical queries
βββ ETL/ # Pentaho transformations and jobs
βββ Analytics/ # Jupyter notebooks and Python scripts
βββ Dashboard/ # PowerBI files and CSV exports
βββ Documentation/ # ERD, data dictionary, user guides
βββ Screenshots/ # Dashboard visualizations
βββ LICENSE # MIT License
βββ README.md # This file
- End-to-end BI solution implementation
- Data warehouse design principles (star schema, SCD)
- ETL best practices and error handling
- Advanced analytics techniques (predictive modeling, optimization)
- Business intelligence visualization
- Professional project documentation and GitHub portfolio management
This is a portfolio project. For suggestions or improvements, please open an issue or submit a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.
- Professor Peyman Teymoori for guidance and feedback
- Team members for collaboration
- Olist for providing the dataset through Kaggle
Built as part of a Business Intelligence University course - showcasing professional data engineering and analytics skills.