This is the 3rd prize winning solution by team JYL (Jeong-Yoon Lee and Abhishek Thakur) for the Countable Care competition at DrivenData.org.
The code assumes raw data is available under the data folder, and saves outputs in the build folder.
To produce the final submission
./run.sh
Then the final submission file will be available at build/tst/final_sub.csv.
Install python packages listed in requirements.txt - scipy, numpy, scikit-learn, statsmodels, pandas, Kaggler packages
Install latest XGBoost from source and copy xgboost and wrapper/libxgboostwrapper.so into the system bin and lib folders
respectively:
git clone git@github.com:dmlc/xgboost.git
cd xgboost
bash build.sh
(sudo) cp xgboost /usr/local/bin
(sudo) cp wrapper/libxgboostwrapper.so /usr/local/lib
To install latest Kaggler package from source:
git clone git@github.com:jeongyoonlee/Kaggler.git
cd Kaggler
python setup.py build_ext --inplace
(sudo) python setup.py install
8 features are used as follows:
feature1- impute 0 for missing values for numeric and ordinal features. create dummy variables for values in categorical features appearing 10+ times in training datafeature2- same asfeature1except takinglog(1 + x)transformation for ordinal features.feature3- same asfeature2except creating dummy variables for values appearing 3+ times in training data.feature4- same asfeature3except treating ordinal features as categorical features.feature5- same asfeature4except takinglog2(1 + x)transformation * for ordinal features before treating ordinal features as categorical features.feature8- same asfeature4except normalizing numeric features.feature9- impute -1 for missing values, and label-encode categorical features.feature10- impute 0 for missing values for numeric features, and label-encode categorical features.
You can generate feature files manually using relevant Makefiles. For example, to generate feature1 files for class 00 out of 14 classes:
make -f Makefile.feature.feature1 build/feature/feature1.trn00.sps
or you can run an algorithm Makefile that uses featuer1, then feature files will be generated automatically before training:
make -f Makefile.xg_100_8_0.05_feature1
6 different algorithm implementations are used as follows:
fm- Factorization Machine implementation from Kaggernn- Neural Networks implementation from Kagglerlr- Logistic Regression implementation from Scikit-Learngbm- Gradient Boosting Machine implementation from Scikit-Learnlibfm- Factorization Machine implementation from libFMxg- Gradient Boosting Machine implementation from XGBoost
From 6 different algorithm implementations and 8 different features (see Features), 19 individual models are built as follows:
fm_200_8_0.001_feature2fm_200_8_0.001_feature2fm_200_8_0.001_feature3gbm_bagging_40_7_0.1_feature10libfm_200_4_0.005_feature2libfm_200_4_0.005_feature4lr_0.1_feature2lr_0.1_feature4nn_20_64_0.005_feature8nn_20_8_0.01_feature2nn_20_8_0.01_feature3rf_400_40_feature2rf_400_40_feature5rf_400_40_feature9rf_400_40_feature10xg_100_8_0.05_feature1xg_100_8_0.05_feature5xg_100_8_0.05_feature8xg_100_8_0.05_feature9xg_100_8_0.05_feature10xg_bagging_120_7_0.1_feature9
Each model has its Makefile available for training and prediction. For example, to generate predictions for fm_200_8_0.001_feature2, run:
make -f fm_200_8_0.001_feature2
Predictions for training data with 5-CV and test data will be saved in build/val and build/tst folders respectively.
Using predictions of 19 individual models (see [Individual Models](individual models)) as inputs, a Gradient Boosting Machine ensemble model is trained as follows:
esb_xg_grid_colsub
Parameters for the ensemble model are selected for each class by using grid search.
After generating individual model predictions, run the ensemble Makefile as follows:
make -f Makefile.esb.xg_grid_colsub
The prediction and submission files will be available in the build/tst folder.
| Model Name | Public Leaderboard | 5-fold CV | Comment | |------------|-------------|-----------|-----------|---------| | esb_xg_grid_esb19_xgb120 | 0.2497 | - | 0.7 * esb_xg_grid_esb19 + 0.3 * sub_xgb120 | | esb_xg_grid_esb19 | 0.2503 | 0.2488 | |
| Model Name | Leaderboard | 5-fold CV | Comment | |------------|-------------|-----------|-----------|---------| | xg_bagging_120_7_0.1_feature9 | - | 0.2564 | | | xg_100_8_0.05_feature9 | - | 0.2568 | | | xg_100_8_0.05_feature1 | - | 0.2575 | | | xg_100_8_0.05_feature8 | - | 0.2575 | | | xg_100_8_0.05_feature10 | - | 0.2618 | | | nn_20_8_0.01_feature3 | - | 0.2660 | | | nn_20_8_0.01_feature2 | - | 0.2669 | | | nn_20_64_0.005_feature8 | - | 0.2675 | | | gbm_bagging_40_7_0.1_feature10 | - | 0.2678 | | | libfm_200_4_0.005_feature4 | - | 0.2694 | | | fm_200_8_0.001_feature3 | - | 0.2717 | | | fm_200_4_0.001_feature2 | - | 0.2720 | | | libfm_200_4_0.005_feature2 | - | 0.2723 | | | rf_400_40_feature9 | - | 0.2755 | | | rf_400_40_feature2 | - | 0.2769 | | | rf_400_40_feature5 | - | 0.2776 | | | rf_400_40_feature10 | - | 0.2881 | | | lr_0.1_feature4 | - | 0.3699 | | | lr_0.1_feature2 | - | 0.3755 | |