Skip to content

[R-package] Improvements, readability, and bug fixes#378

Merged
guolinke merged 2 commits intomicrosoft:masterfrom
Laurae2:patch-13
Apr 3, 2017
Merged

[R-package] Improvements, readability, and bug fixes#378
guolinke merged 2 commits intomicrosoft:masterfrom
Laurae2:patch-13

Conversation

@Laurae2
Copy link
Contributor

@Laurae2 Laurae2 commented Apr 2, 2017

This PR is large-sized changes for the R package. There are issues to solve which will not be fixed in this PR, as they are already on the master branch.

R improvements and bug fixes:

  • All changes are (supposed to be) without code regression (I added only on top, or edited what we have by adding more elements)
  • Allow to use regression_l1, regression_l2, huber, fair, poisson, which were unusable previously due to hard-coded rules.
  • Auto-define default loss functions for regression_l1 (MAE), regression_l2 (MSE), huber (MAE), fair (MAE), poisson (Poisson loss):

image

  • Changed the way of pre-allocating zero-valued vectors (slightly faster).
  • Users who have xgboost library used after lightgbm can now again use lightgbm examples (getinfo, setinfo, slice global environment name clash).
  • Add lgb.unloader which wipes LightGBM environment so we can avoid restarting R when an object gets stuck in memory for no apparent reason (like when training over training different things on same variables which have changed).
  • lgb.unloader can fully wipe LightGBM objects in the specified environment (lgb.Booster, lgb.Dataset), and will not cause an error when lightgbm was already detached from the R environment.
  • Added new example for lgb.unloader.
  • Remove "free booster handle" message on Predictor types
  • Fix dim.lgb.Dataset, dimnames.lgb.Dataset functions' example dontrun tag

R readability/stylistic changes:

  • Commented the whole R code from scratch (everything is now commented to make editing easier for newcomers)
  • Changed all single quotes to double quotes for defining characters (consistency fix)
  • Fix all the spacing issues (code way easier to edit as we don't have to mash back+space button to get the code at the right place...)
  • Improved readability of code by splitting functions into "paragraph" chunks (group of actions) to ease editing and spotting critical chunks
  • Made demonstration codes easier to read

R issues to solve in future PRs, not now:

  • Reverse the following or use Travis as alternative: do not run examples on devtools::check on any function currently (might reverse this later, in the future - we would need Travis setup).
  • [Windows only (?)] Do not lock DLL in R when building library, because it does not allow to use devtools::check() to inspect the whole code which is essential for checking for code correctness (without having to run everything by hand). Potentially a regression from PR Support build self-contained R package.  #340 to answer issue Relative paths in R-package prevent source package build #339 (not sure though, it is a strange lock bug). This will not pass in CRAN in its current state as the lock does not allow to run CRAN tests (need to unload library properly) - we still have no CRAN release, therefore this is not a major issue currently.
  • Get rid of library(lightgbm) in examples (not allowed without \{dontrun})
  • Stop relying for some allocations (<<-) on the global environment (not easy to fix at all).
  • Provide the user demo code to convert list of evaluation metrics to a matrix or data.table (data.table::rbindlist).

Tests performed:

  • Run all R function examples
  • Run extra examples, see below for Poisson loss example

Example for testing a new loss (ex: Poisson loss):

library(lightgbm)
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
data(agaricus.test, package = "lightgbm")
test <- agaricus.test
dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label)
params <- list(objective = "poisson") # DO NOT SPECIFY METRIC, ONLY OBJECTIVE like `regression_l1`, `regression_l2`, `huber`, `fair`, `poisson`
valids <- list(test = dtest)
model <- lgb.train(params,
                   dtrain,
                   100,
                   valids,
                   min_data = 1,
                   learning_rate = 1,
                   early_stopping_rounds = 10)

Beginning of training log (does not converge because it is not the right objective obviously, and there are too many 0s, which makes it nearly impossible - this is normal because it is not a dataset for Poisson regression):

[LightGBM] [Info] Total Bins 232
[LightGBM] [Info] Number of data: 6513, number of used features: 116
[LightGBM] [Info] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Trained a tree with leaves=14 and max_depth=6
[1]:	test's poisson:0.523205 
[LightGBM] [Info] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Trained a tree with leaves=24 and max_depth=8
[2]:	test's poisson:0.482608 
[LightGBM] [Info] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Trained a tree with leaves=24 and max_depth=7

@msftclas
Copy link

msftclas commented Apr 2, 2017

@Laurae2,
Thanks for having already signed the Contribution License Agreement. Your agreement was validated by Microsoft. We will now review your pull request.
Thanks,
Microsoft Pull Request Bot

@guolinke guolinke merged commit b6c973a into microsoft:master Apr 3, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Mar 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants