remove unnecessary omp single that cause deadlock (fixes #6273)#6394
remove unnecessary omp single that cause deadlock (fixes #6273)#6394jameslamb merged 3 commits intomicrosoft:masterfrom
Conversation
jameslamb
left a comment
There was a problem hiding this comment.
Thank you very much for the EXCELLENT investigation and write-up! This fix makes sense to me.
To ensure we don't introduce a deadlock like this on this codepath again, could you please add the example Python code as a test here?
Right after this test:
Something like this:
def test_dataset_construction_with_high_cardinality_categorical_succeeds():
pd = pytest.importorskip("pandas")
X = pd.DataFrame({"x1": np.random.randint(0, 5_000, 10_000)})
y = np.random.rand(10_000)
dtrain = lgb.Dataset(X, y, categorical_feature=["x1"])
dtrain.construct()
assert ds.num_data() == 10_000
assert ds.num_feature() == 1|
@microsoft-github-policy-service agree |
75ba6eb to
112f757
Compare
112f757 to
0749516
Compare
|
Thanks for the update! We'll review them soon. In the future when you contribute to LightGBM (and we hope you will!), don't force-push here. It's not necessary, as we squash all commits into 1 when merging to |
jameslamb
left a comment
There was a problem hiding this comment.
I tested this tonight using an approach similar to #6226, with both the R and Python packages. Confirmed that the processing time scales as expected with higher values of environment variable OMP_NUM_THREADS. Also confirmed that the unit test you've added here gets deadlocked on master but runs quickly and successfully with the changes here 🎉
Thank you SO MUCH for taking the time to investigate and fix this, and for the great explanation in the PR description.
|
This pull request has been automatically locked since there has not been any recent activity since it was closed. |
fixes #6273
omp single clause has implicit barrier, which waits for all threads, including threads that did not execute omp single block.
Therefore, if omp single clause is used conditionally inside omp parallel, it will cause a deadlock (explanation below).
I think it is safe to remove omp single clause from
omp_get_max_threads()(introuduced in #6226).omp_get_max_threads()is thread-safe https://www.openmp.org/spec-html/5.1/openmpse6.html#x27-260001.6omp_get_max_threads()gets its value.background
I digged into #6273.
I suspect a deadlock near
OMP_NUM_THREADS()from the backtrace.In this case,
LightGBM::DatasetLoader::ConstructFromSampleDatahas omp parallel and OMP_NUM_THREADS() (FindBin => ArgMax => ArgMaxMT => OMP_NUM_THREADS()) has omp single.deadlock explanation code