Support SyncBatchNorm5D by wkcn · Pull Request #14542 · apache/mxnet

wkcn · 2019-03-27T12:50:07Z

Description

Hi! there.
Currently, SyncBatchNorm doesn't support 5+D input.
In this PR, I fix it.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

tiny change. The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

modify the code of SyncBatchNorm to support 2+D inputs.
Remove the default value key for SyncBatchNorm
Using CUDNN to compute 5+D inputs for BatchNorm
modify the testcase test_sync_batchnorm
Add an operator-test for BatchNorm and SyncBatchNorm
reformat tests/python/gpu/test_gluon_gpu.py

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

zhreshold

So it will support only 3D to 5D? What's the limitation here?

src/operator/contrib/sync_batch_norm-inl.h

zhreshold · 2019-03-28T01:02:44Z

looks good

wkcn · 2019-03-28T03:25:02Z

There is a bug when the dimension of inputs is 2.

======================================================================

FAIL: test_gluon_gpu.test_sync_batchnorm

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest

    self.test(*self.arg)

  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 177, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/gpu/test_gluon_gpu.py", line 373, in test_sync_batchnorm

    num_devices=ndev, cuda=cuda)

  File "/work/mxnet/tests/python/gpu/test_gluon_gpu.py", line 352, in _check_batchnorm_result

    atol=1e-3, rtol=1e-3)

  File "/work/mxnet/python/mxnet/test_utils.py", line 495, in assert_almost_equal

    raise AssertionError(msg)

AssertionError: 

Items are not equal:

Error 1.597678 exceeds tolerance rtol=0.001000, atol=0.001000.  Location of maximum error:(0,), a=0.912201, b=0.909150

 a: array([0.9122006, 0.906133 ], dtype=float32)

 b: array([0.9091504, 0.9045997], dtype=float32)

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=868530648 to reproduce.

…-mxnet into support_5d_sync_batchnorm

wkcn · 2019-03-29T06:56:52Z

There seems to be a bug in SyncBatchNorm, when spatial_shape is 1x1, 1xn or nx1. I am checking it.

wkcn · 2019-03-30T08:15:37Z

It seems that the bug has been addressed, although I do not know the specific reason yet.

I will add a test for multi-output.

wkcn · 2019-03-31T00:24:10Z

@zhreshold @szha Hi! I have updated the PR and add adquate unittests.
Would you mind reviewing it? Thank you!

zhreshold · 2019-04-01T18:24:14Z

If @szha has no complaint, I can merge it in 24hr

zhreshold · 2019-04-02T17:53:14Z

thanks @wkcn , this is merged!

* support SyncBatchNorm5D * fix * update testcase and reformat code * retrigger CI * update test case * test * Retrigger CI * disable cudnn for batchnorm * fix BatchNorm(cudnn) * fix build * Remove a testcase * Update sync_batch_norm-inl.h * update unittest * update unittest * update test * fix test * change atol and rtol * BN(cudnn) 5d * update test * test * Testing * Update batch_norm.cu * test cudnnoff * Update test_operator.py * update BN! : )

szha · 2019-05-15T05:29:18Z

tests/python/unittest/test_gluon.py

+                            input2grad.asnumpy(), atol=atol, rtol=rtol)
+
+    cfgs = [(1, False)]
+    num_gpus = mx.context.num_gpus()


This line requires having GPU when CUDA is installed, or it would throw this error:

====================================================================== ERROR: test_gluon.test_sync_batchnorm ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/home/travis/build/dmlc/mxnet-distro/mxnet-build/tests/python/unittest/common.py", line 177, in test_new orig_test(*args, **kwargs) File "/home/travis/build/dmlc/mxnet-distro/mxnet-build/tests/python/unittest/test_gluon.py", line 693, in test_sync_batchnorm num_gpus = mx.context.num_gpus() File "/home/travis/build/dmlc/mxnet-distro/mxnet/context.py", line 258, in num_gpus check_call(_LIB.MXGetGPUCount(ctypes.byref(count))) File "/home/travis/build/dmlc/mxnet-distro/mxnet/base.py", line 254, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) MXNetError: [11:47:54] include/mxnet/base.h:427: Check failed: e == cudaSuccess (30 vs. 0) : CUDA: unknown error Stack trace: [bt] (0) /home/travis/build/dmlc/mxnet-distro/mxnet/libmxnet.so(+0x4b60fb) [0x7f8d608830fb] [bt] (1) /home/travis/build/dmlc/mxnet-distro/mxnet/libmxnet.so(+0x2440eec) [0x7f8d6280deec] [bt] (2) /home/travis/build/dmlc/mxnet-distro/mxnet/libmxnet.so(MXGetGPUCount+0x19) [0x7f8d6280df79] [bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f8d9a2e1c7c] [bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7f8d9a2e15ac] [bt] (5) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48e) [0x7f8d9a4f85fe] [bt] (6) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x15f9e) [0x7f8d9a4f9f9e] [bt] (7) /usr/bin/python(PyEval_EvalFrameEx+0x965) [0x4c84a5] [bt] (8) /usr/bin/python(PyEval_EvalCodeEx+0x2ac) [0x4cfedc] -------------------- >> begin captured logging << -------------------- common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1179889124 to reproduce. --------------------- >> end captured logging << --------------------- ----------------------------------------------------------------------

Can you please move this test to tests/python/gpu/test_gluon_contrib_gpu.py? @wkcn @zhreshold

I don't know why a unknown CUDA error was raised.
https://github.com/apache/incubator-mxnet/blob/master/include/mxnet/base.h#L424

I was testing it on a platform without GPU, with CUDA installed. In any case, the test seems misplaced.

* support SyncBatchNorm5D * fix * update testcase and reformat code * retrigger CI * update test case * test * Retrigger CI * disable cudnn for batchnorm * fix BatchNorm(cudnn) * fix build * Remove a testcase * Update sync_batch_norm-inl.h * update unittest * update unittest * update test * fix test * change atol and rtol * BN(cudnn) 5d * update test * test * Testing * Update batch_norm.cu * test cudnnoff * Update test_operator.py * update BN! : )

support SyncBatchNorm5D

df2f264

wkcn added the pr-work-in-progress PR is still work in progress label Mar 27, 2019

wkcn changed the title ~~support SyncBatchNorm5D~~ [WIP] Support SyncBatchNorm5D Mar 27, 2019

fix

c314241

wkcn changed the title ~~[WIP] Support SyncBatchNorm5D~~ Support SyncBatchNorm5D Mar 27, 2019

wkcn added pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress labels Mar 27, 2019

wkcn requested review from szha and zhreshold March 27, 2019 15:50

zhreshold reviewed Mar 27, 2019

View reviewed changes

src/operator/contrib/sync_batch_norm-inl.h Show resolved Hide resolved

update testcase and reformat code

46bc426

zhreshold approved these changes Mar 28, 2019

View reviewed changes

wkcn added 4 commits March 28, 2019 11:26

retrigger CI

4729860

update test case

e7bd3bb

test

02415f9

Retrigger CI

380e750

wkcn changed the title ~~Support SyncBatchNorm5D~~ [WIP, Don't merge] Support SyncBatchNorm5D Mar 28, 2019

wkcn added 9 commits March 28, 2019 21:30

disable cudnn for batchnorm

4653390

Merge branch 'support_5d_sync_batchnorm' of github.com:wkcn/incubator…

e11d076

…-mxnet into support_5d_sync_batchnorm

fix BatchNorm(cudnn)

c8ad1a8

fix build

c22b400

Remove a testcase

f316e91

Update sync_batch_norm-inl.h

ea470be

update unittest

d4a118d

merge master

f126fb1

update unittest

d1a7787

update test

2a3ba52

wkcn added 6 commits March 29, 2019 19:27

fix test

6d6142f

change atol and rtol

659f1db

BN(cudnn) 5d

36f930d

update test

9ec1f51

test

904f5bd

Testing

e238132

wkcn added pr-work-in-progress PR is still work in progress and removed pr-awaiting-review PR is waiting for code review labels Mar 30, 2019

wkcn added 3 commits March 30, 2019 11:02

Update batch_norm.cu

1ba2c84

test cudnnoff

784b77d

Update test_operator.py

767efeb

update BN! : )

993e8e3

wkcn changed the title ~~[WIP, Don't merge] Support SyncBatchNorm5D~~ Support SyncBatchNorm5D Mar 30, 2019

wkcn added pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress labels Mar 30, 2019

zhreshold approved these changes Apr 1, 2019

View reviewed changes

zhreshold merged commit e2f5b47 into apache:master Apr 2, 2019

szha reviewed May 15, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support SyncBatchNorm5D#14542

Support SyncBatchNorm5D#14542
zhreshold merged 27 commits intoapache:masterfrom
wkcn:support_5d_sync_batchnorm

wkcn commented Mar 27, 2019 •

edited

Loading

Uh oh!

zhreshold left a comment

Uh oh!

Uh oh!

zhreshold commented Mar 28, 2019

Uh oh!

wkcn commented Mar 28, 2019 •

edited

Loading

Uh oh!

wkcn commented Mar 29, 2019 •

edited

Loading

Uh oh!

wkcn commented Mar 30, 2019 •

edited

Loading

Uh oh!

wkcn commented Mar 31, 2019

Uh oh!

zhreshold commented Apr 1, 2019

Uh oh!

zhreshold commented Apr 2, 2019

Uh oh!

szha May 15, 2019

Uh oh!

wkcn May 15, 2019

Uh oh!

szha May 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wkcn commented Mar 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Essentials

Changes

Comments

Uh oh!

zhreshold left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhreshold commented Mar 28, 2019

Uh oh!

wkcn commented Mar 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wkcn commented Mar 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wkcn commented Mar 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wkcn commented Mar 31, 2019

Uh oh!

zhreshold commented Apr 1, 2019

Uh oh!

zhreshold commented Apr 2, 2019

Uh oh!

szha May 15, 2019

Choose a reason for hiding this comment

Uh oh!

wkcn May 15, 2019

Choose a reason for hiding this comment

Uh oh!

szha May 15, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wkcn commented Mar 27, 2019 •

edited

Loading

wkcn commented Mar 28, 2019 •

edited

Loading

wkcn commented Mar 29, 2019 •

edited

Loading

wkcn commented Mar 30, 2019 •

edited

Loading