[MXNET-1413] Adding Large Tensor support for sort operators by access2rohit · Pull Request #15170 · apache/mxnet

access2rohit · 2019-06-06T23:42:40Z

Description

ops supported: sort, topk, argmax

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-1413], where Torch nondeterministic segfault #1413 refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

access2rohit · 2019-06-06T23:46:57Z

@mxnet-label-bot Add [pr-awaiting-review]

access2rohit · 2019-06-06T23:47:36Z

@apeforest please review

apeforest · 2019-06-07T00:06:26Z

src/operator/tensor/ordering_op-inl.h

-  mxnet_op::Kernel<range_fwd, xpu>::Launch(s, batch_size * element_num, 1, 0, 1,
-    kWriteTo, indices.dptr_);
+  mxnet_op::Kernel<range_fwd, xpu>::Launch(s, batch_size * element_num, 1, static_cast<index_t>(0),
+                                           static_cast<index_t>(1), kWriteTo, indices.dptr_);


Maybe just using data initializer:

Suggested change

static_cast<index_t>(1), kWriteTo, indices.dptr_);

index_t{0}, index_t{1}

access2rohit · 2019-06-20T01:01:27Z

tests/python/unittest/test_ndarray.py

        nd_ret_topk = mx.nd.topk(a_nd, axis=1, ret_typ="indices", k=3, is_ascend=True).asnumpy()
-        assert nd_ret_topk.dtype == np.float32  # Test the default dtype
+        # Test the default dtype
+        if is_large_tensor_enabled:


changed default type from float to int32/64 based on whether large tensor support is enabled or not

access2rohit · 2019-06-20T01:01:40Z

tests/python/unittest/test_ndarray.py

        nd_ret_topk_ind = nd_ret_topk_ind.asnumpy()
        assert nd_ret_topk_val.dtype == dtype
-        assert nd_ret_topk_ind.dtype == np.float32
+        if is_large_tensor_enabled:


same as above

access2rohit · 2019-06-20T07:08:37Z

@apeforest please review

apeforest · 2019-06-20T20:09:03Z

src/operator/tensor/ordering_op-inl.h

+  Tensor<xpu, 1, index_t> workspace =
+    ctx.requested[0].get_space_typed<xpu, 1, index_t>(Shape1(batch_size * k + batch_size), s);
+  Tensor<xpu, 1, index_t> sel_indices =
+    Tensor<xpu, 1, index_t>((workspace.dptr_), Shape1(batch_size * k), s);


nit: parenthesis around workspace.dptr_ seems redundant

apeforest · 2019-06-20T20:11:42Z

src/operator/tensor/ordering_op-inl.h

  CHECK(type_assign(&(*out_attrs)[1], mshadow::kInt32))
-          << "Failed to set the type of ret_indices to int32.";
+#endif
+      << "Failed to set the type of ret_indices to int32.";


I think this message should be different when MSHADOW_USE_INT64_TENSOR_SIZE is on.

I think its better if we logged "Failed to set the type of ret_indices"

apeforest · 2019-06-20T20:12:44Z

tests/nightly/test_large_array.py

+    k = nd.topk(b, k=10, axis=0, dtype=np.int64)
+    assert np.sum(k.asnumpy() == (LARGE_X - 1)) == SMALL_Y
+    b = create_2d_tensor(rows=SMALL_Y, columns=LARGE_X)
+    l = nd.topk(b, k=1, axis=-1, dtype=np.int64, ret_typ="value")


please also test when ret_typ is "both" and "indices"

Good point. Default is indices. I can check for "both" as well

apeforest

Thanks for making this change with iterations of updates to make the API backward compatible. LGTM overall except a few minor changes.

larroy

LGTM % comment about uninitalized variables.

larroy · 2019-06-20T21:40:39Z

src/operator/tensor/ordering_op-inl.h

  Stream<xpu> *s = ctx.run_ctx.get_stream<xpu>();
  CHECK(param.ret_typ == topk_enum::kReturnValue || param.ret_typ == topk_enum::kReturnBoth);
-  int batch_size, element_num;  // number of batches + the size of each batch
+  size_t batch_size;


Can we get into the good practice of not leaving variables uninitialized?
This is verboten in many serious circumstances, see automotive, MISRA, aerospatial.... I know we are parsing the arguments later, but still.
Same thing for above.

sounds good. I can do this for primitive types which come with default garbage values. I will refrain from intializing class objects, since they are null by default unless explicitly assigned memory.
@larroy does that sound good ?
@apeforest what are your thoughts on this ?

Yes I was referring to primitive types. Which class objects you referr to? When initializing a class the fields are not null, it actually depends on what's on the ctor. If there's no ctor and the type is pod is garbage.

Yes, I agree with @larroy leaving no uninit variables is a good SE practice. Class object should have its own default constructors, if not, then the design of class need improvement.

@larroy I was talking about objects of Tensor class

Only pods need to be initialized in any circumstance including class ctor (otherwise garbage), so answering to your original question, yes. And seems we all agree here. Objects don't need explicit = initialization as per ctor as @apeforest pointed out.

apeforest · 2019-06-20T23:52:08Z

tests/nightly/test_large_array.py

+    b = create_2d_tensor(rows=SMALL_Y, columns=LARGE_X)
+    l = nd.topk(b, k=1, axis=-1, dtype=np.int64, ret_typ="value")
+    assert l.sum() == np.sum(np.arange(0, SMALL_Y))
+    b = create_2d_tensor(rows=LARGE_X, columns=SMALL_Y)


Why do we create b multiple times? Can we reuse one to save computation?

I can reuse 1st but 2nd still needs to be created.

fine. don't need to do in this PR.

access2rohit · 2019-06-21T16:27:28Z

@mxnet-label-bot Add [pr-awaiting-merge]

apeforest · 2019-06-21T16:55:03Z

@access2rohit Can you check the CI status? If it got staled, please retrigger it again.

apeforest

LGTM

lebeg · 2019-06-26T17:31:12Z

Unfortunately this change has broken NightlyTestsForBinaries #15374

…5170)

access2rohit requested a review from szha as a code owner June 6, 2019 23:42

access2rohit force-pushed the lts_sort branch from 75eff52 to ce55434 Compare June 6, 2019 23:45

access2rohit changed the title ~~Adding Large Tensor support for sort operators~~ [MXNET-1413] Adding Large Tensor support for sort operators Jun 6, 2019

marcoabreu added the pr-awaiting-review PR is waiting for code review label Jun 6, 2019

access2rohit force-pushed the lts_sort branch from ce55434 to ccb8d6c Compare June 6, 2019 23:59

apeforest reviewed Jun 7, 2019

View reviewed changes

access2rohit force-pushed the lts_sort branch from ccb8d6c to 7c4cb0f Compare June 7, 2019 00:10

access2rohit changed the title ~~[MXNET-1413] Adding Large Tensor support for sort operators~~ [WIP] [MXNET-1413] Adding Large Tensor support for sort operators Jun 7, 2019

access2rohit force-pushed the lts_sort branch 3 times, most recently from c3af3fa to e62d8da Compare June 7, 2019 20:24

access2rohit force-pushed the lts_sort branch 8 times, most recently from a9e2fff to cedab68 Compare June 20, 2019 00:57

access2rohit commented Jun 20, 2019

View reviewed changes

access2rohit force-pushed the lts_sort branch from cedab68 to 50b3aca Compare June 20, 2019 01:03

access2rohit changed the title ~~[WIP] [MXNET-1413] Adding Large Tensor support for sort operators~~ [MXNET-1413] Adding Large Tensor support for sort operators Jun 20, 2019

apeforest reviewed Jun 20, 2019

View reviewed changes

larroy approved these changes Jun 20, 2019

View reviewed changes

access2rohit force-pushed the lts_sort branch 3 times, most recently from d7c78fa to 496dd9f Compare June 20, 2019 23:17

apeforest reviewed Jun 20, 2019

View reviewed changes

access2rohit force-pushed the lts_sort branch from 496dd9f to 8216bb9 Compare June 21, 2019 00:02

marcoabreu added the pr-awaiting-merge Review and CI is complete. Ready to Merge label Jun 21, 2019

apeforest approved these changes Jun 21, 2019

View reviewed changes

[MXNET-1413] Adding Large Tensor support for sort operators

8ff0e56

access2rohit force-pushed the lts_sort branch from 8216bb9 to 8ff0e56 Compare June 21, 2019 17:15

Roshrini mentioned this pull request Jun 21, 2019

[MXNET-895] ONNX import/export: TopK #13627

Merged

5 tasks

apeforest merged commit 8b5f376 into apache:master Jun 21, 2019

roywei mentioned this pull request Jun 25, 2019

[CI] nightly failure test on amp tutorial #15355

Closed

access2rohit mentioned this pull request Jun 25, 2019

Revert default return type for indices in argsort() and topk() to fp32 #15360

Merged

5 tasks

lebeg mentioned this pull request Jun 26, 2019

NightlyTestsForBinaries tutorials test broken #15374

Open

lebeg mentioned this pull request Jun 26, 2019

SSDMultiBoxLoss is broken after recent MXNet change dmlc/gluon-cv#836

Closed

access2rohit added a commit to access2rohit/incubator-mxnet that referenced this pull request Sep 25, 2019

[MXNET-1413] Adding Large Tensor support for sort operators (apache#1…

1f5011c

…5170)

access2rohit added a commit to access2rohit/incubator-mxnet that referenced this pull request Sep 25, 2019

[MXNET-1413] Adding Large Tensor support for sort operators (apache#1…

f35ddde

…5170)

	static_cast<index_t>(1), kWriteTo, indices.dptr_);
	index_t{0}, index_t{1}

Conversation

access2rohit commented Jun 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Essentials

Uh oh!

access2rohit commented Jun 6, 2019

Uh oh!

access2rohit commented Jun 6, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

access2rohit Jun 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

access2rohit commented Jun 20, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apeforest left a comment

Choose a reason for hiding this comment

Uh oh!

larroy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

larroy Jun 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

access2rohit commented Jun 21, 2019

Uh oh!

apeforest commented Jun 21, 2019

Uh oh!

apeforest left a comment

Choose a reason for hiding this comment

Uh oh!

lebeg commented Jun 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

access2rohit commented Jun 6, 2019 •

edited

Loading

access2rohit Jun 20, 2019 •

edited

Loading

larroy Jun 20, 2019 •

edited

Loading