Skip to content

Fix for rnnt_loss.py#1177

Merged
pkufool merged 6 commits intok2-fsa:masterfrom
yfyeung:yfyeung-patch-1
Apr 26, 2023
Merged

Fix for rnnt_loss.py#1177
pkufool merged 6 commits intok2-fsa:masterfrom
yfyeung:yfyeung-patch-1

Conversation

@yfyeung
Copy link
Contributor

@yfyeung yfyeung commented Apr 25, 2023

No description provided.

@pkufool pkufool added the ready Ready for review and trigger GitHub actions to run label Apr 25, 2023
@pkufool pkufool added ready Ready for review and trigger GitHub actions to run and removed ready Ready for review and trigger GitHub actions to run labels Apr 25, 2023
@pkufool pkufool merged commit a23383c into k2-fsa:master Apr 26, 2023
@yfyeung yfyeung deleted the yfyeung-patch-1 branch April 26, 2023 07:33
@danpovey
Copy link
Collaborator

This PR is to fix an issue where a lot of memory is used in the backward pass of the simple RNNT loss.
It can cause random-seeming failures where, after some time, training ends with a message like this:

    Variable._execution_engine.run_backward(
RuntimeError: CUDA out of memory. Tried to allocate 5.37 GiB (GPU 0; 31.75 GiB total capacity; 17.47 GiB already allocated; 12.64 GiB free; 17.83 GiB reserved in total by PyTorch)

(note, there remains a mystery why, often, it seems to be asking for much less memory than the device has free (that 12.64 GiB number comes from the device_free of cudaMemGetInfo(&device_free, &device_total))... possibly this has to do with other things using the machine; but regardless, the fact is that way more memory is being used in the backward pass than really needs to be used.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready Ready for review and trigger GitHub actions to run

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants