Skip to content

Added batch_size_predict kwarg to PyTorch LearnedKernelDrift#715

Merged
ascillitoe merged 7 commits intoSeldonIO:masterfrom
ascillitoe:feature/learnedkernel_batch_size_predict
Feb 1, 2023
Merged

Added batch_size_predict kwarg to PyTorch LearnedKernelDrift#715
ascillitoe merged 7 commits intoSeldonIO:masterfrom
ascillitoe:feature/learnedkernel_batch_size_predict

Conversation

@ascillitoe
Copy link
Contributor

@ascillitoe ascillitoe commented Jan 13, 2023

batch_size_predict was included as a kwarg to the KeOps LearnedKernelDrift backend in #602 (batch_size_predict controls the batch size used for predictions with the already trained kernel). This adds the same option to the PyTorch LearnedKernelDrift backend for consistency.

Resolves #612

TODO's

  • Reword some of the cd_mmd_keops.ipynb example to account for the new PyTorch kwarg.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@ascillitoe ascillitoe added the WIP PR is a Work in Progress label Jan 13, 2023
@ascillitoe ascillitoe changed the title Feature/learnedkernel batch size predict Added batch_size_predict kwarg to PyTorch LearnedKernelDrift Jan 13, 2023
@codecov-commenter
Copy link

codecov-commenter commented Jan 13, 2023

Codecov Report

Merging #715 (4c441a2) into master (acc200b) will decrease coverage by 0.16%.
The diff coverage is 100.00%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #715      +/-   ##
==========================================
- Coverage   80.32%   80.17%   -0.16%     
==========================================
  Files         137      137              
  Lines        9302     9261      -41     
==========================================
- Hits         7472     7425      -47     
- Misses       1830     1836       +6     
Flag Coverage Δ
macos-latest-3.9 ?
ubuntu-latest-3.10 ?
ubuntu-latest-3.7 80.11% <100.00%> (ø)
ubuntu-latest-3.8 80.16% <100.00%> (ø)
ubuntu-latest-3.9 ?
windows-latest-3.9 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
alibi_detect/cd/pytorch/learned_kernel.py 94.04% <ø> (ø)
alibi_detect/cd/tensorflow/learned_kernel.py 93.42% <ø> (ø)
alibi_detect/cd/learned_kernel.py 100.00% <100.00%> (ø)
alibi_detect/saving/validators.py 76.92% <0.00%> (-13.47%) ⬇️
alibi_detect/utils/frameworks.py 88.88% <0.00%> (-3.71%) ⬇️
alibi_detect/utils/missing_optional_dependency.py 94.28% <0.00%> (-0.16%) ⬇️
alibi_detect/datasets.py 68.55% <0.00%> (-0.14%) ⬇️
alibi_detect/saving/schemas.py 98.69% <0.00%> (-0.09%) ⬇️
alibi_detect/utils/fetching/fetching.py 0.00% <0.00%> (ø)
alibi_detect/cd/base.py 91.19% <0.00%> (+0.21%) ⬆️
... and 1 more

@ascillitoe
Copy link
Contributor Author

Preliminary experiments on the cd_mmd_keops.ipynb notebook suggest that the maximum batch_size_predict (constrained by available memory) is larger than the maximum batch_size (~20,000 compared to ~1000 for batch_size, when n_features=256). However, increasing batch_size_predict does not significantly decrease runtime (runtime = 238 seconds for batch_size_predict=20,000 and 240 seconds for batch_size_predict=1000), therefore the benefits are unclear.

@arnaudvl would be good to get your thoughts on the above. Wondering if I'm misinterpreting what you had in mind with #612 .

@ascillitoe ascillitoe requested a review from arnaudvl January 19, 2023 11:19
@arnaudvl
Copy link
Contributor

arnaudvl commented Jan 23, 2023

Preliminary experiments on the cd_mmd_keops.ipynb notebook suggest that the maximum batch_size_predict (constrained by available memory) is larger than the maximum batch_size (~20,000 compared to ~1000 for batch_size, when n_features=256). However, increasing batch_size_predict does not significantly decrease runtime (runtime = 238 seconds for batch_size_predict=20,000 and 240 seconds for batch_size_predict=1000), therefore the benefits are unclear.

@arnaudvl would be good to get your thoughts on the above. Wondering if I'm misinterpreting what you had in mind with #612 .

Have you tried runtimes for e.g. batch_size_predict = [1000, 5000, 10000, 15000, 20000]? And similarly for data with a different range of n_features (e.g. from 5 to 200)?

"* `dataloader`: Dataloader object used during training of the kernel. Defaults to `torch.utils.data.DataLoader`. The dataloader is not initialized yet, this is done during init off the detector using the `batch_size`. Custom dataloaders can be passed as well, e.g. for graph data we can use `torch_geometric.data.DataLoader`.\n",
"\n",
"Additional KeOps keyword arguments:\n",
"* `batch_size_predict`: Batch size used for the trained drift detector predictions. Defaults to 1,000,000 for KeOps and 1,000 for PyTorch.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite true in practice since the top level detector is instantiated by default with 1mn and this default propagates to PyTorch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point. For the default I guess we'll have to decide on a compromise for the optimal value between the pytorch and keops implementations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As PyTorch is likely the more common use case, might be fine to prioritise that one.

@ascillitoe
Copy link
Contributor Author

ascillitoe commented Jan 26, 2023

@arnaudvl below are some more detailed benchmarks. Note that the mem_mean column is GPU RAM in GB, with the counter reset prior to the predict stage, such that the number represents the RAM used by the predict (score) call exclusively.

KeOps

============================================================================================
   backend   n_ref  n_test  batch_size  batch_size_predict  n_features  time_mean (s)  mem_mean (GB)
============================================================================================
0    keops  100000  100000        2000                2000           5       3.78      0.15
1    keops  100000  100000        2000                2000          50       9.73      1.06
2    keops  100000  100000        2000                5000           5       3.71      0.15
3    keops  100000  100000        2000                5000          50       9.98      1.06
4    keops  100000  100000        2000               10000           5       4.08      0.15
5    keops  100000  100000        2000               10000          50      10.05      1.06
6    keops  100000  100000        2000               20000           5       4.21      0.15
7    keops  100000  100000        2000               20000          50      10.92      1.06
8    keops  100000  100000        2000               40000           5       4.19      0.15
9    keops  100000  100000        2000               40000          50      10.17      1.06
10   keops  100000  100000        2000              100000           5       4.43      0.15
11   keops  100000  100000        2000              100000          50      10.84      1.06
12   keops  100000  100000       20000                2000           5       3.28      0.15
13   keops  100000  100000       20000                2000          50      10.82      1.06
14   keops  100000  100000       20000                5000           5       3.39      0.15
15   keops  100000  100000       20000                5000          50       9.89      1.06
16   keops  100000  100000       20000               10000           5       3.17      0.15
17   keops  100000  100000       20000               10000          50      10.51      1.06
18   keops  100000  100000       20000               20000           5       3.06      0.15
19   keops  100000  100000       20000               20000          50      10.01      1.06
20   keops  100000  100000       20000               40000           5       3.18      0.15
21   keops  100000  100000       20000               40000          50      10.02      1.06
22   keops  100000  100000       20000              100000           5       3.21      0.15
23   keops  100000  100000       20000              100000          50       9.95      1.06

PyTorch

============================================================================================
   backend   n_ref  n_test  batch_size  batch_size_predict  n_features  time_mean (s) mem_mean (GB)
============================================================================================
    backend   n_ref  n_test  batch_size  batch_size_predict  n_features time_mean mem_mean
0   pytorch  100000  100000        2000                2000           5    230.76     0.35
1   pytorch  100000  100000        2000                2000          50    228.84     0.35
2   pytorch  100000  100000        2000                5000           5    232.15      0.5
3   pytorch  100000  100000        2000                5000          50    232.67     0.51
4   pytorch  100000  100000        2000               10000           5    231.95      2.0
5   pytorch  100000  100000        2000               10000          50    233.72     2.01
6   pytorch  100000  100000        2000               20000           5    232.89      8.0
7   pytorch  100000  100000        2000               20000          50    232.99     8.01
8   pytorch  100000  100000        2000               40000           5       OOM      OOM
9   pytorch  100000  100000        2000               40000          50       OOM      OOM
10  pytorch  100000  100000        2000              100000           5       OOM      OOM
11  pytorch  100000  100000        2000              100000          50       OOM      OOM
12  pytorch  100000  100000       20000                2000           5       OOM      OOM
13  pytorch  100000  100000       20000                2000          50       OOM      OOM
14  pytorch  100000  100000       20000                5000           5       OOM      OOM
15  pytorch  100000  100000       20000                5000          50       OOM      OOM
16  pytorch  100000  100000       20000               10000           5       OOM      OOM
17  pytorch  100000  100000       20000               10000          50       OOM      OOM
18  pytorch  100000  100000       20000               20000           5       OOM      OOM
19  pytorch  100000  100000       20000               20000          50       OOM      OOM
20  pytorch  100000  100000       20000               40000           5       OOM      OOM
21  pytorch  100000  100000       20000               40000          50       OOM      OOM
22  pytorch  100000  100000       20000              100000           5       OOM      OOM
23  pytorch  100000  100000       20000              100000          50       OOM      OOM

I'm taking two conclusions from this, but let me know if you disagree or would like more/different runs (or plots etc). For PyTorch:

  1. The maximum possible batch_predict_size is around an order of magnitude greater than the maximum batch_size (i.e. 20000 vs 2000).
  2. Increasing batch_size_predict increases memory consumption (from 0.35GB for batch_size_predict=2000 to 8.01GB for batch_size_predict=20000). However, the impact on runtime is negligible.

@ascillitoe
Copy link
Contributor Author

ascillitoe commented Jan 27, 2023

Some more PyTorch results going down to smaller batch_size_predict's and also trying n_features=256.

Conclusion is that going down to very small batch_size_predict's impacts run time substantially, but as long as it isn't very small (i.e. >=500) then run times are relatively constant.

Note: I think the constant memory usage for most of the results (2.12GB) is because we're measuring peak memory consumption, and for batch_size_predict<20000 the memory peaks before we get to the kernel_mat_fn call.

============================================================================================
    backend   n_ref  n_test  batch_size  batch_size_predict  n_features time_mean mem_mean
============================================================================================
0   pytorch  100000  100000        5000                  50          50    773.41     2.12
1   pytorch  100000  100000        5000                  50         256    788.67     2.12
2   pytorch  100000  100000        5000                 500          50    229.08     2.12
3   pytorch  100000  100000        5000                 500         256    233.27     2.12
4   pytorch  100000  100000        5000                2000          50    226.91     2.12
5   pytorch  100000  100000        5000                2000         256    228.24     2.12
6   pytorch  100000  100000        5000                5000          50    227.46     2.12
7   pytorch  100000  100000        5000                5000         256    227.74     2.12
8   pytorch  100000  100000        5000               10000          50    224.14     2.12
9   pytorch  100000  100000        5000               10000         256    230.19     2.12
10  pytorch  100000  100000        5000               20000          50    232.24     8.01
11  pytorch  100000  100000        5000               20000         256    248.45     8.04
12  pytorch  100000  100000        5000               40000          50       OOM      OOM
13  pytorch  100000  100000        5000               40000         256       OOM      OOM
14  pytorch  100000  100000        5000              100000          50       OOM      OOM
15  pytorch  100000  100000        5000              100000         256       OOM      OOM

@arnaudvl any thoughts on this? I'm not seeing significant benefits to having batch_size_predict != batch_size in this case, but wondering if its worth leaving in as we might uncover benefits later on?

@ascillitoe ascillitoe removed the WIP PR is a Work in Progress label Jan 31, 2023
@arnaudvl
Copy link
Contributor

Some more PyTorch results going down to smaller batch_size_predict's and also trying n_features=256.

Conclusion is that going down to very small batch_size_predict's impacts run time substantially, but as long as it isn't very small (i.e. >=500) then run times are relatively constant.

Note: I think the constant memory usage for most of the results (2.12GB) is because we're measuring peak memory consumption, and for batch_size_predict<20000 the memory peaks before we get to the kernel_mat_fn call.

============================================================================================
    backend   n_ref  n_test  batch_size  batch_size_predict  n_features time_mean mem_mean
============================================================================================
0   pytorch  100000  100000        5000                  50          50    773.41     2.12
1   pytorch  100000  100000        5000                  50         256    788.67     2.12
2   pytorch  100000  100000        5000                 500          50    229.08     2.12
3   pytorch  100000  100000        5000                 500         256    233.27     2.12
4   pytorch  100000  100000        5000                2000          50    226.91     2.12
5   pytorch  100000  100000        5000                2000         256    228.24     2.12
6   pytorch  100000  100000        5000                5000          50    227.46     2.12
7   pytorch  100000  100000        5000                5000         256    227.74     2.12
8   pytorch  100000  100000        5000               10000          50    224.14     2.12
9   pytorch  100000  100000        5000               10000         256    230.19     2.12
10  pytorch  100000  100000        5000               20000          50    232.24     8.01
11  pytorch  100000  100000        5000               20000         256    248.45     8.04
12  pytorch  100000  100000        5000               40000          50       OOM      OOM
13  pytorch  100000  100000        5000               40000         256       OOM      OOM
14  pytorch  100000  100000        5000              100000          50       OOM      OOM
15  pytorch  100000  100000        5000              100000         256       OOM      OOM

@arnaudvl any thoughts on this? I'm not seeing significant benefits to having batch_size_predict != batch_size in this case, but wondering if its worth leaving in as we might uncover benefits later on?

Mainly wanted to check low batch_size values as these are used for training and are more likely to be e.g. 32, 64 etc rather than 5000. Is it possible to run the experiments with low batch_size values?

@ascillitoe
Copy link
Contributor Author

Mainly wanted to check low batch_size values as these are used for training and are more likely to be e.g. 32, 64 etc rather than 5000. Is it possible to run the experiments with low batch_size values?

============================================================================================
    backend  n_ref  n_test  batch_size  batch_size_predict  n_features  time_mean  mem_mean
============================================================================================
0   pytorch  20000   20000          32                  32          50      71.40      0.00
1   pytorch  20000   20000          32                  64          50      30.24      0.00
2   pytorch  20000   20000          32                 256          50      17.10      0.00
3   pytorch  20000   20000          32               10000          50      16.84      2.01
4   pytorch  20000   20000          32               20000          50      17.13      2.01
5   pytorch  20000   20000          64                  32          50      67.99      0.00
6   pytorch  20000   20000          64                  64          50      27.59      0.00
7   pytorch  20000   20000          64                 256          50      14.93      0.00
8   pytorch  20000   20000          64               10000          50      14.08      2.01
9   pytorch  20000   20000          64               20000          50      13.83      2.01
10  pytorch  20000   20000         256                  32          50      65.25      0.01
11  pytorch  20000   20000         256                  64          50      25.28      0.01
12  pytorch  20000   20000         256                 256          50      12.80      0.01
13  pytorch  20000   20000         256               10000          50      11.85      2.01
14  pytorch  20000   20000         256               20000          50      11.86      2.01
15  pytorch  20000   20000       10000                  32          50      65.22      8.42
16  pytorch  20000   20000       10000                  64          50      25.19      8.42
17  pytorch  20000   20000       10000                 256          50      12.81      8.42
18  pytorch  20000   20000       10000               10000          50      11.78      8.42
19  pytorch  20000   20000       10000               20000          50      11.78      8.42
20  pytorch  20000   20000       20000                  32          50      64.34      0.00
21  pytorch  20000   20000       20000                  64          50      24.70      0.00
22  pytorch  20000   20000       20000                 256          50      12.03      0.00
23  pytorch  20000   20000       20000               10000          50      10.99      2.01
24  pytorch  20000   20000       20000               20000          50      10.99      2.01

@arnaudvl here are the updated results with low batch_size's. Increasing batch_size_predict does decrease runtime, but so does increase batch_size. I'm not sure this fully highlights the motivation for having separate kwargs, since batch_size = batch_size_predict = 10000 looks like a good setting (compared to batch_size = batch_size_predict = 32). Although we can achieve a larger batch_size_predict than batch_size for a GPU memory limit, the run time decreases tail-off well before we run into memory issues, so there is little benefit to setting batch_size_predict >> batch_predict` here.

However, I think if we had considerbly less memory so batch_size was limited to 64 or so, then there would be big gains in being able to set batch_size_predict a little bigger, say 256. The following demonstrates this:

============================================================================================
    backend  n_ref  n_test  batch_size  batch_size_predict  n_features  time_mean  mem_mean
============================================================================================
6   pytorch  20000   20000          64                  64          50      27.59      0.00
7   pytorch  20000   20000          64                 256          50      14.93      0.00

big gain in increasing batch_size_predict compared to batch_size, assuming we are unable to increase batch_size > 64 due to severe memory constraints. e.g.

============================================================================================
    backend  n_ref  n_test  batch_size  batch_size_predict  n_features  time_mean  mem_mean
============================================================================================
12  pytorch  20000   20000         256                 256          50      12.80      0.01

However, if we are already able to run big batch_size's compared to n_ref + n_test, then there is no point increasing batch_size_predict by itself:

============================================================================================
    backend  n_ref  n_test  batch_size  batch_size_predict  n_features  time_mean  mem_mean
============================================================================================
18  pytorch  20000   20000       10000               10000          50      11.78      8.42
19  pytorch  20000   20000       10000               20000          50      11.78      8.42

So to summarise, I can see the two separate kwarg's being useful when n_ref + n_test is much larger than the maximum achievable batch_size.

p.s. I haven't run v. large n_ref + n_test's here since it would take forever with small batch_sizes...

@arnaudvl
Copy link
Contributor

Mainly wanted to check low batch_size values as these are used for training and are more likely to be e.g. 32, 64 etc rather than 5000. Is it possible to run the experiments with low batch_size values?

============================================================================================
    backend  n_ref  n_test  batch_size  batch_size_predict  n_features  time_mean  mem_mean
============================================================================================
0   pytorch  20000   20000          32                  32          50      71.40      0.00
1   pytorch  20000   20000          32                  64          50      30.24      0.00
2   pytorch  20000   20000          32                 256          50      17.10      0.00
3   pytorch  20000   20000          32               10000          50      16.84      2.01
4   pytorch  20000   20000          32               20000          50      17.13      2.01
5   pytorch  20000   20000          64                  32          50      67.99      0.00
6   pytorch  20000   20000          64                  64          50      27.59      0.00
7   pytorch  20000   20000          64                 256          50      14.93      0.00
8   pytorch  20000   20000          64               10000          50      14.08      2.01
9   pytorch  20000   20000          64               20000          50      13.83      2.01
10  pytorch  20000   20000         256                  32          50      65.25      0.01
11  pytorch  20000   20000         256                  64          50      25.28      0.01
12  pytorch  20000   20000         256                 256          50      12.80      0.01
13  pytorch  20000   20000         256               10000          50      11.85      2.01
14  pytorch  20000   20000         256               20000          50      11.86      2.01
15  pytorch  20000   20000       10000                  32          50      65.22      8.42
16  pytorch  20000   20000       10000                  64          50      25.19      8.42
17  pytorch  20000   20000       10000                 256          50      12.81      8.42
18  pytorch  20000   20000       10000               10000          50      11.78      8.42
19  pytorch  20000   20000       10000               20000          50      11.78      8.42
20  pytorch  20000   20000       20000                  32          50      64.34      0.00
21  pytorch  20000   20000       20000                  64          50      24.70      0.00
22  pytorch  20000   20000       20000                 256          50      12.03      0.00
23  pytorch  20000   20000       20000               10000          50      10.99      2.01
24  pytorch  20000   20000       20000               20000          50      10.99      2.01

@arnaudvl here are the updated results with low batch_size's. Increasing batch_size_predict does decrease runtime, but so does increase batch_size. I'm not sure this fully highlights the motivation for having separate kwargs, since batch_size = batch_size_predict = 10000 looks like a good setting (compared to batch_size = batch_size_predict = 32). Although we can achieve a larger batch_size_predict than batch_size for a GPU memory limit, the run time decreases tail-off well before we run into memory issues, so there is little benefit to setting batch_size_predict >> batch_predict` here.

However, I think if we had considerbly less memory so batch_size was limited to 64 or so, then there would be big gains in being able to set batch_size_predict a little bigger, say 256. The following demonstrates this:

============================================================================================
    backend  n_ref  n_test  batch_size  batch_size_predict  n_features  time_mean  mem_mean
============================================================================================
6   pytorch  20000   20000          64                  64          50      27.59      0.00
7   pytorch  20000   20000          64                 256          50      14.93      0.00

big gain in increasing batch_size_predict compared to batch_size, assuming we are unable to increase batch_size > 64 due to severe memory constraints. e.g.

============================================================================================
    backend  n_ref  n_test  batch_size  batch_size_predict  n_features  time_mean  mem_mean
============================================================================================
12  pytorch  20000   20000         256                 256          50      12.80      0.01

However, if we are already able to run big batch_size's compared to n_ref + n_test, then there is no point increasing batch_size_predict by itself:

============================================================================================
    backend  n_ref  n_test  batch_size  batch_size_predict  n_features  time_mean  mem_mean
============================================================================================
18  pytorch  20000   20000       10000               10000          50      11.78      8.42
19  pytorch  20000   20000       10000               20000          50      11.78      8.42

So to summarise, I can see the two separate kwarg's being useful when n_ref + n_test is much larger than the maximum achievable batch_size.

p.s. I haven't run v. large n_ref + n_test's here since it would take forever with small batch_sizes...

The argument for batch_size + batch_size_predict is not just for memory, but also detector performance. It is possible that the best batch_size for training is quite small (e.g. 32) but we want to use a much bigger batch size for predictions since we want to be as fast as possible. It looks like these latest sets of experiments support this.

@ascillitoe
Copy link
Contributor Author

ascillitoe commented Jan 31, 2023

The argument for batch_size + batch_size_predict is not just for memory, but also detector performance. It is possible that the best batch_size for training is quite small (e.g. 32) but we want to use a much bigger batch size for predictions since we want to be as fast as possible. It looks like these latest sets of experiments support this.

Good point! We are good to go then (after review!) 🙂

@ascillitoe ascillitoe merged commit f8dd11e into SeldonIO:master Feb 1, 2023
@ascillitoe ascillitoe deleted the feature/learnedkernel_batch_size_predict branch February 1, 2023 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add batch_size_predict as kwarg to PyTorch backend for learned detectors.

3 participants