-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Adding dtype to LDAModel to speed it up #1656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Good idea with the asserts! I don't think save/load need any special handling at all. Maybe the only tricky part is how to handle backward compatibility: should loading models saved before this change stil work? I'd say yes. We need to test this explicitly: save an "old" model, then load it using your new code with dtypes and asserts, make sure everything continues to work as expected. The other compatibility direction (load new model in old code) is not necessary. |
|
Do I need to do anything to make code save my new And yes, there is a compatibility problem as tests shown me. To achieve it, I'll need to set dtype to float64 if it's not present in the saved model. I'll need some time to wrap my head around code in load/save methods. |
|
Nice @xelez, my suggestions about it:
P/S similar task #1319 |
|
Implemented setting of Two things left to do:
|
|
Looked up failing tests: I'll also need to modify |
…e when checking if something is float.
|
Fixed tests and quick-fixed AuthorTopicModel. |
* replace assert with docstring comment * add test to check that it really saves dtype for different inputs
gensim/models/ldamodel.py
Outdated
| if prior == 'symmetric': | ||
| logger.info("using symmetric %s at %s", name, 1.0 / prior_shape) | ||
| init_prior = np.asarray([1.0 / self.num_topics for i in xrange(prior_shape)]) | ||
| logger.info("using symmetric %s at %s", name, 1.0 / prior_shape) #TODO: prior_shape? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have feeling that it should be
"using symmetric %s at %s", name, 1.0 / self.num_topics
am I right? @menshikh-iv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes @xelez, you are correct!
gensim/matutils.py
Outdated
| """ | ||
| For a vector `theta~Dir(alpha)`, compute `E[log(theta)]`. | ||
| Saves dtype of the argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this comment mean? Looks out of place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to add some note that this function returns np.array with the same dtype as input alpha. Well, probably it's not really needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the intent, it's not really apparent from the text Saves dtype of the argument.
…odels using LdaModel
|
@piskvorky , @menshikh-iv I think I've finished. |
gensim/models/ldamodel.py
Outdated
| if not hasattr(result, 'dtype'): | ||
| result.dtype = np.float64 # float64 was used before as default in numpy | ||
| result.dtype = np.float64 # float64 was used before as default in numpy | ||
| logging.warning("dtype was not set, so using np.float64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more concrete message please. When reading this warning, users will be left scratching their heads: set where? Why? What does this mean to me?
How about "dtype not set in saved %s file %s, assuming np.float64" % (result.__class__.__name__, fname)?
And only log at INFO or even DEBUG level, since it's an expected state when loading an old model, nothing out of ordinary.
Question: isn't it better to infer the dtype from the loaded object? Can it ever happen that it's something else, not np.float64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed message, decided info level suites better.
About inferring. Not clear how to do it. Infer from LdaState.eta and LdaState.sstats? But then we had test that their sum is np.float64, so it's safe to assume that we don't loose precision when setting dtype to np.float64 and np.float32 is not enough.
Anyway, let's imagine situation some of nd.arrays are somehow of different dtype, like np.float32 and some are np.float64. The right dtype is still np.float64.
gensim/models/hdpmodel.py
Outdated
| alpha, beta = self.hdp_to_lda() | ||
| ldam = ldamodel.LdaModel(num_topics=self.m_T, alpha=alpha, id2word=self.id2word, random_state=self.random_state) | ||
| ldam = ldamodel.LdaModel(num_topics=self.m_T, alpha=alpha, id2word=self.id2word, | ||
| random_state=self.random_state, dtype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style: no vertical indent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
gensim/models/ldamodel.py
Outdated
| self.sstats = np.zeros(shape) | ||
| def __init__(self, eta, shape, dtype=np.float32): | ||
| self.eta = eta.astype(dtype, copy=False) | ||
| self.sstats = np.zeros(shape, dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using positional arguments can lead to subtle bugs with numpy. Better use explicit names for keyword parameters: dtype=dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
gensim/models/ldaseqmodel.py
Outdated
|
|
||
| lda = ldamodel.LdaModel(num_topics=num_topics, alpha=self.alphas, id2word=self.id2word) | ||
| lda = ldamodel.LdaModel(num_topics=num_topics, alpha=self.alphas, id2word=self.id2word, | ||
| dtype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style: no vertical indent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
gensim/models/ldaseqmodel.py
Outdated
| """ | ||
| lda_model = ldamodel.LdaModel(num_topics=self.num_topics, alpha=self.alphas, id2word=self.id2word) | ||
| lda_model = ldamodel.LdaModel(num_topics=self.num_topics, alpha=self.alphas, id2word=self.id2word, | ||
| dtype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style: no vertical indent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
gensim/models/ldamodel.py
Outdated
| # Check if `dtype` is set after main pickle load | ||
| # if not, then it's an old model and we should set it to default `np.float64` | ||
| if not hasattr(result, 'dtype'): | ||
| result.dtype = np.float64 # float64 was used before as default in numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old LDA used float64, really?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much everything is using float64 cause it's default dtype when creating arrays.
| corpus, id2word=self.id2word, num_topics=self.num_topics, | ||
| passes=passes, alpha=self.alphas, random_state=random_state | ||
| passes=passes, alpha=self.alphas, random_state=random_state, | ||
| dtype=np.float64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it will be a good idea to change default behaviour (to float32)?
CC @piskvorky @xelez
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not now, LdaSeqModel will require modifications similar to those I made in LdaModel to handle dtype properly.
| self.assertTrue(isinstance(topic, np.ndarray)) | ||
| self.assertEqual(topic.dtype, np.float64) | ||
| # Note: started moving to np.float32 as default | ||
| # self.assertEqual(topic.dtype, np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to enable + switch to float32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break other topic models then
gensim/test/test_matutils.py
Outdated
|
|
||
|
|
||
| class TestMatUtils(unittest.TestCase): | ||
| def test_dirichlet_expectation_keeps_precision(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that make new file is a good idea. Please move this tests to LDA tests class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please add tests for the load old model with new code for all models that you changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test not just loading the old models, but also using them.
The asserts that we newly sprinkled into the code may trigger errors in various places, if something is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@menshikh-iv That test don't involve LDA at all, so it's wrong place for this test. Haven't found any file involving testing matutils functions so I think a separate file is not that bad.
|
@piskvorky , @menshikh-iv see latest commit for backwards compatibility tests. By the way, I think it's good idea to remove my asserts before merging. They were used mostly during tests to ensure that I haven't missed any place to add dtype. That way we definitely won't break old code or models. |
|
By the way, why .npy files are ignored in .gitignore? |
menshikh-iv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great @xelez, please fix last changes, LGTM for me, wdyt @piskvorky ?
|
|
||
| # add a little random jitter, to randomize results around the same alpha | ||
| sort_alpha = self.alpha + 0.0001 * self.random_state.rand(len(self.alpha)) | ||
| # random_state.rand returns float64, but converting back to dtype won't speed up anything |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe .astype (for consistency only) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consistency vs one additional array copy. I'm not sure :)
| # dtype could be absent in old models | ||
| if not hasattr(result, 'dtype'): | ||
| result.dtype = np.float64 # float64 was implicitly used before (cause it's default in numpy) | ||
| logging.info("dtype was not set in saved %s file %s, assuming np.float64", result.__class__.__name__, fname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe warn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #1656 (comment) for discussion. Although it's expected state when loading old model, maybe a warning still a good thing.
gensim/matutils.py
Outdated
| else: | ||
| result = psi(alpha) - psi(np.sum(alpha, 1))[:, np.newaxis] | ||
| return result.astype(alpha.dtype) # keep the same precision as input | ||
| return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please return astype, because
np.float32 -> np.float32
np.float64 -> np.float64
but
np.float16 -> np.float32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, my bad, you're right!
Then tests that I added in separate file aren't needed.
| for k, v in topic: | ||
| self.assertTrue(isinstance(k, int)) | ||
| self.assertTrue(isinstance(v, float)) | ||
| self.assertTrue(np.issubdtype(v, float)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simple isinstance is better here (and everywhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simple isinstance fails cause np.float32 is not an instance of float
gensim/test/test_matutils.py
Outdated
|
|
||
| class TestMatUtils(unittest.TestCase): | ||
| def test_dirichlet_expectation_keeps_precision(self): | ||
| for dtype in (np.float32, np.float64, np.complex64, np.complex128): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add np.float16 and you'll see a problem
|
Thanks a lot @xelez, nice work 👍 |
|
Great feature! @xelez how would you summarize it, for layman people who just want "the gist"? The title says |
|
@piskvorky with LdaMulticore (80k words, 100 topics) ~ 20%, I checked it yesterday. |
|
That's neat :) Let's make sure this information makes it into the release notes / tweets etc. |
|
There are a couple of places in ldamodel.py where 1e-100 gets added to |
|
Good point. I'd hope this would be caught by the unit tests though -- @menshikh-iv ? |
|
@piskvorky this isn't catched by unittests, because operation |
|
The unit tests should catch the operation of "add epsilon" not working, which (I presume) leads to some issues. In other words, if the unit tests pass, what is the problem? |
|
For types For me, it looks like medium bug. |
|
That's not my question. My question is: do unit tests catch it? If not, is it an issue with the unit tests (=> update unit tests), or with the algorithm (=> update gensim code)? If yes, how come we didn't discover the bug earlier. |
|
@piskvorky unittests doesn't catch it. |
|
Then the unit tests should be improved, as part of the solution here -- so that we catch similar bugs automatically in the future. |
|
The problem here is not in tests at all, it's generally impossible to catch this bug in this code with unittests. Here, perhaps, we need to change the lda code itself, but I do not think this is a good idea. |
|
I don't understand -- if there's no way to catch a bug, then there is no bug. |
|
It's definitely test-for-able in principle: I noticed the bug because I started getting division by zero errors in a processing pipeline that used to work. I don't know how create a minimal corpus that triggers it, though. |
|
Summarizing: this is a bug, we need to fix it. |
* Add dtype to LdaModel, assert about it everywhere * Implement loading of old models without dtype * Use assert_allclose instead of == in LdaModel tests. Use np.issubdtype when checking if something is float. * Fix AuthorTopicModel * Fix matutils.dirichlet_expectation * replace assert with docstring comment * add test to check that it really saves dtype for different inputs * Change default to np.float32 and cleanup * Fix wrong logging message * Remove out-of-place comment * Cleanup PEP8 * Add dtype to sklearn LdaTransformer * Set precision explicitly in lda model converters * Add dtype to LdaMulticore * Set dtype to float64 explicitly to retain backward compatibility in models using LdaModel * Cleanup asserts and fix another place calculating in float64 * Fix import * Fix remarks by piskvorky * Add backward compatibility tests * Add missing .npy files * Fix dirichlet_expectation not working with np.float16 * Fix path to saved model
Started implementing #1576
Current state:
And I need to somehow rewrite test asserts like this:
self.assertTrue(all(model.alpha == np.array([0.3, 0.3])))Cause
model.alphais now converted tofloat32(or whatever dtype) andnp.arraystandard dtype isfloat64. Usenp.allclose, maybe?And I'm not sure where to discuss things, here or in the issue.