Skip to content

Conversation

@saparina
Copy link
Contributor

@saparina saparina commented Mar 7, 2017

…alue
Possibly solve the issue #911

@tmylk
Copy link
Contributor

tmylk commented Mar 7, 2017

Could you please setup the distributed workers on your box and check if it actually solves #911. Have you been able to reproduce #911?

@saparina
Copy link
Contributor Author

saparina commented Mar 8, 2017

@tmylk Yes, I reproduced #911 in the way it's described here :

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from gensim import models, corpora
corpus = corpora.MmCorpus('deerwester.mm') # load a corpus of nine documents, from the Tutorials
id2word = corpora.Dictionary.load('deerwester.dict')

lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=100, distributed=True)

I got the same error:

2017-03-08 10:40:51,078 : INFO : loaded corpus index from deerwester.mm.index
2017-03-08 10:40:51,078 : INFO : initializing corpus reader from deerwester.mm
2017-03-08 10:40:51,078 : INFO : accepted corpus with 9 documents, 12 features, 28 non-zero entries
2017-03-08 10:40:51,078 : INFO : loading Dictionary object from deerwester.dict
2017-03-08 10:40:51,078 : INFO : loaded deerwester.dict
2017-03-08 10:40:51,079 : INFO : using symmetric alpha at 0.01
2017-03-08 10:40:51,079 : INFO : using symmetric eta at 0.08333333333333333
2017-03-08 10:40:51,147 : INFO : using distributed version with 2 workers
2017-03-08 10:40:51,163 : INFO : running online LDA training, 100 topics, 1 passes over the supplied corpus of 9 documents, updating model once every 9 documents, evaluating perplexity every 9 documents, iterating 50x with a convergence threshold of 0.001000
2017-03-08 10:40:51,163 : WARNING : too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
2017-03-08 10:40:51,163 : INFO : initializing 2 workers
Traceback (most recent call last):
  File "LDA+issue.py", line 9, in <module>
    lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=100, distributed=True)
  File "/home/irina/GSoC/gensim/gensim/models/ldamodel.py", line 334, in __init__
    self.update(corpus, chunks_as_numpy=use_numpy)
  File "/home/irina/GSoC/gensim/gensim/models/ldamodel.py", line 635, in update
    self.log_perplexity(chunk, total_docs=lencorpus)
  File "/home/irina/GSoC/gensim/gensim/models/ldamodel.py", line 526, in log_perplexity
    perwordbound = self.bound(chunk, subsample_ratio=subsample_ratio) / (subsample_ratio * corpus_words)
  File "/home/irina/GSoC/gensim/gensim/models/ldamodel.py", line 727, in bound
    gammad, _ = self.inference([doc])
  File "/home/irina/GSoC/gensim/gensim/models/ldamodel.py", line 428, in inference
    if doc and not isinstance(doc[0][0], six.integer_types):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In distributed mode chunks keeps as np arrays and expression like doc == True is incorrect for doc np array.
I check it on one machine with two workers and now it works in distributed mode.

@tmylk tmylk merged commit ed757df into piskvorky:develop Mar 8, 2017
@tmylk
Copy link
Contributor

tmylk commented Mar 8, 2017

Thanks for the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants