-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Migrating from Gensim 3.x to 4
Gensim 4.0 is compatible with older releases (3.8.3 and prior) for the most part. Your existing stored models and code will continue to work in 4.0, except:
Gensim 4.0+ is Python 3 only. See the Gensim & Compatibility policy page for supported Python 3 versions. They train much faster and consume less RAM (see 4.0 benchmarks).
The *2Vec-related classes (Word2Vec, FastText, & Doc2Vec) have undergone significant internal refactoring for clarity, consistency, efficiency & maintainability.
model = Word2Vec(size=100, …) # 🚫
model = FastText(size=100, …) # 🚫
model = Doc2Vec(size=100, …) # 🚫
model = Word2Vec(vector_size=100, …) # 👍
model = FastText(vector_size=100, …) # 👍
model = Doc2Vec(vector_size=100, …) # 👍model = Word2Vec(iter=5, …) # 🚫
model = FastText(iter=5, …) # 🚫
model = Doc2Vec(iter=5, …) # 🚫
model = Word2Vec(epochs=5, …) # 👍
model = FastText(epochs=5, …) # 👍
model = Doc2Vec(epochs=5, …) # 👍Before, the iter name was used to match the original word2vec implementation. But epochs is more standard and descriptive, plus iter clashes with Python's built-in iter.
random_word = random.choice(model.wv.index2word) # 🚫
random_word = random.choice(model.wv.index_to_key) # 👍This unifies the terminology: these models map keys to vectors (not just words or entities to vectors).
4. vocab dict became key_to_index for looking up a key's integer index, or get_vecattr() and set_vecattr() for other per-key attributes:
rock_idx = model.wv.vocab["rock"].index # 🚫
rock_cnt = model.wv.vocab["rock"].count # 🚫
vocab_len = len(model.wv.vocab) # 🚫
rock_idx = model.wv.key_to_index["rock"] # 👍
rock_cnt = model.wv.get_vecattr("rock", "count") # 👍
vocab_len = len(model.wv) # 👍L2-normalized vectors are now computed dynamically, on request. The full numpy array of "normalized vectors" is no longer stored in memory:
all_normed_vectors = model.wv.get_normed_vectors() # still works but now creates a new array on each call!
normed_vector = model.wv.vectors_norm[model.wv.vocab["rock"].index] # 🚫
normed_vector = model.wv.get_vector("rock", norm=True) # 👍This allows Gensim 4.0.0 to be much more memory efficient than Gensim <4.0.
6. no more vocabulary and trainables attributes; properties previously there have been moved back to the model:
out_weights = model.trainables.syn1neg # 🚫
min_count = model.vocabulary.min_count # 🚫
out_weights = model.syn1neg # 👍
min_count = model.min_count # 👍7. methods like most_similar(), wmdistance(), doesnt_match(), similarity(), & others moved to KeyedVectors
These methods moved from the full model (Word2Vec, Doc2Vec, FastText) object to its .wv subcomponent (of type KeyedVectors) many releases ago:
w2v_model.most_similar(word) # 🚫
w2v_model.most_similar_cosmul(word) # 🚫
w2v_model.wmdistance(wordlistA, wordlistB) # 🚫
w2v_model.similar_by_word(word) # 🚫
w2v_model.similar_by_vector(word) # 🚫
w2v_model.doesnt_match(wordlist) # 🚫
w2v_model.similarity(wordA, wordB) # 🚫
w2v_model.n_similarity(wordlistA, wordlistB) # 🚫
w2v_model.evaluate_word_pairs(wordpairs) # 🚫
w2v_model.accuracy(questions) # 🚫
w2v_model.log_accuracy(section) # 🚫
w2v_model.wv.most_similar(word) # 👍
w2v_model.wv.most_similar_cosmul(word) # 👍
w2v_model.wv.wmdistance(wordlistA, wordlistB) # 👍
w2v_model.wv.similar_by_word(word) # 👍
w2v_model.wv.similar_by_vector(word) # 👍
w2v_model.wv.doesnt_match(wordlist) # 👍
w2v_model.wv.similarity(wordA, wordB) # 👍
w2v_model.wv.n_similarity(wordlistA, wordlistB) # 👍
w2v_model.wv.evaluate_word_pairs(wordpairs) # 👍
w2v_model.wv.evaluate_word_analogies(questions) # 👍
w2v_model.wv.log_accuracy(section) # 👍Most generally, if any call on a full model (Word2Vec, Doc2Vec, FastText) object only needs the word vectors to calculate its response, and you encounter a has no attribute error in Gensim 4.0.0+, make the call on the contained KeyedVectors object instead.
In addition, wmdistance will normalize vectors to unit length now by default:
# 🚫 BEFORE
model.init_sims(replace=True) # 🚫 First normalize all embedding vectors.
distance = model.wmdistance(wordlistA, wordlistB) # 🚫 Then compute WMD distance.
# 👍 Now in 4.0+
distance = model.wv.wmdistance(wordlistA, wordlistB) # 👍 WMD distance over normalized embedding vectors.
distance = model.wv.wmdistance(wordlistA, wordlistB, norm=False) # 👍 WMD over non-normalizated vectors.These two training callbacks had muddled semantics, confused users and introduced race conditions. Use on_epoch_begin and on_epoch_end instead.
Gensim 4.0 now ignores these two functions entirely, even if implementations for them are present.
…and it's now a standard KeyedVectors object, so has all the standard attributes and methods of KeyedVectors (but no specialized properties like vectors_docs):
random_doc_id = np.random.randint(doc2vec_model.docvecs.count) # 🚫
document_vector = doc2vec_model.docvecs["some_document_tag"] # 🚫
all_docvecs = doc2vec_model.docvecs.vectors_docs # 🚫
random_doc_id = np.random.randint(len(doc2vec_models.dv)) # 👍
document_vector = doc2vec_model.dv["some_document_tag"] # 👍
all_docvecs = doc2vec_model.dv.vectors # 👍Because the vectors for document tags are now in a standard KeyedVectors, prior specific-to-Doc2Vec accessors like doctags_syn0, vectors_docs, or index_to_doctag are no longer supported; the analogous generic accessors should be used instead:
all_docvecs = doc2vec_model.docvecs.doctag_syn0 # 🚫
all_docvecs = doc2vec_model.docvecs.vectors_docs # 🚫
doctag = doc2vec_model.docvecs.index_to_doctag[n] # 🚫
all_docvecs = doc2vec_model.dv.vectors # 👍
doctag = doc2vec_model.dv.index_to_key[n] # 👍"night" in model.wv.vocab # 🚫
"night" in model.wv.key_to_index # 👍Of course, even OOV words have vectors in FastText (assembled from vectors of their character ngrams), so the following is not a good way to test the presence of a vector:
"no_such_word" in model.wv # 🚫 always returns True for FastText!
model.wv["no_such_word"] # returns a vector even for OOV wordsThe following notes are for advanced users, who were using or extending the Gensim internals more deeply, perhaps relying on protected / private attributes.
-
A key change is the creation of a unified
KeyedVectorsclass for working with sets-of-vectors, that's reused for both word-vectors and doc-vectors, both when these are a subcomponent of the full algorithm models (for training) and when they are separate vector-sets (for lighter-weight re-use). Thus, this unified class shares the same (& often improved) convenience methods & implementations. -
One notable internal implementation change means that performing the usual similarity operations no longer requires the creation of a 2nd full cache of unit-normalized vectors, via the
.init_sims()method & stored in the.vectors_normproperty. That used to involve a noticeable delay on 1st use, much higher memory use, and extra complications when attempting to deploy/share vectors among multiple processes. -
A number of errors and inefficiencies in the FastText implementation have been corrected. Model size in memory and when saved to disk will be much smaller, and using
FastTextas if it wereWord2Vec, by disabling character n-grams (withmax_n=0), should be as fast & performant as vanillaWord2Vec. -
When supplying a Python iterable corpus to instance-initialization,
build_vocab(), ortrain(), the parameter name is nowcorpus_iterable, to reflect the central expectation (that it is an iterable) and for correspondence with thecorpus_filealternative. The prior model-specific names for this parameter, likesentencesordocuments, were overly-specific, and sometimes led users to the mistaken belief that such input must be precisely natural-language sentences.
If you're unsure or getting unexpected results, let us know at the Gensim mailing list.
…to be more explicit in its intent, and easier to tell apart from its chunkier parent Phrases:
phrases = Phrases(corpus)
phraser = Phraser(phrases) # 🚫
phrases = Phrases(corpus)
frozen_phrases = phrases.freeze() # 👍Note that phrases (collocation detection, multi-word expressions) have been pretty much rewritten from scratch for Gensim 4.0, and are more efficient and flexible now overall.
Despite its general-sounding name, the module will not satisfy the majority of use cases in production and is likely to waste people's time. See this Github ticket for more motivation behind this.
A rarely used contributed module, of poor quality of both code and documentation.
The original module was named too broadly. Now it's clearer this module employs the Annoy kNN library, while there's also similarities.nmslib etc.
These wrappers of 3rd party libraries required too much effort. There were no volunteers to maintain and support them properly in Gensim.
If your work depends on any of the modules below, feel free to copy it out of Gensim 3.8.3 (the last release where they appear), and extend & maintain the wrapper yourself.
The removed submodules are:
- gensim.models.wrappers.dtmmodel
- gensim.models.wrappers.ldamallet
- gensim.models.wrappers.ldavowpalwabbit
- gensim.models.wrappers.varembed
- gensim.models.wrappers.wordrank
- gensim.sklearn_api.atmodel
- gensim.sklearn_api.d2vmodel
- gensim.sklearn_api.ftmodel
- gensim.sklearn_api.hdp
- gensim.sklearn_api.ldamodel
- gensim.sklearn_api.ldaseqmodel
- gensim.sklearn_api.lsimodel
- gensim.sklearn_api.phrases
- gensim.sklearn_api.rpmodel
- gensim.sklearn_api.text2bow
- gensim.sklearn_api.tfidf
- gensim.sklearn_api.w2vmodel
- gensim.viz