should help you out The Python Global Interpreter Lock ("GIL") and other interthread-bottlenecks prevent its code from saturating all CPU cores with the classic gensim Word2Vec/Doc2Vec/etc flexible corpus-iterators – where you can supply any re-iterable sequence of the texts. You can improve the throughput a bit with steps like:
wish help you to fix your issue Gensim Doc2Vec doesn't yet have official support for expanding-the-vocabulary (via build_vocab(..., update=True)), so the model's behavior here is not defined to do anything useful. In fact, I think any existing doc-tags will be completely discarded and replaced with any in the latest corpus. (Additionally, there are outstanding unresolved reports of memory-fault process-crashes when trying to use update_vocab() with Doc2Vec, such as this issue.) Even if that worked, there are a number of murky balancing issues to consider if ever continuing to call train() on a model with texts different than the initial training-set. In particular, each such training session will nudge the model to be better on the new examples, but lose value of the original training, possibly making the model worse for some cases or overall.
Hope that helps It is a good idea to train on all the 10 million documents, that will help you capture the general essence of the words and not just with in the context of authors that you are interested in. Also, it will help you if the set of authors who you are interested in, changes tomorrow. If you think Doc2Vec takes a lot of time, you could also use Fasttext to learn WordEmbeddings and use a simple average or TF-IDF weighted average on the word vectors to construct your DocumentVector. You could leverage the power of hierarchical softmax (loss function) in Fasttext that will reduce your training time by 1000+ folds.
Why use TaggedBrownCorpus when training gensim doc2vec
like below fixes the issue You shouldn't use TaggedBrownCorpus. It's just a demo class for reading a particular tiny demo dataset that's included with gensim for unit-tests and intro tutorials. It does things in a reasonable way for that data-format-on-disk, but any other efficient way of getting your data into a repeat-iterable sequence of TaggedDocument-like objects is just as good.
Gensim doc2vec file stream training worse performance
like below fixes the issue Most users should not be calling train() more than once in their own loop, where they try to manage the alpha & iterations themselves. It is too easy to do it wrong. Specifically, your code where you call train() in a loop is doing it wrong. Whatever online source or tutorial you modeled this code on, you should stop consulting, as it's misleading or outdated. (The notebooks bundled with gensim are better examples on which to base any code.)
gensim - Doc2Vec: MemoryError when training on english Wikipedia
I think the issue was by ths following , The required model size in addressable memory is largely a function of the number of weights required, by the number of unique words and unique doc-tags. With 145,000,000 unique doc-tags, no matter how many words you limit yourself to, just the raw doc-vectors in-training alone will require: