machine learning - Word embedding training -
i have 1 corpus word embedding. using corpus, trained word embedding. however, whenever train word embedding, results quite different(this results based on k-nearest neighbor(knn)). example, in first training, 'computer' nearest neighbor words 'laptops', 'computerized' ,'hardware'. but, in second training, knn words 'software', 'machine',...('laptops' low ranked!) - training performed independently 20 epochs, , hyper-parameters same.
i want train word embedding similar(e.g., 'laptops' high ranked). how should do? should modulate hyper-parameters(learning rate, initializing, etc)?
you didn't word2vec software you're using, might change relevant factors.
the word2vec algorithm inherently uses randomness, in both initialization , several aspects of training (like selection of negative-examples, if using negative-sampling, or random downsampling of very-frequent words). additionally, if you're doing multithreaded training, essentially-random jitter in os thread scheduling change order of training examples, introducing source of randomness. shouldn't expect subsequent runs, exact same parameters , corpus, give identical results.
still, enough data, suitable parameters, , proper training loop, relative-neighbors results should similar run-to-run. if it's not, more data or more iterations might help.
wildly-different results if model overlarge (too many dimensions/words) corpus – , prone overfitting. is, finds great configuration data, through memorizing idiosyncracies, without achieving generalization power. , if such overfitting possible, there typically many equally-good such memorizations – can different run-to-tun. meanwhile, right-sized model lots of data instead capturing true generalities, , those more consistent run-to-run, despite randomization.
getting more data, using smaller vectors, using more training passes, or upping minimum-count of word-occurrences retain/train word might help. (very-infrequent words don't high-quality vectors, wind interfering quality of other words, , randomly intruding in most-similar lists.)
to know else might awry, should clarify in question things like:
- software used
- modes/metaparameters used
- corpus size, in number of examples, average example size in words, , unique-words count (both in raw corpus, , after minumum-count applied)
- methods of preprocessing
- code you're using training (if you're managing multiple training-passes yourself)
Comments
Post a Comment