machine learning - Word embedding training -

August 15, 2012

i have 1 corpus word embedding. using corpus, trained word embedding. however, whenever train word embedding, results quite different(this results based on k-nearest neighbor(knn)). example, in first training, 'computer' nearest neighbor words 'laptops', 'computerized' ,'hardware'. but, in second training, knn words 'software', 'machine',...('laptops' low ranked!) - training performed independently 20 epochs, , hyper-parameters same.

i want train word embedding similar(e.g., 'laptops' high ranked). how should do? should modulate hyper-parameters(learning rate, initializing, etc)?

you didn't word2vec software you're using, might change relevant factors.

the word2vec algorithm inherently uses randomness, in both initialization , several aspects of training (like selection of negative-examples, if using negative-sampling, or random downsampling of very-frequent words). additionally, if you're doing multithreaded training, essentially-random jitter in os thread scheduling change order of training examples, introducing source of randomness. shouldn't expect subsequent runs, exact same parameters , corpus, give identical results.

still, enough data, suitable parameters, , proper training loop, relative-neighbors results should similar run-to-run. if it's not, more data or more iterations might help.

wildly-different results if model overlarge (too many dimensions/words) corpus – , prone overfitting. is, finds great configuration data, through memorizing idiosyncracies, without achieving generalization power. , if such overfitting possible, there typically many equally-good such memorizations – can different run-to-tun. meanwhile, right-sized model lots of data instead capturing true generalities, , those more consistent run-to-run, despite randomization.

getting more data, using smaller vectors, using more training passes, or upping minimum-count of word-occurrences retain/train word might help. (very-infrequent words don't high-quality vectors, wind interfering quality of other words, , randomly intruding in most-similar lists.)

to know else might awry, should clarify in question things like:

software used
modes/metaparameters used
corpus size, in number of examples, average example size in words, , unique-words count (both in raw corpus, , after minumum-count applied)
methods of preprocessing
code you're using training (if you're managing multiple training-passes yourself)

Search This Blog

How Y

machine learning - Word embedding training -

Comments

Post a Comment

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

reflection - How to access the object-members of an object declaration in kotlin -

php - Doctrine Query Builder Error on Join: [Syntax Error] line 0, col 87: Error: Expected Literal, got 'JOIN' -