python - Training ResNetv1 From Scratch using Tensorflow Slim -
although stated in slim model train_image_classifier.py can used train models scratch, found hard in practice. in case, trying train resnet scratch on local machine 6xk80s. used this:
dataset_dir=/nv/hmart1/ashaban6/scratch/data/imagenet_rf_record train_dir=/nv/hmart1/ashaban6/scratch/train_dir depth=50 num_clones=8 cuda_visible_devices="0,1,2,3,4,5,6,7,8" python train_image_classifier.py --train_dir=${train_dir} --dataset_name=imagenet --model_name=resnet_v1_${depth} --max_number_of_steps=100000000 --batch_size=32 --learning_rate=0.1 --learning_rate_decay_type=exponential --dataset_split_name=train --dataset_dir=${dataset_dir} --optimizer=momentum --momentum=0.9 --learning_rate_decay_factor=0.1 --num_epochs_per_decay=30 --weight_decay=0.0001 --num_readers=12 --num_clones=$num_clones
i followed same settings suggested in paper. using 8 gpus on local machine batch_size 32 effective batch size 32x8=256. learning rate set 0.1 , decayed 10 every 30 epochs. after 70k steps (70000x256/1.2e6 ~ 15 epochs), top-1 performance on validation set low ~14% while should around 50% after many iterations. used command top-1 performance:
dataset_dir=/nv/hmart1/ashaban6/scratch/data/imagenet_rf_record checkpoint_file=/nv/hmart1/ashaban6/scratch/train_dir/ depth=50 cuda_visible_devices="10" python eval_image_classifier.py --alsologtostderr --checkpoint_path=${checkpoint_file} --dataset_dir=${dataset_dir} --dataset_name=imagenet --dataset_split_name=validation --model_name=resnet_v1_${depth}
with lack of working examples hard if there bug in slim training code or problem in script. wrong in script? has trained resent scratch?
Comments
Post a Comment