Google Cloud Dataflow: Different behavior for DirectRunner versus DataFlowRunner when using argparse -
i building google cloud dataflow pipeline process videos. having hard time debugging pipeline because environment behavior seems different on directrunner versus dataflowrunner.
my video processing tool (called deepmeerkat below) takes in arguments argparse. call pipeline:
python run_clouddataflow.py \ --runner dataflowrunner \ --project $project \ --staging_location $bucket/staging \ --temp_location $bucket/temp \ --job_name $project-deepmeerkat \ --setup_file ./setup.py \ --maxnumworkers 3 \ --tensorflow \ --training
where last 2 arguments, tensorflow , training both pipeline, rest needed clouddataflow.
i parse args , pass argv pipeline
beam.pipeline(argv=pipeline_args)
and within deepmeerkat's argparse, parse known args.
args,_=parser.parse_known_args()
this works locally, tensorflow turned off (default on) , training turned on (default on). printing args confirms behavior. fails parse on cloud dataflow, tensorflow stays on, , training off.
directrunner:
deepmeerkat args: namespace(tensorflow=false, training=true)
from logging of dataflowrunner:
deepmeerkat args: namespace(tensorflow=true, training=false)
any ideas of what's going on here? identical commands, identical code, changing directrunner dataflowrunner.
i'd rather not go down road of passing custom arguments pipeline options, since need assign them somehow downstream, if 1 has tool parses arguments, seems more straightforward solution, provided there isn't special dataflow worker.
i had wrong conceptual model this. locally, each "worker" still has access sys args, not runner behavior different, rather "worker" circumventing cloud pipeline , grabbing new args parse. way in dataflowrunner explicitly pass pipeline args dofn function using
__init__(self,args)
. parse args internally within beam pipeline if came string.
Comments
Post a Comment