Concurrent inference with tensorflow -
looking @ cuda backend tensorflow, looks computation synchronized on single cuda stream (which think makes sense).
this suggests 1 concurrently run inference on 2 different tensorflow models, sharing same gpu. each have own cuda stream , execute independently.
is in fact supported use case? in particular, i'm curious if (1) expected work , (2) if there performance concerns.
obviously latency may little higher if 2 models sharing same compute resources, i'm curious if there assumptions in tensorflow cause more significant performance penalties.
thanks!
Comments
Post a Comment