Distributed Tensorflow in Kubernetes

Introduce

Distributed Tensorflow (Clustering) can speed up your training. Distributed tensorflow in kubernates make it easy to:

This topic will describe how to make a distributed tensorflow.

TFJob is a CRD(Custom Resource Definitions) of k8s that will create by kubeflow.
TFJob can help you to set

You must know the basic concept of distributed tensorflow here: Distributed TensorFlow
You must know how to write a distributed tensorflow training. Ex: train_and_evaluate

1. Create(Download) source & Dockerfile File:Iris train and eval.zip and unzip to the same folder.

2. Create training container, where "ecgwc" is the username in dockerhub and "tf-iris:dist" is the container name

$ docker build -t ecgwc/tf-iris:dist .

3. Check if trainig docker is workable.

$ docker run --rm ecgwc/tf-iris:dist

4. Push docker to dockerHub

$ docker push ecgwc/tf-iris:dist

5. Create(Download) yaml file for distributed tensorflow: File:Tf-dist-iris.zip

6. Deploy yaml to k8s

$ kubectl create -f tf-dist-iris.yaml

7. Check training status

$ kubectl -n kubeflow logs tf-dist-chief-0