Difference between revisions of "Distributed Tensorflow in Kubernetes"

Latest revision as of 04:17, 16 November 2018

Introduce

Distributed Tensorflow (Clustering) can speed up your training. Distributed tensorflow in kubernates make it easy to:

Add k8s nodes to extend computing capability
Simplify the work to make a distributed tensorflow

This topic will describe how to make a distributed tensorflow.

TFJob

TFJob is a CRD(Custom Resource Definitions) of k8s that will create by kubeflow.
TFJob can help you to set

Prerequisite

You must know the basic concept of distributed tensorflow here: Distributed TensorFlow
You must know how to write a distributed tensorflow training. Ex: train_and_evaluate

Steps

1. Create(Download) source & Dockerfile File:Iris train and eval.zip and unzip to the same folder.

2. Create training container, where "ecgwc" is the username in dockerhub and "tf-iris:dist" is the container name

$ docker build -t ecgwc/tf-iris:dist .

3. Check if trainig docker is workable.

$ docker run --rm ecgwc/tf-iris:dist

4. Push docker to dockerHub

$ docker push ecgwc/tf-iris:dist

5. Create(Download) yaml file for distributed tensorflow: File:Tf-dist-iris.zip

6. Deploy yaml to k8s

$ kubectl create -f tf-dist-iris.yaml

7. Check training status

Check pod

Check tfjob

Check training log

$ kubectl -n kubeflow logs tf-dist-chief-0

Reference

https://github.com/Azure/kubeflow-labs/tree/master/7-distributed-tensorflow

@@ Line 1: / Line 1: @@
 == Introduce ==
+Distributed Tensorflow (Clustering) can speed up your training.
+Distributed tensorflow in kubernates make it easy to:
+# Add k8s nodes to extend computing capability
+# Simplify the work to make a distributed tensorflow
-Distributed Tensorflow (Clustering) can speed up your training. Distributed tensorflow in kubernates make it easy to:
+This topic will describe how to make a distributed tensorflow.
-#Add k8s nodes to extend computing capability
-#Simplify the work to make a distributed tensorflow
-This topic will describe how to make a distributed tensorflow.
+== TFJob ==
+* TFJob is a CRD(Custom Resource Definitions) of k8s that will create by kubeflow.
+* TFJob can help you to set
 == Prerequisite ==
+# You must know the basic concept of distributed tensorflow here: [https://www.tensorflow.org/deploy/distributed| Distributed TensorFlow]
-#You must know the basic concept of distributed tensorflow here: [https://www.tensorflow.org/deploy/distributed| Distributed TensorFlow]
+# You must know how to write a distributed tensorflow training. Ex: [https://www.tensorflow.org/api_docs/python/tf/estimator/train_and_evaluate| train_and_evaluate]
-#You must know how to write a distributed tensorflow training. Ex: [https://www.tensorflow.org/api_docs/python/tf/estimator/train_and_evaluate| train_and_evaluate]
 == Steps ==
+. Create(Download) source & Dockerfile [[File:Iris train and eval.zip]] and unzip to the same folder.
-. Create(Download) source & Dockerfile ([[File:Iris train and eval.zip|RTENOTITLE]]) and unzip to the same folder.
 . Create training container, where "ecgwc" is the username in dockerhub and "tf-iris:dist" is the container name
@@ Line 22: / Line 23: @@
 </syntaxhighlight>
-. Check if trainig docker is workable.<syntaxhighlight lang="bash">
+. Check if trainig docker is workable.
+<syntaxhighlight lang="bash">
 $ docker run --rm ecgwc/tf-iris:dist
 </syntaxhighlight>
+[[File:Dist tf k8s-1.png]]
-[[File:Dist tf k8s-1.png|RTENOTITLE]]
+. Push docker to dockerHub
+<syntaxhighlight lang="bash">
-. Push docker to dockerHub<syntaxhighlight lang="bash">
 $ docker push ecgwc/tf-iris:dist
 </syntaxhighlight>
-. Create(Download) yaml file for distributed tensorflow: [[File:Tf-dist-iris.zip|RTENOTITLE]]
+. Create(Download) yaml file for distributed tensorflow: [[File:Tf-dist-iris.zip]]
-. Deploy yaml to k8s<syntaxhighlight lang="bash">
+. Deploy yaml to k8s
+<syntaxhighlight lang="bash">
 $ kubectl create -f tf-dist-iris.yaml
 </syntaxhighlight>
-. Check training process
+. Check training status
+* Check pod
+[[File:Dist tf k8s-2.png]]
+* Check tfjob
+[[File:Dist tf k8s-3.png]]
+* Check training log
+<syntaxhighlight lang="bash">
+$ kubectl -n kubeflow logs tf-dist-chief-0
+</syntaxhighlight>
-[[File:Dist tf k8s-2.png|RTENOTITLE]]
 == Reference ==
+https://github.com/Azure/kubeflow-labs/tree/master/7-distributed-tensorflow
-[https://github.com/Azure/kubeflow-labs/tree/master/7-distributed-tensorflow https://github.com/Azure/kubeflow-labs/tree/master/7-distributed-tensorflow]

Difference between revisions of "Distributed Tensorflow in Kubernetes"

Latest revision as of 04:17, 16 November 2018

Contents

Introduce

TFJob

Prerequisite

Steps

Reference

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools