Multi GPU เป็น keras

คุณสามารถโปรแกรมในไลบรารี keras (หรือเทนเซอร์โฟลว) เพื่อแบ่งพาร์ติชันการฝึกอบรมใน GPU หลาย ๆ ตัวได้อย่างไร สมมติว่าคุณอยู่ในอินสแตนซ์ Amazon ec2 ที่มี 8 GPU และคุณต้องการที่จะใช้ทั้งหมดในการฝึกอบรมได้เร็วขึ้น แต่รหัสของคุณเป็นเพียงสำหรับ CPU หรือ GPU เดียว

— เฮ็กเตอร์แบลนดิน
แหล่งที่มา

คุณได้ตรวจสอบเอกสารเทนเซอร์โฟลว์แล้วหรือยัง?

— n1tk

@ sb0709: ฉันเริ่มอ่านเมื่อเช้านี้ แต่ฉันสงสัยว่าจะทำยังไงใน keras

— Hector Blandin

ไม่ทราบใน keras แต่สำหรับ tensorflow: tf จะใช้ GPU เป็นค่าเริ่มต้นสำหรับการคำนวณแม้ว่าจะใช้กับ CPU (หากมี GPU รองรับอยู่) ดังนั้นคุณสามารถทำเพื่อลูป: "สำหรับ d ใน ['/ gpu: 1', '/ gpu: 2', '/ gpu: 3' ... '/ gpu: 8',]:" และใน "tf.device (d)" ควรรวมทรัพยากร GPU ของคุณทุกอินสแตนซ์ ดังนั้น tf.device () จะถูกใช้งานจริง

— n1tk

อย่างนี้ ?? สำหรับ d ใน ['/ gpu: 1', '/ gpu: 2', '/ gpu: 3' ... '/ gpu: 8',]: tf.device (d) และนั่นคืออะไร? ผมจะพยายามเช่น :) ว่า

— เฮ็กเตอร์ Blandin

เท่าที่ฉันรู้ใช่คุณสามารถทำงานใด ๆ บนอุปกรณ์ที่แตกต่างกัน

— n1tk

คำตอบ:

จากคำถามพบบ่อย Keras:

https://keras.io/getting-started/faq/#how-can-i-run-a-keras-model-on-multiple-gpus

ด้านล่างเป็นรหัสที่คัดลอกมาเพื่อเปิดใช้งาน 'data parallelism' นั่นคือการให้ GPU แต่ละตัวประมวลผลชุดย่อยของข้อมูลที่แตกต่างกันอย่างอิสระ

from keras.utils import multi_gpu_model

# Replicates `model` on 8 GPUs.
# This assumes that your machine has 8 available GPUs.
parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(loss='categorical_crossentropy',
                       optimizer='rmsprop')

# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)

โปรดทราบว่าสิ่งนี้ดูเหมือนว่าจะใช้ได้เฉพาะแบ็กเอนด์ Tensorflow ในขณะที่เขียน

อัปเดต (ก.พ. 2018) :

ตอนนี้ Keras ยอมรับการเลือก gpu อัตโนมัติโดยใช้ multi_gpu_model ดังนั้นคุณไม่ต้อง hardcode จำนวน gpus อีกต่อไป รายละเอียดในคำขอดึงนี้ กล่าวอีกนัยหนึ่งนี่เป็นการเปิดใช้งานโค้ดที่มีลักษณะดังนี้:

try:
    model = multi_gpu_model(model)
except:
    pass

แต่เพื่อให้ชัดเจนยิ่งขึ้นคุณสามารถใช้สิ่งต่อไปนี้:

parallel_model = multi_gpu_model(model, gpus=None)

โบนัส :

ในการตรวจสอบว่าคุณใช้ GPU ของคุณจริงๆหรือไม่โดยเฉพาะกับ NVIDIA คุณสามารถตรวจสอบการใช้งานในเทอร์มินัลโดยใช้:

watch -n0.5 nvidia-smi

อ้างอิง:

— weiji14
แหล่งที่มา

ไม่multi_gpu_model(model, gpus=None)ทำงานในกรณีที่มีเพียง 1 GPU? มันจะเจ๋งถ้ามันปรับให้เข้ากับจำนวนของ GPU โดยอัตโนมัติ

— CMCDragonkai

ใช่ฉันคิดว่ามันใช้งานได้กับ 1 GPU ดูgithub.com/keras-team/keras/pull/9226#issuecomment-361692460แต่คุณอาจต้องระวังว่าโค้ดของคุณถูกดัดแปลงให้ทำงานบน multi_gpu_model แทนรุ่นง่าย ๆ . สำหรับกรณีส่วนใหญ่มันอาจจะไม่สำคัญ แต่ถ้าคุณจะทำอะไรบางอย่างเช่นเอาท์พุทของเลเยอร์กลางบางส่วนคุณจะต้องเขียนโค้ดตามนั้น

— weiji 14

คุณมีการอ้างอิงถึงความแตกต่างของรุ่น gpu หลายรุ่นหรือไม่?

— CMCDragonkai

คุณหมายถึงบางสิ่งบางอย่างเช่นgithub.com/rossumai/keras-multi-gpu/blob/master/blog/docs/… ?

— weiji14

การอ้างอิงนั้นยอดเยี่ยม @ weiji14 อย่างไรก็ตามฉันก็สนใจว่ามันทำงานอย่างไรเพื่ออนุมาน keras มีการแบ่งแบทช์เท่ากันหรือรอบโรบินในแบบจำลองที่มีอยู่หรือไม่?

— CMCDragonkai

สำหรับ TensorFlow:

TensorFlow ใช้ GPU

นี่คือตัวอย่างโค้ดเกี่ยวกับวิธีการใช้ดังนั้นสำหรับแต่ละภารกิจจะถูกระบุรายการด้วยอุปกรณ์ / อุปกรณ์:

# Creates a graph.
c = []
for d in ['/gpu:2', '/gpu:3']:
  with tf.device(d):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
    c.append(tf.matmul(a, b))
with tf.device('/cpu:0'):
  sum = tf.add_n(c)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(sum))

tf จะใช้ GPU เป็นค่าเริ่มต้นสำหรับการคำนวณแม้ว่าจะใช้กับ CPU (หากปัจจุบันรองรับ GPU) ดังนั้นคุณสามารถทำเพื่อลูป: "สำหรับ d ใน ['/ gpu: 1', '/ gpu: 2', '/ gpu: 3' ... '/ gpu: 8',]:" และใน "tf.device (d)" ควรรวมทรัพยากร GPU ของคุณทุกอินสแตนซ์ ดังนั้น tf.device () จะถูกใช้งานจริง

Scaling Keras Model Training เป็น GPU หลายตัว

Keras

สำหรับ Keras โดยใช้ Mxnet กว่าargs.num_gpusโดยที่num_gpusเป็นรายการของ GPU ที่คุณต้องการ

def backend_agnostic_compile(model, loss, optimizer, metrics, args):
  if keras.backend._backend == 'mxnet':
      gpu_list = ["gpu(%d)" % i for i in range(args.num_gpus)]
      model.compile(loss=loss,
          optimizer=optimizer,
          metrics=metrics, 
          context = gpu_list)
  else:
      if args.num_gpus > 1:
          print("Warning: num_gpus > 1 but not using MxNet backend")
      model.compile(loss=loss,
          optimizer=optimizer,
          metrics=metrics)

horovod.tensorflow

เหนือสิ่งอื่นใดจาก Uber ที่เปิดให้บริการ Horovod เมื่อเร็ว ๆ นี้และฉันคิดว่าเยี่ยมมาก:

Horovod

import tensorflow as tf
import horovod.tensorflow as hvd

# Initialize Horovod
hvd.init()

# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())

# Build model…
loss = …
opt = tf.train.AdagradOptimizer(0.01)

# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)

# Add hook to broadcast variables from rank 0 to all other processes during
# initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]

# Make training operation
train_op = opt.minimize(loss)

# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir=“/tmp/train_logs”,
                                      config=config,
                                      hooks=hooks) as mon_sess:
 while not mon_sess.should_stop():
   # Perform synchronous training.
   mon_sess.run(train_op)

— n1tk
แหล่งที่มา

โดยทั่วไปคุณสามารถใช้ตัวอย่างของตัวอย่างต่อไปนี้ สิ่งที่คุณต้องมีคือการระบุค่าการใช้ cpu และ gpu หลังจากนำเข้า keras

import keras

config = tf.ConfigProto( device_count = {'GPU': 1 , 'CPU': 56} )
sess = tf.Session(config=config) 
keras.backend.set_session(sess)

หลังจากนั้นคุณจะพอดีกับโมเดล

model.fit(x_train, y_train, epochs=epochs, validation_data=(x_test, y_test))

สุดท้ายคุณสามารถลดค่าการสิ้นเปลืองไม่ใช่งานบนขีด จำกัด บน

— johncasey
แหล่งที่มา