TensorFlow Core Tutorials - 케라스를 사용한 분산 훈련

2019. 12. 29. 02:28

케라스를 사용한 분산 훈련

튜토리얼바로가기

개요

tf.distribute.Strategy: 훈련을 여러 처리 장치들로 분산시키는 것을 추상화 한것
- 기존의 모델이나 훈련 코드를 조금만 바꾸어 분산훈련을 할 수 있게 하는 것
- tf.distribute.MirroredStrategy를 사용: 동기화된 훈련 방식을 활용하여 한 장비에 있는 여러개의 GPU로 그래프 내 복제를 수행
  - 모델의 모든 변수를 각 프로세서에 복사
  - 각 프로세서의 그래디언트를 올 리듀스(all-reduce)를 사용하여 모음
  - 모아서 계산한 값을 각 프로세서의 모델 복사본에 적용
- 참고 - 다른 분산 전략 가이드

필요한 패키지 가져오기

try:
  %tensorflow_version 2.x
except Exception:
  pass

from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow_datasets as tfds
import tensorflow as tf

tfds.disable_progress_bar()

import os

데이터셋 다운로드

MNIST 데이터셋을 TensorFlow Datasets에서 다운로드
- tf.data 형식을 반환
- with_info=True로 설정: 전체 데이터에 대한 메타정보(훈련과 테스트 샘플수 등 정보)로 함께 불러옴

datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
mnist_train, mnist_test = datasets['train'], datasets['test']

# 메타데이터 출력
info

tfds.core.DatasetInfo(
    name='mnist',
    version=1.0.0,
    description='The MNIST database of handwritten digits.',
    homepage='http://yann.lecun.com/exdb/mnist/',
    features=FeaturesDict({
        'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    }),
    total_num_examples=70000,
    splits={
        'test': 10000,
        'train': 60000,
    },
    supervised_keys=('image', 'label'),
    citation="""@article{lecun2010mnist,
      title={MNIST handwritten digit database},
      author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
      journal={ATT Labs [Online]. Available: http://yann. lecun. com/exdb/mnist},
      volume={2},
      year={2010}
    }""",
    redistribution_info=,
)

분산 전략 정의하기

MirroredStrategy객체: 분산과 관련된 처리, 컨텍스트 관리자 제공
- tf.distribute.MirroredStrategy.scope

strategy = tf.distribute.MirroredStrategy()

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)


INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)

print('장치의 수: {}'.format(strategy.num_replicas_in_sync))

장치의 수: 1

입력 파이프라인 구성하기

다중 GPU로 모델을 훈련할 때는 배치크기를 늘려야 컴퓨팅 자원을 효과적으로 사용가능
- 기본: GPU 메모리에 맞춰 가능한 큰 배치 크기 사용, 학습률 조정

# 데이터셋 내 샘플의 수
print(info.splits.total_num_examples)

num_train_examples = info.splits['train'].num_examples
num_test_examples = info.splits['test'].num_examples

print(num_train_examples, num_test_examples)

BUFFER_SIZE = 10000

BATCH_SIZE_PER_REPLICA = 64
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

print('BATCH_SIZE: {}'.format(BATCH_SIZE))

70000
60000 10000
BATCH_SIZE: 64

픽셀의 값이 0~255 사이, 0-1범위로 정규화 필요

def scale(image, label):
  image = tf.cast(image, tf.float32)
  image /= 255

  return image, label

훈련 데이터 순서를 섞고 훈련을 위해 배치로 묶음

train_dataset = mnist_train.map(scale).shuffle(BATCH_SIZE).batch(BATCH_SIZE)
eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)

모델 만들기

strategy.scope컨텍스트 안에서 케라스 모델을 만들고 컵파일 함

with strategy.scope():
  model = tf.keras.Sequential([
                               tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
                               tf.keras.layers.MaxPool2D(),
                               tf.keras.layers.Flatten(),
                               tf.keras.layers.Dense(64, activation='relu'),
                               tf.keras.layers.Dense(10, activation='softmax')
  ])

  model.compile(loss='sparse_categorical_crossentropy',
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

콜백 정의하기

텐서보드(TensorBoard): 텐서보드용 로그를 남김, 텐서보드에서 그래프를 그릴 수 있게 해줌
모델 체크포인트(Checkpoint): 에포크가 끝난 후 모델을 저장
학습률 스케줄러: 에포크 혹은 배치가 끝난 후 학습률을 바꿈

# 체크포인트를 저장할 디렉토리
checkpoint_dir = './training_checkpoints'

# 체크포인트 파일의 이름
checkpoint_prefix = os.path.join(checkpoint_dir, 'ckpt_{epoch}')

# 학습률을 점점 줄이기 위한 함수
# 필요한 함수를 직접 정의해서 사용가능
def decay(epoch):
  if epoch < 3:
    return 1e-3
  elif epoch >=3 and epoch < 7:
    return 1e-4
  else:
    return 1e-5

# epoch가 끝날 때마다 학습률을 출력하는 콜백
class PrintLR(tf.keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs=None):
    print('\n에포크 {}의 학습률은 {}입니다.'.format(epoch+1, model.optimizer.lr.numpy()))

callbacks = [
             tf.keras.callbacks.TensorBoard(log_dir='./logs'),
             tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,
                                                save_weights_only=True),
             tf.keras.callbacks.LearningRateScheduler(decay),
             PrintLR()
]

훈련과 평가

model.fit(train_dataset, epochs=12, callbacks=callbacks)

Epoch 1/12
    938/Unknown - 18s 19ms/step - loss: 0.1962 - accuracy: 0.9429
에포크 1의 학습률은 0.0010000000474974513입니다.
938/938 [==============================] - 18s 19ms/step - loss: 0.1962 - accuracy: 0.9429
Epoch 2/12
937/938 [============================>.] - ETA: 0s - loss: 0.0629 - accuracy: 0.9814
에포크 2의 학습률은 0.0010000000474974513입니다.
938/938 [==============================] - 12s 12ms/step - loss: 0.0629 - accuracy: 0.9814
Epoch 3/12
936/938 [============================>.] - ETA: 0s - loss: 0.0429 - accuracy: 0.9874
에포크 3의 학습률은 0.0010000000474974513입니다.
938/938 [==============================] - 12s 13ms/step - loss: 0.0430 - accuracy: 0.9874
Epoch 4/12
936/938 [============================>.] - ETA: 0s - loss: 0.0241 - accuracy: 0.9934
에포크 4의 학습률은 9.999999747378752e-05입니다.
938/938 [==============================] - 12s 13ms/step - loss: 0.0240 - accuracy: 0.9934
Epoch 5/12
936/938 [============================>.] - ETA: 0s - loss: 0.0206 - accuracy: 0.9948
에포크 5의 학습률은 9.999999747378752e-05입니다.
938/938 [==============================] - 12s 13ms/step - loss: 0.0206 - accuracy: 0.9948
Epoch 6/12
936/938 [============================>.] - ETA: 0s - loss: 0.0188 - accuracy: 0.9955
에포크 6의 학습률은 9.999999747378752e-05입니다.
938/938 [==============================] - 13s 13ms/step - loss: 0.0187 - accuracy: 0.9955
Epoch 7/12
935/938 [============================>.] - ETA: 0s - loss: 0.0172 - accuracy: 0.9961
에포크 7의 학습률은 9.999999747378752e-05입니다.
938/938 [==============================] - 12s 13ms/step - loss: 0.0172 - accuracy: 0.9961
Epoch 8/12
936/938 [============================>.] - ETA: 0s - loss: 0.0148 - accuracy: 0.9967
에포크 8의 학습률은 9.999999747378752e-06입니다.
938/938 [==============================] - 12s 13ms/step - loss: 0.0148 - accuracy: 0.9967
Epoch 9/12
936/938 [============================>.] - ETA: 0s - loss: 0.0146 - accuracy: 0.9968
에포크 9의 학습률은 9.999999747378752e-06입니다.
938/938 [==============================] - 12s 12ms/step - loss: 0.0146 - accuracy: 0.9968
Epoch 10/12
936/938 [============================>.] - ETA: 0s - loss: 0.0144 - accuracy: 0.9968
에포크 10의 학습률은 9.999999747378752e-06입니다.
938/938 [==============================] - 12s 12ms/step - loss: 0.0144 - accuracy: 0.9968
Epoch 11/12
936/938 [============================>.] - ETA: 0s - loss: 0.0142 - accuracy: 0.9969
에포크 11의 학습률은 9.999999747378752e-06입니다.
938/938 [==============================] - 12s 12ms/step - loss: 0.0142 - accuracy: 0.9969
Epoch 12/12
936/938 [============================>.] - ETA: 0s - loss: 0.0140 - accuracy: 0.9969
에포크 12의 학습률은 9.999999747378752e-06입니다.
938/938 [==============================] - 12s 13ms/step - loss: 0.0140 - accuracy: 0.9969





<tensorflow.python.keras.callbacks.History at 0x7f1d34000c50>

# 체크포인트 디렉터리 확인하기
!ls {checkpoint_dir}

checkpoint             ckpt_4.data-00000-of-00002
ckpt_10.data-00000-of-00002  ckpt_4.data-00001-of-00002
ckpt_10.data-00001-of-00002  ckpt_4.index
ckpt_10.index             ckpt_5.data-00000-of-00002
ckpt_11.data-00000-of-00002  ckpt_5.data-00001-of-00002
ckpt_11.data-00001-of-00002  ckpt_5.index
ckpt_11.index             ckpt_6.data-00000-of-00002
ckpt_12.data-00000-of-00002  ckpt_6.data-00001-of-00002
ckpt_12.data-00001-of-00002  ckpt_6.index
ckpt_12.index             ckpt_7.data-00000-of-00002
ckpt_1.data-00000-of-00002   ckpt_7.data-00001-of-00002
ckpt_1.data-00001-of-00002   ckpt_7.index
ckpt_1.index             ckpt_8.data-00000-of-00002
ckpt_2.data-00000-of-00002   ckpt_8.data-00001-of-00002
ckpt_2.data-00001-of-00002   ckpt_8.index
ckpt_2.index             ckpt_9.data-00000-of-00002
ckpt_3.data-00000-of-00002   ckpt_9.data-00001-of-00002
ckpt_3.data-00001-of-00002   ckpt_9.index
ckpt_3.index

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
eval_loss, eval_acc = model.evaluate(eval_dataset)

print('\n평가 손실: {}, 평가 정확도: {}'.format(eval_loss, eval_acc))

    157/Unknown - 3s 20ms/step - loss: 0.0378 - accuracy: 0.9871
평가 손실: 0.037776678224067516, 평가 정확도: 0.9871000051498413

# 텐서보드 extension load 
%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard

%tensorboard --logdir logs

Output hidden; open in https://colab.research.google.com to view.

!ls -sh ./logs

total 4.0K
4.0K train

SaveModel로 내보내기(지원중단됨)

저작자표시 비영리 변경금지

'AI&BigData > TensorFlow2.0' 카테고리의 다른 글

TensorFlow Core Tutorial - 첫번째 신경망 훈련하기: 기초적인 분류 문제 (0)	2020.01.03
TensorFlow Core Tutorials - 모델 저장과 복원 (0)	2019.12.28

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

오늘도 난, 하하하

TensorFlow Core Tutorials - 케라스를 사용한 분산 훈련

케라스를 사용한 분산 훈련

개요

필요한 패키지 가져오기

데이터셋 다운로드

분산 전략 정의하기

입력 파이프라인 구성하기

모델 만들기

콜백 정의하기

훈련과 평가

SaveModel로 내보내기(지원중단됨)

'AI&BigData > TensorFlow2.0' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역