모델 저장과 복원

  • 튜토리얼바로가기
  • 훈련하는 도중이나 훈련이 끝난후 모델을 저장할 수 있음
    • 모델을 중지된 지점부터 다시 훈련할 수 있음
    • 모델저장시 모델 공유가능, 작업재현 가능
  • 연구한 모델과 기법 공개시 모델을 만드는 코드, 모델의 훈련된 가중치 또는 파라미터 등을 제공
    • 모델의 작동방식 이해와 새로운 데이터로 모델을 실험하는데 도움이 됨
  • 저장방식: 사용하는 API에 따라 다름, 현재 문서에서는 tf.kera 고수준 API사용

설정

설치와 임포트

  • 필요한 라이브러리 설치(설치가 안되어 있을경우에는 pip install -q h5py pyyaml을 통해 라이브러리 설치
try:
  # Colab only
  %tensorflow_version 2.x
except Exception:
  pass
!pip show h5py pyyaml
Name: h5py
Version: 2.10.0
Summary: Read and write HDF5 files from Python
Home-page: http://www.h5py.org
Author: Andrew Collette
Author-email: andrew.collette@gmail.com
License: BSD
Location: /tensorflow-2.1.0/python3.6
Requires: numpy, six
Required-by: Keras-Applications, textgenrnn, tensor2tensor, pymc3, Keras, keras-vis
---
Name: PyYAML
Version: 3.13
Summary: YAML parser and emitter for Python
Home-page: http://pyyaml.org/wiki/PyYAML
Author: Kirill Simonov
Author-email: xi@resolvent.net
License: MIT
Location: /usr/local/lib/python3.6/dist-packages
Requires: 
Required-by: PyDrive, Keras, featuretools, fastai, distributed, coveralls, bokeh

예제 데이터셋 받기

  • MNIST데이터셋으로 모델 훈련, 가중치 저장하는 예제 구현
    • 모델 실행속도를 빠르게 하기 위해 처음 10,000개만 사용
from __future__ import absolute_import, division, print_function, unicode_literals

import os

import tensorflow as tf
from tensorflow import keras

tf.__version__
'2.1.0-rc1'
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()

#print(train_images.shape)
#print(train_labels.shape)
#print(test_images.shape)
#print(test_labels.shape)

train_labels = train_labels[:10000]
test_labels = test_labels[:10000]

train_images = train_images[:10000].reshape(-1, 28*28)/255.0
test_images = test_images[:10000].reshape(-1, 28*28)/255.0

#print(test_images[0])

모델정의

# 간단한 Sequential 모델을 반환
def create_model():
  model = tf.keras.models.Sequential([
                                     keras.layers.Dense(512, activation='relu', input_shape=(784,)),
                                     keras.layers.Dropout(0.2),
                                     keras.layers.Dense(10, activation='softmax')
  ])

  model.compile(optimizer='adam',
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])
  return model

# 모델 객체를 만듦
model = create_model()
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 512)               401920    
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                5130      
=================================================================
Total params: 407,050
Trainable params: 407,050
Non-trainable params: 0
_________________________________________________________________

훈련하는 동안 체크포인트 저장하기

  • 체크포인트(checkpoint) 자동 저장
    • 훈련 중간과 훈련 마지막에 체크포인트 저장을 많이 사용하는 방법
    • 다시 훈련하지 않고 모델을 재사용하거나 훈련과정이 중지된 경우 이어서 훈련 진행
    • tf.keras.callbacks.ModelCheckPoint콜백 사용

체크포인트 콜백사용

checkpoint_path = 'training_1/cp.ckpt'
checkpoint_dir = os.path.dirname(checkpoint_path)

cp_callback = tf.keras.callbacks.ModelCheckpoint(checkpoint_path,
                                                 save_weights_only=True,
                                                 verbose=1)

model = create_model()
model.fit(train_images, train_labels, epochs=10,
          validation_data = (test_images, test_labels),
          callbacks = [cp_callback])  # 훈련단계 콜백을 전달
Train on 10000 samples, validate on 10000 samples
Epoch 1/10
 9760/10000 [============================>.] - ETA: 0s - loss: 0.4361 - accuracy: 0.8741
Epoch 00001: saving model to training_1/cp.ckpt
10000/10000 [==============================] - 2s 195us/sample - loss: 0.4322 - accuracy: 0.8752 - val_loss: 0.2542 - val_accuracy: 0.9248
Epoch 2/10
 9696/10000 [============================>.] - ETA: 0s - loss: 0.2004 - accuracy: 0.9407
Epoch 00002: saving model to training_1/cp.ckpt
10000/10000 [==============================] - 1s 138us/sample - loss: 0.1990 - accuracy: 0.9411 - val_loss: 0.1924 - val_accuracy: 0.9395
Epoch 3/10
 9824/10000 [============================>.] - ETA: 0s - loss: 0.1330 - accuracy: 0.9606
Epoch 00003: saving model to training_1/cp.ckpt
10000/10000 [==============================] - 1s 136us/sample - loss: 0.1331 - accuracy: 0.9604 - val_loss: 0.1704 - val_accuracy: 0.9467
Epoch 4/10
 9696/10000 [============================>.] - ETA: 0s - loss: 0.0953 - accuracy: 0.9715
Epoch 00004: saving model to training_1/cp.ckpt
10000/10000 [==============================] - 1s 135us/sample - loss: 0.0946 - accuracy: 0.9718 - val_loss: 0.1618 - val_accuracy: 0.9520
Epoch 5/10
 9408/10000 [===========================>..] - ETA: 0s - loss: 0.0697 - accuracy: 0.9799
Epoch 00005: saving model to training_1/cp.ckpt
10000/10000 [==============================] - 1s 132us/sample - loss: 0.0712 - accuracy: 0.9792 - val_loss: 0.1408 - val_accuracy: 0.9543
Epoch 6/10
 9984/10000 [============================>.] - ETA: 0s - loss: 0.0499 - accuracy: 0.9857
Epoch 00006: saving model to training_1/cp.ckpt
10000/10000 [==============================] - 1s 133us/sample - loss: 0.0502 - accuracy: 0.9856 - val_loss: 0.1366 - val_accuracy: 0.9589
Epoch 7/10
 9696/10000 [============================>.] - ETA: 0s - loss: 0.0433 - accuracy: 0.9872
Epoch 00007: saving model to training_1/cp.ckpt
10000/10000 [==============================] - 1s 134us/sample - loss: 0.0435 - accuracy: 0.9874 - val_loss: 0.1427 - val_accuracy: 0.9575
Epoch 8/10
 9632/10000 [===========================>..] - ETA: 0s - loss: 0.0280 - accuracy: 0.9929
Epoch 00008: saving model to training_1/cp.ckpt
10000/10000 [==============================] - 1s 137us/sample - loss: 0.0282 - accuracy: 0.9930 - val_loss: 0.1417 - val_accuracy: 0.9589
Epoch 9/10
 9440/10000 [===========================>..] - ETA: 0s - loss: 0.0294 - accuracy: 0.9917
Epoch 00009: saving model to training_1/cp.ckpt
10000/10000 [==============================] - 1s 134us/sample - loss: 0.0296 - accuracy: 0.9918 - val_loss: 0.1364 - val_accuracy: 0.9603
Epoch 10/10
 9472/10000 [===========================>..] - ETA: 0s - loss: 0.0165 - accuracy: 0.9963
Epoch 00010: saving model to training_1/cp.ckpt
10000/10000 [==============================] - 1s 136us/sample - loss: 0.0166 - accuracy: 0.9962 - val_loss: 0.1365 - val_accuracy: 0.9620





<tensorflow.python.keras.callbacks.History at 0x7fe2e04573c8>
!ls {checkpoint_dir}
checkpoint             cp.ckpt.data-00001-of-00002
cp.ckpt.data-00000-of-00002  cp.ckpt.index
  • 훈련하지 않은 새모델 생성, 테스트 세트에서 평가
    • 훈련하지 않은 모델의 성능은 무작위로 선택하는 정도의 수준(~10% 정확도)
model = create_model()
loss, acc = model.evaluate(test_images, test_labels, verbose=2)
print("훈련되지 않은 모델의 정확도: {:5.2f}%".format(100*acc))
10000/10000 - 1s - loss: 2.2941 - accuracy: 0.1117
훈련되지 않은 모델의 정확도: 11.17%
  • 체크포인트에서 가중치를 로드하고 다시 평가
    • 단, 가중치만 복원할땐 원본 모델과 동일한 구조로 모델을 만들어야 함
model.load_weights(checkpoint_path)
loss, acc = model.evaluate(test_images, test_labels, verbose=2)
print("복원된 모델의 정확도: {:5.2f}%".format(100*acc))
10000/10000 - 1s - loss: 0.1365 - accuracy: 0.9620
복원된 모델의 정확도: 96.20%

체크포인트 콜백 매개변수

  • 콜백함수에서 제공하는 매개변수
    • 체크포인트 이름을 고유하게 만들기
    • 체크포인트 주기를 조정하기
checkpoint_path = 'training_2/cp--{epoch:04d}.ckpt'
checkpoint_dir = os.path.dirname(checkpoint_path)

cp_callback = tf.keras.callbacks.ModelCheckpoint(checkpoint_path,
                                                 save_weights_only=True,
                                                 verbose=1,
                                                 period=5)  # 5번째 에포크마다 가중치를 저장
model = create_model()
# 수동으로 가중치 저장
model.save_weights(checkpoint_path.format(epoch=0))
model.fit(train_images, train_labels,
          epochs=50, callbacks=[cp_callback],
          validation_data = (test_images, test_labels),
          verbose=0)
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of samples seen.

Epoch 00005: saving model to training_2/cp--0005.ckpt

Epoch 00010: saving model to training_2/cp--0010.ckpt

Epoch 00015: saving model to training_2/cp--0015.ckpt

Epoch 00020: saving model to training_2/cp--0020.ckpt

Epoch 00025: saving model to training_2/cp--0025.ckpt

Epoch 00030: saving model to training_2/cp--0030.ckpt

Epoch 00035: saving model to training_2/cp--0035.ckpt

Epoch 00040: saving model to training_2/cp--0040.ckpt

Epoch 00045: saving model to training_2/cp--0045.ckpt

Epoch 00050: saving model to training_2/cp--0050.ckpt





<tensorflow.python.keras.callbacks.History at 0x7fe2e021e390>
!ls {checkpoint_dir}
checkpoint               cp--0025.ckpt.data-00001-of-00002
cp--0000.ckpt.data-00000-of-00002  cp--0025.ckpt.index
cp--0000.ckpt.data-00001-of-00002  cp--0030.ckpt.data-00000-of-00002
cp--0000.ckpt.index           cp--0030.ckpt.data-00001-of-00002
cp--0005.ckpt.data-00000-of-00002  cp--0030.ckpt.index
cp--0005.ckpt.data-00001-of-00002  cp--0035.ckpt.data-00000-of-00002
cp--0005.ckpt.index           cp--0035.ckpt.data-00001-of-00002
cp--0010.ckpt.data-00000-of-00002  cp--0035.ckpt.index
cp--0010.ckpt.data-00001-of-00002  cp--0040.ckpt.data-00000-of-00002
cp--0010.ckpt.index           cp--0040.ckpt.data-00001-of-00002
cp--0015.ckpt.data-00000-of-00002  cp--0040.ckpt.index
cp--0015.ckpt.data-00001-of-00002  cp--0045.ckpt.data-00000-of-00002
cp--0015.ckpt.index           cp--0045.ckpt.data-00001-of-00002
cp--0020.ckpt.data-00000-of-00002  cp--0045.ckpt.index
cp--0020.ckpt.data-00001-of-00002  cp--0050.ckpt.data-00000-of-00002
cp--0020.ckpt.index           cp--0050.ckpt.data-00001-of-00002
cp--0025.ckpt.data-00000-of-00002  cp--0050.ckpt.index
# 가장 마지막에 만들어진 체크포인트 확인
lastest = tf.train.latest_checkpoint(checkpoint_dir)
lastest
'training_2/cp--0050.ckpt'

체크포인트 파일

  • 체크포인트에 저장되어 있는 정보
    • 모델의 가중치를 포함하는 하나 이상의 샤드(shard)
    • 가중치가 어느 샤드에 저장되어 있는지를 나타내는 인덱스 파일
  • 단일 머신에서 훈련한다면 .data-00000-of-00001 확장자를 가진 샤드 하나만 만들어 짐

수동으로 가중치 저장하기

# 가중치를 저장
model.save_weights('./checkpoints/my_checkpoint')

# 가중치를 복원
model = create_model()
model.load_weights('./checkpoints/my_checkpoint')

loss, acc = model.evaluate(test_images, test_labels, verbose=2)
print("복원된 모델의 정확도: {:5.2f}%".format(100*acc))
10000/10000 - 1s - loss: 0.2244 - accuracy: 0.9663
복원된 모델의 정확도: 96.63%

모델 전체를 저장하기

  • 전체 모델을 파일 하나에 저장하는 방법 제공
    • 원본 코드를 사용하지 않고 나중에 정확히 동일 상태에서 훈련을 다시 시작 가능
  • 저장되는 것들
    • 가중치
    • 모델 구성(구조)
    • 옵티마이저에 지정한 설정
      • 현재는 텐서플로우 옵티마이저를 저장할 수 없음, 모델 로드후 다시 컴파일 해야 함, 옵티마이저의 상태는 유지되지 않음
    • 모델의 체크포인트

HDF5 파일로 저장하기

  • 케라스는 HDF5 표준의 기본 저장 포맷 제공
    • 모델을 하나의 이진파일(binary blob)처럼 다룰 수 있음
model = create_model()

model.fit(train_images, train_labels, epochs=5)

# 전체 모델을 HDF5파일로 저장
model.save('my_model.h5')
Train on 10000 samples
Epoch 1/5
10000/10000 [==============================] - 1s 91us/sample - loss: 0.4399 - accuracy: 0.8770
Epoch 2/5
10000/10000 [==============================] - 1s 69us/sample - loss: 0.2002 - accuracy: 0.9436
Epoch 3/5
10000/10000 [==============================] - 1s 71us/sample - loss: 0.1314 - accuracy: 0.9608
Epoch 4/5
10000/10000 [==============================] - 1s 73us/sample - loss: 0.0992 - accuracy: 0.9711
Epoch 5/5
10000/10000 [==============================] - 1s 72us/sample - loss: 0.0694 - accuracy: 0.9802
# 가중치와 옵티마이저를 포함하여 정확히 동일한 모델을 다시 생성
new_model = keras.models.load_model('my_model.h5')
new_model.summary()
Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_10 (Dense)             (None, 512)               401920    
_________________________________________________________________
dropout_5 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_11 (Dense)             (None, 10)                5130      
=================================================================
Total params: 407,050
Trainable params: 407,050
Non-trainable params: 0
_________________________________________________________________
loss, acc = new_model.evaluate(test_images, test_labels, verbose=2)
print("복원된 모델의 정확도: {:5.2f}%".format(100*acc))
10000/10000 - 1s - loss: 0.1535 - accuracy: 0.9535
복원된 모델의 정확도: 95.35%

saved_model을 사용하기(지원중단됨)

  • 실험적인 메서드, 향후 버전에서 변경될 수 있음