ReLU: Better non-linearity

1. XOR문제 - 9 hidden layers: poor result

1) 소스코드

1 input layer, output layer, 9 hidden layers

sigmoid 함수 사용

(...)
with tf.name_scope('Layer1') as scope:
    W1 = tf.Variable(tf.random_uniform([2, 5], -1.0, 1.0), name='weight1')
    b1 = tf.Variable(tf.zeros([5]), name='bias1')
    L1 = tf.sigmoid(tf.matmul(X, W1) + b1)
with tf.name_scope('Layer2') as scope:
    W2 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight2')
    b2 = tf.Variable(tf.zeros([5]), name='bias2')
    L2 = tf.sigmoid(tf.matmul(L1, W2) + b2)
with tf.name_scope('Layer3') as scope:
    W3 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight3')
    b3 = tf.Variable(tf.zeros([5]), name='bias3')
    L3 = tf.sigmoid(tf.matmul(L2, W3) + b3)
with tf.name_scope('Layer4') as scope:
    W4 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight4')
    b4 = tf.Variable(tf.zeros([5]), name='bias4')
    L4 = tf.sigmoid(tf.matmul(L3, W4) + b4)
with tf.name_scope('Layer5') as scope:
    W5 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight5')
    b5 = tf.Variable(tf.zeros([5]), name='bias5')
    L5 = tf.sigmoid(tf.matmul(L4, W5) + b5)
with tf.name_scope('Layer6') as scope:
    W6 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight6')
    b6 = tf.Variable(tf.zeros([5]), name='bias6')
    L6 = tf.sigmoid(tf.matmul(L5, W6) + b6)
with tf.name_scope('Layer7') as scope:
    W7 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight7')
    b7 = tf.Variable(tf.zeros([5]), name='bias7')
    L7 = tf.sigmoid(tf.matmul(L6, W7) + b7)
with tf.name_scope('Layer8') as scope:
    W8 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight8')
    b8 = tf.Variable(tf.zeros([5]), name='bias8')
    L8 = tf.sigmoid(tf.matmul(L7, W8) + b8)
with tf.name_scope('Layer9') as scope:
    W9 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight9')
    b9 = tf.Variable(tf.zeros([5]), name='bias9')
    L9 = tf.sigmoid(tf.matmul(L8, W9) + b9)
with tf.name_scope('Layer10') as scope:
    W10 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight10')
    b10 = tf.Variable(tf.zeros([5]), name='bias10')
    L10 = tf.sigmoid(tf.matmul(L9, W10) + b10)
with tf.name_scope('Hypothesis') as scope:
    W11 = tf.Variable(tf.random_uniform([5, 1], -1.0, 1.0), name='weight11')
    b11 = tf.Variable(tf.zeros([1]), name='bias11')
    hypothesis = tf.sigmoid(tf.matmul(L10, W11) + b11)
(...)

Graph

2) 결과

Cost가 떨어지지 않으며 Accuracy가 2개의 NN보다 나쁜 결과로 나옴

(...)
Step:  196000 	Cost:  0.6931472 [array([[ 0.15111303, -0.7233987 , -0.08999562,  0.4218123 , -0.38752127],
(...)
Step:  198000 	Cost:  0.6931472 [array([[ 0.15111303, -0.7233987 , -0.08999562,  0.4218123 , -0.38752127],
(...)

Hypothesis:  [[0.49999994]
 [0.49999994]
 [0.50000006]
 [0.50000006]] 
Correct:  [[0.]
 [0.]
 [1.]
 [1.]] 
Accuracy:  0.5

Cost / Accuracy graph

3) 문제 발생의 원인: backpropagation의 문제

Active function으로 sigmoid를 사용했기 때문에, backpropagation에서 vanishing gradient 문제 발생
- sigmoid의 결과는 0~1의 값, 입력값이 0보다 작은 경우 0에 가까운 값이 출력
- backpropagation에서는 gradient를 chain rule에 의해 계산하는데 만약 0에 가까운 값이 계속 곱해지는 경우
  미분값이 점점 작이지는 문제 발생
Vanishing gadient의 의미
- 경사도가 사라진다
- 학습하기 어렵다
- 입력이 출력에 미치는 영향도가 없다

2. 문제의 해결: ReLU activation function의 사용

1) ReLU 함수: max(0, x)

2) 소스코드

ReLU activation function 사용(단, 마지막은 0~1사이의 값을 출력해야하므로 sigmoid 함수를 사용해야 함)

(...)
with tf.name_scope('Layer1') as scope:
    W1 = tf.Variable(tf.random_uniform([2, 5], -1.0, 1.0), name='weight1')
    b1 = tf.Variable(tf.zeros([5]), name='bias1')
    L1 = tf.nn.relu(tf.matmul(X, W1) + b1)
(...)
with tf.name_scope('Layer10') as scope:
    W10 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight10')
    b10 = tf.Variable(tf.zeros([5]), name='bias10')
    L10 = tf.nn.relu(tf.matmul(L9, W10) + b10)
with tf.name_scope('Hypothesis') as scope:
    W11 = tf.Variable(tf.random_uniform([5, 1], -1.0, 1.0), name='weight11')
    b11 = tf.Variable(tf.zeros([1]), name='bias11')
    hypothesis = tf.sigmoid(tf.matmul(L10, W11) + b11)
(...)

Graph

3) 결과

Cost가 낮은 값으로 출력되며 Accuracy 또한 높음

(...)
Step:  196000 	Cost:  0.00051748654 [array([[-0.36031628, -0.7233987 , -0.08999562,  0.7240177 , -0.73926085],
(...)
Step:  198000 	Cost:  0.0005120867 [array([[-0.3604096 , -0.7233987 , -0.08999562,  0.7241369 , -0.73939353],
(...)
Hypothesis:  [[0.00101205]
 [0.9999987 ]
 [0.9999994 ]
 [0.00101205]] 
Correct:  [[0.]
 [1.]
 [1.]
 [0.]] 
Accuracy:  1.0

4) sigmoid와 ReLU 함수 사용 시 cost/accuracy graph 비교

3. Non-linear activation function

sigmoid
tanh
ReLU
Leaky ReLU
Maxout
ELU

저작자표시 비영리 변경금지