[Hands-On ML] 13장. 텐서플로를 사용한 데이터 적재와 전처리 - 2

TFRecord 포맷

Tensorflow는 대용량 데이터를 저장하고 효율적으로 읽기 위해 TFRecord 포맷을 선호한다. TFRecord 포맷은 크기가 다른 연속된 이진 레코드를 저장하는 단순한 이진 포맷이다.

tf.io.TFRecordWritier를 이용해 TFRecord를 만들고, tf.data.TFRecordDataset을 이용해 읽을 수 있다.

# Make TFRecord
with tf.io.TFRecordWriter("my_data.tfrecord") as f:
    f.write(b"This is the first record")
    f.write(b"And this is the second record")

# Read TFRecord
filepaths = ['my_data.tfrecord']
dataset = tf.data.TFRecordDataset(filepaths)
for item in dataset:
    print(item)

TFRecord를 입축하거나, 압축된 TFRecord를 읽어야 할 때는 다음과 같이 한다.

# Make Compressed
options = tf.io.TFRecordOptions(compression_type="GZIP")
with tf.io.TFRecordWriter("my_compressed.tfrecord", options) as f:
    f.write(b"This is the first record")

# Read Compressed
dataset = tf.data.TFRecordDataset(["my_compressed.tfrecord"], compression_type="GZIP")
for item in dataset:
    print(item)

프로토콜 버퍼 개요

일반적으로 TFRecord는 직렬화된 프로토콜 버퍼(protocol buffer)를 담고 있다. 프로토콜 버퍼를 정의하는 방법은 다음과 같다.

syntax = "proto3";
message Person {
    string name = 1;
    int32 id = 2;
    repeated string email = 3;
}

위 코드는 다음과 같은 의미를 가진다.

프로토콜 버퍼 포맷의 버전 3을 사용한다.
각 Person 객체는 다음과 같은 필드를 가진다.
- name: string 타입
- id: int32 타입
- email: string 타입 (반복 필드)
1, 2, 3은 필드 식별자로, 레코드의 이진 표현에 사용된다.

객체를 생성하고 필드를 수정하는 방법은 다음과 같다.

from person_pb2 import Person

# make an object
person = Person(name='Al', id=123, email=['a@b.com'])
print(person)

# read a field
person.name

# modify a filed
person.name = 'Alice'

# repeated field can be refered as an array
person.email[0]

# add a new email address
person.email.append('c@d.com')

# Serialize object to byte string
serialized = person.SerializeToString()

Example 프로토콜 버퍼

일반적으로는 사전 정의된 프로토콜 버퍼를 많이 사용한다.

Example 프로토콜 버퍼는 dataset에 있는 하나의 샘플을 표현하는 버퍼로, 전형적인 주요 프로토콜 버퍼이다. 정의는 다음과 같다.

syntax = "proto3";

message BytesList { repeated bytes value = 1; }
message FloatList { repeated float value = 1 [packed = true]; }
message Int64List { repeated int64 value = 1 [packed = true]; }
message Feature {
    oneof kind {
        BytesList bytes_list = 1;
        FloatList float_list = 2;
        Int64List int64_list = 3;
    }
};
message Features { map<string, Feature> feature = 1; };
message Example { Features features = 1; };

정의를 자세히 살펴보면 다음과 같다.

BytesList, FloatList, Int64List가 정의된다.
[packed=true]는 효율적인 인코딩을 위해 반복적인 수치 필드에 사용된다.
Feature는 BytesList, FloatList, Int64List 중 하나를 담고 있다.
Features는 특성 이름과 특성 값을 매핑한 딕셔너리를 가진다.
Example은 하나의 Features 객체를 가진다.

앞서 만든 Person 객제와 동일한 tf.train.Example 객체를 다음과 같이 만들 수 있다.

from tensorflow.train import BytesList, FloatList, Int64List
from tensorflow.train import Feature, Features, Example

person_example = Example(
    features = Features(
        feature={
            'name': Feature(bytes_list=BytesList(value=[b'Alice'])),
            'id': Feature(int64_list=Int64List(value=[123])),
            'emails': Feature(bytes_list=BytesList(value=[
                b'a@b.com', b'c@d.com'
            ]))
        }
    )
)

보통 하나 이상의 Example 객체를 만들게 될 것이다.

Example 프로토콜 버퍼를 파싱하는 방법은 다음과 같다.

# define feature dictionary
feature_description = {
    'name': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'id': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    'emails': tf.io.VarLenFeature(tf.string),
}

# parse one-by-one
def parse(serialized_example):
    return tf.io.parse_single_example(serialized_example, feature_description)
dataset = tf.data.TFRecordDataset(['my_contacts.tfrecord']).map(parse)
for parsed_example in dataset:
    print(parsed_example)

# parse by batch
def parse(serialized_examples):
    return tf.io.parse_example(serialized_examples, feature_description)
dataset = tf.data.TFRecordDataset(['my_contacts.tfrecord']).batch(2).map(parse)
for parsed_example in dataset:
    print(parsed_example)

가변 길이 특성은 희소 텐서로 파싱된다. tf.sparse_to_dense()로 밀집 텐서로 변환할 수도 있고, 희소 텐서의 값을 바로 참조할 수도 있다. 후자가 더 간단하다.

# 1: tf.sparse.to_dense
tf.sparse.to_dense(parsed_example['emails'], default_value=b'')

# 2: directly refer
parsed_example['emails'].values

리스트의 리스트를 다룰 때는 SequenceExample 프로토콜 버퍼를 사용한다. 파싱할 때는 tf.io.parse_single_sequence_example()이나 tf.io.parse_sequence_example()을 사용한다.

Normalization 층

케라스는 전처리를 위한 Normalization 층을 제공한다. 층을 만들 때 각 특성의 평균과 분산을 전달할 수도 있고, 모델을 훈련하기 전에 adapt() 메서드를 활용해 미리 특성의 평균과 분산을 계산할 수도 있다.

norm_layer = tf.keras.layers.Normalization()
model = tf.keras.models.Sequential(
    [norm_layer,
    tf.keras.layers.Dense(1)]
)
model.compile(loss='mse', optimizer=tf.keras.optimizers.SGD(learning_rate=2e-3))
norm_layer.adapt(X_train)
model.fit(X_train, y_train, validation_data=(X_valid, y_valid), epochs=5)

위와 같이 모델 자체에 전처리 층을 포함하면 훈련에 사용하는 신경망과 제품에 사용하는 신경망이 완전히 일치하여 서로 다른 전처리 층을 가지는 것을 방지할 수 있다. 하지만 Normalization 층의 경우 훈련 속도를 느리게 만들기 때문에 모델에 포함시켜 버리는 것은 별로 좋지 않다.

이를 방지하기 위해서는 다음과 같은 방법을 사용한다.

Normalization 층을 독립적으로 한 번 전처리한다.
전처리 층이 없는 모델로 훈련을 진행한다.
제품에 배포할 때는 전처리 층과 모델을 합쳐서 배포한다.

# 1: independent norm layer
norm_layer = tf.keras.layers.Normalization()
norm_layer.adapt(X_train)
X_train_scaled = norm_layer(X_train)
X_valid_scaled = norm_layer(X_valid)

# 2: Train without Preprocessing layer
model = tf.keras.models.Sequential([tf.keras.layers.Dense(1)])
model.compile(loss='mse', optimizer=tf.keras.optimizers.SGD(learning_rate=2e-3))
model.fit(X_train_scaled, y_train, validation_data=(X_valid_scaled, y_valid), epochs=5)

# 3: final model with norm_layer
final_model = tf.keras.models.Sequential([
    norm_layer,
    tf.keras.layers.Dense(1)
])
X_new = X_test[:3]
y_pred = final_model(X_new)

keras 전처리 층은 tf.data API와 함께 사용할 수 있다. 다음 코드는 adapt() 메서드를 호출한 Normalization 층을 dataset에 있는 각 배치의 입력 특성에 적용하는 예시이다.

dataset = dataset.map(lambda X, y: (norm_layer(X), y))

사용자 정의 전처리 층을 만들 수도 있다. 다음은 Normalization 층을 직접 구현한 것이다.

import numpy as np

class MyNormalization(tf.keras.layers.Layer):
    def adapt(self, X):
        self.mean_ = np.mean(X, axis=0, keepdims=True)
        self.std_ = np.std(X, axis=0, keepdims=True)

    def call(self, inputs):
        eps = tf.keras.backend.epsilon()
        return (inputs - self.mean_) / (self.std_ + eps)

Discretization 층

값 범위를 범주(구간, bin)로 매핑할 때는 Discretization 층을 사용한다. 다음 코드는 수치 특성 age를 ‘18미만’, ‘18~50`, ‘50이상’의 세 범주로 매핑하는 코드이다.

age = tf.constant([[10.], [93.], [57.], [18.], [37.], [5.]])
discretize_layer = tf.keras.layers.Discretization(bin_boundaries=[18., 50.])
age_categories = discretize_layer(age)

직접 기준을 정하는 대신 원하는 구간 수를 설정할 수도 있다.

discretize_layer = tf.keras.layers.Discretization(num_bins=3)
discretize_layer.adapt(age)

이렇게 범주형으로 바꾸었다면 다음 단계는 원-핫 인코딩이다.

CategoryEncoding 층

CategoryEncoding 층은 원-핫 인코딩을 지원한다.

>>> onehot_layer = tf.keras.layers.CategoryEncoding(num_tokens=3)
>>> onehot_layer(age_categories)
<tf.Tensor: shape=(6, 3), dtype=float32, numpy=
array([[0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]], dtype=float32)>

동일한 범주를 사용할 때, 동시에 한 개 이상의 범주형 특성을 인코딩하면 멀티-핫 인코딩을 수행한다. 즉, 입력 특성에 있는 범주에 해당하는 위치마다 값이 1이 된다.

>>> two_age_categories = np.array([[1, 0], [2, 2], [2, 0]])
>>> onehot_layer(two_age_categories)
<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[1., 1., 0.],
       [0., 0., 1.],
       [1., 0., 1.]], dtype=float32)>

결과의 각 열이 0, 1, 2에 해당한다. 따라서 [1, 0]은 0과 1, [2, 2]는 2, [2, 0]는 0과 2의 위치에 1이 표현된 것을 볼 수 있다.

바로 앞 예시에서 볼 수 있듯이, 두 개의 값이 나타나도 1로 매핑될 수 있다. 이를 방지하려면 output_mode='count'를 추가한다.

위와 같이 멀티-핫 인코딩을 사용하지 않고 각 특성마다 별도로 인코딩하고 싶다면 별도로 원-핫 인코딩을 한 다음 합쳐야 한다.

>>> onehot_layer = tf.keras.layers.CategoryEncoding(num_tokens=3 + 3)
>>> onehot_layer(two_age_categories + [0, 3])       # add 3 to the second feature
<tf.Tensor: shape=(3, 6), dtype=float32, numpy=
array([[0., 1., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 1.],
       [0., 0., 1., 1., 0., 0.]], dtype=float32)>

코드 보러가기

별도의 출처 표시가 있는 이미지를 제외한 모든 이미지는 강의자료에서 발췌하였음을 밝힙니다.

Twitter Facebook LinkedIn

[Hands-On ML] 13장. 텐서플로를 사용한 데이터 적재와 전처리 - 2

HJ

TFRecord 포맷

프로토콜 버퍼 개요

Example 프로토콜 버퍼

Normalization 층

Discretization 층

CategoryEncoding 층

공유하기

댓글남기기

참고

[Articles] Highly accurate protein structure prediction with AlphaFold

[Articles] A universal SNP and small-indel variant caller using deep neural networks

[Articles] Learning Transferable Visual Models From Natural Language Supervision

[Articles] Mastering the game of Go with deep neural networks and tree search