NLP 실습 - (2) BERT 모델 학습

NLP 실습 - (2) BERT 모델 학습

2023. 2. 2. 19:17ㆍMachine Learning

지난 포스팅에서는 신문사 분류 모델링을 위해 직접 데이터를 수집한 과정과 함께 토큰화, 불용어처리와 같은 전처리 과정에 대해 소개하였습니다. 이번 글에서는 신문사 분류를 위해 이미 학습된 모델인 BERT를 활용한 모델링 코드 예시와 함께 성능에 대해 간단히 소개하겠습니다.

1. 데이터 전처리

모델 학습에 앞서 타겟변수에 대한 라벨 인코딩 및 학습-평가 데이터 셋 분리하는 작업이 필요합니다.

1.1. 라벨 인코딩

sklearn의 preprocessing 라이브러리를 활용하여 라벨 인코딩을 해주었습니다.

import pandas as pd
from sklearn import preprocessing

## Load data (토큰화, 불용어처리 완료)
df = pd.read_csv('train_news.csv')

## Label encoding (target)
label = preprocessing.LabelEncoder()
df['press_num'] = label.fit_transform(df['press'])

신문사명(press) 컬럼을 신문사번호(press_num) 컬럼으로 인코딩해주어 모델 학습이 가능한 형태로 변환해주었습니다.

1.2. 데이터셋 분리

학습변수 (X)와 예측변수 (y)를 각각 train, test set 으로 분리해주었습니다. 이때, stratify 옵션을 활용하여 train, test set 의 라벨 구성비가 유사하도록 설정해주었습니다.

from sklearn.model_selection import train_test_split

## Split dataset
X = df["morphs_2"].values
y = df["press_num"].values

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True, stratify=y)

2. 모델 학습

자연어 처리의 경우 방대한 양의 데이터를 학습한 사전 학습 모델을 불러와 사용하는데요, 저는 BERT 모델의 일종인 distillBERT 모델을 불러와 모델링을 진행하였습니다. distillBERT의 경우 기존 BERT 모델보다 크기는 40% 작지만, 97% 의 유사한 성능을 보이고, 60% 빠른 학습이 가능하다고 합니다[1].

2.1. 패키지 설치 및 모델 설정

먼저 BERT 및 diltilBERT model 을 손쉽게 돌릴 수 있도록 도와주는 라이브러리인 ktrain을 설치합니다[2].

ktrain is a lightweight wrapper for the deep learning library TensorFlow Keras (and other libraries) to help build, train, and deploy neural networks and other machine learning models. Inspired by ML framework extensions like fastai and ludwig.
ktrain is designed to make deep learning and AI more accessible and easier to apply for both newcomers and experienced practitioners. With only a few lines of code, ktrain allows you to easily and quickly.

!pip install -q ktrain

아래와 같이 학습할 모델에 대한 정보를 설정해 줍니다.
한국어 데이터 전처리를 위해 distillbert의 multilingual 모델을 선택했고, 데이터의 평균 길이가 450 정도여서 maxlen를 512로 설정했습니다.

import ktrain
from ktrain import text

MODEL_NAME = "distilbert-base-multilingual-cased"
t = text.Transformer(MODEL_NAME, maxlen=512, classes=label.classes_)

trn = t.preprocess_train(x_train, y_train)
val = t.preprocess_test(x_test, y_test)
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=16)

2.2. 패키지 설치 및 모델 설정

본격적인 학습에 앞서 learning rate를 찾아줍니다.

learner.lr_find(show_plot=True, max_epochs=5)

위에서 찾은 learning rate를 활용하여 learning rate를 설정하고, max_epoch을 설정하여 학습을 진행합니다.

learner.fit_onecycle(1e-4, 10)

이제 평가 데이터 셋을 이용해 모델의 성능을 확인합니다. ktrain 에서는 아래의 명령어로 미리 설정해놓은 평가 데이터셋에 대해 모델의 예측 결과를 sklearn의 classification_report 와 같이 출력해줍니다.

learner.validate(class_names=t.get_classes())

이번 글에서는 모델 학습을 위한 간단한 전처리 이후 ktrain을 활용하여 몇 줄의 코드로 distillbert를 활용하여 학습 및 평가하는 방법에 대해 알아보았습니다. 더 궁금하신 점이 있거나 수정이 필요한 내용이 있으면 댓글로 남겨주세요:)

4. 참고

[1] https://cpm0722.github.io/paper-review/distilbert-a-distilled-version-of-bert-smaller-faster-cheaper-and-lighter

[2] https://github.com/amaiya/ktrain

'Machine Learning' 카테고리의 다른 글

ChatGPT란? (개념 및 원리) (0)	2023.04.14
공부 기록 1일. AutoML 도입 효과 및 학습 목적 (0)	2023.02.16
NLP 실습 - (1) 데이터 수집 및 전처리 (1)	2023.01.29
AutoML 이란? (종류 및 장단점) (0)	2022.11.21
ROC-AUC, PR-AUC 개념 비교 정리 (0)	2022.07.21

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

data-minggeul

data-minggeul

태그

최근글

댓글

공지사항

아카이브

1. 데이터 전처리

1.1. 라벨 인코딩

1.2. 데이터셋 분리

2. 모델 학습

2.1. 패키지 설치 및 모델 설정

2.2. 패키지 설치 및 모델 설정

4. 참고

'Machine Learning' 카테고리의 다른 글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역