A A
[NLP] Building a vocabulary set - ์–ดํœ˜ ์ง‘ํ•ฉ ๊ตฌ์ถ•ํ•˜๊ธฐ

์–ดํœ˜ ์ง‘ํ•ฉ ๊ตฌ์ถ•ํ•˜๊ธฐ (Vocab)

1๋‹จ๊ณ„: ์‹ค์Šต ํ™˜๊ฒฝ ๋งŒ๋“ค๊ธฐ

pip ๋ช…๋ น์–ด๋กœ ํŒจํ‚ค์ง€๋ฅผ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค.
!pip install ratsnlp

 

2๋‹จ๊ณ„: ๊ตฌ๊ธ€ ๋“œ๋ผ์ด๋ธŒ ์—ฐ๋™ํ•˜๊ธฐ

from google.colab import drive
drive.mount('/gdrive', force_remount=True)

3๋‹จ๊ณ„: ๋ง๋ญ‰์น˜ ๋‹ค์šด๋กœ๋“œ ๋ฐ ์ „์ฒ˜๋ฆฌ

  • ์ฝ”ํฌ๋ผ(Korpora)๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ฅผ ํ™œ์šฉํ•ด BPE ์ˆ˜ํ–‰ ๋Œ€์ƒ ๋ง๋ญ‰์น˜๋ฅผ ๋‚ด๋ ค๋ฐ›๊ณ  ์ „์ฒ˜๋ฆฌ.
  • ์‹ค์Šต์šฉ ๋ง๋ญ‰์น˜๋Š” ๋ฐ•์€์ • ๋‹˜์ด ๊ณต๊ฐœํ•˜์‹  Naver Sentiment Movie Corpus(NSMC)์„ ์‚ฌ์šฉ
  • ๋ฐ์ดํ„ฐ๋ฅผ ๋‚ด๋ ค๋ฐ›์•„ `nsmc`๋ผ๋Š” ๋ณ€์ˆ˜๋กœ ์ฝ์–ด๋“ค์ž…๋‹ˆ๋‹ค.
from Korpora import Korpora
nsmc = Korpora.load("nsmc", force_download=True)

 

NSMC์— ํฌํ•จ๋œ ์˜ํ™” ๋ฆฌ๋ทฐ(์ˆœ์ˆ˜ ํ…์ŠคํŠธ)๋“ค์„ ์ง€์ •๋œ ๊ฒฝ๋กœ์— ์ €์žฅ.
import os
def write_lines(path, lines):
    with open(path, 'w', encoding='utf-8') as f:
        for line in lines:
            f.write(f'{line}\n')

write_lines("/content/train.txt", nsmc.train.get_all_texts())
write_lines("/content/test.txt", nsmc.test.get_all_texts())

 

`train.txt`, 'text.txt' ์˜ ์•ž๋ถ€๋ถ„ ํ™•์ธ
!head train.txt
!head test.txt

 

4๋‹จ๊ณ„: GPT ํ† ํฌ๋‚˜์ด์ € ๊ตฌ์ถ•

  • GPT ๊ณ„์—ด ๋ชจ๋ธ์ด ์‚ฌ์šฉํ•˜๋Š” Tokenizer ->  Byte-level Byte Pair Encoding(BBPE)์ž…๋‹ˆ๋‹ค.
  • ์–ดํœ˜์ง‘ํ•ฉ ๊ตฌ์ถ• ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•ด๋‘˜ ๋””๋ ‰ํ† ๋ฆฌ๋ฅผ ๊ตฌ๊ธ€ ๋“œ๋ผ์ด๋ธŒ ๊ณ„์ • ๋‚ด `๋‚ด ๋“œ๋ผ์ด๋ธŒ/nlpbook/bbpe`๋กœ ๋งŒ๋“ค์–ด ๋‘ .
import os
os.makedirs("/gdrive/My Drive/nlpbook/bbpe", exist_ok=True)
`nsmc` ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  BBPE ์–ดํœ˜์ง‘ํ•ฉ์„ ๊ตฌ์ถ• (์‹œ๊ฐ„ ์กฐ๊ธˆ ์†Œ์š”)
from tokenizers import ByteLevelBPETokenizer
bytebpe_tokenizer = ByteLevelBPETokenizer()
bytebpe_tokenizer.train(
    files=["/content/train.txt", "/content/test.txt"],
    vocab_size=10000,
    special_tokens=["[PAD]"]
)
bytebpe_tokenizer.save_model("/gdrive/My Drive/nlpbook/bbpe")
  • ์ฝ”๋“œ ์‹คํ–‰์ด ๋๋‚˜๋ฉด ์ž์‹ ์˜ ๊ตฌ๊ธ€ ๋“œ๋ผ์ด๋ธŒ ๊ฒฝ๋กœ(`/gdrive/My Drive/nlpbook/bbpe`)์— `vocab.json`๊ณผ `merges.txt`๊ฐ€ ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค.
  • ์ „์ž๋Š” ๋ฐ”์ดํŠธ ๋ ˆ๋ฒจ BPE์˜ ์–ดํœ˜ ์ง‘ํ•ฉ์ด๋ฉฐ ํ›„์ž๋Š” ๋ฐ”์ด๊ทธ๋žจ ์Œ์˜ ๋ณ‘ํ•ฉ ์šฐ์„ ์ˆœ์œ„
  • `vocab.json` ๋‚ด์šฉ ํ™•์ธ ์‹คํ–‰ ์ฝ”๋“œ ์ž…๋‹ˆ๋‹ค.
!cat /gdrive/My\ Drive/nlpbook/bbpe/vocab.json
  • merges.txt์˜ ๋‚ด์šฉ ํ™•์ธ ์‹คํ–‰ ์ฝ”๋“œ ์ž…๋‹ˆ๋‹ค.
!head /gdrive/My\ Drive/nlpbook/bbpe/merges.txt

 

5๋‹จ๊ณ„: BERT ํ† ํฌ๋‚˜์ด์ € ๊ตฌ์ถ•

 BERT๋Š” ์›Œ๋“œํ”ผ์Šค(wordpiece) ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉ.
  • ์šฐ์„  ์–ดํœ˜์ง‘ํ•ฉ ๊ตฌ์ถ• ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•ด๋‘˜ ๋””๋ ‰ํ† ๋ฆฌ๋ฅผ ์ž์‹ ์˜ ๊ตฌ๊ธ€ ๋“œ๋ผ์ด๋ธŒ ๊ณ„์ • ๋‚ด `๋‚ด ๋“œ๋ผ์ด๋ธŒ/nlpbook/bbpe`๋กœ ๋งŒ๋“ค์–ด๋‘ก๋‹ˆ๋‹ค.
import os
os.makedirs("/gdrive/My Drive/nlpbook/wordpiece", exist_ok=True)
  • ๋‹ค์Œ์„ ์ˆ˜ํ–‰ํ•˜๋ฉด BERT ๋ชจ๋ธ์ด ์‚ฌ์šฉํ•˜๋Š” ์›Œ๋“œํ”ผ์Šค ์–ดํœ˜์ง‘ํ•ฉ์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค.(์‹œ๊ฐ„์ด ๊ฑธ๋ฆผ)
  • ์ฝ”๋“œ ์‹คํ–‰์ด ๋๋‚˜๋ฉด ์ž์‹ ์˜ ๊ตฌ๊ธ€ ๋“œ๋ผ์ด๋ธŒ ๊ฒฝ๋กœ(/gdrive/My Drive/nlpbook/wordpiece)์— vocab.txt๊ฐ€ ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค.
  • vocab.txt์˜ ๋‚ด์šฉ ํ™•์ธ ์‹คํ–‰ ์ฝ”๋“œ ์ž…๋‹ˆ๋‹ค.
!head /gdrive/My\ Drive/nlpbook/wordpiece/vocab.txt