A A
[NLP} Tokenization - ํ† ํฐํ™”ํ•˜๊ธฐ

Tokenization - ํ† ํฐํ™”ํ•˜๊ธฐ

1๋‹จ๊ณ„: ์ฝ”๋žฉ ๋…ธํŠธ๋ถ ์ดˆ๊ธฐํ™”

ํŒจํ‚ค์ง€๋ฅผ ์„ค์น˜ํ•ด์ค๋‹ˆ๋‹ค.
!pip install ratsnlp

 

๊ตฌ๊ธ€ ๋“œ๋ผ์ด๋ธŒ ์—ฐ๋™ํ•˜๊ธฐ

  • ํŠœํ† ๋ฆฌ์–ผ์—์„œ ๊ตฌ์ถ•ํ•œ ์–ดํœ˜ ์ง‘ํ•ฉ์„ ์ €์žฅํ•ด ๋‘” ๊ตฌ๊ธ€ ๋“œ๋ผ์ด๋ธŒ๋ฅผ ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค.
from google.colab import drive
drive.mount('/gdrive', force_remount=True)

 

2๋‹จ๊ณ„: GPT ์ž…๋ ฅ๊ฐ’ ๋งŒ๋“ค๊ธฐ

  • GPT ๋ชจ๋ธ ์ž…๋ ฅ๊ฐ’์„ ๋งŒ๋“ค๋ ค๋ฉด Byte-level Byte Pair Encoding ์–ดํœ˜์ง‘ํ•ฉ ๊ตฌ์ถ• ๊ฒฐ๊ณผ(`vocab.json`, `merges.txt`)๊ฐ€ ์ž์‹ ์˜ ๊ตฌ๊ธ€ ๋“œ๋ผ์ด๋ธŒ ๊ฒฝ๋กœ(`/gdrive/My Drive/nlpbook/wordpiece`)์— ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ์•„๋ž˜ ์ฝ”๋“œ๋ฅผ ์ˆ˜ํ–‰ํ•ด ์ด๋ฏธ ๋งŒ๋“ค์–ด ๋†“์€ BBPE ์–ดํœ˜์ง‘ํ•ฉ์„ ํฌํ•จํ•œ GPT ํ† ํฌ๋‚˜์ด์ €๋ฅผ `tokenizer_gpt`๋ผ๋Š” ๋ณ€์ˆ˜๋กœ ์„ ์–ธํ•ฉ๋‹ˆ๋‹ค.
from transformers import GPT2Tokenizer
tokenizer_gpt = GPT2Tokenizer.from_pretrained("/gdrive/My Drive/nlpbook/bbpe")
tokenizer_gpt.pad_token = "[PAD]"
ํ•œ๋ฒˆ ์˜ˆ์‹œ ๋ฌธ์žฅ 3๊ฐœ๋ฅผ ๊ฐ๊ฐ ํ† ํฐํ™” ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
sentences = [
    "์•„ ๋”๋น™.. ์ง„์งœ ์งœ์ฆ๋‚˜๋„ค์š” ๋ชฉ์†Œ๋ฆฌ",
    "ํ ...ํฌ์Šคํ„ฐ๋ณด๊ณ  ์ดˆ๋”ฉ์˜ํ™”์ค„....์˜ค๋ฒ„์—ฐ๊ธฐ์กฐ์ฐจ ๊ฐ€๋ณ์ง€ ์•Š๊ตฌ๋‚˜",
    "๋ณ„๋ฃจ ์˜€๋‹ค..",
]
tokenized_sentences = [tokenizer_gpt.tokenize(sentence) for sentence in sentences]

 

์•„๋ž˜ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰์‹œ์ผœ์„œ ํ† ํฐํ™” ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธ ํ•ด๋ณด์„ธ์š”.
tokenized_sentences

 

์ด๋ฒˆ์—๋Š” Batch_size๊ฐ€ 3์ด๋ผ๊ณ  ๊ฐ€์ •ํ•˜๊ณ  ์ด๋ฒˆ ๋ฐฐ์น˜์˜ ์ž…๋ ฅ๊ฐ’์„ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
batch_inputs = tokenizer_gpt(
    sentences,
    padding="max_length", # ๋ฌธ์žฅ์˜ ์ตœ๋Œ€ ๊ธธ์ด์— ๋งž์ถฐ ํŒจ๋”ฉ
    max_length=12, # ๋ฌธ์žฅ์˜ ํ† ํฐ ๊ธฐ์ค€ ์ตœ๋Œ€ ๊ธธ์ด
    truncation=True, # ๋ฌธ์žฅ ์ž˜๋ฆผ ํ—ˆ์šฉ ์˜ต์…˜
)
  • ์ฝ”๋“œ ์‹คํ–‰ ๊ฒฐ๊ณผ๋กœ ๋‘ ๊ฐ€์ง€์˜ ์ž…๋ ฅ๊ฐ’์ด ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค.
  • ํ•˜๋‚˜๋Š” input_ids์ž…๋‹ˆ๋‹ค. batch_inputs['input_ids']๋ฅผ ์ฝ”๋žฉ์—์„œ ์‹คํ–‰ํ•ด ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•ด๋ณด๋ฉด input_ids์™€ ์‹คํ–‰ ๊ฒฐ๊ณผ์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค
batch_input์˜ ๋‚ด์šฉ์„ ํ•œ๋ฒˆ ํ™•์ธํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
batch_inputs.keys()
dict_keys(['input_ids', 'attention_mask'])

 

  • input_ids๋Š” ํ† ํฐํ™” ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ง€๊ณ  ๊ฐ ํ† ํฐ๋“ค์„ ์ธ๋ฑ์Šค(index)๋กœ ๋ฐ”๊พผ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ์–ดํœ˜ ์ง‘ํ•ฉ(vocab.json)์„ ํ™•์ธํ•ด ๋ณด๋ฉด ๊ฐ ์–ดํœ˜๊ฐ€ ์ˆœ์„œ๋Œ€๋กœ ๋‚˜์—ด๋œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ์š”. ์ด ์ˆœ์„œ๊ฐ€ ๋ฐ”๋กœ ์ธ๋ฑ์Šค์ž…๋‹ˆ๋‹ค.
  • ์ด๊ฐ™์ด ๊ฐ ํ† ํฐ์„ ์ธ๋ฑ์Šค๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์„์ธ๋ฑ์‹ฑ(indexing)>์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
batch_inputs['input_ids']
[[334, 2338, 263, 581, 4055, 464, 3808, 0, 0, 0, 0, 0], [3693, 336, 2876, 758, 2883, 356, 806, 422, 9875, 875, 2960, 7292], [4957, 451, 3653, 263, 0, 0, 0, 0, 0, 0, 0, 0]]

 

  • attention_mask๋Š” ์ผ๋ฐ˜ ํ† ํฐ์ด ์ž๋ฆฌํ•œ ๊ณณ(1)๊ณผ ํŒจ๋”ฉ ํ† ํฐ์ด ์ž๋ฆฌํ•œ ๊ณณ(0)์„ ๊ตฌ๋ถ„ํ•ด ์•Œ๋ ค์ฃผ๋Š” ์žฅ์น˜์ž…๋‹ˆ๋‹ค
batch_inputs['attention_mask']
[[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]

 

3๋‹จ๊ณ„: BERT ์ž…๋ ฅ๊ฐ’ ๋งŒ๋“ค๊ธฐ

  • ์ด๋ฒˆ์—” BERT ๋ชจ๋ธ์˜ ์ž…๋ ฅ๊ฐ’์„ ๋งŒ๋“ค์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. BERT Tokenizer ์„ ์–ธํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฉด BERT ๋ชจ๋ธ์ด ์‚ฌ์šฉํ•˜๋Š” ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ดˆ๊ธฐํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ์ „์— ์ž์‹ ์˜ ๊ตฌ๊ธ€ ๋“œ๋ผ์ด๋ธŒ ๊ฒฝ๋กœ(/gdrive/My Drive/nlpbook/wordpiece)์—๋Š” BERT์šฉ ์›Œ๋“œํ”ผ์Šค ์–ดํœ˜ ์ง‘ํ•ฉ(vocab.txt)์ด ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ vocab.txt๋ฅผ ์ƒ์„ฑํ•˜์ง€ ์•Š์•˜๋‹ค๋ฉด ๊ผญ ์ƒ์„ฑํ•ด์ฃผ์„ธ์š”!
from transformers import BertTokenizer
tokenizer_bert = BertTokenizer.from_pretrained("/gdrive/My Drive/nlpbook/wordpiece", do_lower_case=False)
์ด์ œ ํ•œ๋ฒˆ ์˜ˆ์‹œ ๋ฌธ์žฅ 3๊ฐœ๋ฅผ ๊ฐ๊ฐ ํ† ํฐํ™” ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
sentences = [
    "์•„ ๋”๋น™.. ์ง„์งœ ์งœ์ฆ๋‚˜๋„ค์š” ๋ชฉ์†Œ๋ฆฌ",
    "ํ ...ํฌ์Šคํ„ฐ๋ณด๊ณ  ์ดˆ๋”ฉ์˜ํ™”์ค„....์˜ค๋ฒ„์—ฐ๊ธฐ์กฐ์ฐจ ๊ฐ€๋ณ์ง€ ์•Š๊ตฌ๋‚˜",
    "๋ณ„๋ฃจ ์˜€๋‹ค..",
]
tokenized_sentences = [tokenizer_bert.tokenize(sentence) for sentence in sentences]
  • ์ฝ”๋“œ๋ฅผ ์‹คํ–‰์‹œํ‚ค๊ณ  ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•ด๋ณด๋ฉด, ํ† ํฐ ์ผ๋ถ€์— '##' ์ด๋ ‡๊ฒŒ ์žˆ๋Š” ํ† ํฐ์ด ์žˆ์„๊ฒ๋‹ˆ๋‹ค.
  • ์ด ํ† ํฐ์€ ์–ด์ ˆ(๋„์–ด์“ฐ๊ธฐ ๊ธฐ์ค€)์˜ ์‹œ์ž‘์ด ์•„๋‹˜์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.>
  • ์˜ˆ๋ฅผ ๋“ค์–ด '##๋„ค์š”' ๋Š” ์ด ํ† ํฐ์ด ์•ž์„  ํ† ํฐ >์งœ์ฆ๋‚˜์™€ ๊ฐ™์€ ์–ด์ ˆ์— ์œ„์น˜ํ•˜๋ฉฐ ์–ด์ ˆ ๋‚ด์—์„œ ์—ฐ์†๋˜๊ณ  ์žˆ์Œ์„ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค.>
์•„๋ž˜ ์ฝ”๋“œ๋Š” BERT ๋ชจ๋ธ์˜ ์‹ค์ œ ์ฝ”๋“œ ์ž…๋ ฅ ๊ฐ’์ž…๋‹ˆ๋‹ค.
batch_inputs = tokenizer_bert(
    sentences,
    padding="max_length",
    max_length=12,
    truncation=True,
)
์ฝ”๋“œ๋ฅผ ์‹คํ–‰์‹œํ‚ค๋ฉด ์„ธ ๊ฐ€์ง€์˜ ์ž…๋ ฅ๊ฐ’์ด ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค.
batch_inputs.keys()
dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
  • ํ•˜๋‚˜๋Š” GPT ๋ชจ๋ธ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ํ† ํฐ ์ธ๋ฑ์Šค ์‹œํ€€์Šค๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” input_ids์ž…๋‹ˆ๋‹ค.
batch_inputs['input_ids']๋ฅผ ์ž…๋ ฅํ•˜๊ณ  ์ด๋ฅผ ์ถœ๋ ฅํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
[[2, 621, 2631, 16, 16, 1993, 3678, 1990, 3323, 3, 0, 0], [2, 997, 16, 16, 16, 2609, 2045, 2796, 1981, 1224, 16, 3], [2, 3274, 9508, 16, 16, 3, 0, 0, 0, 0, 0, 0]]
  • ํ•œ๋ฒˆ ๋ณด๋ฉด ๋ชจ๋“  ๋ฌธ์žฅ ์•ž์— 2, ๋์— 3์ด ๋ถ™์€ ๊ฑธ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด๋Š” ๊ฐ๊ฐ [CLS], [SEP]๋ผ๋Š” ํ† ํฐ์— ๋Œ€์‘ํ•˜๋Š” ์ธ๋ฑ์Šค์ธ๋ฐ์š”.
  • BERT๋Š” ๋ฌธ์žฅ ์‹œ์ž‘๊ณผ ๋์— ์ด ๋‘ ๊ฐœ ํ† ํฐ์„ ๋ง๋ถ™์ด๋Š” ํŠน์ง•์ด ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  attention_mask๋„ ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค.
  • BERT์˜ attention_mask๋Š” GPT์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ผ๋ฐ˜ ํ† ํฐ์ด ์ž๋ฆฌํ•œ ๊ณณ(1)๊ณผ ํŒจ๋”ฉ ํ† ํฐ์ด ์ž๋ฆฌํ•œ ๊ณณ(0)์„ ๊ตฌ๋ถ„ํ•ด ์•Œ๋ ค์ค๋‹ˆ๋‹ค.
  • ํ•œ๋ฒˆ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰์‹œ์ผœ์„œ ํ™•์ธํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
batch_inputs['attention_mask']
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]]

 

๋งˆ์ง€๋ง‰์œผ๋กœ token_type_ids ๋ผ๋Š” ์ž…๋ ฅ๊ฐ’๋„ ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค. ์ด๋Š” ์„ธ๊ทธ๋จผํŠธ(segment)์— ํ•ด๋‹น ํ•ฉ๋‹ˆ๋‹ค.
  • ์„ธ๊ทธ๋จผํŠธ(segment)์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ’์€ 0์ž…๋‹ˆ๋‹ค.
  • ์„ธ๊ทธ๋จผํŠธ ์ •๋ณด๋ฅผ ์ž…๋ ฅํ•˜๋Š” ๊ฑด BERT ๋ชจ๋ธ์˜ ํŠน์ง•์ž…๋‹ˆ๋‹ค.
  • BERT ๋ชจ๋ธ์€ ๊ธฐ๋ณธ์ ์œผ๋กœ ๋ฌธ์„œ(ํ˜น์€ ๋ฌธ์žฅ) 2๊ฐœ๋ฅผ ์ž…๋ ฅ๋ฐ›๋Š”๋ฐ์š”, ๋‘˜์€  token_type_ids ๋กœ ๊ตฌ๋ถ„ํ•ฉ๋‹ˆ๋‹ค.
  • ์ฒซ ๋ฒˆ์งธ ์„ธ๊ทธ๋จผํŠธ(๋ฌธ์„œ ํ˜น์€ ๋ฌธ์žฅ)์— ํ•ด๋‹นํ•˜๋Š” token_type_ids๋Š” 0, ๋‘ ๋ฒˆ์งธ ์„ธ๊ทธ๋จผํŠธ๋Š” 1์ž…๋‹ˆ๋‹ค.
batch_inputs['token_type_ids']
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
  • ์—ฌ๊ธฐ์„œ๋Š” ๋ฌธ์žฅ์„ ํ•˜๋‚˜์”ฉ ๋„ฃ์—ˆ์œผ๋ฏ€๋กœ token_type_ids๊ฐ€ ๋ชจ๋‘ 0์œผ๋กœ ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค.