A A
[Words] Word Tokenization - Morphemes (ํ˜•ํƒœ์†Œ)

Word Tokenization - Morphemes

Word-based tokenization - ์‚ฌ๋žŒ์ด ์“ฐ๋Š” ๋‹จ์–ด์˜ ์˜๋ฏธ

  • ํฐ ์‚ฌ์ „์ด ์žˆ์–ด์•ผ ํ•œ๋‹ค. ์‚ฌ์ „์— ์—†๋Š” ๋‹จ์–ด๊ฐ€ ์žˆ์œผ๋ฉด ์ฒ˜๋ฆฌ ๋ถˆ๊ฐ€ → ํ•ด๊ฒฐํ•˜๋ ค๋ฉด ์‚ฌ์ „์ด ์—„์ฒญ ์ปค์•ผํ•ด!
  • ๋ณด์ด์ง€ ์•Š๋Š” ๋‹จ์–ด๋‚˜ ํฌ๊ท€ํ•œ ๋‹จ์–ด๋ฅผ ์ž˜ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์—†์Œ
  • ํ•ด๊ฒฐ์ฑ… → subword tokenization

Subword tokenization

  • ๋ณดํ†ต ๋ง๋ญ‰์น˜ ์— ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋“ค์˜ ์ง‘ํ•ฉ, ๋นˆ๋„๊ฐ€ ๋‚ฎ์€ ๋‹จ์–ด๋Š” ์–ดํœ˜๊ฐ€ ๋ถ€์กฑํ•  ์ˆ˜๋„
  • ๋‹จ์–ด๋ณด๋‹ค ๋” ์ชผ๊ฐœ. ๊ทธ๋ ‡๋‹ค๊ณ  ๋‹จ์–ด or ๊ธ€์ž๋„ ์•„๋‹˜ , ๊ทธ ์ค‘๊ฐ„์—์„œ ์ž๋ฅธ๋‹ค.
  • ๋นˆ๋„๊ฐ€ ๋‚ฎ์€๊ฑด ์ตœ๋Œ€ํ•œ ์ž๋ฅด๊ณ  ์‹ถ์€ ์š•๊ตฌ์— ์˜ํ•˜์—ฌ ๋งŒ๋“ค์–ด์ง
    • ๋ณธ์  ์—†๋Š” ๋‹จ์–ด, ํ”ํ•˜์ง€ ์•Š์€ ๋‹จ์–ด
  • ๊ธฐ์กด์˜ NLP๋Š” ๊ณ ์ •๋œ ์–ดํœ˜๋กœ ์ž‘๋™ → ๊ทธ ๋ฐ–์— ์žˆ๋Š” ๋ชจ๋“  ํ† ํฐ์€ UNK(์•Œ์ˆ˜์—†์Œ)์œผ๋กœ ์ถ•์†Œ
Example)
strawberryies
- OOV in sentence: The basket was filled with strawberrie
- The / Basket / was / filled / with / UNK (strawberries ์ž๋ฆฌ)

Example)
"subword" -> "sub" & "word" (๊ฐ๊ฐ vector ์‚ฌ์šฉ)

Character-based tokenization - ๊ธ‰์ง„, ๊ทน๋‹จ์  ๋ฐฉ๋ฒ• (๋ชจ๋“  ๊ธ€์ž ๋‹ค ์ชผ๊ฐฌ)

  • ํ† ํฐํ™” ๋ฅผ ์™„์ „ํžˆ ํฌ๊ธฐํ•˜๊ณ , ๋ชจ๋“  ๊ฒƒ์„ ๊ฐœ๋ณ„ ๋ฌธ์ž๋กœ ๋‚˜๋ˆ„๊ธฐ
    • in English: ๋ชจ๋“  ๋Œ€๋ฌธ์ž, ์†Œ๋ฌธ์ž, ์ˆซ์ž์™€ ์ผ๋ถ€ ๊ตฌ๋‘์ ๋„
Example)
T / h / e / _ / b / a / s / k / e / t / _ / w / a / s / ...
  • dictionary ๊ฑฐ์ด ํ•„์š” ์—†์Œ → ์˜คํžˆ๋ ค ๋‹จ์ ๋“ค์ด ๋งŒ๋“ค์–ด๋ƒ„(์–ธ์–ด์  ๊ด€์ )
    • ๊ฐœ๋ณ„ ๋ฌธ์ž๋กœ ๋ถ„ํ• ์‹œ, sequence๊ฐ€ ๊ธธ์–ด์ง€๊ณ , ๊ณ„์‚ฐ ์‹œ๊ฐ„ ์ฆ๊ฐ€
  • ์ค‘๊ตญ์–ด๋Š” ๊ดœ์ฐฎ์„ ์ˆ˜๋„(๊ฐœ๋ณ„ ๋ฌธ์ž์˜ ์˜๋ฏธ ์ „๋‹ฌ) , ์˜์–ด & ํ•œ๊ตญ์–ด๋Š” ํž˜๋“ค์–ด. ๋ณ„๋กœ ์•ˆ์ข‹์•„
  • ์—ฐ๊ฒฐ๋œ ์–ธ์–ด(๋ณตํ•ฉ์–ด)๋ฅผ ๋ฌธ์ž๋กœ ์ฒ˜๋ฆฌ ํ• ์ˆ˜ ์žˆ๋‹ค.
Example)
- ๋ฌด์˜๋ฏธํ•œ ๊ฐœ๋ณ„ ํ† ํฐ “d” and “o” -> “dog” & “dollar”

subword tokenization(ํ† ํฐํ™”)๋ฅผ ์œ„ํ•œ ๊ณตํ†ต ์•Œ๊ณ ๋ฆฌ์ฆ˜

  • Byte-Pair Encoding (BPE)
  • Unigram Language Modeling Tokenization
  • WordPiece Model
  • SentencePiece Model
  • ๋‘ ๋ถ€๋ถ„์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค.
    • raw training corpus๋ฅผ ์ˆ˜๊ฐ•ํ•˜๊ณ  ์–ดํœ˜(ํ† ํฐ ์„ธํŠธ)๋ฅผ ์œ ๋„ํ•˜๋Š” ํ† ํฐ ํ•™์Šต์ž
    • raw test sentence๋ฅผ ๊ฐ€์ ธ์™€์„œ ํ•ด๋‹น ์–ดํœ˜์— ๋”ฐ๋ผ ํ† ํฐํ™”ํ•˜๋Š” ํ† ํฐ ์„ธ๊ทธ๋จผํŠธํ™” ํ”„๋กœ๊ทธ๋žจ