A A
[NLP] Word Embedding - ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ

1. Word Embedding?

Word Embedding, ์›Œ๋“œ์ž„๋ฒ ๋”ฉ ์ด๋ž€? ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์น˜ํ˜• ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. 
  • ๋‹ค๋ฅธ ์˜๋ฏธ๋กœ ๋งํ•˜๋ฉด Text๋‚ด์˜ ๋‹จ์–ด๋“ค์„ ์ปดํ“จํ„ฐ๊ฐ€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” Vector์˜ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•˜๋Š”๊ฒƒ์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๋‹จ์–ด๋ฅผ ๊ณ ์ฐจ์› ๊ณต๊ฐ„์˜ ์ €์ฐจ์› ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • Word Embedding ๊ณผ์ •์„ ๊ฑฐ์นœ Vector๋Š” ๋‹จ์–ด์˜ ์˜๋ฏธ(mean), ๋ฌธ๋งฅ(context), ์œ ์‚ฌ์„ฑ(similar) ๋“ฑ์„ ์ˆ˜์น˜ํ™” ํ•ด์„œ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ์˜ ๊ณผ์ •์€ ํฌ๊ฒŒ ๋ณด๋ฉด 2๊ฐ€์ง€์˜ ๋ฐฉ๋ฒ•์œผ๋กœ ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค.

2. Word Embedding์˜ ๋ฐฉ๋ฒ•

  • Word Embedding์˜ ๋ฐฉ๋ฒ•์€ ํฌ๊ฒŒ ๋ณด๋ฉด 2๊ฐ€์ง€์˜ ๋ฐฉ๋ฒ•์œผ๋กœ ์ด๋ฃจ์–ด ์ง„๋‹ค๊ณ  ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ํ•˜๋‚˜๋Š” Count๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•, ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ์˜ˆ์ธก ๊ธฐ๋ฐ˜์˜ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์šฐ์„  ์นด์šดํŠธ ๊ธฐ๋ฐ˜์˜ ๋ฐฉ๋ฒ•๋ถ€ํ„ฐ ์„ค๋ช…ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

2-1. Count๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•

Count๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์€ ๋‹จ์–ด(Word)์˜ ๋ฌธ๋งฅ(Context) ์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์—ฌ, ๋‹จ์–ด๋ฅผ Vector๋กœ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค.
  • ํŠน์ • ๋‹จ์–ด์˜ ์ฃผ๋ณ€ ๋‹จ์–ด๋“ค์˜ ๋นˆ๋„๋ฅผ ์นด์šดํŠธํ•ด์„œ Vector๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
  • ๋Œ€ํ‘œ์ ์œผ๋กœ TF-IDF, Co-occurence Matrix๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

TF-IDF (Tern Frequency-Inverse Document Frequency)

TF-IDF๋Š” Text Data์—์„œ ๋‹จ์–ด์˜ ์ค‘์š”์„ฑ์„ ํ‰๊ฐ€(Weight(๊ฐ€์ค‘์น˜)๋ฅผ ๊ณ„์‚ฐ)ํ•˜๋Š” ํ†ต๊ณ„์  ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๋ณดํ†ต ๋ฌธ์„œ ๋‚ด์—์„œ ํŠน์ • ๋‹จ์–ด๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ผ๋•Œ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • TF-IDF๋Š” ๋‘ ๋ถ€๋ถ„์œผ๋กœ ๋‚˜๋ˆ„์–ด์ ธ์„œ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค. TF, IDF 2๊ฐœ๋กœ ๋‚˜๋‰˜์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.
  • 'TF'๋Š” ๋ฌธ์„œ ๋‚ด์—์„œ ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋‚˜ํƒ€๋‚˜๋Š” ๋นˆ๋„์ˆ˜(๋“ฑ์žฅ ํšŸ์ˆ˜)๋ฅผ ๋งํ•ฉ๋‹ˆ๋‹ค.
  • ์ผ๋ฐ˜์ ์ธ 'TF'๋Š” ๋‹จ์–ด ๋“ฑ์žฅ ํšŸ์ˆ˜๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜, Normalization(์ •๊ทœํ™”)ํ•˜์—ฌ ์ƒ๋Œ€์ ์ธ ๋นˆ๋„๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • TF(t, d) ๊ณ„์‚ฐ: ํŠน์ • ๋‹จ์–ด t๊ฐ€ ๋ฌธ์„œ d ๋‚ด์— ๋“ฑ์žฅํ•˜๋Š” ํšŸ์ˆ˜๋ฅผ ๋ฌธ์„œ d์˜ ์ „์ฒด ๋‹จ์–ด ์ˆ˜๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค.

 

  • 'IDF'๋Š” ์ „์ฒด ๋ฌธ์„œ ์ง‘ํ•ฉ์—์„œ ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋“ค์–ด์žˆ๋Š” ๋ฌธ์„œ๋“ค์˜ ๋น„์œจ์— ๊ธฐ๋ฐ˜ํ•œ ์ˆ˜์น˜์ž…๋‹ˆ๋‹ค.
  • ์‰ฝ๊ฒŒ ๋งํ•˜๋ฉด ์ „์ฒด ๋ฌธ์„œ ์ง‘ํ•ฉ์—์„œ ํŠน์ •๋‹จ์–ด๊ฐ€ ์–ผ๋งˆ๋‚˜ ํฌ๊ท€ํ•œ์ง€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ ํ”ํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด๋Š” Weight(๊ฐ€์ค‘์น˜)๊ฐ€ ์ ๊ณ , ํฌ๊ท€ํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด๋Š” Weight(๊ฐ€์ค‘์น˜)๊ฐ€ ํฝ๋‹ˆ๋‹ค.
  • IDF(t, D) ๊ณ„์‚ฐ, ๋กœ๊ทธ ์Šค์ผ€์ผ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ „์ฒด ๋ฌธ์„œ์˜ ์ˆ˜๋ฅผ ํŠน์ • ๋‹จ์–ด t๊ฐ€ ํฌํ•จ๋œ ๋ฌธ์„œ์˜ ์ˆ˜๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค.

 

  • TF-IDF(t, d, D) ๊ณ„์‚ฐ: TF์™€ IDF๋ฅผ ๊ณฑํ•˜์—ฌ ์–ป์Šต๋‹ˆ๋‹ค.
  • TF-IDF(t, d, D) = TF(t, d) * IDF(t, D)

TF-IDF์˜ ๊ตฌ์„ฑ

  • TF-IDF์—์„œ ์ค‘์š”ํ•˜๊ฒŒ ๋ด์•ผ ํ•˜๋Š”๊ฑด TF-IDF๊ฐ’์ด ํด์ˆ˜๋ก ์–ด๋–ค ๋‹จ์–ด๊ฐ€ ํ•ด๋‹น ๋ฌธ์„œ์— ๋” ์ค‘์š”ํ•˜๋‹ค๋Š”๊ฒƒ์„ ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ž˜์„œ ํŠน์„ฑ ๋ฌธ์„œ์—์„œ ๋‚˜์ฃผ ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด์—๋Š” ๋†’์€ Weight(๊ฐ€์ค‘์น˜)๊ฐ€ ๋ถ€์—ฌ๋˜์ง€๋งŒ, ์ „์ฒด ๋ฌธ์„œ ์ง‘ํ•ฉ์—์„œ ์ž์ฃผ ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด๋Š” ๋‚ฎ์€ ๊ฐ€์ค‘์น˜๋ฅผ ๊ฐ–๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  TF-IDF๋Š” ์ •๋ณด ๊ฒ€์ƒ‰์„ ํ•˜๋ฉด์„œ ๊ทธ ๋ฌธ์„œ์˜ ๊ด€๋ จ์„ฑ์„ ํŒ๋‹จํ•˜๋Š” ๊ฒฝ์šฐ์— ์‚ฌ์šฉ๋˜๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.

Example Code - TF-IDF

  • Python์˜ sklearn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•ด์„œ TF-IDF๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ํ•œ๋ฒˆ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
  • ์ด ์ฝ”๋„๋Š” ๋ฌธ์„œ list๋ฅผ input์œผ๋กœ ๋ฐ›์•„์„œ ๊ฐ ๋ฌธ์„œ์˜ TF-IDF Vector๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
  • ์ด ์ฝ”๋“œ๋Š” ํ•œ๊ตญ์–ด Text์— ๋Œ€ํ•œ TF-IDF ๊ณ„์‚ฐ์„ ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ, ํ† ํฐํ™”, ์–ด๊ฐ„ ์ถ”์ถœ๋“ฑ์˜ ์ „์ฒ˜๋ฆฌ ๊ณผ์ • ๋ฐ ํ•œ๊ตญ์–ด Text ์ฒ˜๋ฆฌ๋ฅผ ํ•˜๊ธฐ ์œ„ํ•ด์„œ KoNLPy ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์™€ Okt ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ๋ฅผ ํฌํ•จํ•˜์˜€์Šต๋‹ˆ๋‹ค.
from sklearn.feature_extraction.text import TfidfVectorizer
from konlpy.tag import Okt
import re

# ์˜ˆ์‹œ ๋ฌธ์„œ ๋ฆฌ์ŠคํŠธ
documents = [
    '์ด๊ฒƒ์€ ์ฒซ ๋ฒˆ์งธ ๋ฌธ์„œ์ž…๋‹ˆ๋‹ค.', # ์—ฌ๊ธฐ์— document ํŒŒ์ผ์„ ๋„ฃ์œผ๋ฉด ๋ฉ๋‹ˆ๋‹ค.
    '์ด๊ฒƒ์€ ๋‘ ๋ฒˆ์งธ ๋ฌธ์„œ์ž…๋‹ˆ๋‹ค.', # ์—ฌ๊ธฐ์— document ํŒŒ์ผ์„ ๋„ฃ์œผ๋ฉด ๋ฉ๋‹ˆ๋‹ค.
    '์ด๊ฒƒ์€ ์„ธ ๋ฒˆ์งธ ๋ฌธ์„œ์ž…๋‹ˆ๋‹ค.', # ์—ฌ๊ธฐ์— document ํŒŒ์ผ์„ ๋„ฃ์œผ๋ฉด ๋ฉ๋‹ˆ๋‹ค.
]

# ํ•œ๊ตญ์–ด ๋ถˆ์šฉ์–ด ๋ฆฌ์ŠคํŠธ - ๋ถˆ์šฉ์–ด ๋ฆฌ์ŠคํŠธ๋Š” ์˜ˆ์‹œ๋กœ ๋„ฃ์–ด๋†“์•˜์Šต๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž๊ฐ€ ์ถ”๊ฐ€ํ•ด์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
stopwords = ['์ด๊ฒƒ', '์ž…๋‹ˆ๋‹ค', '๋ฌธ์„œ', '๋ฒˆ์งธ']

# Okt ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ ์ธ์Šคํ„ด์Šค ์ƒ์„ฑ
okt = Okt()
# ๋ฌธ์„œ ์ „์ฒ˜๋ฆฌ ํ•จ์ˆ˜
def preprocessing(document):
    # ํŠน์ˆ˜ ๋ฌธ์ž ์ œ๊ฑฐ
    document = re.sub('[^๊ฐ€-ํžฃใ„ฑ-ใ…Žใ…-ใ…ฃa-zA-Z]', ' ', document)
    # ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ํ† ํฐํ™” ๋ฐ ์–ด๊ฐ„ ์ถ”์ถœ
    tokens = okt.morphs(document, stem=True)
    # ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ
    tokens = [token for token in tokens if token not in stopwords]
    return tokens

# TF-IDF ๋ฒกํ„ฐ๋ผ์ด์ € ์ƒ์„ฑ
vectorizer = TfidfVectorizer(tokenizer=preprocessing)

# ๋ฌธ์„œ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฒกํ„ฐ๋ผ์ด์ €๋ฅผ ํ•™์Šต์‹œํ‚ค๊ณ  TF-IDF ๋ฒกํ„ฐ ์ƒ์„ฑ
tfidf_matrix = vectorizer.fit_transform(documents)

# ๊ฐ ๋‹จ์–ด์˜ idf๊ฐ’ ์ถœ๋ ฅ
print('๋‹จ์–ด๋ณ„ idf ๊ฐ’: ', dict(zip(vectorizer.get_feature_names(), vectorizer.idf_)))

# TF-IDF ๋ฒกํ„ฐ ์ถœ๋ ฅ
print('TF-IDF ๋ฒกํ„ฐ: ', tfidf_matrix.toarray())

๋‹จ์–ด ๋™์‹œ ์ถœํ˜„ ํ–‰๋ ฌ (Co-occurence Matrix)

๋‹จ์–ด ๋™์‹œ ์ถœํ˜„ ํ–‰๋ ฌ (Co-occurence Matrix)์€ ๋‹จ์–ด๊ฐ„์— ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋˜๋Š” ํ‘œํ˜„ ๋ฐฉ๋ฒ•์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค.
  • ์ด ํ–‰๋ ฌ์€ ์ฃผ์–ด์ง„ ๋ฌธ์„œ or ๋ง๋ญ‰์น˜(Corpus)์—์„œ ๋‹จ์–ด ์Œ์ด ํ•จ๊ป˜ ๋“ฑ์žฅํ•œ ํšŸ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ–‰๋ ฌ์ž…๋‹ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ "๋™์‹œ์ถœํ˜„" ์ด๋ผ๋Š” ๋ง์€, ๋‘ ๋‹จ์–ด๊ฐ€ ์ฃผ์–ด์ง„ ๋งฅ๋žต(Context)์—์„œ ํ•จ๊ป˜ ๋‚˜ํƒ€๋‚˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  • ํ•œ๋ฒˆ ์˜ˆ๋ฅผ ๋“ค์–ด์„œ ๋ณด๋ฉด, "๋ฌด๊ธฐ" ์ด๋ž‘ "์ „์Ÿ"์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ๋งŽ์ด ๋‚˜ํƒ€๋‚œ๋‹ค๋ฉด, ์ด ๋‘ ๋‹จ์–ด๊ฐ„์—๋Š” ์˜๋ฏธ์ ์ธ ๊ด€๊ณ„๊ฐ€ ์žˆ์„์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ํ•œ๋ฒˆ ๋‹จ์–ด ๋™์‹œ ์ถœํ˜„ ํ–‰๋ ฌ (Co-occurence Matrix)๋ฅผ ๋งŒ๋“ค๋ ค๋ฉด ์ด๋Ÿฌํ•œ ๋‹จ๊ณ„๋กœ ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค.
  1. Corpus(๋ง๋ญ‰์น˜, ๋‹จ์–ด์ง‘ํ•ฉ) ๊ตฌ์ถ•: ๋ถ„์„ํ•˜๊ณ ์ž ํ•˜๋Š” ๋Œ€์ƒ ๋ฌธ์„œ๋“ค๋กœ ์ด๋ฃจ์–ด์ง„ Corpus๋ฅผ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค.
  2. Window ๊ตฌ์ถ•: ๋ฌธ์„œ๋ฅผ Tokenํ™”ํ›„, ๋‹จ์–ด์˜ Window๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. Window๋Š” ๋‹จ์–ด๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์ •์˜ํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋˜๋ฉฐ, ์ฃผ๋ณ€ ๋‹จ์–ด๋ฅผ ๋ช‡๊ฐœ๊นŒ์ง€ ํฌํ•จ์‹œํ‚ฌ์ง€๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ Word2Vec์— ๊ด€๋ จํ•ด์„œ ์“ด ๊ธ€์„ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”.
  3. ๋‹จ์–ด ๋™์‹œ ์ถœํ˜„ ํ–‰๋ ฌ (Co-occurence Matrix): ๊ฐ Window์—์„œ ๋“ฑ์žฅํ•œ ๋‹จ์–ด์Œ์˜ ๋นˆ๋„๋ฅผ ํ–‰๋ ฌ์— ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค. ํ–‰๋ ฌ์˜ ํ–‰ & ์—ด์€ ๋‹จ์–ด๋ฅผ ์˜๋ฏธํ•˜๋ฉฐ, ๊ฐ ์…€์€ ํ•ด๋‹น ๋‹จ์–ด๋“ค์˜ ๋™์‹œ ์ถœํ˜„ ํšŸ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

Co-occurence Matrix (๋™์‹œ ๋ฐœ์ƒ ํ–‰๋ ฌ) ์˜ˆ์‹œ.

  • "๋ฌด๊ธฐ" ๋ž‘ "์ „์Ÿ"์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ์ฃผ์–ด์ง„ Window์—์„œ ๋“ฑ์žฅํ•˜๋ฉด, ํ•ด๋‹น ํ–‰๋ ฌ์˜ "๋ฌด๊ธฐ" ๋ž‘ "์ „์Ÿ"์—ด์— ํ•ด๋‹นํ•˜๋Š” ์…€์˜ ๊ฐ’์„ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค.
  • ๋‹จ์–ด ๋™์‹œ ์ถœํ˜„ ํ–‰๋ ฌ (Co-occurence Matrix)๋Š” ๋Œ€์นญ์„ฑ์„ ๊ฐ€์ง€๋ฉฐ, ํ–‰๋ ฌ์˜ ๊ฐ ์›์†Œ๋Š” ํ•ด๋‹น ๋‹จ์–ด ์Œ์˜ ๋™์‹œ ์ถœํ˜„ ๋นˆ๋„๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
    • ๊ทธ๋ฆฌ๊ณ , ์ด ํ–‰๋ ฌ์€ ์ฃผ๋กœ ์ž ์žฌ ์˜๋ฏธ ๋ถ„์„(Latent Semantic Analysis, LSA) ์— ํ™œ์šฉ๋˜์–ด์„œ ๋‹จ์–ด๊ฐ„์˜ ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ์„ ํŒŒ์•…ํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

2-2. ์˜ˆ์ธก ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•

์˜ˆ์ธก ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์€ ํŠน์ • ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋‹จ์–ด๋ฅผ Vector๋ฅผ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค.
  • ์ฃผ์–ด์ง„ Context(๋ฌธ๋งฅ) ์—์„œ ํŠน์ • ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๊ฑฐ๋‚˜, ํŠน์ • ๋‹จ์–ด๋ฅผ ๊ฐ€์ง€๊ณ  ์ฃผ๋ณ€ Context(๋ฌธ๋งฅ)์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
  • ๋Œ€ํ‘œ์ ์œผ๋กœ Word2Vec, Glove, FastText๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. Word2Vec์— ๋Œ€ํ•ด์„œ๋Š” ์ž์„ธํ•˜๊ฒŒ ์„ค๋ช…ํ•œ ๊ธ€์ด ์žˆ์œผ๋‹ˆ ์ด๋ฒˆ์—๋Š” ํŒจ์Šคํ•˜๊ณ  Glove, FastText ๋‘ ๋ฐฉ๋ฒ•์— ๋ฐํ•˜์—ฌ ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.
 

[NLP] Word2Vec - ๊ฐœ๋… & Model

1. What is Word2Vec? Word2Vec์€ ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋˜๋Š” ์ธ๊ธฐ์žˆ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๋‹จ์–ด๋Š” ๋ณดํ†ต 'Token' ํ† ํฐ ์ž…๋‹ˆ๋‹ค. ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹จ์–ด(Token)๋“ค ์‚ฌ์ด์˜ ์˜๋ฏธ์  ๊ด€๊ณ„๋ฅผ Vector ๊ณต๊ฐ„์—

daehyun-bigbread.tistory.com


Glove (Global Vectors for Word Representation)

Glove๋Š” Word Embedding์„ ํ•™์Šต ํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค.
  • Glove๋Š” Count ๊ธฐ๋ฐ˜, ์˜ˆ์ธก ๊ธฐ๋ฐ˜์„ ์‚ฌ์šฉํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ๋‹จ์–ด ๊ฐ„์˜ ์ „์—ญ์ ์ธ ์˜๋ฏธ ๊ด€๊ณ„๋ฅผ ํฌ์ฐฉํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ฉฐ, Word2Vec๊ณผ ๋‹ค๋ฅธ ์ž„๋ฒ ๋”ฉ ๊ธฐ๋ฒ•๊ณผ๋Š” ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.
  • Glove๋Š” ์ „์ฒด Corpus(๋ง๋ญ‰์น˜)์˜ ํ†ต๊ณ„ ์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ด์„œ ๋‹จ์–ด๋ฅผ Vector๋กœ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค. 
  • ์ฃผ์š” ํŠน์ง•์€ ๋‹จ์–ด ๋™์‹œ ์ถœํ˜„ ํ–‰๋ ฌ (Co-occurence Matrix)๋ฅผ ํ™œ์šฉํ•ด์„œ ๋‹จ์–ด๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ Modelingํ•ฉ๋‹ˆ๋‹ค.
    • ์ด Modelingํ•œ ํ–‰๋ ฌ์€ ์ „์ฒด Corpus(๋ง๋ญ‰์น˜)์—์„œ ๋‹จ์–ด๊ฐ€ ํ•จ๊ป˜ ๋“ฑ์žฅํ•œ ๋นˆ๋„๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • ๋น„์„ ํ˜• ๊ด€๊ณ„ Modeling: Glove๋Š” ๋‹จ์–ด Vector๊ฐ„์˜ ์„ ํ˜• ๊ด€๊ณ„ ๋ฐ ๋น„์„ ํ˜• ๊ด€๊ณ„๋„ ์บก์ฒ˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด์„œ ๋‹จ์–ด๊ฐ„์˜ ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ์„ ์ž˜ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ž„๋ฒ ๋”ฉ ํ•™์Šต ๋ชฉ์  ํ•จ์ˆ˜ (์†์‹ค ํ•จ์ˆ˜ - loss Function): Glove๋Š” ์ž„๋ฒ ๋”ฉ ํ•™์Šต์„ ์œ„ํ•œ ๋ชฉ์  ํ•จ์ˆ˜(์†์‹ค ํ•จ์ˆ˜)๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.
    • ์ด ๋ชฉ์  ํ•จ์ˆ˜(์†์‹ค ํ•จ์ˆ˜)๋Š” ๋‹จ์–ด Vector๊ฐ„์˜ ๋‚ด์ ์ด ํ•ด๋‹น ๋‹จ์–ด ์Œ์˜ ๋™์‹œ ์ถœํ˜„ ํ™•๋ฅ ์˜ log๋กœ ์ˆ˜๋ ดํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๋‘ ๋‹จ์–ด Vector์˜ ๋‚ด์ ์ด ๋‘ ๋‹จ์–ด๊ฐ€ ํ•จ๊ป˜ ๋“ฑ์žฅํ•  ํ™•๋ฅ ๊ณผ ๋น„๋ก€ํ•˜๊ฒŒ ๋˜๋„๋ก ํ•™์Šต๋ฉ๋‹ˆ๋‹ค.
    • ํ•œ๋ฒˆ ๊ทธ๋Ÿฌ๋ฉด ํ•จ์ˆ˜์— ๋ฐํ•˜์—ฌ ์„ค๋ช…ํ•˜๊ธฐ ์ „์— ๊ฐ ์šฉ์–ด๋ฅผ ํ•œ๋ฒˆ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์ถœ์ฒ˜:https://wikidocs.net/22885

Glove ๊ฐœ๋… ๋ฐ ์ˆ˜์‹ ์„ค๋ช…

  • Glove๋Š” Embedding ๋œ ์ค‘์‹ฌ ๋‹จ์–ด์™€ ์ฃผ๋ณ€ ๋‹จ์–ด Vector์˜ ๋‚ด์ ์ด ์ „์ฒด Corpus(๋ง๋ญ‰์น˜)์—์„œ ๋™์‹œ ๋“ฑ์žฅ ํ™•๋ฅ ์ด ๋˜๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
  • ์•„๋ž˜ ์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • ์•„๋ž˜ ์‹์€ ๋ชฉ์  ํ•จ์ˆ˜(์†์‹ค ํ•จ์ˆ˜)๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์—†์œผ๋ฏ€๋กœ, ์ž„๋ฒ ๋”ฉ๋œ ๋ฒกํ„ฐ์˜ ํŠน์„ฑ์„ ๋ฐ˜์˜ํ• ์ˆ˜ ์žˆ๊ฒŒ ์ˆ˜์‹์„ ๋งŒ๋“ค์–ด ์ค˜์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ์ผ๋‹จ, ๋‹จ์–ด๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์ž˜ ํ‘œํ˜„ํ•ด ์ฃผ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ์–ด๋–ค ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ–ˆ์„๋•Œ ๋‹ค๋ฅธ ์–ด๋– ํ•œ ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•  ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋‚ผ์ˆ˜ ์žˆ๋Š” ์‹์„ ์‚ฌ์šฉํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.

์ถœ์ฒ˜:https://wikidocs.net/22885

  • Glove๋Š” Vector wi, wj, wk๋ฅผ ๊ฐ€์ง€๊ณ  ์–ด๋–ค ํ•จ์ˆ˜ F๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฉด, Pik / Pjk๊ฐ€ ๋‚˜์˜จ๋‹ค๋Š” ์ดˆ๊ธฐ ์‹์œผ๋กœ ์ „๊ฐœ๋ฅผ ํ•ฉ๋‹ˆ๋‹ค.

์ถœ์ฒ˜:https://wikidocs.net/22885

  • ์ผ๋‹จ F๋ผ๋Š” ํ•จ์ˆ˜๊ฐ€ ์–ด๋– ํ•œ ์‹์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”์ง€ ์•Œ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ผ๋‹จ F์•ˆ์— ์ง‘์–ด๋„ฃ์„ wi, wj, wk์˜ ๊ด€๊ณ„๋ฅผ ์•Œ์•„๋ณด๊ธฐ ์œ„ํ•ด์„œ wi, wj๋ฅผ ๋บ€ ๋ฒกํ„ฐ๋ฅผ wk๋ฅผ ๋‚ด์ ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด์œ ๋Š” ํ•จ์ˆ˜ F๋Š” ๋‘ ๋‹จ์–ด ์‚ฌ์ด์˜ ๋™์‹œ ๋“ฑ์žฅ ํ™•๋ฅ ์˜ ํฌ๊ธฐ ๊ด€๊ณ„์˜ ratio(๋น„) ์ •๋ณด๋ฅผ Vector ๊ณต๊ฐ„์— Encoding ํ•˜๋Š”๊ฒƒ์ด Glove ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ชฉ์ ์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋ž˜์„œ wi, wj ๋‘ Vector์˜ ์ฐจ์ด๋ฅผ ํ•จ์ˆ˜ F์˜ input์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ทผ๋ฐ ์šฐ๋ณ€์€ ์Šค์นผ๋ผ ๊ฐ’์ด๊ณ  ์ขŒ๋ณ€์€ ๋ฒกํ„ฐ ๊ฐ’์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ์„ฑ๋ฆฝํ•˜๊ฒŒ ํ•ด์ฃผ๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ•จ์ˆ˜ F์˜ ๋‘ ์ž…๋ ฅ์˜ ๋‚ด์ (Dot Product)๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

์ถœ์ฒ˜:https://wikidocs.net/22885

  • ๊ทผ๋ฐ, ์ด๋•Œ ํ•จ์ˆ˜ F๊ฐ€ ๋งŒ์กฑํ•ด์•ผ ํ•  ์กฐ๊ฑด์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ค‘์‹ฌ๋‹จ์–ด w, ์ฃผ๋ฒˆ๋‹จ์–ด ~w์˜ ์„ ํƒ ๊ธฐ์ค€์€ ๋ฌด์ž‘์œ„ ์„ ํƒ์ด๋ฏ€๋กœ, ์ด ๋‘˜์˜ ๊ด€๊ณ„๋Š” ์ž์œ ๋กญ๊ฒŒ ๊ตํ™˜์ด ๋˜๋„๋ก ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๊ฒŒ ์„ฑ๋ฆฝ์ด ๋˜๊ฒŒ ํ•˜๋ ค๋ฉด ํ•จ์ˆ˜ F๊ฐ€ ์‹ค์ˆ˜์˜ ๋ง์…ˆ & ์–‘์ˆ˜์˜ ๊ณฑ์…ˆ์— ๋Œ€ํ•ด์„œ ์ค€๋™ํ˜•(Homomorphism)์„ ๋งŒ์กฑํ•˜๋„๋ก ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.

ํ•จ์ˆ˜ F๊ฐ€ ๋งŒ์กฑํ•ด์•ผ ํ•  ์กฐ๊ฑด

 

 ์ค€๋™ํ˜•(Homomorphism)์— ๋ฐํ•˜์—ฌ ๊ฐ„๋‹จํžˆ ์„ค๋ช…ํ•ด๋ณด๋ฉด a์™€ b์— ๋Œ€ํ•ด์„œ ํ•จ์ˆ˜ F๊ฐ€ F(a+b)๊ฐ€ F(a)F(b)๊ฐ€ ๊ฐ™๋„๋ก ๋งŒ์กฑ์‹œ์ผœ์•ผ ํ•œ๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค. ์‹์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด -> F(a+b) = F(a)F(b) ์ž…๋‹ˆ๋‹ค.
  • ๊ด€๋ จํ•œ ์‹์„ ๊ฐ€์ ธ์˜ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์ถœ์ฒ˜:https://wikidocs.net/22885

  • ์ด๋ ‡๊ฒŒ ํ˜„์žฌ์˜ ์‹์„ ์ค€๋™ํ˜• ์‹์œผ๋กœ ๋ปด์…ˆ์— ๋Œ€ํ•œ ์ค€๋™ํ˜•์‹์œผ๋กœ ๋ณ€๊ฒฝ์„ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ์›๋ž˜ ๊ณฑ์…ˆ์— ๋Œ€ํ•œ ์‹๋„ ๋‚˜๋ˆ—์…ˆ์œผ๋กœ ๋ด๋€๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฉด ์ด ์ค€๋™ํ˜•์‹์˜ ์šฐ๋ณ€์˜ ์‹์€ Pik / Pjk ์ด๋ฏ€๋กœ, ๊ฒฐ๊ณผ์ ์œผ๋กœ ์•„๋ž˜์˜ ์‹๊ณผ ๊ฐ™์ด ์ •๋ฆฌ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. 

์ถœ์ฒ˜:https://wikidocs.net/22885

  • ์ด ์ค€๋™ํ˜•(Homomorphism)์‹ ์›๋ž˜ ์‹์„ ์ขŒ๋ณ€์œผ๋กœ ํ’€์–ด์„œ ์“ฐ๋ฉด ์›๋ž˜์˜ ์‹์œผ๋กœ ์ •๋ฆฌ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

์ถœ์ฒ˜:https://wikidocs.net/22885

  • ์ด ์ •๋ฆฌํ•œ ์‹์€ ๋ปด์…ˆ์— ๋Œ€ํ•œ ์ค€๋™ํ˜•(Homomorphism)์‹ ์˜ ํ˜•ํƒœ๊ฐ€ ์ •ํ™•ํžˆ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฉด ์ด์ œ ์ด๋ฅผ ๋งŒ์กฑํ•˜๋Š” ํ•จ์ˆ˜ F๋ฅผ ์ฐพ์•„์•ผํ•ฉ๋‹ˆ๋‹ค. ์ด ํ•จ์ˆ˜ F๋ฅผ ๋งŒ์กฑ์‹œํ‚ค๋Š” ํ•จ์ˆ˜๋ฅผ ์ง€์ˆ˜ ํ•จ์ˆ˜(Exponential Function)์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์•„๋ž˜์˜ ์‹์—์„œ F๋ฅผ ์ง€์ˆ˜ํ•จ์ˆ˜ exp๋ผ๊ณ  ํ•˜๊ณ  ์น˜ํ™˜ํ•ด์„œ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์ถœ์ฒ˜:https://wikidocs.net/22885

  • ์œ„์˜ ๋‘๋ฒˆ์งธ ์‹์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์‹์„ ์–ป์„์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ถœ์ฒ˜:https://wikidocs.net/22885

  • ๊ทผ๋ฐ, ์šฐ๋ฆฌ๊ฐ€ ๋ด์•ผํ•˜๋Š” ์ค‘์š”ํ•œ ์‚ฌ์‹ค์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ค‘์‹ฌ๋‹จ์–ด w, ์ฃผ๋ฒˆ๋‹จ์–ด ~w๋Š” ๋‘๊ฐ’์˜ ์œ„์น˜๋ฅผ ๋ด๊พธ์–ด๋„ ์‹์ด ์„ฑ๋ฆฝํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. 
  • ์ด๋ง์€ ๋‹จ์–ด๊ฐ„์˜ ๊ต์ฒด๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๋ง์ธ๋ฐ, ๊ทธ๋Ÿด๋ ค๋ฉด ์œ„์˜ ์‹์—์„œ log Xi ํ•ญ์ด ๊ฑธ๋ฆผ๋Œ ์ž…๋‹ˆ๋‹ค.
  • ์ด๋ถ€๋ถ„๋งŒ ์—†๋‹ค๋ฉด ์ด ์ˆ˜์‹์„ ์„ฑ๋ฆฝ ์‹œํ‚ฌ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋•Œ์˜ ํ•ด๊ฒฐ์ฑ…์€ log Xiํ•ญ์„ wi์— ๋Œ€ํ•œ ํŽธํ–ฅ (bi,bk) ์ƒ์ˆ˜ํ•ญ์œผ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฐ™์€ ์ด์œ ๋กœ ์ฃผ๋ณ€๋‹จ์–ด ~w์— ๋Œ€ํ•œ ํŽธํ–ฅ ~b๋„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ํŽธํ–ฅ(bias)์ฒ˜๋Ÿผ ์ƒ์ˆ˜ํ•ญ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๋‘˜์„ ๋™์ผํ•œ ์ˆ˜์‹์œผ๋กœ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Word(๋‹จ์–ด)๊ฐ€ ๋‹ฌ๋ผ์ง€๊ฒŒ ๋˜๋ฉด Bias(ํŽธํ–ฅ)๋„ i, k์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง€๋Š” ์ƒ์ˆ˜ํ•ญ์ด๋ผ๊ณ  ์•Œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

์ถœ์ฒ˜:https://wikidocs.net/22885

  • ์œ„์˜ 2๋ฒˆ์งธ ์‹์ด ๋ชฉ์  ํ•จ์ˆ˜(์†์‹ค ํ•จ์ˆ˜)์˜ ํ•ต์‹ฌ์ด ๋˜๋Š” ์‹์ž…๋‹ˆ๋‹ค.
  • ์‹์—์„œ ํ•™์Šต๋˜์–ด์•ผ ํ•˜๋Š” Embedding๋œ ๋‹จ์–ด๋“ค์ด ์ขŒ๋ณ€์ชฝ ์œผ๋กœ ๋ชฐ๋ ค์žˆ๊ณ , ์šฐ๋ณ€์—๋Š” log(Xik)๋ฅผ ํ†ตํ•ด Window ์‚ฌ์ด์ฆˆ๋ฅผ ๋‘๊ณ  Corpus(๋ง๋ญ‰์น˜) ์ „์ฒด์—์„œ ๋‹จ์–ด๋ณ„ ๋“ฑ์žฅ ๋นˆ๋„๋ฅผ ๊ตฌํ•œ ๋™์‹œ ์ถœํ˜„ ํ–‰๋ ฌ (Co-occurence Matrix)์—์„œ ๋กœ๊ทธ๋ฅผ ์ทจํ•ด์ค€ ํ–‰๋ ฌ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ขŒ๋ณ€์˜ 4๊ฐœ ํ•ญ์€ Training์„ ํ†ตํ•ด์„œ ๊ฐ’์ด ๋ด๋€Œ๋Š” ๋ณ€์ˆ˜๊ฐ€ ์žˆ๊ณ , ์šฐ๋ณ€์˜ ๊ฐ’์€ ์ขŒ๋ณ€์˜ ๊ฐ’๊ณผ์˜ ์ฐจ์ด๋ฅผ ์ตœ์†Œํ™” ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค.
  • ์ด ์‹์„ ๊ตฌํ•˜๋ ค๊ณ  ํ•˜๋Š” ๋ชฉ์ ํ•จ์ˆ˜ J๋กœ ๋‚˜ํƒ€๋‚˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค, ๊ทธ๋ฆฌ๊ณ  ๋ชฉ์ ํ•จ์ˆ˜J ์—์„œ V๋Š” ๋‹จ์–ด ์ง‘ํ•ฉ์˜ ํฌ๊ธฐ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

์ถœ์ฒ˜:https://wikidocs.net/22885

  • ๋ณ€ํ™˜์„ ํ•ด์ฃผ๋Š” ์ด์œ ๋Š” ๋†’์€ ๋‹จ์–ด ์Œ์ด ๋“ฑ์žฅํ•ด์„œ Embedding ๊ฒฐ๊ณผ๊ณผ ์™ธ๊ณก๋˜์ง€ ์•Š๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•œ ๋ชฉ์ ์ž…๋‹ˆ๋‹ค.
  • ์˜ˆ๋ฅผ ๋“ค์–ด์„œ "I", "is" ๋ผ๋Š” ๋‹จ์–ด๋“ค์€ ์˜๋ฏธ๊ฐ€ ํฌ์ง€๋Š” ์•Š์ง€๋งŒ ์ผ๋ฐ˜์ ์œผ๋กœ ์˜๋ฌธ ๋ฌธ์žฅ์—์„œ ๋งŽ์ด ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด ์ด๋ฏ€๋กœ, ์ด ๋‹จ์–ด๋“ค์˜ ๋“ฑ์žฅ ๋นˆ๋„์— ๋”ฐ๋ผ์„œ Embedding ๊ฒฐ๊ณผ๊ณผ ์™ธ๊ณก๋ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— Weight(๊ฐ€์ค‘์น˜)ํ•จ์ˆ˜ f(x)๋ฅผ ๋ชฉ์  ํ•จ์ˆ˜(์†์‹ค ํ•จ์ˆ˜)์— ์ˆ˜์‹์„ ์ถ”๊ฐ€ํ•ด์„œ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

์ถœ์ฒ˜:https://wikidocs.net/22885

  • ๋ชฉ์  ํ•จ์ˆ˜(์†์‹ค ํ•จ์ˆ˜)์•  ์‚ฌ์šฉํ•˜๋Š” Weight(๊ฐ€์ค‘์น˜)ํ•จ์ˆ˜๋Š” ๋™์‹œ ์ถœํ˜„ ๋นˆ๋„๊ฐ€ ๋†’์€ ๋‹จ์–ด ์Œ์— ๋‚ฎ์€ Weight(๊ฐ€์ค‘์น˜)๋ฅผ ๋ถ€์—ฌํ•˜๊ณ , ๋ฐ˜๋Œ€๋กœ ์ถœํ˜„๋นˆ๋„๊ฐ€ ๋†’์€ ๋‹จ์–ด ์Œ์—๋Š” ๋†’์€ Weight(๊ฐ€์ค‘์น˜)๋ฅผ ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค.
  • ์ด์œ ๋Š” ๊ณ ๋ฐ€๋„ ๋‹จ์–ด์Œ์ด Model์„ ์ง€๋ฐฐํ•˜๋Š”๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด ๋ถˆ์šฉ์–ด('the', 'is', 'are')๋Š” ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚˜์ง€๋งŒ, ๋ถˆ์šฉ์–ด๋“ค๋ผ๋ฆฌ ์„œ๋กœ ์˜๋ฏธ์ ์œผ๋กœ ๋ณด๋ฉด ๊ฐ€๊น๋‹ค๋Š”๊ฑธ ์˜๋ฏธํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋ž˜์„œ ๋ถˆ์šฉ์–ด์™€ ๊ฐ™์€ ๋‹จ์–ด์Œ์— ๋Œ€ํ•œ Weight(๊ฐ€์ค‘์น˜)๋ฅผ ์ค„์ด๋Š”๊ฒƒ์€ Model์˜ ํ•™์Šต์— ๋„์›€์ด ๋  ๋ฟ๋”๋Ÿฌ ํฌ๊ท€ & ํšŒ์†Œ์„ฑ์ด ์žˆ๋Š” ๋‹จ์–ด ์Œ์˜ ์ •๋ณด๋ฅผ ๋ณด์กดํ•˜๋Š”๋ฐ ๋„์›€์„ ์ค๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  Weight(๊ฐ€์ค‘์น˜)ํ•จ์ˆ˜๋Š” Model์ด ๋‹จ์ˆœํžˆ ๋™์‹œ ์ถœํ˜„ ๋นˆ๋„๊ฐ€ ๋†’์€ ๋‹จ์–ด์Œ๋งŒ์„ ๊ณ ๋ คํ•˜๋Š”๊ฒƒ์ด ์•„๋‹Œ, ๋‹ค์–‘ํ•œ ๋‹จ์–ด์Œ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜์—ฌ ์ •ํ™•ํ•˜๊ณ  ๋‹ค์–‘ํ•œ ๋‹จ์–ด ์˜๋ฏธ๋ฅผ ํฌ์ฐฉํ•˜๋Š”๋ฐ ๋„์›€์„ ์ค๋‹ˆ๋‹ค.

 

  • ๋˜ํ•œ f(x) ํ•จ์ˆ˜๋Š” 1๋ณด๋‹ค ํฐ ๊ฐ’์„ ๋ฐ˜ํ™˜ํ•˜์ง€ ์•Š์œผ๋ฉฐ, Weight๋Š” 0~1 ์‚ฌ์ด์˜ ๊ฐ’์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.
  • ์ด ๋ฐฉ์‹์€ ๋™์‹œ ์ถœํ˜„(Co-occurence)์ด ์œ„์˜ ๊ทธ๋ž˜ํ”„ Xmax ๋ผ๋Š” ๋ณ€์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ๋™์‹œ ์ถœํ˜„(Co-occurence)๋ฅผ ์ดˆ๊ณผํ•˜๋Š” ๋‹จ์–ด์Œ์— ๋ฐํ•˜์—ฌ Weight(๊ฐ€์ค‘์น˜)๋ฅผ ์ œํ•œํ•จ์œผ๋กœ, ๋†’์€ ๋™์‹œ ์ถœํ˜„(Co-occurence) ํšŸ์ˆ˜๋ฅผ ๊ฐ€์ง„ ๋‹จ์–ด์Œ์ด ํ•™์Šต์— ๊ณผ๋„ํ•œ ์˜ํ–ฅ์„ ์ฃผ๋Š”๊ฒƒ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด f(x) ํ•จ์ˆ˜์˜ ์‹์€ ์•„๋ž˜์˜ ์™ผ์ชฝ์˜ ์‹๊ณผ ๊ฐ™์ด ์ •์˜๋˜๊ณ , ์ตœ์ข…์ ์œผ๋กœ ์†์‹คํ•จ์ˆ˜์˜ ์‹์€ ์•„๋ž˜์˜ ์˜ค๋ฅธ์ชฝ์˜ ์‹๊ณผ ๊ฐ™์ด ์ •์˜๋ฉ๋‹ˆ๋‹ค.

์ถœ์ฒ˜:https://wikidocs.net/22885


Example Code - Glove

  • Glove๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ๋ ค๋ฉด Glove ํŒจํ‚ค์ง€๋ฅผ ์„ค์น˜ํ•ด ์ค˜์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ์›๋ž˜๋Š” 'glove-python' ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•ด์„œ ํ•ด์•ผํ•˜์ง€๋งŒ, ํ˜„์žฌ ํŒจํ‚ค์ง€ ์„ค์น˜ ๊ณผ์ •์—์„œ ์˜ค๋ฅ˜๊ฐ€ ์žˆ๋Š”๊ฑฐ ๊ฐ™์•„์„œ 'gensim' ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋Œ€์ฒดํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.
pip install gensim
  • ์ด ์ฝ”๋“œ๋Š” Glove์˜ ์‚ฌ์ „ ํ•™์Šต์ด๋œ ๋ชจ๋ธ ํŒŒ์ผ 'glove.6B.100d.txt'๋ฅผ ๋ถˆ๋Ÿฌ์™€์„œ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ ํŒŒ์ผ์ด ์žˆ๋Š” Github๋ฅผ ์•„๋ž˜ ๋งํฌ ๋‹ฌ์•„๋†“์„ํ…Œ๋‹ˆ๊นŒ ๊ผญ ๋‹ค์šด๋กœ๋“œ ํ•˜๊ณ , ๊ฒฝ๋กœ ์ง€์ • ํ•ด์ฃผ์…”์„œ ์ฝ”๋“œ ๋Œ๋ ค๋ณด์…”์•ผ ํ•ด์š”!
  • ์ด Github ReadMe Page์˜ 'Download Pre-Trained Word Vector' ์„น์…˜์—์„œ ๋‹ค์šด๋กœ๋“œ ํ•˜์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.
 

GitHub - stanfordnlp/GloVe: Software in C and data files for the popular GloVe model for distributed word representations, a.k.a

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings - GitHub - stanfordnlp/GloVe: Software in C and data files for the p...

github.com

from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# GloVe ๋ชจ๋ธ์„ word2vec ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜
glove_input_file = 'glove.6B.100d.txt'  # ์‹ค์ œ GloVe ํŒŒ์ผ์˜ ๊ฒฝ๋กœ๋ฅผ ์ž…๋ ฅํ•˜์„ธ์š”.
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

# ๋ณ€ํ™˜๋œ ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)

# ํ…์ŠคํŠธ ํŒŒ์ผ์—์„œ ๋ฌธ์„œ ์ฝ์–ด์˜ค๊ธฐ
with open('your_text_file.txt', 'r', encoding='utf-8') as file:
    documents = [line.strip().split() for line in file]

# ๋ฌธ์„œ์— ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋“ค์˜ ๋ฒกํ„ฐ๋ฅผ ์ถ”์ถœํ•˜์—ฌ ๋™์‹œ ๋“ฑ์žฅ ํ–‰๋ ฌ ์ƒ์„ฑ
word_vectors = [model[word] for document in documents for word in document]
co_occurrence_matrix = cosine_similarity(word_vectors)

# ์ž…๋ ฅ ๋‹จ์–ด์— ๋Œ€ํ•ด ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ๋‹จ์–ด๋“ค์˜ ๋ฆฌ์ŠคํŠธ ๋ฐ˜ํ™˜
def most_similar_words(input_word, top_n=5):
    if input_word in model:
        input_vector = model[input_word]
        similarity_scores = cosine_similarity([input_vector], word_vectors)[0]
        most_similar_indices = np.argsort(similarity_scores)[::-1][:top_n]
        most_similar_words = [model.index2word[index] for index in most_similar_indices]
        return most_similar_words
    else:
        return []

# ๋‹จ์–ด ์ž…๋ ฅ ๋ฐ ๊ฒฐ๊ณผ ์ถœ๋ ฅ
input_word = 'hamburger'
similar_words = most_similar_words(input_word)
print(f"{input_word}์™€(๊ณผ) ์œ ์‚ฌํ•œ ๋‹จ์–ด๋“ค: {similar_words}")
  • ์—ฌ๊ธฐ์— "input_text_file"์—๋Š” ์‚ฌ์šฉํ•  ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ํŒŒ์ผ์„ ๋„ฃ๊ณ , input_word์—๋Š” ์•„๋ฌด๋Ÿฐ ๋‹จ์–ด๋ฅผ ๋„ฃ์œผ๋ฉด "most_similar_words" ํ•จ์ˆ˜์—์„œ ๋„ฃ์€ ๋‹จ์–ด์˜ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ๋‹จ์–ด๋“ค์˜ ๋ฆฌ์ŠคํŠธ๋ฅผ ๋ฆฌํ„ดํ•ด์ค๋‹ˆ๋‹ค.
  • ๋˜ํ•œ top_n Parameter์—์„œ ๋ฐ˜ํ™˜ํ•  ์œ ์‚ฌํ•œ ๋‹จ์–ด์˜ ๊ฐœ์ˆ˜๋ฅผ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

FastText

  • FastText๋Š” Facebook(ํ˜„ Meta)์—์„œ ๊ฐœ๋ฐœํ•œ ์˜คํ”ˆ์†Œ์Šค ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ Word Embedding (๋‹จ์–ด ์ž„๋ฒ ๋”ฉ)์„ ์ƒ์„ฑํ•˜๊ณ  ํšจ๊ณผ์ ์œผ๋กœ ํ…์ŠคํŠธ ๋ถ„๋ฅ˜ ์ž‘์—…์„ ํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • FastText์˜ ํŠน์ง•์€ ๋งค์ปค๋‹ˆ์ฆ˜ ์ž์ฒด๋Š” Word2Vec์˜ ํ™•์žฅํŒ ๊ฐœ๋…์ด์ง€๋งŒ, ์ฐจ์ด์ ์€ FastText๋Š” Word2Vec๊ณผ ๋‹ฌ๋ฆฌ ํ•˜๋‚˜์˜ ๋‹จ์–ด ์•ˆ์—๋„ ์—ฌ๋Ÿฌ ๋‹จ์–ด๋“ค์ด ์กด์žฌํ•˜๋Š”๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, subword๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

N-gram

  • FastText์—์„œ ๊ฐ ๋‹จ์–ด๋Š” ๊ธ€์ž ๋‹จ์œ„ N-gram ๊ตฌ์„ฑ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค. 
  • ์—ฌ๊ธฐ์„œ N-gram์€ ์–ธ์–ดํ•™, ํ†ต๊ณ„ํ•™์  ๊ฐœ๋…์—์„œ ๊ฐ€์ ธ์˜จ๊ฒƒ์œผ๋กœ, ์—ฐ์†๋œ n๊ฐœ์˜ ํ•ญ๋ชฉ(์—ฌ๊ธฐ์„œ๋Š” ๋‹จ์–ด์ž…๋‹ˆ๋‹ค)๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ "n"์€ ์—ฐ์†๋œ ํ•ญ๋ชฉ์˜ ๊ฐœ์ˆ˜๋กœ ๋‚˜ํƒ€๋‚ด๋Š”๋ฐ, 1-gram์€ ์œ ๋‹ˆ๊ทธ๋žจ(Unigram), 2-gram์€ ๋ฐ”์ด๊ทธ๋žจ(Bigram), 3-gram์€ ํŠธ๋ผ์ด๊ทธ๋žจ(Trigram)์œผ๋กœ ๋ถ€๋ฆ…๋‹ˆ๋‹ค.
  • N-gram์€ Text์—์„œ ์–ด๋–ค ํŒจํ„ด์ด๋‚˜ ๋ฌธ๋งฅ์„ ํŒŒ์•…ํ•˜๋Š” ๋ฐ์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ฃผ๋กœ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ(NLP)์—์„œ Text๋ฅผ ํŠน์ • ํฌ๊ธฐ์˜ n-gram์œผ๋กœ ๋‚˜๋ˆ„์–ด ์‚ฌ์šฉํ•˜๋ฉด, ๋ฌธ์žฅ์ด๋‚˜ ๋ฌธ์„œ์˜ ๊ตฌ์กฐ, ์˜๋ฏธ, ๋ฌธ๋งฅ ๋“ฑ์„ ํŒŒ์•…ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.

 

  • ์˜ˆ๋ฅผ ๋“ค์–ด์„œ "house"๋ผ๋Š” ๋ฌธ์žฅ์„ 3-gram์ธ ํŠธ๋ผ์ด๊ทธ๋žจ(Trigram)์œผ๋กœ Vectorํ™” ํ•ด์„œ 5๊ฐœ์˜ ๋‚ด๋ถ€ ๋‹จ์–ด(subword) Token์„ Vector๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค.  
# n = 3์ธ ๊ฒฝ์šฐ
<ho, hou, ous, use, se>
  • ๊ทธ๋ฆฌ๊ณ  ์ถ”๊ฐ€์ ์œผ๋กœ ๊ธฐ์กด ๋‹จ์–ด์™ธ ์— <, ์™€ >๋ฅผ ๋ถ™์ธ ํ† ํฐ์„ ํ•˜๋‚˜๋” ๋ฒกํ„ฐํ™” ํ•ด์ค๋‹ˆ๋‹ค.
# ์ถ”๊ฐ€ ํ† ํฐ
<house>
  • ์‹ค์ œ ์‚ฌ์šฉํ• ๋•Œ n์˜ ์ตœ์†Œ, ์ตœ๋Œ€ ๊ฐ’์œผ๋กœ ๋ฒ”์œ„๋ฅผ ์„ค์ •ํ• ์ˆ˜ ์žˆ๋Š”๋ฐ, ๊ธฐ๋ณธ์œผ๋กœ ์ตœ์†Œ๋Š” 3, ์ตœ๋Œ€๋Š” 6์œผ๋กœ ์„ค์ •๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด ๋‹จ์–ด๋“ค์— ๋Œ€ํ•ด์„œ FastText๋Š” ๋‚ด๋ถ€ ๋‹จ์–ด๋“ค์„ Vectorํ™” ํ•ฉ๋‹ˆ๋‹ค.
# n = 3 ~ 6์ธ ๊ฒฝ์šฐ
<ho, hou, ous, use, se>, <hou, hous, ouse, use>, ..., <house>
  • ์—ฌ๊ธฐ์„œ ๋‚ด๋ถ€ ๋‹จ์–ด๋“ค์„ Vectorํ™” ํ•œ๋‹ค๋Š”๊ฑด ์ด ๋‚ด๋ถ€๋‹จ์–ด๋“ค์— ๋Œ€ํ•˜์—ฌ Word2Vec์„ ์ˆ˜ํ–‰ํ•œ๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค.
  • ์ด๋ ‡๊ฒŒ ์ˆ˜ํ–‰ํ•ด์„œ ๋‚ด๋ถ€ ๋‹จ์–ด๋“ค์˜ Vector๊ฐ’๋“ค์„ ์–ป์—ˆ์œผ๋ฉด, ๋‹จ์–ด house์˜ Vector๊ฐ’๋“ค์€ ๋ฒกํ„ฐ๊ฐ’๋“ค์˜ ์ด ํ•ฉ์ž…๋‹ˆ๋‹ค.
house = <ho + hou + ous + use + se>, <hou + hous + ouse + use>, ..., + <house>

 

๋ชจ๋ฅด๋Š” ๋‹จ์–ด

  • FastText์˜ ์ธ๊ณต ์‹ ๊ฒฝ๋ง์„ ํ•™์Šตํ•œ ํ›„์—๋Š” ๋ฐ์ดํ„ฐ์…‹์˜ ๋ชจ๋“  ๊ฐ n-gram์— ๋Œ€ํ•ด์„œ Word Embedding์ด ๋ฉ๋‹ˆ๋‹ค.
  • ์ด๋ ‡๊ฒŒ ๋˜๋ฉด ๋ฐ์ดํ„ฐ์…‹๋งŒ ์ถฉ๋ถ„ํ•˜๋‹ค๋ฉด ์œ„์™€ ๊ฐ™์€ ๋‚ด๋ถ€ ๋‹จ์–ด(subword)๋ฅผ ํ†ตํ•ด ๋ชจ๋ฅด๋Š” ๋‹จ์–ด(Out of Vocabulary, OOV)์— ๋Œ€ํ•ด์„œ๋„ ๋‹ค๋ฅธ ๋‹จ์–ด์™€์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์˜ˆ๋ฅผ ๋“ค์–ด์„œ FastText์—์„œ "dancestudio"๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ํ•™์Šต์ด ์•ˆ๋˜์–ด ์žˆ์ง€๋งŒ, ๋‹ค๋ฅธ ๋‹จ์–ด์—์„œ "dance"์™€ "studio" ๋‚ด๋ถ€ ๋‹จ์–ด๊ฐ€ ์žˆ์œผ๋ฉด FastText๋Š” "dancestudio"์˜ Vector๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Word2Vec, Glove๋Š” ๋ชจ๋ฅด๋Š” ๋‹จ์–ด์— ๋ฐํ•˜์—ฌ ๋Œ€์ฒ˜ํ• ์ˆ˜ ์—†๋Š”๊ฒƒ ๊ณผ๋Š” ๋‹ค๋ฅธ์  ์ž…๋‹ˆ๋‹ค.

 ๋นˆ๋„์ˆ˜๊ฐ€ ์ ์€ ๋‹จ์–ด

  • ๋“ฑ์žฅ ๋นˆ๋„์ˆ˜๊ฐ€ ์ ์€(rate word)์— ๋Œ€ํ•ด์„œ Word2Vec์€ Embedding์˜ ์ •ํ™•๋„๊ฐ€ ๋†’์ง€ ์•Š์•˜๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ฐธ๊ณ ํ• ์ˆ˜ ์žˆ๋Š” ์ˆ˜๊ฐ€ ์ ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
  • ๊ทผ๋ฐ, FastText๋Š” ๋‹จ์–ด๊ฐ€ ์ ์€(rate word) ํšŒ๊ท€ ๋‹จ์–ด๋ผ๋„, ๊ทธ ๋‹จ์–ด์˜ N-gram์ด ๋‹ค๋ฅธ ๋‹จ์–ด์˜ N-gram์ด ๊ฒน์น˜๋Š” ๊ฒฝ์šฐ์—๋Š” ๋†’์€ Embedding Vector๊ฐ’์„ ์–ป์Šต๋‹ˆ๋‹ค.
  • FastText๊ฐ€ Noise๊ฐ€ ๋งŽ์€ ์ฝ”ํผ์Šค์—์„œ ๊ฐ•์ ์„ ๊ฐ€์ง€๋Š” ๊ฒƒ์˜ ์ด์œ ์ž…๋‹ˆ๋‹ค.
  • ๋ชจ๋“  ํ›ˆ๋ จ ์ฝ”ํผ์Šค์— ์˜คํƒ€, ๋งž์ถค๋ฒ•์ด ํ‹€๋ฆฐ ๋‹จ์–ด๊ฐ€ ์—†์œผ๋ฉด ์ข‹๊ฒ ์ง€๋งŒ, ์‹ค์ œ ๋งŽ์€ ๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ์—๋Š” ์˜คํƒ€๊ฐ€ ์„ž์—ฌ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ์˜คํƒ€๊ฐ€ ์„ž์ธ ๋‹จ์–ด๋Š” ๋‹น์—ฐํžˆ ๋“ฑ์žฅ ๋นˆ๋„์ˆ˜๊ฐ€ ๋งค์šฐ ์ ์œผ๋ฏ€๋กœ ์ผ์ข…์˜ ํฌ๊ท€ ๋‹จ์–ด๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.
  • Word2Vec์—์„œ๋Š” ์˜คํƒ€๊ฐ€ ์„ž์ธ ๋‹จ์–ด๋Š” Embedding์ด ์ œ๋Œ€๋กœ ์•ˆ๋˜์ง€๋งŒ FastText๋Š” ๊ทธ๋ž˜๋„ ์ผ์ • ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
  • ์˜ˆ๋ฅผ ๋“ค์–ด ๋‹จ์–ด apple๊ณผ ์˜คํƒ€๋กœ p๋ฅผ ํ•œ ๋ฒˆ ๋” ์ž…๋ ฅํ•œ appple์˜ ๊ฒฝ์šฐ์—๋Š” ์‹ค์ œ๋กœ ๋งŽ์€ ๊ฐœ์ˆ˜์˜ ๋™์ผํ•œ n-gram์„ ๊ฐ€์งˆ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Example Code - FastText

  • FastText๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด 'gensim' ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋ฏ€๋กœ ์„ค์น˜ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.
pip install gensim
from gensim.models import FastText
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# ํ…์ŠคํŠธ ํŒŒ์ผ์—์„œ ๋ฌธ์„œ ์ฝ์–ด์˜ค๊ธฐ
with open('your_text_file.txt', 'r', encoding='utf-8') as file:
    documents = [line.strip().split() for line in file]

# FastText ๋ชจ๋ธ ํ•™์Šต
fasttext_model = FastText(documents, vector_size=100, window=5, min_count=1, workers=4)

# ์ž…๋ ฅ ๋‹จ์–ด์— ๋Œ€ํ•ด ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ๋‹จ์–ด๋“ค์˜ ๋ฆฌ์ŠคํŠธ ๋ฐ˜ํ™˜
def most_similar_words(input_word, top_n=5):
    if input_word in fasttext_model.wv:
        most_similar_words = fasttext_model.wv.most_similar(input_word, topn=top_n)
        return [word for word, _ in most_similar_words]
    else:
        return []

# ๋‹จ์–ด ์ž…๋ ฅ ๋ฐ ๊ฒฐ๊ณผ ์ถœ๋ ฅ
input_word = 'input_word'
similar_words = most_similar_words(input_word)
print(f"{input_word}์™€(๊ณผ) ์œ ์‚ฌํ•œ ๋‹จ์–ด๋“ค: {similar_words}")
  • ์—ฌ๊ธฐ์— "input_text_file"์—๋Š” ์‚ฌ์šฉํ•  ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ํŒŒ์ผ์„ ๋„ฃ๊ณ , "input_word"์—๋Š” ์•„๋ฌด๋Ÿฐ ๋‹จ์–ด๋ฅผ ๋„ฃ์œผ๋ฉด "most_similar_words" ํ•จ์ˆ˜์—์„œ ๋„ฃ์€ ๋‹จ์–ด์˜ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ๋‹จ์–ด๋“ค์˜ ๋ฆฌ์ŠคํŠธ๋ฅผ ๋ฆฌํ„ดํ•ด์ค๋‹ˆ๋‹ค.
  • ๋˜ํ•œ top_n Parameter์—์„œ ๋ฐ˜ํ™˜ํ•  ์œ ์‚ฌํ•œ ๋‹จ์–ด์˜ ๊ฐœ์ˆ˜๋ฅผ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.