A A
[NLP] ํ†ต๊ณ„ ๊ธฐ๋ฐ˜ ๊ธฐ๋ฒ• ๊ฐœ์„ ํ•˜๊ธฐ
์•ž์— ๊ธ€, Thesaurus(์‹œ์†Œ๋Ÿฌ์Šค), Co-occurence Matrix(๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ)๋ถ€๋ถ„์—์„œ ํ†ต๊ณ„ ๊ธฐ๋ฐ˜ ๊ธฐ๋ฒ•์— ๋ฐํ•˜์—ฌ ์„ค๋ช…ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • Thesaurus(์‹œ์†Œ๋Ÿฌ์Šค), Co-occurence Matrix(๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ) ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€๊ธˆ ๋‚ด์šฉ๊ณผ ์—ฐ๊ฒฐ๋˜๋Š” ๊ธ€์ด๋‹ˆ๊นŒ ํ•œ๋ฒˆ ์ฝ์–ด๋ณด์„ธ์š”.
 

[NLP] Thesaurus(์‹œ์†Œ๋Ÿฌ์Šค), Co-occurence Matrix(๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ)

์˜ค๋žœ๋งŒ์— NLP ๊ด€๋ จ ๊ธ€์„ ์“ฐ๋„ค์š”.. ์‹œ๊ฐ„ ๋‚˜๋Š”๋Œ€๋กœ ์—ด์‹ฌํžˆ ์“ฐ๊ณ  ์˜ฌ๋ ค ๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. Thesaursus - ์‹œ์†Œ๋Ÿฌ์Šค์‹œ์†Œ๋Ÿฌ์Šค(Thesaurus)๋Š” ๋‹จ์–ด์™€ ๊ทธ ์˜๋ฏธ๋ฅผ ์—ฐ๊ฒฐ์‹œ์ผœ์ฃผ๋Š” ๋„๊ตฌ์ž…๋‹ˆ๋‹ค.์ฃผ๋กœ ํŠน์ • ๋‹จ์–ด์™€ ์˜๋ฏธ

daehyun-bigbread.tistory.com

 

Pointwise Mutual Information (PMI) - ์ ๋ณ„ ์ƒํ˜ธ์ •๋ณด๋Ÿ‰

  • ๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ์˜ ์›์†Œ๋Š” ๋‘ ๋‹จ์–ด๊ฐ€ ๋™์‹œ์— ๋ฐœ์ƒํ•œ ํšŸ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • ๊ทผ๋ฐ '๋ฐœ์ƒ' ํšŸ์ˆ˜๋ผ๋Š” ํ‘œํ˜„์€ ์ข‹์€ ํŠน์ง•์€ ์•„๋‹™๋‹ˆ๋‹ค. ๋งŽ์ด ๋‚˜์˜ค๋Š” ๋‹จ์–ด(๊ณ ๋นˆ๋„ ๋‹จ์–ด)๋ฅผ ์ƒ๊ฐํ•ด๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋‹จ์ˆœํžˆ ๋“ฑ์žฅ ํšŸ์ˆ˜๋กœ ๊ด€๋ จ์„ฑ์„ ํ‰๊ฐ€ํ•œ๋‹ค๋ฉด, ๋งŽ์ด ๋‚˜์˜ค๋Š” ๋‹จ์–ด๋“ค์˜ ์—ฐ๊ด€์„ฑ์ด ๋†’๋‹ค๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๊ฒ ์ฃ . ์‹ค์ œ๋กœ ๋ณด๋ฉด ๊ด€๋ จ์ด ์—†๋Š” ๋‹จ์–ด์ผ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ž˜์„œ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ Pointwise Mutual Information(PMI), ์ ๋ณ„ ์ƒํ˜ธ์ •๋ณด๋Ÿ‰ ์ด๋ผ๋Š” ์ฒ™๋„๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • Pointwise Mutual Information(PMI)๋Š” ํ™•๋ฅ ๋ณ€์ˆ˜ x์™€ y์— ๋Œ€ํ•ด ๋‹ค์Œ์‹์œผ๋กœ ์ •์˜๋ฉ๋‹ˆ๋‹ค.

Pointwise Mutual Information(PMI)์˜ ์ˆ˜์‹์ž…๋‹ˆ๋‹ค.

  • ๋Š” ๐‘ฅ๊ฐ€ ์ผ์–ด๋‚  ํ™•๋ฅ , ๋Š” ๐‘ฆ๊ฐ€ ์ผ์–ด๋‚  ํ™•๋ฅ , ๐‘ƒ(๐‘ฅ,๐‘ฆ)๋Š” ๐‘ฅ์™€ ๐‘ฆ๊ฐ€ ๋™์‹œ์— ์ผ์–ด๋‚  ํ™•๋ฅ ์„ ๋œปํ•ฉ๋‹ˆ๋‹ค.
  • ์ด PMI ๊ฐ’์ด ๋†’์„์ˆ˜๋ก ๊ด€๋ จ์„ฑ์ด ๋†’๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค.
  • ์ด ์‹์„ ์ ์šฉํ•˜๋ฉด ๐‘ƒ(๐‘ฅ)๋Š” ๋‹จ์–ด ๐‘ฅ๊ฐ€ Corpus(๋ง๋ญ‰์น˜)์— ๋“ฑ์žฅํ•  ํ™•๋ฅ ์„ ๊ฐ€๋ฆฌํ‚ต๋‹ˆ๋‹ค.
  • ์˜ˆ์ปจ๋Œ€ 10,000๊ฐœ์˜ ๋‹จ์–ด๋กœ ์ด๋ฃจ์–ด์ง„ ๋ง๋ญ‰์น˜์—์„œ "the"๊ฐ€ 100๋ฒˆ ๋“ฑ์žฅํ•œ๋‹ค๋ฉด?
    • ๐‘ƒ("๐‘กโ„Ž๐‘’") = 100/10000 = 0.01์ด ๋ฉ๋‹ˆ๋‹ค.
  •  ๋˜ํ•œ ๐‘ƒ(๐‘ฅ,๐‘ฆ)๋Š” ๋‹จ์–ด ๐‘ฅ์™€ ๐‘ฆ๊ฐ€ ๋™์‹œ๋ฐœ์ƒํ•  ํ™•๋ฅ ์ด๋ฏ€๋กœ, ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ "the"์™€ "car"๊ฐ€ 10๋ฒˆ ๋™์‹œ๋ฐœ์ƒ ํ–ˆ๋‹ค๋ฉด?
    • ๐‘ƒ("๐‘กโ„Ž๐‘’","๐‘๐‘Ž๐‘Ÿ")= 10/10000 =0.001 ์ด ๋˜๋Š” ๊ฒƒ์ด์ฃ .
  • ๊ทธ๋Ÿผ Co-occurence Matrix (๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ - ๊ฐ ์›์†Œ๋Š” ๋™์‹œ๋ฐœ์ƒํ•œ ๋‹จ์–ด์˜ ํšŸ์ˆ˜)์„ ์‚ฌ์šฉํ•˜์—ฌ ์‹์„ ๋ฅผ ๋‹ค์‹œ ์จ๋ด…์‹œ๋‹ค.
  • ๋Š” Co-occurence Matrix (๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ), ๐ถ(๐‘ฅ,๐‘ฆ)๋Š” ๋‹จ์–ด ๐‘ฅ์™€ ๐‘ฆ๊ฐ€ ๋™์‹œ๋ฐœ์ƒํ•˜๋Š” ํšŸ์ˆ˜, ๐ถ(๐‘ฅ)์™€ ๐ถ(๐‘ฆ)๋Š” ๊ฐ๊ฐ ๋‹จ์–ด ๐‘ฅ์™€ ๐‘ฆ์˜ ๋“ฑ์žฅ ํšŸ์ˆ˜์ž…๋‹ˆ๋‹ค.
  • ์ด๋•Œ Corpus(๋ง๋ญ‰์น˜)์— ํฌํ•จ๋œ ๋‹จ์–ด ์ˆ˜๋ฅผ ๐‘์ด๋ผ ํ•˜๋ฉด, ์‹๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ณ€ํ•ฉ๋‹ˆ๋‹ค.

  • ์ด๋ ‡๊ฒŒ Pointwise Mutual Information(PMI)๋ผ๋Š” ์ฒ™๋„๋ฅผ ์•Œ๊ฒŒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทผ๋ฐ, ๋ฌธ์ œ๊ฐ€ Co-occurence(๋™์‹œ ๋ฐœ์ƒ ํšŸ์ˆ˜)๊ฐ€ 0์ด๋ฉด logโก20=−∞ ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. 
  • ์ด ๋ฌธ์ œ๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ์‹ค์ œ๋กœ ๊ตฌํ˜„ํ•  ๋•Œ๋Š” ์–‘์˜ ์ƒํ˜ธ์ •๋ณด๋Ÿ‰(Positive PMI, PPMI)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

PPMI์˜ ์ˆ˜์‹์ž…๋‹ˆ๋‹ค.

  • ์ด ์‹์— ๋”ฐ๋ผ PMI๊ฐ€ ์Œ์ˆ˜์ผ ๋•Œ๋Š” 0์œผ๋กœ ์ทจ๊ธ‰ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด์ œ ๋‹จ์–ด ์‚ฌ์ด์˜ ๊ด€๋ จ์„ฑ์„ 0 ์ด์ƒ์˜ ์‹ค์ˆ˜๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฉด Co-occurence Matrix(๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ)์„ PPMI ํ–‰๋ ฌ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ๊ตฌํ˜„ํ•ด๋ด…์‹œ๋‹ค.
def ppmi(C, verbose=False, eps = 1e-8):
    '''PPMI(์ ๋ณ„ ์ƒํ˜ธ์ •๋ณด๋Ÿ‰) ์ƒ์„ฑ

    :param C: ๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ
    :param verbose: ์ง„ํ–‰ ์ƒํ™ฉ์„ ์ถœ๋ ฅํ• ์ง€ ์—ฌ๋ถ€
    :return:
    '''
    M = np.zeros_like(C, dtype=np.float32)
    N = np.sum(C)
    S = np.sum(C, axis=0)
    total = C.shape[0] * C.shape[1]
    cnt = 0

    for i in range(C.shape[0]):
        for j in range(C.shape[1]):
            pmi = np.log2(C[i, j] * N / (S[j]*S[i]) + eps)
            M[i, j] = max(0, pmi)

            if verbose:
                cnt += 1
                if cnt % (total//100 + 1) == 0:
                    print('%.1f%% ์™„๋ฃŒ' % (100*cnt/total))
    return M
    • ์—ฌ๊ธฐ์—์„œ ์ธ์ˆ˜ C๋Š” Co-occurence Matrix(๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ), verbose๋Š” ์ง„ํ–‰์ƒํ™ฉ ์ถœ๋ ฅ์„ ์—ฌ๋ถ€๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ํ”Œ๋ž˜๊ทธ์ž…๋‹ˆ๋‹ค.
    • ํฐ Corpus๋ฅผ ๋‹ค๋ฃฐ ๋•Œ verbose=True๋กœ ์„ค์ •ํ•˜๋ฉด ์ค‘๊ฐ„์ค‘๊ฐ„ ์ง„ํ–‰ ์ƒํ™ฉ์„ ์•Œ๋ ค์ฃผ์ฃ .
    • ์ฐธ๊ณ ๋กœ, ์ด ์ฝ”๋“œ๋Š” Co-occurence Matrix(๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ)์— ๋Œ€ํ•ด์„œ๋งŒ PPMI ํ–‰๋ ฌ์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๊ณ ์ž ๋‹จ์ˆœํ•˜๊ฒŒ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค.
    • ๊ตฌ์ฒด์ ์œผ๋กœ ๋งํ•˜๋ฉด, ๋‹จ์–ด ๐‘ฅ์™€ ๐‘ฆ๊ฐ€ ๋™์‹œ์— ๋ฐœ์ƒํ•˜๋Š” ํšŸ์ˆ˜๋ฅผ ๐ถ(๐‘ฅ,๐‘ฆ)๋ผ ํ–ˆ์„ ๋•Œ?

  • ์œ„์˜ ์ˆ˜์‹์ฒ˜๋Ÿผ ๋˜๋„๋ก. ์ฆ‰, ๊ทผ์‚ฌ๊ฐ’์„ ๊ตฌํ•˜๋„๋ก ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฉด Co-occurence Matrix(๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ)์„ PPMI ํ–‰๋ ฌ๋กœ ๋ณ€ํ™˜ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
import sys
sys.path.append('..')
import numpy as np
from common.util import preprocess, create_co_matrix, cos_similarity, ppmi

text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)
vocab_size = len(word_to_id)
C = create_co_matrix(corpus, vocab_size)
W = ppmi(C)

np.set_printoptions(precision=3)  # ์œ ํšจ ์ž๋ฆฟ์ˆ˜๋ฅผ ์„ธ ์ž๋ฆฌ๋กœ ํ‘œ์‹œ
print('๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ')
print(C)
print('-'*50)
print('PPMI')
print(W)

 

  • ์ฝ”๋“œ์˜ ์‹คํ–‰ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.
๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ
[[0 1 0 0 0 0 0]
 [1 0 1 0 1 1 0]
 [0 1 0 1 0 0 1]
 [0 0 1 0 1 0 0]
 [0 1 0 1 0 1 0]
 [0 1 0 0 1 0 1]
 [0 0 0 0 0 1 0]]

PPMI
[[0.    1.807 0.    0.    0.    0.    0.   ]
 [1.807 0.    0.807 0.    0.807 0.807 0.   ]
 [0.    0.807 0.    1.807 0.    0.    1.807]
 [0.    0.    1.807 0.    1.807 0.    0.   ]
 [0.    0.807 0.    1.807 0.    0.807 0.   ]
 [0.    0.807 0.    0.    0.807 0.    2.807]
 [0.    0.    0.    0.    0.    2.807 0.   ]]

 

  • ์ด๊ฒƒ์œผ๋กœ Co-occurence Matrix(๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ)์„ PPMI ํ–‰๋ ฌ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฒ•์„ ์•Œ์•„๋ดค์Šต๋‹ˆ๋‹ค.
  • ์ด๋•Œ PPMI ํ–‰๋ ฌ์˜ ๊ฐ ์›์†Œ๋Š” 0 ์ด์ƒ์˜ ์‹ค์ˆ˜์ž…๋‹ˆ๋‹ค. ์ด์ œ ๋” ์ข‹์€ ์ฒ™๋„๋กœ ์ด๋ค„์ง„ Matrix(๋” ์ข‹์€ ๋‹จ์–ด Vector)์„ ์†์— ์ฅ์—ˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋‚˜ PPMI ํ–‰๋ ฌ์—๋„ ์—ฌ์ „ํžˆ ํฐ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค!
    • Corpus(๋ง๋ญ‰์น˜)์˜ ์–ดํœ˜ ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ๊ฐ ๋‹จ์–ด Vector์˜ Dimension(์ฐจ์›) ์ˆ˜๋„ ์ฆ๊ฐ€ํ•œ๋‹ค๋Š” ๋ฌธ์ œ์ฃ .
    • ์˜ˆ๋ฅผ ๋“ค์–ด Corpus(๋ง๋ญ‰์น˜)์˜ ์–ดํœ˜ ์ˆ˜๊ฐ€ 10๋งŒ ๊ฐœ๋ผ๋ฉด ๊ทธ ๋ฒกํ„ฐ์˜ ์ฐจ์› ์ˆ˜๋„ ๋˜‘๊ฐ™์ด 10๋งŒ์ด ๋ฉ๋‹ˆ๋‹ค.
    • 10๋งŒ ์ฐจ์›์˜ ๋ฒกํ„ฐ๋ฅผ ๋‹ค๋ฃฌ๋‹ค๋Š” ๊ฒƒ์€ ๊ทธ๋‹ค์ง€ ํ˜„์‹ค์ ์ด์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  • ๋˜ํ•œ, ์ด ํ–‰๋ ฌ์˜ ๋‚ด์šฉ์„ ๋“ค์—ฌ๋‹ค๋ณด๋ฉด ์›์†Œ ๋Œ€๋ถ€๋ถ„์ด 0์ธ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Vector์˜ ์›์†Œ ๋Œ€๋ถ€๋ถ„์ด ์ค‘์š”ํ•˜์ง€ ์•Š๋‹ค๋Š” ๋œป์ด์ฃ . ๋‹ค๋ฅด๊ฒŒ ํ‘œํ˜„ํ•˜๋ฉด ๊ฐ ์›์†Œ์˜ '์ค‘์š”๋„'๊ฐ€ ๋‚ฎ๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค.
  • ๋”์šฑ์ด ์ด๋Ÿฐ Vector๋Š” ๋…ธ์ด์ฆˆ์— ์•ฝํ•˜๊ณ  ๊ฒฌ๊ณ ํ•˜์ง€ ๋ชปํ•˜๋Š” ์•ฝ์ ๋„ ์žˆ์ง€์š”.
  • ์ด ๋ฌธ์ œ์— ๋Œ€์ฒ˜ํ•˜๊ณ ์ž ์ž์ฃผ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ธฐ๋ฒ•์ด ๋ฐ”๋กœ ๋ฒกํ„ฐ์˜ Dimension Reduction(์ฐจ์› ๊ฐ์†Œ)์ž…๋‹ˆ๋‹ค.

Dimension Reduction - ์ฐจ์› ๊ฐ์†Œ

์ฐจ์› ๊ฐ์†Œ(dimensionality reduction)๋Š” ๋ฌธ์ž ๊ทธ๋Œ€๋กœ ๋ฒกํ„ฐ์˜ ์ฐจ์›์„ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์„ ๋งํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋‚˜ ๋‹จ์ˆœํžˆ ์ค„์ด๊ธฐ๋งŒ ํ•˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ, '์ค‘์š”ํ•œ ์ •๋ณด'๋Š” ์ตœ๋Œ€ํ•œ ์œ ์ง€ํ•˜๋ฉด์„œ ์ค„์ด๋Š” ๊ฒŒ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค.
  • ์ง๊ด€์ ์ธ ์˜ˆ๋กœ ์•„๋ž˜์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด, ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋ฅผ ๊ณ ๋ คํ•ด์„œ ์ค‘์š”ํ•œ '์ถ•'์„ ์ฐพ๋Š” ์ผ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

2์ฐจ์› ๋ฐ์ดํ„ฐ๋ฅผ 1์ฐจ์›์œผ๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ์ค‘์š”ํ•œ ์ถ•(๋ฐ์ดํ„ฐ๋ฅผ ๋„“๊ฒŒ ๋ถ„ํฌ์‹œํ‚ค๋Š” ์ถ•)์„ ์ฐพ์Šต๋‹ˆ๋‹ค.

  • ์™ผ์ชฝ์€ ๋ฐ์ดํ„ฐ์ ๋“ค์„ 2์ฐจ์› ์ขŒํ‘œ๊ณ„์— ํ‘œ์‹œํ•œ ๋ชจ์Šต์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ์˜ค๋ฅธ์ชฝ์€ ์ƒˆ๋กœ์šด ์ถ•์„ ๋„์ž…ํ•˜์—ฌ ๋˜‘๊ฐ™์€ ๋ฐ์ดํ„ฐ๋ฅผ ์ขŒํ‘œ์ถ• ํ•˜๋‚˜๋งŒ์œผ๋กœ ํ‘œ์‹œํ–ˆ์Šต๋‹ˆ๋‹ค(์ƒˆ๋กœ์šด ์ถ•์„ ์ฐพ์„ ๋•Œ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ๋„“๊ฒŒ ๋ถ„ํฌ๋˜๋„๋ก ๊ณ ๋ คํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค).
  • ์ด๋•Œ ๊ฐ ๋ฐ์ดํ„ฐ์ ์˜ ๊ฐ’์€ ์ƒˆ๋กœ์šด ์ถ•์œผ๋กœ ์‚ฌ์˜๋œ ๊ฐ’์œผ๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ ์ค‘์š”ํ•œ ๊ฒƒ์€ ๊ฐ€์žฅ ์ ํ•ฉํ•œ ์ถ•์„ ์ฐพ์•„๋‚ด๋Š” ์ผ๋กœ, 1์ฐจ์› ๊ฐ’๋งŒ์œผ๋กœ๋„ ๋ฐ์ดํ„ฐ์˜ ๋ณธ์งˆ์ ์ธ ์ฐจ์ด๋ฅผ ๊ตฌ๋ณ„ํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด์™€ ๊ฐ™์€ ์ž‘์—…์€ ๋‹ค์ฐจ์› ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ฐจ์›์„ ๊ฐ์†Œ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์€ ์—ฌ๋Ÿฌ ๊ฐ€์ง€์ž…๋‹ˆ๋‹ค๋งŒ, ์ด๋ฒˆ์—๋Š” ํŠน์ž‡๊ฐ’๋ถ„ํ•ด(Singular Value Decomposition, SVD)๋ฅผ ์ด์šฉํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.
  • ํŠน์ž‡๊ฐ’๋ถ„ํ•ด(Singular Value Decomposition, SVD)๋Š” ์ž„์˜์˜ ํ–‰๋ ฌ์„ ์„ธ ํ–‰๋ ฌ์˜ ๊ณฑ์œผ๋กœ ๋ถ„ํ•ดํ•˜๋ฉฐ, ์ˆ˜์‹์œผ๋กœ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

  • ๊ฐ™์ด SVD๋Š” ์ž„์˜์˜ ํ–‰๋ ฌ ๐‘‹๋ฅผ ๐‘ˆ,๐‘†,๐‘‰ ๋ผ๋Š” ์„ธ ํ–‰๋ ฌ์˜ ๊ณฑ์œผ๋กœ ๋ถ„ํ•ดํ•ฉ๋‹ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ ๐‘ˆ์™€ ๐‘‰๐‘‡๋Š” ์ง๊ตํ–‰๋ ฌ(orthogonal matrix)์ด๊ณ , ๊ทธ ์—ด๋ฒกํ„ฐ๋Š” ์„œ๋กœ ์ง๊ตํ•ฉ๋‹ˆ๋‹ค.
  • ๋˜ํ•œ ๐‘†๋Š” ๋Œ€๊ฐํ–‰๋ ฌ(diagonal matrix)(๋Œ€๊ฐ์„ฑ๋ถ„ ์™ธ์—๋Š” ๋ชจ๋‘ 0์ธ ํ–‰๋ ฌ)์ž…๋‹ˆ๋‹ค.
  • ์ด ์ˆ˜์‹์„ ์‹œ๊ฐ์ ์œผ๋กœ ํ‘œํ˜„ํ•œ ๊ฒƒ์ด ์•„๋ž˜์˜ ๊ทธ๋ฆผ์ž…๋‹ˆ๋‹ค.

SVD์— ์˜ํ•œ ํ–‰๋ ฌ์˜ ๋ณ€ํ™˜ (ํ–‰๋ ฌ์˜ '๋นˆ ๋ถ€๋ถ„'์€ ์›์†Œ๊ฐ€ 0์ž„์„ ๋œปํ•จ)

  • ๐‘ˆ๋Š” ์ง๊ตํ–‰๋ ฌ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด ์ง๊ตํ–‰๋ ฌ์€ ์–ด๋– ํ•œ ๊ณต๊ฐ„์˜ ์ถ•(๊ธฐ์ €)์„ ํ˜•์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ์ง€๊ธˆ ์šฐ๋ฆฌ์˜ Corpus(๋ง๋ญ‰์น˜)์—์„œ๋Š” ์ด ๐‘ˆ ํ–‰๋ ฌ์„ '๋‹จ์–ด ๊ณต๊ฐ„'์œผ๋กœ ์ทจ๊ธ‰ํ•  ์ˆ˜ ์žˆ์ฃ .
  • ๋˜ํ•œ ๐‘†๋Š” ๋Œ€๊ฐํ–‰๋ ฌ๋กœ, ๊ทธ ๋Œ€๊ฐ์„ฑ๋ถ„์—๋Š” 'ํŠน์ž‡๊ฐ’'(singular value)์ด ํฐ ์ˆœ์„œ๋กœ ๋‚˜์—ด๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํŠน์ž‡๊ฐ’์ด๋ž€, ์‰ฝ๊ฒŒ ๋งํ•ด 'ํ•ด๋‹น ์ถ•์˜ ์ค‘์š”๋„'๋ผ๊ณ  ๊ฐ„์ฃผํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ž˜์„œ ๊ฐ™์ด ์ค‘์š”๋„๊ฐ€ ๋‚ฎ์€ ์›์†Œ(ํŠน์ž‡๊ฐ’์ด ์ž‘์€ ์›์†Œ)๋ฅผ ๊นŽ์•„๋‚ด๋Š” ๋ฐฉ๋ฒ•์„ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

SVD์— ์˜ํ•œ Dimension Reduction(์ฐจ์› ๊ฐ์†Œ)

  • ํ–‰๋ ฌ ๐‘†์—์„œ ํŠน์ž‡๊ฐ’์ด ์ž‘๋‹ค๋ฉด ์ค‘์š”๋„๊ฐ€ ๋‚ฎ๋‹ค๋Š” ๋œป์ด๋ฏ€๋กœ, ํ–‰๋ ฌ ๐‘ˆ์—์„œ๋Š” ์—ฌ๋ถ„์˜ ์—ดVector๋ฅผ ๊นŽ์•„๋‚ด์–ด ์›๋ž˜์˜ ํ–‰๋ ฌ์„ ๊ทผ์‚ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด๋ฅผ ์šฐ๋ฆฌ ๋ฌธ์ œ๋กœ ๊ฐ€์ ธ์™€์„œ '๋‹จ์–ด์˜ PPMI ํ–‰๋ ฌ'์— ์ ์šฉํ•ด๋ณผ๊นŒ์š”?
  • ๊ทธ๋Ÿฌ๋ฉด ํ–‰๋ ฌ ๐‘‹์˜ ๊ฐ ํ–‰์—๋Š” ํ•ด๋‹น ๋‹จ์–ด ID์˜ ๋‹จ์–ด ๋ฒกํ„ฐ๊ฐ€ ์ €์žฅ๋˜์–ด ์žˆ์œผ๋ฉฐ,
  • ๊ทธ ๋‹จ์–ด ๋ฒกํ„ฐ๊ฐ€ ํ–‰๋ ฌ ๐‘ˆ′๋ผ๋Š” ์ฐจ์› ๊ฐ์†Œ๋œ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„๋˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
๋‹จ์–ด์˜ ๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ์„ ์ •๋ฐฉํ–‰๋ ฌ์ด์ง€๋งŒ, ์ดํ•ด ํ•˜๊ธฐ ์‰ฝ๊ฒŒ ์ง์‚ฌ๊ฐํ˜•์œผ๋กœ ๊ทธ๋ ธ์Šต๋‹ˆ๋‹ค.
๋˜ํ•œ ์—ฌ๊ธฐ์—์„œ๋Š” SVD๋ฅผ ์ง๊ด€์ ์ด๊ณ  ๊ฐ„๋žตํ•˜๊ฒŒ๋งŒ ์„ค๋ช…ํ–ˆ์Šต๋‹ˆ๋‹ค.

SVD (ํŠน์ž‡๊ฐ’๋ถ„ํ•ด)์— ์˜ํ•œ Dimension Reduction(์ฐจ์› ๊ฐ์†Œ)

์ด์ œ SVD๋ฅผ ํŒŒ์ด์ฌ ์ฝ”๋“œ๋กœ ์‚ดํŽด๋ด…์‹œ๋‹ค. 
  • ์ด์ œ SVD๋ฅผ ํŒŒ์ด์ฌ ์ฝ”๋“œ๋กœ ์‚ดํŽด๋ด…์‹œ๋‹ค.
  • SVD๋Š” ๋„˜ํŒŒ์ด์˜ linalg ๋ชจ๋“ˆ์ด ์ œ๊ณตํ•˜๋Š” svd ๋ฉ”์„œ๋“œ๋กœ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ฐธ๊ณ ๋กœ, "linalg"๋Š” ์„ ํ˜•๋Œ€์ˆ˜(linear algebra)์˜ ์•ฝ์–ด์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿผ, ๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ์„ ๋งŒ๋“ค์–ด PPMI ํ–‰๋ ฌ๋กœ ๋ณ€ํ™˜ํ•œ ๋‹ค์Œ SVD๋ฅผ ์ ์šฉํ•ด๋ด…์‹œ๋‹ค.
import sys
sys.path.append('..')
import numpy as np
import matplotlib.pyplot as plt
from common.util import preprocess, create_co_matrix, ppmi

text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)
vocab_size = len(id_to_word)
C = create_co_matrix(corpus, vocab_size, window_size=1)
W = ppmi(C)

# SVD
U, S, V = np.linalg.svd(W)

print(C[0])  # ๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ
# [0 1 0 0 0 0 0]

print(W[0])  # PPMI ํ–‰๋ ฌ
# [0.    1.807 0.    0.    0.    0.    0.   ]

print(U[0])  # SVD
# [ 3.409e-01 -1.110e-16 -1.205e-16 -4.441e-16  0.000e+00 -9.323e-01
#   2.226e-16]

print(U[0, :2])
# [ 3.409e-01 -1.110e-16]
  • ์ด ๊ฒฐ๊ณผ์—์„œ ๋ณด๋“ฏ ์›๋ž˜๋Š” ํฌ์†Œ Vector์ธ ๐‘Š[0]๊ฐ€ SVD์— ์˜ํ•ด์„œ ๋ฐ€์ง‘ Vector ๐‘ˆ[0]๋กœ ๋ณ€ํ™˜๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ์ด ๋ฐ€์ง‘ Vector์˜ Dimension(์ฐจ์›)์„ Reduction(๊ฐ์†Œ)์‹œํ‚ค๋ ค๋ฉด, ์˜ˆ์ปจ๋Œ€ 2-Dimension Vector(2์ฐจ์› ๋ฒกํ„ฐ) ์ค„์ด๋ ค๋ฉด ๋‹จ์ˆœํžˆ ์ฒ˜์Œ์˜ ๋‘ ์›์†Œ๋ฅผ ๊บผ๋‚ด๋ฉด ๋ฉ๋‹ˆ๋‹ค.
print(U[0, :2])
# [ 3.409e-01 -1.110e-16]
  • ๊ทธ๋Ÿฌ๋ฉด ๊ฐ ๋‹จ์–ด๋ฅผ 2์ฐจ์› ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•œ ํ›„ ๊ทธ๋ž˜ํ”„๋กœ ๊ทธ๋ ค๋ด…์‹œ๋‹ค.
for word, word_id in word_to_id.items():
    plt.annotate(word, (U[word_id, 0], U[word_id, 1]))

plt.scatter(U[:, 0], U[:, 1], alpha=0.5)
plt.show()

Co-Occurance Matrix์— SVD๋ฅผ ์ ์šฉํ•œ ํ›„, ๊ฐ ๋‹จ์–ด๋ฅผ 2-Dimension Vector๋กœ ๋ณ€ํ™˜ํ•ด ๊ทธ๋ฆฐ ๊ทธ๋ž˜ํ”„ ์ž…๋‹ˆ๋‹ค.

  • ๋ณด๋ฉด "goodbye"์™€ "hello", "you"์™€ "i"๊ฐ€ ์ œ๋ฒ• ๊ฐ€๊นŒ์ด ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์šฐ๋ฆฌ์˜ ์ƒ๊ฐ๊ณผ ๋น„๊ต์  ๋น„์Šทํ•˜์ฃ . ํ•˜์ง€๋งŒ ์ง€๊ธˆ ์‚ฌ์šฉํ•œ Corpus(๋ง๋ญ‰์น˜)๊ฐ€ ์•„์ฃผ ์ž‘์•„์„œ ์ด ๊ฒฐ๊ณผ๋ฅผ ๊ทธ๋Œ€๋กœ ๋ฐ›์•„๋“ค์ด๊ธฐ์—๋Š” ์ข€ ๊ทธ๋ ‡์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ž˜์„œ PTB Dataset์ด๋ผ๋Š” ๋” ํฐ Corpus(๋ง๋ญ‰์น˜)๋ฅผ ์‚ฌ์šฉํ•ด์„œ ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

PTB Dataset

์ด๋ฒˆ์—๋Š” ์ ๋‹นํžˆ ํฐ? Corpus(๋ง๋ญ‰์น˜)์ธ ํŽœ ํŠธ๋ฆฌ๋ฑ…ํฌ(Penn Treebank, PTB) Dataset์„ ํ™œ์šฉํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
PTB ๋ง๋ญ‰์น˜๋Š” ์ฃผ์–ด์ง„ ๊ธฐ๋ฒ•์˜ ํ’ˆ์งˆ์„ ์ธก์ •ํ•˜๋Š” ๋ฒค์น˜๋งˆํฌ๋กœ ์ž์ฃผ ์ด์šฉ๋ฉ๋‹ˆ๋‹ค.
  • ์ด PTB ๋ง๋ญ‰์น˜๋Š” ํ…์ŠคํŠธ ํŒŒ์ผ๋กœ ์ œ๊ณต๋˜๋ฉฐ, ์›๋ž˜์˜ PTB ๋ฌธ์žฅ์— ๋ช‡ ๊ฐ€์ง€ ์ „์ฒ˜๋ฆฌ๋ฅผ ํ•ด๋‘์—ˆ์Šต๋‹ˆ๋‹ค.
  • ์˜ˆ์ปจ๋Œ€ ํฌ์†Œํ•œ ๋‹จ์–ด๋ฅผ <unk>๋ผ๋Š” ํŠน์ˆ˜ ๋ฌธ์ž๋กœ ์น˜ํ™˜ํ•œ๋‹ค๊ฑฐ๋‚˜ <unk>๋Š” "unknown"์˜ ์•ฝ์–ด, ๊ตฌ์ฒด์ ์ธ ์ˆซ์ž๋ฅผ "N"์œผ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๋“ฑ์˜ ์ž‘์—…์ด ์ ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
1 consumers may want to move their telephones a little closer to the tv set
2 <unk> <unk> watching abc 's monday night football can now vote during <unk> for the greatest play in N years from among four or five <unk> <unk>
3 two weeks ago viewers of several nbc <unk> consumer segments started calling a N number for advice on various <unk> issues
4 and the new syndicated reality show hard copy records viewers ' opinions for possible airing on the next day 's show
5 interactive telephone technology has taken a new leap in <unk> and television programmers are racing to exploit the possibilities
6 eventually viewers may grow <unk> with the technology and <unk> the cost
  • ์œ„์—์„œ ๋ณด๋“ฏ PTB ๋ง๋ญ‰์น˜์—์„œ๋Š” ํ•œ ๋ฌธ์žฅ์ด ํ•˜๋‚˜์˜ ์ค„๋กœ ์ €์žฅ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด ์ฑ…์—์„œ๋Š” ๊ฐ ๋ฌธ์žฅ์„ ์—ฐ๊ฒฐํ•œ 'ํ•˜๋‚˜์˜ ํฐ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ'๋กœ ์ทจ๊ธ‰ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋•Œ ๊ฐ ๋ฌธ์žฅ ๋์— <eos>๋ผ๋Š” ํŠน์ˆ˜ ๋ฌธ์ž๋ฅผ ์‚ฝ์ž…ํ•ฉ๋‹ˆ๋‹ค.
  • ๋‹ค์Œ์€ ptb.py๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์˜ˆ์ž…๋‹ˆ๋‹ค.
import sys
sys.path.append('..')
from dataset import ptb

corpus, word_to_id, id_to_word = ptb.load_data('train')

print('๋ง๋ญ‰์น˜ ํฌ๊ธฐ:', len(corpus))
print('corpus[:30]:', corpus[:30])
print()

print('id_to_word[0]:', id_to_word[0])
print('id_to_word[1]:', id_to_word[1])
print('id_to_word[2]:', id_to_word[2])
print()

print("word_to_id['car']:", word_to_id['car'])
print("word_to_id['happy']:", word_to_id['happy'])
print("word_to_id['lexus']:", word_to_id['lexus'])
  • ์•„๋ž˜๋Š” ์‹คํ–‰ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.
corpus size: 929589
corpus[:30]: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29]

id_to_word[0]: aer
id_to_word[1]: banknote
id_to_word[2]: berlitz

word_to_id['car']: 3856
word_to_id['happy']: 4428
word_to_id['lexus']: 7426
  • ๋ง๋ญ‰์น˜๋ฅผ ๋‹ค๋ฃจ๋Š” ๋ฐฉ๋ฒ•์€ ์ง€๊ธˆ๊นŒ์ง€์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.
  • corpus์—๋Š” ๋‹จ์–ด ID ๋ชฉ๋ก์ด ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.
  • id_to_word๋Š” ๋‹จ์–ด ID์—์„œ ๋‹จ์–ด๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋”•์…”๋„ˆ๋ฆฌ์ด๊ณ , word_to_id๋Š” ๋‹จ์–ด์—์„œ ๋‹จ์–ด ID๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋”•์…”๋„ˆ๋ฆฌ์ž…๋‹ˆ๋‹ค.

 

 

PTB Dataset ํ‰๊ฐ€

PTB ๋ฐ์ดํ„ฐ์…‹์— ํ†ต๊ณ„ ๊ธฐ๋ฐ˜ ๊ธฐ๋ฒ•์„ ์ ์šฉํ•ด๋ด…์‹œ๋‹ค.
  • ์ด๋ฒˆ์—๋Š” ํฐ ํ–‰๋ ฌ์— SVD๋ฅผ ์ ์šฉํ•ด์•ผ ํ•˜๋ฏ€๋กœ ๊ณ ์† SVD๋ฅผ ์ด์šฉํ•  ๊ฒƒ์„ ์ถ”์ฒœํ•ฉ๋‹ˆ๋‹ค.
import sys
sys.path.append('..')
import numpy as np
from common.util import most_similar, create_co_matrix, ppmi
from dataset import ptb

window_size = 2
wordvec_size = 100

corpus, word_to_id, id_to_word = ptb.load_data('train')
vocab_size = len(word_to_id)
print('๋™์‹œ๋ฐœ์ƒ ์ˆ˜ ๊ณ„์‚ฐ ...')
C = create_co_matrix(corpus, vocab_size, window_size)
print('PPMI ๊ณ„์‚ฐ ...')
W = ppmi(C, verbose=True)

print('SVD ๊ณ„์‚ฐ ...')
try:
    from sklearn.utils.extmath import randomized_svd
    U, S, V = randomized_svd(W, n_components=wordvec_size, n_iter=5, random_state=None)
except ImportError:
    U, S, V = np.linalg.svd(W)

word_vecs = U[:, :wordvec_size]

querys = ['you', 'year', 'car', 'toyota']
for query in querys:
    most_similar(query, word_to_id, id_to_word, word_vecs, top=5)
  • SVD๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐ sklearn์˜ randomized_svd() Method๋ฅผ ์ด์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์ด Method๋Š” ๋ฌด์ž‘์œ„ ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ Truncated SVD๋กœ, ํŠน์ž‡๊ฐ’์ด ํฐ ๊ฒƒ๋“ค๋งŒ ๊ณ„์‚ฐํ•˜์—ฌ ๊ธฐ๋ณธ์ ์ธ SVD๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฆ…๋‹ˆ๋‹ค.
  • ๋‚˜๋จธ์ง€ ๋ถ€๋ถ„์€ ์•ž์„œ ์ž‘์€ ๋ง๋ญ‰์น˜๋ฅผ ์‚ฌ์šฉํ•œ ์ฝ”๋“œ์™€ ๊ฑฐ์˜ ๊ฐ™์Šต๋‹ˆ๋‹ค.
  • ์‹คํ–‰ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.
[query] you
i: 0.702039909619
we: 0.699448543998
ve: 0.554828709147
do: 0.534370693098
else: 0.512044146526

[query] year
month: 0.731561990308
quarter: 0.658233992457
last: 0.622425716735
earlier: 0.607752074689
next: 0.601592506413

[query] car
luxury: 0.620933665528
auto: 0.615559874277
cars: 0.569818364381
vehicle: 0.498166879744
corsica: 0.472616831915

[query] toyota
motor: 0.738666107068
nissan: 0.677577542584
motors: 0.647163210589
honda: 0.628862379043
lexus: 0.604740429865
  • ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด, ์šฐ์„  'you'๋ผ๋Š” ๊ฒ€์ƒ‰์–ด์—์„œ๋Š” ์ธ์นญ๋Œ€๋ช…์‚ฌ์ธ 'i'์™€ 'we'๊ฐ€ ์ƒ์œ„๋ฅผ ์ฐจ์ง€ํ–ˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์˜์—ญ ๋ฌธ์žฅ์—์„œ๋Š” ๊ด€์šฉ์ ์œผ๋กœ ์ž์ฃผ ๊ฐ™์ด ๋‚˜์˜ค๋Š” ๋‹จ์–ด๋“ค์ด๊ธฐ ๋•Œ๋ฌธ์ด์ฃ .
  • ๋˜, 'year'์˜ ์—ฐ๊ด€์–ด๋กœ๋Š” 'month'์™€ 'quarter'๊ฐ€, 'car'์˜ ์—ฐ๊ด€์–ด๋กœ๋Š” 'auto'์™€ 'vehicle' ๋“ฑ์ด ๋ฝ‘ํ˜”์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  'toyota'์™€ ๊ด€๋ จ๋œ ๋‹จ์–ด๋กœ๋Š” 'nissan', 'honda', 'lexus' ๋“ฑ ์ž๋™์ฐจ ์ œ์กฐ์—…์ฒด๋‚˜ ๋ธŒ๋žœ๋“œ๊ฐ€ ๋ฝ‘ํžŒ ๊ฒƒ๋„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด์ฒ˜๋Ÿผ ๋‹จ์–ด์˜ ์˜๋ฏธ ํ˜น์€ ๋ฌธ๋ฒ•์ ์ธ ๊ด€์ ์—์„œ ๋น„์Šทํ•œ ๋‹จ์–ด๋“ค์ด ๊ฐ€๊นŒ์šด ๋ฒกํ„ฐ๋กœ ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค.
  • ์ƒ๊ฐํ•œ๊ฑฐ๋ž‘ ๋น„์Šทํ•œ ๊ฒฐ๊ณผ๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๊ฒ ๋„ค์š”.

Summary

Co-Occurance Matrix (๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ)์„ PPMI ํ–‰๋ ฌ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ๋‹ค์‹œ Dimension(์ฐจ์›)์„ Reduction(๊ฐ์†Œ)์‹œํ‚ด์œผ๋กœ์จ, ๊ฑฐ๋Œ€ํ•œ 'ํฌ์†Œ๋ฒกํ„ฐ'๋ฅผ ์ž‘์€ '๋ฐ€์ง‘๋ฒกํ„ฐ'๋กœ ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ๋‹ค.
๋‹จ์–ด์˜ Vector ๊ณต๊ฐ„์—์„œ๋Š” ์˜๋ฏธ๊ฐ€ ๊ฐ€๊นŒ์šด ๋‹จ์–ด๋Š” ๊ทธ ๊ฑฐ๋ฆฌ๋„ ๊ฐ€๊นŒ์šธ ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€๋œ๋‹ค.