๐Ÿ“ NLP (์ž์—ฐ์–ด์ฒ˜๋ฆฌ)

๐Ÿ“ NLP (์ž์—ฐ์–ด์ฒ˜๋ฆฌ)/๐Ÿ“• Natural Language Processing

[NLP] Word Embedding - ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ

1. Word Embedding? Word Embedding, ์›Œ๋“œ์ž„๋ฒ ๋”ฉ ์ด๋ž€? ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์น˜ํ˜• ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ์˜๋ฏธ๋กœ ๋งํ•˜๋ฉด Text๋‚ด์˜ ๋‹จ์–ด๋“ค์„ ์ปดํ“จํ„ฐ๊ฐ€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” Vector์˜ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•˜๋Š”๊ฒƒ์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๋‹จ์–ด๋ฅผ ๊ณ ์ฐจ์› ๊ณต๊ฐ„์˜ ์ €์ฐจ์› ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. Word Embedding ๊ณผ์ •์„ ๊ฑฐ์นœ Vector๋Š” ๋‹จ์–ด์˜ ์˜๋ฏธ(mean), ๋ฌธ๋งฅ(context), ์œ ์‚ฌ์„ฑ(similar) ๋“ฑ์„ ์ˆ˜์น˜ํ™” ํ•ด์„œ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ์˜ ๊ณผ์ •์€ ํฌ๊ฒŒ ๋ณด๋ฉด 2๊ฐ€์ง€์˜ ๋ฐฉ๋ฒ•์œผ๋กœ ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค. 2. Word Embedding์˜ ๋ฐฉ๋ฒ• Word Embedding์˜ ๋ฐฉ๋ฒ•์€ ํฌ๊ฒŒ ๋ณด๋ฉด 2๊ฐ€์ง€์˜ ๋ฐฉ๋ฒ•์œผ๋กœ ์ด๋ฃจ์–ด ์ง„๋‹ค๊ณ  ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜๋‚˜๋Š” Count๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•, ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ์˜ˆ์ธก ๊ธฐ..

๐Ÿ“ NLP (์ž์—ฐ์–ด์ฒ˜๋ฆฌ)/๐Ÿ“• Natural Language Processing

[NLP] Word2Vec, CBOW, Skip-Gram - ๊ฐœ๋… & Model

1. What is Word2Vec? Word2Vec์€ ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋˜๋Š” ์ธ๊ธฐ์žˆ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๋‹จ์–ด๋Š” ๋ณดํ†ต 'Token' ํ† ํฐ ์ž…๋‹ˆ๋‹ค. ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹จ์–ด(Token)๋“ค ์‚ฌ์ด์˜ ์˜๋ฏธ์  ๊ด€๊ณ„๋ฅผ Vector ๊ณต๊ฐ„์— ์ž˜ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์„ ํ•™์Šตํ•˜๋Š” ๋น„์ง€๋„๋ฐฉ์‹(Unsupervised learning)์œผ๋กœ ์„ค๊ณ„ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ž…๋‹ˆ๋‹ค. ์ฃผ๋ณ€ ๋‹จ์–ด๋“ค(๋ฌธ๋งฅ)์„ ํ†ตํ•ด์„œ ๊ฐ ๋‹จ์–ด๋“ค์„ ์˜ˆ์ธกํ•˜๊ฑฐ๋‚˜, ๋ฐ˜๋Œ€๋กœ ๊ฐ ๋‹จ์–ด๋“ค์„ ํ†ตํ•ด ์ฃผ๋ณ€์˜ ๋‹จ์–ด๋“ค์„ ๋ณด๊ณ  ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ๋น„์œ  ํ•˜์ž๋ฉด ์ด๋ฏธ์ง€๋ฅผ ํ•™์Šตํ•˜๋“ฏ, ๋‹จ์–ด๋ฅผ Vector๋กœ ๋ณด๊ณ  ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ Word2Vec์€ ๋‹จ์–ด๋“ค ์‚ฌ์ด์˜ ์˜๋ฏธ์ ์ธ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ , ์œ„์˜ ๊ทธ๋ฆผ์— ์žˆ๋Š” ๋ฌธ์žฅ์„ ์ด์šฉํ•ด ๋ชจ๋ธ์„ ํ•™์Šต ์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ ๊ฐ ๋‹จ์–ด(Token..

๐Ÿ“ NLP (์ž์—ฐ์–ด์ฒ˜๋ฆฌ)/๐Ÿ“• Natural Language Processing

[NLP] GRU Model - LSTM Model์„ ๊ฐ€๋ณ๊ฒŒ ๋งŒ๋“  ๋ชจ๋ธ

1. GRU Model์€ ๋ฌด์—‡์ผ๊นŒ? GRU (Gated Recurrent Unit)๋Š” ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง(RNN)์˜ ํ•œ ์ข…๋ฅ˜๋กœ, ์•ž์—์„œ ์„ค๋ช…ํ•œ LSTM(Long Short-Term Memory)๋ชจ๋ธ์˜ ๋‹จ์ˆœํ™”๋œ ํ˜•ํƒœ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. GRU Model์€ LSTM Model๊ณผ ๋น„์Šทํ•œ ๋ฐฉ์‹์œผ๋กœ ์ž‘๋™ํ•˜์ง€๋งŒ, ๋” ๊ฐ„๋‹จํ•œ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. LSTM Model์˜ ์žฅ์ ์„ ์œ ์ง€ํ•˜๋˜, Gate(๊ฒŒ์ดํŠธ)์˜ ๊ตฌ์กฐ๋ฅผ ๋‹จ์ˆœํ•˜๊ฒŒ ๋งŒ๋“  ๋ชจ๋ธ์ด GRU Model ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ GRU, LSTM Model์€ ๋‘˜๋‹ค Long-Term Dependency(์žฅ๊ธฐ ์˜์กด์„ฑ) ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ๋งŒ๋“ค์–ด ์กŒ์Šต๋‹ˆ๋‹ค. LSTM Model์„ ์„ค๋ช…ํ•œ ๊ธ€์—์„œ ์„ค๋ช…ํ–ˆ์ง€๋งŒ LSTM Model์€ "Cell State(์…€ ์ƒํƒœ)"์™€ "Hidden state(์ˆจ..

๐Ÿ“ NLP (์ž์—ฐ์–ด์ฒ˜๋ฆฌ)/๐Ÿ“• Natural Language Processing

[NLP] LSTM - Long Short Term Memory Model

1. LSTM Model์€ ๋ฌด์—‡์ผ๊นŒ?LSTM์€ Long Short-Term Memory์˜ ์•ฝ์ž์ž…๋‹ˆ๋‹ค. RNN - Recurrent Neural Network (์ˆœํ™˜ ์‹ ๊ฒฝ๋ง)์˜ ๋ฌธ์ œ์ธ Long-Term Dependency (์žฅ๊ธฐ ์˜์กด์„ฑ) ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.๊ธฐ์กด์˜ RNN(์ˆœํ™˜ ์‹ ๊ฒฝ๋ง)๋ชจ๋ธ์€ ์‹œ๊ฐ„ & ๊ณต๊ฐ„์  ํŒจํ„ด์„ ํ•™์Šตํ•˜๊ณ  ์˜ˆ์ธกํ•˜๋Š”๋ฐ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ˆœ์ฐจ์ ์ธ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š”๋ฐ์—๋Š” ๊ฐ•์ ์ด ์žˆ๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.๋‹ค๋งŒ Long-Term Dependency(์žฅ๊ธฐ ์˜์กด์„ฑ) ๋ฌธ์ œ๊ฐ€ ์žˆ์–ด์„œ ๊ธด Sequence์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š”๋ฐ ์–ด๋ ค์›€์ด ์žˆ์Šต๋‹ˆ๋‹ค.Long-Term Dependency(์žฅ๊ธฐ ์˜์กด์„ฑ)์— ๋Œ€ํ•œ ์„ค๋ช…์€ ์•„๋ž˜์˜ ๊ธ€์— ์ ํ˜€์žˆ์œผ๋‹ˆ๊นŒ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”. [NLP] Vanilla RNN Model, Lo..

๐Ÿ“ NLP (์ž์—ฐ์–ด์ฒ˜๋ฆฌ)/๐Ÿ“• Natural Language Processing

[NLP] Vanilla RNN Model, Long-Term Dependency - ์žฅ๊ธฐ ์˜์กด์„ฑ ๋ฌธ์ œ

1. ๊ธฐ๋ณธ RNN ๋ชจ๋ธ (Vanilla RNN Model)์˜ ํ•œ๊ณ„RNN๋ถ€๋ถ„์„ ์„ค๋ช…ํ•œ ๊ธ€์—์„œ ๊ธฐ๋ณธ RNN Model์„ ์•Œ์•„๋ณด๊ณ  ๊ตฌํ˜„ํ•ด ๋ณด์•˜์Šต๋‹ˆ๋‹ค.๋ณดํ†ต RNN Model์„ ๊ฐ€์žฅ ๋‹จ์ˆœํ•œ ํ˜•ํƒœ์˜ RNN ์ด๋ผ๊ณ  ํ•˜๋ฉฐ ๋ฐ”๋‹๋ผ RNN (Vanilla RNN)์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.๊ทผ๋ฐ, Vanilla RNN ๋ชจ๋ธ์— ๋‹จ์ ์œผ๋กœ ์ธํ•˜์—ฌ, ๊ทธ ๋‹จ์ ๋“ค์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•œ ๋‹ค์–‘ํ•œ RNN ๋ณ€ํ˜• Model์ด ๋‚˜์™”์Šต๋‹ˆ๋‹ค.๋Œ€ํ‘œ์ ์œผ๋กœ LSTM, GRU ๋ชจ๋ธ์ด ์žˆ๋Š”๋ฐ, ์ผ๋‹จ ์ด๋ฒˆ๊ธ€์—์„œ๋Š” LSTM Model์— ๋Œ€ํ•œ ์„ค๋ช…์„ ํ•˜๊ณ , ๋‹ค์Œ ๊ธ€์—์„œ๋Š” GRU Model์— ๋Œ€ํ•˜์—ฌ ์„ค๋ช…์„ ํ•˜๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.Vanilla RNN์€ ์ด์ „์˜ ๊ณ„์‚ฐ ๊ฒฐ๊ณผ์— ์˜์กดํ•˜์—ฌ ์ถœ๋ ฅ ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ค์–ด ๋ƒ…๋‹ˆ๋‹ค.์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ Vanilla RNN์€ ์งง์€ Sequence์—๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ์ง€๋งŒ, ๊ธด..

๐Ÿ“ NLP (์ž์—ฐ์–ด์ฒ˜๋ฆฌ)/๐Ÿ“• Natural Language Processing

[NLP] RNN (Recurrent Netural Network) - ์ˆœํ™˜์‹ ๊ฒฝ๋ง

1. RNN ์ด๋ž€?RNN์€ Sequence data๋ฅผ ์ฒ˜๋ฆฌ ํ•˜๊ธฐ ์œ„ํ•œ ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ ์ž…๋‹ˆ๋‹ค.์ฃผ๋กœ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ(NLP)๋ฅผ ํฌํ•จํ•œ ์—ฌ๋Ÿฌ Sequence Modeling ์ž‘์—…์—์„œ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.ํŠน์ง•์œผ๋กœ๋Š” ์‹œ๊ฐ„์ , ๊ณต๊ฐ„์  ์ˆœ์„œ ๊ด€๊ณ„์— ์˜ํ•˜์—ฌ Context๋ฅผ ๊ฐ€์ง€๋Š” ํŠน์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.๐Ÿ’ก exampleI want to have an apple์ด 'apple'์— ํ•œ๋ฒˆ ์ฃผ๋ชฉํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.์ด apple์ด๋ผ๋Š” ๋‹จ์–ด๋Š” ๋ฌธ๋งฅ์ด ํ˜•์„ฑํ•˜๋Š” ์ฃผ๋ณ€์˜ ๋‹จ์–ด๋“ค์„ ํ•จ๊ป˜ ์‚ดํŽด๋ด์•ผ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.2. RNN์— ๋Œ€ํ•˜์—ฌRNN์˜ ํŠน์ง•์€ ์–ด๋–ค๊ฒƒ์ด ์žˆ์„๊นŒ์š”?RNN์€ ์€๋‹‰์ธต(hidden layer)์˜ node์—์„œ ํ™œ์„ฑํ™” ํ•จ์ˆ˜(activation function)์„ ํ†ตํ•ด ๋‚˜์˜จ ๊ฒฐ๊ณผ๊ฐ’์„ ์ถœ๋ ฅ์ธต ๋ฐฉํ–ฅ์œผ๋กœ ๋ณด๋‚ด๋ฉด์„œ, hidden layer node์˜ ๋‹ค์Œ ๊ณ„์‚ฐ..

๐Ÿ“ NLP (์ž์—ฐ์–ด์ฒ˜๋ฆฌ)/๐Ÿ“• Natural Language Processing

[NLP] Seq2Seq, Encoder & Decoder

1..sequence-to-sequence ๐Ÿ’ก ํŠธ๋žœ์Šคํฌ๋จธ(Transformer) ๋ชจ๋ธ์€ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ(machine translation) ๋“ฑ ์‹œํ€€์Šค-ํˆฌ-์‹œํ€€์Šค(sequence-to-sequence) ๊ณผ์ œ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. sequence: ๋‹จ์–ด ๊ฐ™์€ ๋ฌด์–ธ๊ฐ€์˜ ๋‚˜์—ด์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ์—ฌ๊ธฐ์„œ sequence-to-sequence๋Š” ํŠน์ • ์†์„ฑ์„ ์ง€๋‹Œ ์‹œํ€€์Šค๋ฅผ ๋‹ค๋ฅธ ์†์„ฑ์˜ ์‹œํ€€์Šค๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์ž‘์—…(Task) ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  sequence-to-sequence๋Š” RNN์—์„œ many-to-many ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋Š”๋ฐ, RNN์€.. ์ถ”ํ›„์— ์„ค๋ช…ํ•˜๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ๐Ÿ’ก example ๊ธฐ๊ณ„ ๋ฒˆ์—ญ: ์–ด๋–ค ์–ธ์–ด(์†Œ์Šค ์–ธ์–ด, source language)์˜ ๋‹จ์–ด ์‹œํ€€์Šค๋ฅผ ๋‹ค๋ฅธ ์–ธ์–ด(๋Œ€์ƒ ์–ธ์–ด, target la..

๐Ÿ“ NLP (์ž์—ฐ์–ด์ฒ˜๋ฆฌ)/๐Ÿ“• Natural Language Processing

[NLP] Pre-Trained Language Model - ๋ฏธ๋ฆฌ ํ•™์Šต๋œ ์–ธ์–ด๋ชจ๋ธ

Pre-Trained Language Model - ๋ฏธ๋ฆฌ ํ•™์Šต๋œ ์–ธ์–ด๋ชจ๋ธ ๐Ÿ’ก ์–ธ์–ด ๋ชจ๋ธ(Language Model) → ๋‹จ์–ด ์‹œํ€€์Šค์— ๋ถ€์—ฌํ•˜๋Š” ๋ชจ๋ธ (๋‹จ์–ด ์‹œํ€€์Šค๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ํ•ด๋‹น ์‹œํ€€์Šค๊ฐ€ ์–ผ๋งˆ๋‚˜ ๊ทธ๋Ÿด๋“ฏํ•œ์ง€ ํ™•๋ฅ ์„ ์ถœ๋ ฅ์œผ๋กœ ํ•˜๋Š” ๋ชจ๋ธ) ๋ฌธ์žฅ์—์„œ i๋ฒˆ์งธ๋กœ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋ฅผ ๐‘คn ์ด๋ ‡๊ฒŒ ํ‘œ๊ธฐํ•˜๋ฉด n๋ฒˆ์งธ๋กœ ๋“ฑ์žฅํ•˜๋Š” ์–ธ์–ด๋ชจ๋ธ์— ๋“ฑ์žฅํ•  ํ™•๋ฅ  (์ˆ˜์‹ 1) ex) ๋‚œํญ์ด๋ผ๋Š” ๋‹จ์–ด ๋“ฑ์žฅํ›„์— ์šด์ „์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ๋‚˜ํƒ€๋‚  ํ™•๋ฅ ? → ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  ์ด๋ผ๊ณ  ํ•œ๋‹ค. ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  ํ‘œ๊ธฐ์‹œ ๊ฒฐ๊ณผ๊ฐ€ ๋˜๋Š” ์‚ฌ๊ฑด(์šด์ „)์„ ์•ž์—, ์กฐ๊ฑด์ด ๋˜๋Š” ์‚ฌ๊ฑด(๋‚œํญ)์€ ๋’ค์— ์“ด๋‹ค ์กฐ๊ฑด์ด ๋˜๋Š” ์‚ฌ๊ฑด์ด ์šฐ๋ณ€ ๋ถ„์ž์˜ ์ผ๋ถ€, ๊ทธ๋ฆฌ๊ณ  ์šฐ๋ณ€ ๋ถ„๋ชจ๋ฅผ ๊ตฌ์„ฑํ•˜๊ณ  ์žˆ์Œ์„ ๋ณผ ์ˆ˜ ์žˆ์Œ = ์ด๋Š” ๊ฒฐ๊ณผ๊ฐ€ ๋˜๋Š” ์‚ฌ๊ฑด(์šด์ „)์€ ์กฐ๊ฑด์ด ๋˜๋Š” ์‚ฌ๊ฑด(๋‚œํญ)์˜ ์˜ํ–ฅ์„ ๋ฐ›์•„ ๋ณ€ํ•œ๋‹ค๋Š” ๊ฐœ๋…์„ ๋‚ดํฌ..

๐Ÿ“ NLP (์ž์—ฐ์–ด์ฒ˜๋ฆฌ)/๐Ÿ“• Natural Language Processing

[NLP} Tokenization - ํ† ํฐํ™”ํ•˜๊ธฐ

Tokenization - ํ† ํฐํ™”ํ•˜๊ธฐ 1๋‹จ๊ณ„: ์ฝ”๋žฉ ๋…ธํŠธ๋ถ ์ดˆ๊ธฐํ™” ํŒจํ‚ค์ง€๋ฅผ ์„ค์น˜ํ•ด์ค๋‹ˆ๋‹ค. !pip install ratsnlp ๊ตฌ๊ธ€ ๋“œ๋ผ์ด๋ธŒ ์—ฐ๋™ํ•˜๊ธฐ ํŠœํ† ๋ฆฌ์–ผ์—์„œ ๊ตฌ์ถ•ํ•œ ์–ดํœ˜ ์ง‘ํ•ฉ์„ ์ €์žฅํ•ด ๋‘” ๊ตฌ๊ธ€ ๋“œ๋ผ์ด๋ธŒ๋ฅผ ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค. from google.colab import drive drive.mount('/gdrive', force_remount=True) 2๋‹จ๊ณ„: GPT ์ž…๋ ฅ๊ฐ’ ๋งŒ๋“ค๊ธฐ GPT ๋ชจ๋ธ ์ž…๋ ฅ๊ฐ’์„ ๋งŒ๋“ค๋ ค๋ฉด Byte-level Byte Pair Encoding ์–ดํœ˜์ง‘ํ•ฉ ๊ตฌ์ถ• ๊ฒฐ๊ณผ(`vocab.json`, `merges.txt`)๊ฐ€ ์ž์‹ ์˜ ๊ตฌ๊ธ€ ๋“œ๋ผ์ด๋ธŒ ๊ฒฝ๋กœ(`/gdrive/My Drive/nlpbook/wordpiece`)์— ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์•„๋ž˜ ์ฝ”๋“œ๋ฅผ ์ˆ˜ํ–‰ํ•ด ์ด๋ฏธ ๋งŒ๋“ค์–ด ๋†“์€ BBPE ์–ดํœ˜์ง‘ํ•ฉ์„ ํฌ..

๐Ÿ“ NLP (์ž์—ฐ์–ด์ฒ˜๋ฆฌ)/๐Ÿ“• Natural Language Processing

[NLP] Building a vocabulary set - ์–ดํœ˜ ์ง‘ํ•ฉ ๊ตฌ์ถ•ํ•˜๊ธฐ

์–ดํœ˜ ์ง‘ํ•ฉ ๊ตฌ์ถ•ํ•˜๊ธฐ (Vocab) 1๋‹จ๊ณ„: ์‹ค์Šต ํ™˜๊ฒฝ ๋งŒ๋“ค๊ธฐ pip ๋ช…๋ น์–ด๋กœ ํŒจํ‚ค์ง€๋ฅผ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค. !pip install ratsnlp 2๋‹จ๊ณ„: ๊ตฌ๊ธ€ ๋“œ๋ผ์ด๋ธŒ ์—ฐ๋™ํ•˜๊ธฐ from google.colab import drive drive.mount('/gdrive', force_remount=True) 3๋‹จ๊ณ„: ๋ง๋ญ‰์น˜ ๋‹ค์šด๋กœ๋“œ ๋ฐ ์ „์ฒ˜๋ฆฌ ์ฝ”ํฌ๋ผ(Korpora)๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ฅผ ํ™œ์šฉํ•ด BPE ์ˆ˜ํ–‰ ๋Œ€์ƒ ๋ง๋ญ‰์น˜๋ฅผ ๋‚ด๋ ค๋ฐ›๊ณ  ์ „์ฒ˜๋ฆฌ. ์‹ค์Šต์šฉ ๋ง๋ญ‰์น˜๋Š” ๋ฐ•์€์ • ๋‹˜์ด ๊ณต๊ฐœํ•˜์‹  Naver Sentiment Movie Corpus(NSMC)์„ ์‚ฌ์šฉ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚ด๋ ค๋ฐ›์•„ `nsmc`๋ผ๋Š” ๋ณ€์ˆ˜๋กœ ์ฝ์–ด๋“ค์ž…๋‹ˆ๋‹ค. from Korpora import Korpora nsmc = Korpora.load("nsmc", force_download..

Bigbread1129
'๐Ÿ“ NLP (์ž์—ฐ์–ด์ฒ˜๋ฆฌ)' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๊ธ€ ๋ชฉ๋ก (2 Page)