A A
[ML] Reinforcement Learning (๊ฐ•ํ™” ํ•™์Šต) - Q-Learning

๊ฐ•ํ™” ํ•™์Šต (Reinforcement Learning) ์ด๋ž€?

๊ฐ•ํ™” ํ•™์Šต์€ ์—์ด์ „ํŠธ๊ฐ€ ํ™˜๊ฒฝ๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฉด์„œ ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ํ–‰๋™ ์ •์ฑ…์„ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

์—์ด์ „ํŠธ๋Š” ์ฃผ์–ด์ง„ ํ™˜๊ฒฝ์—์„œ ์ตœ์ ์˜ ํ–‰๋™์„ ํ•™์Šตํ•˜์—ฌ ์žฅ๊ธฐ์ ์œผ๋กœ ๋ˆ„์  ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

https://www.kdnuggets.com/2022/05/reinforcement-learning-newbies.html

๊ฐ•ํ™” ํ•™์Šต์˜ ๋ชฉ์ 

์ตœ์ ์˜ ํ–‰๋™ ์ •์ฑ… ํ•™์Šต: ์—์ด์ „ํŠธ๊ฐ€ ์ฃผ์–ด์ง„ ํ™˜๊ฒฝ์—์„œ ์ตœ์ ์˜ ํ–‰๋™์„ ์„ ํƒํ•˜์—ฌ ๋ˆ„์  ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ์ •์ฑ…์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ๋ชฉ์ ์ž…๋‹ˆ๋‹ค.


Q-learning

๊ฐ•ํ™”ํ•™์Šต์—์„œ, Q-learning์ด๋ผ๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค. ํ•œ๋ฒˆ ์ž์„ธํžˆ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

Q-learning์€ ์ƒํƒœ-ํ–‰๋™ ๊ฐ€์น˜ ํ•จ์ˆ˜(Q-ํ•จ์ˆ˜)๋ฅผ ํ•™์Šตํ•˜์—ฌ ์ตœ์ ์˜ ์ •์ฑ…์„ ์ฐพ๋Š” ๊ฐ•ํ™” ํ•™์Šต ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค.

์ด ๋ฐฉ๋ฒ•์€ ์ฃผ์–ด์ง„ ์ƒํƒœ์—์„œ ์–ด๋–ค ํ–‰๋™์„ ์ทจํ•ด์•ผ ํ•˜๋Š”์ง€๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

 

Q-learning์˜ ์›๋ฆฌ

  1. ์ดˆ๊ธฐํ™”
    • ๋ชจ๋“  ์ƒํƒœ-ํ–‰๋™ ์Œ์˜ Q-๊ฐ’์„ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ Q-๊ฐ’์€ 0์œผ๋กœ ์ดˆ๊ธฐํ™”๋ฉ๋‹ˆ๋‹ค.
  2. ์—์ด์ „ํŠธ-ํ™˜๊ฒฝ ์ƒํ˜ธ์ž‘์šฉ
    • ์—์ด์ „ํŠธ๋Š” ํ˜„์žฌ ์ƒํƒœ์—์„œ ํ–‰๋™์„ ์„ ํƒํ•˜๊ณ , ๊ทธ์— ๋”ฐ๋ฅธ ํ™˜๊ฒฝ์˜ ๋ฐ˜์‘(์ฆ‰, ๋ณด์ƒ๊ณผ ๋‹ค์Œ ์ƒํƒœ)์„ ๊ด€์ฐฐํ•ฉ๋‹ˆ๋‹ค.
  3. Q-ํ•จ์ˆ˜ ์—…๋ฐ์ดํŠธ
    • Q-ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ˆ˜์‹์œผ๋กœ ์—…๋ฐ์ดํŠธ๋ฉ๋‹ˆ๋‹ค.

https://towardsdatascience.com/a-beginners-guide-to-q-learning-c3e2a30a653c

 

  • ์—ฌ๊ธฐ์„œ:
    • s๋Š” ํ˜„์žฌ ์ƒํƒœ
    • a๋Š” ์„ ํƒํ•œ ํ–‰๋™
    • r์€ ๋ฐ›์€ ๋ณด์ƒ
    • s′๋Š” ๋‹ค์Œ ์ƒํƒœ
    • α๋Š” ํ•™์Šต๋ฅ 
    • ๋Š” ํ• ์ธ ์ธ์ž์ž…๋‹ˆ๋‹ค.
  • ์ •์ฑ… ์—…๋ฐ์ดํŠธ
    • Q-ํ•จ์ˆ˜๊ฐ€ ์—…๋ฐ์ดํŠธ๋œ ํ›„, ์—์ด์ „ํŠธ๋Š” ์ƒˆ๋กœ์šด Q-ํ•จ์ˆ˜์— ๋”ฐ๋ผ ๋‹ค์Œ ํ–‰๋™์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฐ˜๋ณต
    • ์ด ๊ณผ์ •์„ ์—ฌ๋Ÿฌ ์—ํ”ผ์†Œ๋“œ ๋˜๋Š” ์‹œ๊ฐ„ ๋‹จ๊ณ„์— ๊ฑธ์ณ ๋ฐ˜๋ณตํ•˜์—ฌ Q-ํ•จ์ˆ˜๋ฅผ ์ตœ์ ํ™”ํ•˜๊ณ  ์ˆ˜๋ ด์‹œํ‚ต๋‹ˆ๋‹ค.

 

Q-learning์˜ ์ฃผ์š” ๊ตฌ์„ฑ ์š”์†Œ

  1. ํ•™์Šต๋ฅ  (Learning Rate, )
    • Q-๊ฐ’ ์—…๋ฐ์ดํŠธ์˜ ๋น„์œจ์„ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ํ•™์Šต๋ฅ ์ด ๋†’์„์ˆ˜๋ก Q-๊ฐ’์ด ๋” ๋น ๋ฅด๊ฒŒ ์—…๋ฐ์ดํŠธ๋ฉ๋‹ˆ๋‹ค. 0 < α ≤ 1 ์˜ ๋ฒ”์œ„๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.
  2. ํ• ์ธ ์ธ์ž (Discount Factor, )
    • ๋ฏธ๋ž˜ ๋ณด์ƒ์˜ ํ˜„์žฌ ๊ฐ€์น˜๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ํ• ์ธ ์ธ์ž๊ฐ€ ํด์ˆ˜๋ก ๋ฏธ๋ž˜ ๋ณด์ƒ์„ ๋” ์ค‘์š”ํ•˜๊ฒŒ ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค. 0 ≤ γ ≤ 1์˜ ๋ฒ”์œ„๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.
  3. ํƒํ—˜๊ณผ ํ™œ์šฉ (Exploration vs. Exploitation)
    • ํƒํ—˜(Exploration): ์ƒˆ๋กœ์šด ํ–‰๋™์„ ์‹œ๋„ํ•˜์—ฌ ๋” ๋งŽ์€ ์ •๋ณด๋ฅผ ์–ป๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.
    • ํ™œ์šฉ(Exploitation): ํ˜„์žฌ ์•Œ๊ณ  ์žˆ๋Š” ์ตœ์ ์˜ ํ–‰๋™์„ ์„ ํƒํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.
    • -ํƒ์š• ์ •์ฑ… (ฯต-greedy policy): ํ™•๋ฅ  ฯต(epsilon)๋กœ ํƒํ—˜ํ•˜๊ณ , 1 − ฯต(epsilon) ์˜ ํ™•๋ฅ ๋กœ ์ตœ์ ์˜ ํ–‰๋™์„ ์„ ํƒํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

Q-learning์˜ ์žฅ, ๋‹จ์ 

https://www.researchgate.net/figure/Q-Learning-vs-Deep-Q-Learning_fig1_351884746

Q-learning์˜ ์žฅ์ 

  1. ๋‹จ์ˆœ์„ฑ: ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋‹จ์ˆœํ•˜๊ณ  ๊ตฌํ˜„์ด ์šฉ์ดํ•ฉ๋‹ˆ๋‹ค.
  2. ์˜คํ”„๋ผ์ธ ํ•™์Šต: ํ™˜๊ฒฝ์˜ ๋ชจ๋ธ์ด ํ•„์š” ์—†์œผ๋ฉฐ, ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ์ง์ ‘ ์ƒํ˜ธ์ž‘์šฉ ์—†์ด๋„ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
  3. ๋ณดํŽธ์„ฑ: ๋‹ค์–‘ํ•œ ๊ฐ•ํ™” ํ•™์Šต ๋ฌธ์ œ์— ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Q-learning์˜ ๋‹จ์ 

  1. ํฐ ์ƒํƒœ ๊ณต๊ฐ„: ์ƒํƒœ ๊ณต๊ฐ„์ด ํด ๊ฒฝ์šฐ, Q-ํ…Œ์ด๋ธ”์˜ ํฌ๊ธฐ๊ฐ€ ์ปค์ ธ์„œ ๋ฉ”๋ชจ๋ฆฌ์™€ ๊ณ„์‚ฐ ๋น„์šฉ์ด ํฌ๊ฒŒ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
  2. ์—ฐ์†์ ์ธ ์ƒํƒœ ๋ฐ ํ–‰๋™ ๊ณต๊ฐ„: Q-learning์€ ์ด์‚ฐ์ ์ธ ์ƒํƒœ ๋ฐ ํ–‰๋™ ๊ณต๊ฐ„์— ์ ํ•ฉํ•˜๋ฉฐ, ์—ฐ์†์ ์ธ ์ƒํƒœ ๋ฐ ํ–‰๋™ ๊ณต๊ฐ„์—์„œ๋Š” ํšจ์œจ์ ์ด์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  3. ํƒํ—˜-ํ™œ์šฉ ๊ท ํ˜•: ์ ์ ˆํ•œ ฯต\epsilon ๊ฐ’์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ๋„ˆ๋ฌด ๋‚ฎ๊ฑฐ๋‚˜ ๋†’์œผ๋ฉด ํ•™์Šต์ด ๋น„ํšจ์œจ์ ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Q-learning Example Code

# ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ž„ํฌํŠธ
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# ๊ทธ๋ฆฌ๋“œ์›”๋“œ ํ™˜๊ฒฝ ์ •์˜
class GridWorld:
    def __init__(self, size):
        # ๊ทธ๋ฆฌ๋“œ์˜ ํฌ๊ธฐ๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
        self.size = size
        # ์ดˆ๊ธฐ ์ƒํƒœ๋ฅผ (0, 0)์œผ๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
        self.state = (0, 0)
        # ๋ชฉํ‘œ ์ƒํƒœ๋ฅผ ๊ทธ๋ฆฌ๋“œ์˜ ์˜ค๋ฅธ์ชฝ ์•„๋ž˜ ๋ชจ์„œ๋ฆฌ๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
        self.goal = (size-1, size-1)

    def reset(self):
        # ์ƒํƒœ๋ฅผ ์ดˆ๊ธฐ ์ƒํƒœ๋กœ ๋ฆฌ์…‹ํ•ฉ๋‹ˆ๋‹ค.
        self.state = (0, 0)
        return self.state

    def step(self, action):
        # ํ˜„์žฌ ์ƒํƒœ์˜ x, y ์ขŒํ‘œ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.
        x, y = self.state
        # ํ–‰๋™์— ๋”ฐ๋ผ ์ƒˆ๋กœ์šด ์ƒํƒœ๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
        if action == 0:
            x = max(0, x - 1)  # ์œ„๋กœ ์ด๋™
        elif action == 1:
            x = min(self.size - 1, x + 1)  # ์•„๋ž˜๋กœ ์ด๋™
        elif action == 2:
            y = max(0, y - 1)  # ์™ผ์ชฝ์œผ๋กœ ์ด๋™
        elif action == 3:
            y = min(self.size - 1, y + 1)  # ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™

        # ์ƒˆ๋กœ์šด ์ƒํƒœ๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
        self.state = (x, y)
        # ์ƒˆ๋กœ์šด ์ƒํƒœ๊ฐ€ ๋ชฉํ‘œ ์ƒํƒœ์ธ์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.
        reward = 1 if self.state == self.goal else -0.1
        done = self.state == self.goal
        # ์ƒˆ๋กœ์šด ์ƒํƒœ, ๋ณด์ƒ, ์™„๋ฃŒ ์—ฌ๋ถ€๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
        return self.state, reward, done
# Q-learning ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •
size = 5  # ๊ทธ๋ฆฌ๋“œ์˜ ํฌ๊ธฐ
env = GridWorld(size)  # ๊ทธ๋ฆฌ๋“œ์›”๋“œ ํ™˜๊ฒฝ ์ƒ์„ฑ
q_table = np.zeros((size, size, 4))  # Q-ํ…Œ์ด๋ธ” ์ดˆ๊ธฐํ™” (์ƒํƒœ-ํ–‰๋™ ๊ฐ€์น˜ ํ•จ์ˆ˜)
alpha = 0.1  # ํ•™์Šต๋ฅ 
gamma = 0.9  # ํ• ์ธ ์ธ์ž
epsilon = 0.1  # ํƒํ—˜ ํ™•๋ฅ 
episodes = 1000  # ํ•™์Šต ์—ํ”ผ์†Œ๋“œ ์ˆ˜
# Q-learning ์•Œ๊ณ ๋ฆฌ์ฆ˜
for episode in range(episodes):
    state = env.reset()  # ์—ํ”ผ์†Œ๋“œ ์‹œ์ž‘ ์‹œ ์ƒํƒœ๋ฅผ ์ดˆ๊ธฐํ™”
    done = False  # ์—ํ”ผ์†Œ๋“œ๊ฐ€ ๋๋‚ฌ๋Š”์ง€ ์—ฌ๋ถ€

    while not done:
        if np.random.rand() < epsilon:
            action = np.random.choice(4)  # ํƒํ—˜: ๋ฌด์ž‘์œ„๋กœ ํ–‰๋™ ์„ ํƒ
        else:
            action = np.argmax(q_table[state[0], state[1]])  # ํ™œ์šฉ: Q-๊ฐ’์ด ์ตœ๋Œ€์ธ ํ–‰๋™ ์„ ํƒ

        next_state, reward, done = env.step(action)  # ํ™˜๊ฒฝ์—์„œ ํ–‰๋™ ์ˆ˜ํ–‰
        q_value = q_table[state[0], state[1], action]  # ํ˜„์žฌ ์ƒํƒœ์˜ Q-๊ฐ’
        best_next_q_value = np.max(q_table[next_state[0], next_state[1]])  # ๋‹ค์Œ ์ƒํƒœ์—์„œ์˜ ์ตœ๋Œ€ Q-๊ฐ’

        # Q-ํ…Œ์ด๋ธ” ์—…๋ฐ์ดํŠธ
        q_table[state[0], state[1], action] = q_value + alpha * (reward + gamma * best_next_q_value - q_value)

        state = next_state  # ์ƒํƒœ ์—…๋ฐ์ดํŠธ
# Q-ํ…Œ์ด๋ธ” ์‹œ๊ฐํ™”
# Q-learning ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ƒํƒœ-ํ–‰๋™ ๊ฐ€์น˜ ํ•จ์ˆ˜(State-Action Value Function)๋ฅผ ์ €์žฅํ•˜๋Š” ํ…Œ์ด๋ธ”
# Q-ํ…Œ์ด๋ธ”์˜ ๊ฐ ํ•ญ๋ชฉ์€ ํŠน์ • ์ƒํƒœ์—์„œ ํŠน์ • ํ–‰๋™์„ ์ทจํ–ˆ์„ ๋•Œ์˜ ๊ธฐ๋Œ€ ๋ณด์ƒ์„ ๋‚˜ํƒ€๋ƒ„
plt.figure(figsize=(10, 7))
sns.heatmap(np.max(q_table, axis=2), annot=True, cmap='viridis')
plt.title('Q-Table')
plt.xlabel('State (y)')
plt.ylabel('State (x)')
plt.show()