A A
[Data Mining] Gradient Descent (๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•)

Gradient Descent (๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•)

๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•(Gradient Descent)์€ ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘ ํ•˜๋‚˜๋กœ, ์ฃผ์–ด์ง„ ํ•จ์ˆ˜์˜ ์ตœ์†Œ๊ฐ’์„ ์ฐพ๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋จธ์‹  ๋Ÿฌ๋‹๊ณผ ๋”ฅ ๋Ÿฌ๋‹ ๋ชจ๋ธ์˜ ํ•™์Šต ๊ณผ์ •์—์„œ ์ฃผ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์˜ ์•„์ด๋””์–ด (The Idea Behind Gradient Descent)

  • ์šฐ๋ฆฌ๋Š” ์ข…์ข… ํ•จ์ˆ˜ ๐‘“๋ฅผ ์ตœ๋Œ€ํ™”(๋˜๋Š” ์ตœ์†Œํ™”)ํ•ด์•ผ ํ•  ํ•„์š”๊ฐ€ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ์ฆ‰, ์šฐ๋ฆฌ๋Š” ๊ฐ€๋Šฅํ•œ ๊ฐ€์žฅ ์ž‘์€(๋˜๋Š” ๊ฐ€์žฅ ํฐ) ๊ฐ’์„ ์ƒ์„ฑํ•˜๋Š” ์ž…๋ ฅ v๋ฅผ ์ฐพ์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ์ด๋•Œ, ํ•จ์ˆ˜ ๐‘“๋ฅผ ์ตœ๋Œ€ํ™”(๋˜๋Š” ์ตœ์†Œํ™”)ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๊ฐ€๋Šฅํ•œ ๊ฐ€์žฅ ์ž‘์€(๋˜๋Š” ๊ฐ€์žฅ ํฐ) ๊ฐ’์„ ๋งŒ๋“œ๋Š” ์ž…๋ ฅ ๐‘ฃ๋ฅผ ์ฐพ์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๊ฒƒ์€ ๋งŽ์€ ๋ฌธ์ œ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์ผ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋น„์šฉ ํ•จ์ˆ˜(cost function)๋ฅผ ์ตœ์†Œํ™”ํ•˜์—ฌ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ตœ์ ํ™”ํ•˜๊ฑฐ๋‚˜,
  • ์ด์ต ํ•จ์ˆ˜(profit function)๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜์—ฌ ๋น„์ฆˆ๋‹ˆ์Šค์˜ ์ˆ˜์ต์„ ๊ทน๋Œ€ํ™”ํ•˜๋Š” ๋“ฑ์˜ ๊ฒฝ์šฐ์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค.
์ˆ˜์‹์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋ž˜๋””์–ธํŠธ ∇๐‘“(ํŽธ๋ฏธ๋ถ„์˜ ๋ฒกํ„ฐ)๋Š” ํ•จ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋น ๋ฅด๊ฒŒ ๊ฐ์†Œํ•˜๊ฑฐ๋‚˜ ์ฆ๊ฐ€ํ•˜๋Š” ์ž…๋ ฅ ๋ฐฉํ–ฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

 

  • ๊ทธ๋ž˜๋””์–ธํŠธ(๊ธฐ์šธ๊ธฐ) ๋ฒกํ„ฐ์ธ ∇๐‘“๋Š” ํ•จ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋น ๋ฅด๊ฒŒ ๊ฐ์†Œํ•˜๊ฑฐ๋‚˜ ์ฆ๊ฐ€ํ•˜๋Š” ๋ฐฉํ–ฅ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ฐ ์š”์†Œ๋Š” ํ•ด๋‹น ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ํŽธ๋ฏธ๋ถ„ ๊ฐ’์ž…๋‹ˆ๋‹ค.
  • ํ•จ์ˆ˜๊ฐ€ ํ˜„์žฌ ์œ„์น˜์—์„œ ๊ฐ€์žฅ ๊ธ‰๊ฒฉํ•˜๊ฒŒ ์ฆ๊ฐ€ํ•˜๋Š” ๋ฐฉํ–ฅ์„ ๋”ฐ๋ฅด๋ฉด, ๊ทธ๋ž˜๋””์–ธํŠธ ๋ฒกํ„ฐ๋Š” ์–‘์ˆ˜ ๊ฐ’์ž…๋‹ˆ๋‹ค.
  • ๋ฐ˜๋Œ€๋กœ ํ•จ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๊ธ‰๊ฒฉํ•˜๊ฒŒ ๊ฐ์†Œํ•˜๋Š” ๋ฐฉํ–ฅ์„ ๋”ฐ๋ฅด๋ฉด, ๊ทธ๋ž˜๋””์–ธํŠธ ๋ฒกํ„ฐ๋Š” ์Œ์ˆ˜ ๊ฐ’์ž…๋‹ˆ๋‹ค.
๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์—์„œ๋Š” ํ˜„์žฌ ์œ„์น˜์—์„œ ๊ทธ๋ž˜๋””์–ธํŠธ์˜ ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์œผ๋กœ ์ด๋™ํ•˜์—ฌ ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™” ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜์‹

  • ๐‘“์˜ ๊ตฌ๋ฐฐ๋Š” ์ตœ์†Œ ๋˜๋Š” ์ตœ๋Œ€ 0 : ∇๐‘“(๐ฏ)=0
  • ํ•จ์ˆ˜ ๐‘“์˜ ๊ทน์†Œ์  ๋˜๋Š” ๊ทน๋Œ€์ ์—์„œ ๊ทธ๋ž˜๋””์–ธํŠธ(๊ธฐ์šธ๊ธฐ)๊ฐ€ 0์ž„์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  • ์ฆ‰, ํ•จ์ˆ˜๊ฐ€ ๊ทน์†Œ์  ๋˜๋Š” ๊ทน๋Œ€์ ์— ๋„๋‹ฌํ•˜๋ฉด ๊ทธ๋ž˜๋””์–ธํŠธ ๋ฒกํ„ฐ์˜ ๋ชจ๋“  ์š”์†Œ๊ฐ€ 0์ด ๋ฉ๋‹ˆ๋‹ค.
  • ์ด๋Š” ํ•จ์ˆ˜๊ฐ€ ๊ทน์†Œ์ (์ตœ์†Œ๊ฐ’)์ด๋‚˜ ๊ทน๋Œ€์ (์ตœ๋Œ€๊ฐ’)์— ๋„๋‹ฌํ–ˆ์„ ๋•Œ, ํ•จ์ˆ˜์˜ ๊ธฐ์šธ๊ธฐ๊ฐ€ ๋” ์ด์ƒ ์ฆ๊ฐ€ํ•˜์ง€ ์•Š๊ฑฐ๋‚˜ ๊ฐ์†Œํ•˜์ง€ ์•Š์Œ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • ๋”ฐ๋ผ์„œ ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ• ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ์ด๋Ÿฌํ•œ ์ ์€ ์ตœ์ ํ™”๋œ ์ ์ด๋ผ๊ณ  ๊ฐ„์ฃผ๋ฉ๋‹ˆ๋‹ค.


Minimize or Maximize (์ตœ์†Œํ™” ๋˜๋Š” ์ตœ๋Œ€ํ™”)

ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
  1. ์ž„์˜์˜ ์‹œ์ž‘์ ์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.
  2. ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  3. ๊ทธ๋ž˜๋””์–ธํŠธ์˜ ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์œผ๋กœ ์ž‘์€ ๋‹จ๊ณ„๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.
  4. ์ƒˆ๋กœ์šด ์‹œ์ž‘์ ์œผ๋กœ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ž˜๋””์–ธํŠธ๋Š” ํ•จ์ˆ˜๋ฅผ ๊ฐ€์žฅ ๋งŽ์ด ๊ฐ์†Œ์‹œํ‚ค๋Š” ๋ฐฉํ–ฅ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • ๋น„์Šทํ•˜๊ฒŒ, ํ•จ์ˆ˜๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๊ทธ๋ž˜๋””์–ธํŠธ ๋ฐฉํ–ฅ์œผ๋กœ ์ž‘์€ ๋‹จ๊ณ„๋ฅผ ์ทจํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Using mathematical notation (์ˆ˜ํ•™์  ํ‘œ๊ธฐ๋ฒ• ์‚ฌ์šฉํ•˜๊ธฐ)

  • with monotonic sequence(์ˆ˜์—ด์ด ๋‹จ์กฐ๊ฐ์†Œ)ํ•˜๋Š” ์„ฑ์งˆ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
X๐ง+1 = X๐ง − ๐›พ๐ง ∇ ๐Ÿ(๐ฑ๐ง)
  • x(n)์€ ๐‘›๋ฒˆ์งธ ๋ฐ˜๋ณต์—์„œ์˜ ํ˜„์žฌ ์œ„์น˜, γn ์€ ํ•™์Šต๋ฅ (learning rate),
  • ∇ ๐Ÿ(๐ฑ(๐ง))์€ ํ˜„์žฌ ์œ„์น˜์—์„œ์˜ ํ•จ์ˆ˜ ๐‘“ ์˜ ๊ทธ๋ž˜๋””์–ธํŠธ(๊ธฐ์šธ๊ธฐ)์ž…๋‹ˆ๋‹ค.
  • ์ด๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹ค์Œ ์œ„์น˜์ธ ๐‘ฅ(๐‘›+1)์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
์ ์  ์ž‘์•„์ง€๋Š” Sequence → Minimun์— ๋„๋‹ฌ.. (์–ธ์  ๊ฐ„..?)
f(x0) ≥ f(x1)≥… ์™€ ๊ฐ™์ด ํ˜„์žฌ ์œ„์น˜์—์„œ์˜ ํ•จ์ˆ˜ ๊ฐ’์ด ๊ฐ ๋ฐ˜๋ณต์—์„œ ์ ์ฐจ ๊ฐ์†Œํ•ฉ๋‹ˆ๋‹ค.
์ฆ‰ ์ด๋Š” ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์ด ํ•จ์ˆ˜์˜ ์ตœ์†Œ๊ฐ’์— ์ ์ง„์ ์œผ๋กœ ์ˆ˜๋ ดํ•จ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.
  • ๋˜ํ•œ ์šฐ๋ฆฌ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๊ฒฝ์‚ฌ ํ•˜๊ฐ• ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์Œ์„ ์ฐพ์Šต๋‹ˆ๋‹ค
    • ์†์‹ค, ๋น„์šฉ ๋˜๋Š” ์˜ค๋ฅ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฐ€์ค‘์น˜(ํšŒ๊ท€, ์‹ ๊ฒฝ๋ง) ๋˜๋Š”
    • ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” Parameter(MLE: ์ตœ๋Œ€ ์šฐ๋„ ์ถ”์ •)

  • ๋งŒ์•ฝ ์ž…๋ ฅ์ด 2์ฐจ์›์ธ ๊ฒฝ์šฐ (When input is two dimemsional)

  • Gradient Descent(๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•)์˜ "Zig-Zagging" ํŠน์„ฑ

from collections import Counter
from linear_algebra import distance, vector_subtract, scalar_multiply
from functools import reduce
import math, random

 

  • sum_of_squares ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
def sum_of_squares(v):
    """v ๋‚ด ์š”์†Œ๋“ค์˜ ์ œ๊ณฑํ•ฉ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค."""
    return sum(v_i ** 2 for v_i in v)
  • sum_of_squares๋Š” ํŒŒ์ด์ฌ ๋ฆฌ์ŠคํŠธ๋‚˜ ๋ฐฐ์—ด v ์•ˆ์˜ ๊ฐ ์š”์†Œ๋“ค์˜ ์ œ๊ณฑ์„ ๋ชจ๋‘ ๋”ํ•˜๋Š” ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค.
  • ์˜ˆ๋ฅผ ๋“ค์–ด, v๊ฐ€ [1, 2, 3]์ด๋ผ๋ฉด, ์ด ํ•จ์ˆ˜๋Š” (1^2 + 2^2 + 3^2 = 14)๋ฅผ ๋ฐ˜ํ™˜ํ•  ๊ฒƒ.
def sum_of_squares_np(v):
    """v ๋‚ด ์š”์†Œ๋“ค์˜ ์ œ๊ณฑํ•ฉ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. - Numpy Version."""
    return np.sum(v * v)

๊ทธ๋ž˜๋””์–ธํŠธ ์ถ”์ • (Estimating the Gradient)

f๊ฐ€ ํ•œ ๋ณ€์ˆ˜์˜ ํ•จ์ˆ˜๋ผ๋ฉด, ์  x์—์„œ์˜ ๋„ํ•จ์ˆ˜๋Š” ์šฐ๋ฆฌ๊ฐ€ x๋กœ ์•„์ฃผ ์ž‘์€ ๋ณ€ํ™”๋ฅผ ๋งŒ๋“ค ๋•Œ f(x)๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ณ€ํ•˜๋Š”์ง€ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.

def difference_quotient(f, x, h):
    # ํ•จ์ˆ˜ f์˜ x์—์„œ์˜ ์ฐจ๋ถ„ ๊ทผ์‚ฌ๊ฐ’์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
    return (f(x + h) - f(x)) / h
  • ์ฃผ์–ด์ง„ ํ•จ์ˆ˜ f์— ๋Œ€ํ•ด, ํŠน์ • ์  x์—์„œ์˜ ๋„ํ•จ์ˆ˜(๋ฏธ๋ถ„๊ฐ’)๋ฅผ ๊ทผ์‚ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.
  • ํ•จ์ˆ˜ f, ์  x, ๊ทธ๋ฆฌ๊ณ  ์•„์ฃผ ์ž‘์€ ๊ฐ’ h๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„, f์˜ x์—์„œ์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ทผ์‚ฌํ™”
  • ํ•จ์ˆ˜์˜ ์ •์˜์— ๋”ฐ๋ผ, (f(x + h) - f(x)) / h ๊ณ„์‚ฐ์€ x์—์„œ h๋งŒํผ ๋ณ€ํ™”ํ–ˆ์„ ๋•Œ, ํ•จ์ˆ˜ f์˜ ์ถœ๋ ฅ๊ฐ’์ด ์–ผ๋งˆ๋‚˜ ๋ณ€ํ•˜๋Š”์ง€๋ฅผ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋Š” x์—์„œ ํ•จ์ˆ˜ f์˜ ์ฆ‰์‹œ ๋ณ€ํ™”์œจ์ด๋‚˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ทผ์‚ฌํ•˜๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. h๊ฐ€ 0์— ๊ฐ€๊นŒ์›Œ์งˆ์ˆ˜๋ก, ์ด ๊ทผ์‚ฌ๊ฐ’์€ f์˜ x์—์„œ์˜ ์‹ค์ œ ๋„ํ•จ์ˆ˜ ๊ฐ’์— ๊ฐ€๊นŒ์›Œ์ง‘๋‹ˆ๋‹ค.
  • ๋„ํ•จ์ˆ˜๋Š” ํ•จ์ˆ˜์˜ ๋ณ€ํ™”์œจ์„ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ํ•จ์ˆ˜ ๊ทธ๋ž˜ํ”„ ์ƒ์˜ ํŠน์ • ์ ์—์„œ์˜ ์ ‘์„ ์˜ ๊ธฐ์šธ๊ธฐ์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค.

 

def plot_estimated_derivative():
    # ์‹ค์ œ ํ•จ์ˆ˜ ์ •์˜
    def square(x):
        """x์˜ ์ œ๊ณฑ์„ ๋ฐ˜ํ™˜"""
        return x * x

    # ์‹ค์ œ ๋„ํ•จ์ˆ˜(๋ฏธ๋ถ„๊ฐ’) ์ •์˜
    def derivative(x):
        """square ํ•จ์ˆ˜์˜ ๋„ํ•จ์ˆ˜์ธ 2x๋ฅผ ๋ฐ˜ํ™˜"""
        return 2 * x

    # ๊ทผ์‚ฌ ๋„ํ•จ์ˆ˜ ๊ณ„์‚ฐ์„ ์œ„ํ•œ ๋žŒ๋‹ค ํ•จ์ˆ˜
    derivative_estimate = lambda x: difference_quotient(square, x, h=10)

    # ์‹ค์ œ ๋„ํ•จ์ˆ˜์™€ ๊ทผ์‚ฌ ๋„ํ•จ์ˆ˜๋ฅผ ๊ทธ๋ž˜ํ”„๋กœ ๋น„๊ต
    import matplotlib.pyplot as plt
    x = range(-10, 10)
    
    # ์‹ค์ œ ๋„ํ•จ์ˆ˜ ๊ทธ๋ž˜ํ”„ (๋นจ๊ฐ„์ƒ‰ x๋กœ ํ‘œ์‹œ)
    plt.plot(x, list(map(derivative, x)), 'rx', label='Actual')           
    # ๊ทผ์‚ฌ ๋„ํ•จ์ˆ˜ ๊ทธ๋ž˜ํ”„ (ํŒŒ๋ž€์ƒ‰ +๋กœ ํ‘œ์‹œ)
    plt.plot(x, list(map(derivative_estimate, x)), 'b+', label='Estimate')  
    # ๋ฒ”๋ก€ ์œ„์น˜ ์„ค์ • (9๋Š” ์ƒ๋‹จ ์ค‘์•™์„ ์˜๋ฏธ)
    plt.legend(loc=9)
    
    plt.title('Actual vs Estimate')
    plt.show()                                      

  • square ํ•จ์ˆ˜๋Š” ์ž…๋ ฅ๊ฐ’์˜ ์ œ๊ณฑ์„ ๋ฐ˜ํ™˜ํ•˜๋ฉฐ, ์ด ํ•จ์ˆ˜์˜ ์‹ค์ œ ๋„ํ•จ์ˆ˜๋Š” 2x์ž…๋‹ˆ๋‹ค.
  • ๊ทผ์‚ฌ ๋„ํ•จ์ˆ˜๋Š” difference_quotient ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ„์‚ฐ๋˜๋ฉฐ, ์—ฌ๊ธฐ์„œ h๋Š” ๊ทผ์‚ฌ ๊ณ„์‚ฐ์— ์‚ฌ์šฉ๋˜๋Š” ๋งค์šฐ ์ž‘์€ ๊ฐ’์ž…๋‹ˆ๋‹ค.

 

%matplotlib inline

plot_estimated_derivative()

  • ์ž๋™ ๋ฏธ๋ถ„์€ ๊ทธ๋ผ๋ฐ์ด์…˜์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. (automatic differentiation computes the gradient correctly)
!pip install autograd
# autograd ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ž๋™ ๋ฏธ๋ถ„์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์˜ˆ์ œ์ž…๋‹ˆ๋‹ค.
import autograd.numpy as np
from autograd import grad

# f๋ผ๋Š” ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. ์ด ํ•จ์ˆ˜๋Š” ์ž…๋ ฅ x์— ๋Œ€ํ•ด x^2์˜ ๊ฐ’์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
def f(x):
    return x * x

# f ํ•จ์ˆ˜์˜ ๋„ํ•จ์ˆ˜(๋ฏธ๋ถ„ ํ•จ์ˆ˜)๋ฅผ ์ž๋™์œผ๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
df_dx = grad(f)

# ์ƒ์„ฑ๋œ ๋„ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ x=5.0์ผ ๋•Œ์˜ ๋„ํ•จ์ˆ˜ ๊ฐ’(๋ฏธ๋ถ„ ๊ฐ’)์„ ๊ณ„์‚ฐํ•˜๊ณ  ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
# f(x) = x^2์˜ ๋„ํ•จ์ˆ˜๋Š” 2x์ด๋ฏ€๋กœ, x=5์ผ ๋•Œ ๋„ํ•จ์ˆ˜์˜ ๊ฐ’์€ 10์ž…๋‹ˆ๋‹ค.
print(df_dx(5.0))

# ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ
# ๋จธ์‹ ๋Ÿฌ๋‹๊ณผ ๋”ฅ๋Ÿฌ๋‹์—์„œ ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋กœ ๋งŽ์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
def sigmoid(x):
    return 1./(1. + np.exp(-x))

# ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜์˜ ๋„ํ•จ์ˆ˜๋ฅผ ์ž๋™์œผ๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
dsigmoid_dx = grad(sigmoid)

# ์ƒ์„ฑ๋œ ๋„ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ x=0์ผ ๋•Œ์˜ ๋„ํ•จ์ˆ˜ ๊ฐ’(๋ฏธ๋ถ„ ๊ฐ’)์„ ๊ณ„์‚ฐํ•˜๊ณ  ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
# ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜์˜ ๋ฏธ๋ถ„ ๊ฐ’์€ sigmoid(x) * (1 - sigmoid(x))์ด๊ณ ,
# x=0์ผ ๋•Œ, ์ด ๊ฐ’์€ 0.25์ž…๋‹ˆ๋‹ค.
print(dsigmoid_dx(0.))

# ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜์˜ x=0์ผ ๋•Œ์˜ ํ•จ์ˆ˜ ๊ฐ’ ์ž์ฒด๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
# ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜๋Š” x=0์—์„œ 0.5์˜ ๊ฐ’์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.
print(sigmoid(0))
10.0
0.25
0.5

 

import autograd.numpy as np
from autograd import grad

# ํ•จ์ˆ˜ ์ •์˜: h(x, y) = x^2 + y
def h(x, y):
    return x**2 + y

# x์— ๋Œ€ํ•œ h์˜ ํŽธ๋ฏธ๋ถ„ ํ•จ์ˆ˜ ์ƒ์„ฑ
dh_dx = grad(h, argnum=0)
# y์— ๋Œ€ํ•œ h์˜ ํŽธ๋ฏธ๋ถ„ ํ•จ์ˆ˜ ์ƒ์„ฑ
dh_dy = grad(h, argnum=1)

# x=2, y=2์—์„œ์˜ ํŽธ๋ฏธ๋ถ„ ๊ฐ’ ๊ณ„์‚ฐ ๋ฐ ๋ฐฐ์—ด๋กœ ๋ฐ˜ํ™˜
np.array([dh_dx(2., 2.), dh_dy(2., 2.)])
array([4., 1.])

 

  • ์ž๋™ ๋ฏธ๋ถ„์„ ์‚ฌ์šฉํ•˜์—ฌ x=0.5์—์„œ f(x) = exp(2x) / (1 + sin(x^2))
ํ•จ์ˆ˜๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ผ๋ จ์˜ ๊ธฐ๋ณธ ์—ฐ์‚ฐ์œผ๋กœ ๋ถ„ํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

Step 1: Let z1 = 2x
Step 2: Let z2 = exp(z1)
Step 3: Let z3 = x^2
Step 4: Let z4 = sin(z3)
Step 5: Let z5 = 1 + z4
Step 6: Let z6 = z2 / z5

๋‹ค์‹œ ๊ฐ ๋‹จ๊ณ„์—์„œ x์— ๋Œ€ํ•œ ํ•จ์ˆ˜์™€ ๊ทธ ๋„ํ•จ์ˆ˜์˜ ๊ฐ’์„ ์ถ”์ ํ•ฉ๋‹ˆ๋‹ค.
๋จผ์ € ๊ทธ ์ž์ฒด์— ๋Œ€ํ•œ ์ถœ๋ ฅ์˜ ๋„ํ•จ์ˆ˜๋ฅผ 1๋กœ ์„ค์ •ํ•œ ๋‹ค์Œ ์—ฐ์‡„๋ฒ•์น™์„ ์‚ฌ์šฉํ•˜์—ฌ
x์— ๋Œ€ํ•œ ๊ฐ ์ค‘๊ฐ„๋ณ€์ˆ˜์˜ ๋„ํ•จ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

dz1/dx = 2
dz2/dx = dz1/dx * exp(z1)
dz3/dx = 2x
dz4/dx = cos(z3) * dz3/dx
dz5/dx = dz4/dx  
dz6/dx = (dz2/dx * z5 - z2 * dz5/dx) / z5^2

Numerical gradient

  • f๊ฐ€ ๋งŽ์€ ๋ณ€์ˆ˜์˜ ํ•จ์ˆ˜์ผ ๋•Œ, f๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํŽธ๋ฏธ๋ถ„์„ ๊ฐ€์ง€๋ฉฐ, ๊ฐ๊ฐ์€ ์ž…๋ ฅ ๋ณ€์ˆ˜ ์ค‘ ํ•˜๋‚˜์—์„œ ์ž‘์€ ๋ณ€ํ™”๋ฅผ ๋งŒ๋“ค ๋•Œ f๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ณ€ํ•˜๋Š”์ง€ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • ์šฐ๋ฆฌ๋Š” ๋‹ค๋ฅธ ๋ณ€์ˆ˜๋“ค์„ ๊ณ ์ •ํ•œ ์ฑ„๋กœ i๋ฒˆ์งธ ๋ณ€์ˆ˜๋งŒ์˜ ํ•จ์ˆ˜๋กœ ์ทจ๊ธ‰ํ•˜์—ฌ i๋ฒˆ์งธ ํŽธ๋ฏธ๋ถ„์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
def partial_difference_quotient(f, v, i, h):
    # ์ฃผ์–ด์ง„ ํ•จ์ˆ˜ f์˜ i๋ฒˆ์งธ ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ์ค‘์•™ ์ฐจ๋ถ„๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ํŽธ๋ฏธ๋ถ„์„ ๊ทผ์‚ฌํ•˜๋Š” ํ•จ์ˆ˜
    
    # v์—์„œ i๋ฒˆ์งธ ๋ณ€์ˆ˜์—๋งŒ h๋ฅผ ๋”ํ•ฉ๋‹ˆ๋‹ค.
    w = [v_j + (h if j == i else 0)
         for j, v_j in enumerate(v)]
    
    # w์™€ v์—์„œ์˜ ํ•จ์ˆ˜ ๊ฐ’์˜ ์ฐจ์ด๋ฅผ h๋กœ ๋‚˜๋ˆ„์–ด ํŽธ๋ฏธ๋ถ„์„ ๊ทผ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
    return (f(w) - f(v)) / h
  • ํ•จ์ˆ˜ f์˜ i๋ฒˆ์งธ ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ํŽธ๋ฏธ๋ถ„์„ ๊ทผ์‚ฌํ•˜๋Š” ์ค‘์•™ ์ฐจ๋ถ„๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ๐‘ฃ์—์„œ์˜ i๋ฒˆ์งธ ๋ณ€์ˆ˜์—๋งŒ ์ž‘์€ ๋ณ€ํ™” h๋ฅผ ์ถ”๊ฐ€ํ•œ w๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
  • w์™€ v์—์„œ์˜ ํ•จ์ˆ˜ ๊ฐ’์˜ ์ฐจ์ด๋ฅผ h๋กœ ๋‚˜๋ˆ„์–ด ํŽธ๋ฏธ๋ถ„์„ ๊ทผ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
def estimate_gradient(f, v, h=0.00001):
    # ์ฃผ์–ด์ง„ ํ•จ์ˆ˜ f์˜ ๊ทธ๋ž˜๋””์–ธํŠธ(๊ธฐ์šธ๊ธฐ)๋ฅผ ์ค‘์•™ ์ฐจ๋ถ„๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ทผ์‚ฌํ•˜๋Š” ํ•จ์ˆ˜
    
    # ๊ฐ ๋ณ€์ˆ˜์— ๋Œ€ํ•ด partial_difference_quotient ํ•จ์ˆ˜(์ค‘์•™ ์ฐจ๋ถ„๋ฒ•)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๊ณ„์‚ฐ.
    return [partial_difference_quotient(f, v, i, h)
            for i, _ in enumerate(v)]
estimate_gradient(sum_of_squares, [1,1,1])
2.00001000001393, 2.00001000001393, 2.00001000001393]

 

import numpy as np

def estimate_gradient_np(f, v, h=0.00001):
    # f: ๋ฏธ๋ถ„ํ•˜๊ณ ์ž ํ•˜๋Š” ๋‹ค๋ณ€์ˆ˜ ํ•จ์ˆ˜
    # v: ํ•จ์ˆ˜ f์— ๋Œ€ํ•œ ์ž…๋ ฅ ๋ฒกํ„ฐ
    # h: ๋ฏธ๋ถ„ ๊ณ„์‚ฐ์„ ์œ„ํ•œ ์•„์ฃผ ์ž‘์€ ๊ฐ’, ๊ธฐ๋ณธ๊ฐ’์€ 0.00001

    # np.eye(v.shape[0])๋Š” v์˜ ์ฐจ์›์— ๋งž๋Š” ๋‹จ์œ„ํ–‰๋ ฌ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    # v + h * np.eye(v.shape[0])๋Š” ๊ฐ ๋ณ€์ˆ˜์— ๋Œ€ํ•ด h๋งŒํผ ์ฆ๊ฐ€์‹œํ‚จ ์ƒˆ๋กœ์šด ๋ฒกํ„ฐ๋“ค์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    # np.apply_along_axis๋Š” ์ƒ์„ฑ๋œ ๊ฐ ๋ฒกํ„ฐ์— ๋Œ€ํ•ด ํ•จ์ˆ˜ f๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.
    gradients = np.apply_along_axis(f, 1, v + h * np.eye(v.shape[0])) - f(v)

    # ๊ณ„์‚ฐ๋œ ์ฐจ์ด๋ฅผ h๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ํ•จ์ˆ˜ f์˜ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค.
    return gradients / h
v + h * np.eye(v.shape[0]) → Numpy Broadcasting (v1+h …. vn)
estimate_gradient_np(lambda v: np.sum(v * v), np.array([1.,1.,1]))

# Result: array([2.00001, 2.00001, 2.00001])
  • ๋‹จ์œ„ํ–‰๋ ฌ์˜ ์‚ฌ์šฉ: np.eye(v.shape[0])์„ ํ†ตํ•ด ์ƒ์„ฑ๋œ ๋‹จ์œ„ํ–‰๋ ฌ์€, ๊ฐ ๋ณ€์ˆ˜์— ๋…๋ฆฝ์ ์œผ๋กœ h๋งŒํผ์˜ ๋ณ€ํ™”๋ฅผ ์ฃผ๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•จ์œผ๋กœ์จ, ๊ฐ ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ํ•จ์ˆ˜์˜ ๋ณ€ํ™”์œจ(๋ฏธ๋ถ„๊ณ„์ˆ˜)์„ ๋…๋ฆฝ์ ์œผ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ•จ์ˆ˜ ์ ์šฉ๊ณผ ๊ทธ๋ž˜๋””์–ธํŠธ ๊ณ„์‚ฐ: np.apply_along_axis ํ•จ์ˆ˜๋Š” ๋ณ€ํ˜•๋œ ๋ฒกํ„ฐ ๊ฐ๊ฐ์— ๋Œ€ํ•ด ์›๋ž˜ ํ•จ์ˆ˜ f๋ฅผ ์ ์šฉํ•˜๊ณ , ์ด๋ฅผ ํ†ตํ•ด ์–ป์–ด์ง„ ํ•จ์ˆ˜ ๊ฐ’์˜ ๋ณ€ํ™”๋ฅผ h๋กœ ๋‚˜๋ˆ”์œผ๋กœ์จ, ๊ฐ ๋ณ€์ˆ˜๋ณ„๋กœ ํ•จ์ˆ˜์˜ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค.
  • V * V = V์˜ Dot Product → (V1^2 + V2^2 + V3^2 / 2V1 * 2V2 * 2V3 ~ th)

Using the Gradient

  • sum_of_squares ํ•จ์ˆ˜๋Š” ์ž…๋ ฅ v๊ฐ€ 0์˜ ๋ฒกํ„ฐ์ผ ๋•Œ ๊ฐ€์žฅ ์ž‘์Šต๋‹ˆ๋‹ค.
  • ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์‚ฌ์‹ค์„ ํ™•์ธํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค.
  • (๊ธฐ์šธ๊ธฐ ํ•˜๊ฐ•๋ฒ•) ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋“  3์ฐจ์› ๋ฒกํ„ฐ ์ค‘ ์ตœ์†Œ๊ฐ’์„ ์ฐพ์•„๋ด…์‹œ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ž„์˜์˜ ์‹œ์ž‘์ ์„ ์„ ํƒํ•œ ๋‹ค์Œ ๊ทธ๋ž˜๋””์–ธํŠธ๊ฐ€ ๋งค์šฐ ์ž‘์€ ์ง€์ ์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ ๊ทธ๋ž˜๋””์–ธํŠธ์˜ ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์œผ๋กœ ์•„์ฃผ ์ž‘์€ ๋‹จ๊ณ„๋ฅผ ๋ฐŸ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค
def step(v, direction, step_size):
    """move step_size in the direction from v"""
    return [v_i + step_size * direction_i
            for v_i, direction_i in zip(v, direction)]
  • step ํ•จ์ˆ˜๋Š” ์ฃผ์–ด์ง„ ๋ฐฉํ–ฅ direction์œผ๋กœ step_size๋งŒํผ ์ด๋™ํ•˜๋Š” ๊ธฐ๋Šฅ
  • v: ํ˜„์žฌ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฒกํ„ฐ.
  • direction: ์ด๋™ํ•˜๊ณ ์ž ํ•˜๋Š” ๋ฐฉํ–ฅ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฒกํ„ฐ. ์ด ๋ฐฉํ–ฅ์€ ๋ณดํ†ต ๋ชฉํ‘œ ํ•จ์ˆ˜์˜ ๊ธฐ์šธ๊ธฐ์— ์˜ํ•ด ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค.
  • step_size: ํ•œ ๋ฒˆ์— ์ด๋™ํ•  ๊ฑฐ๋ฆฌ. ์ฆ‰, ์ด๋™์˜ ํฌ๊ธฐ๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฐ˜ํ™˜๊ฐ’: ์ƒˆ๋กœ์šด ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฒกํ„ฐ.
def sum_of_squares_gradient(v):
    return [2 * v_i for v_i in v]
  • ์ฃผ์–ด์ง„ ๋ฒกํ„ฐ v์— ๋Œ€ํ•ด sum_of_squares ํ•จ์ˆ˜์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐ
  • ๋งค๊ฐœ๋ณ€์ˆ˜:v: ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ ์ž ํ•˜๋Š” ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฒกํ„ฐ.๋ฐ˜ํ™˜๊ฐ’:
  • sum_of_squares ํ•จ์ˆ˜์˜ v์—์„œ์˜ ๊ธฐ์šธ๊ธฐ ๋ฒกํ„ฐ. ๊ฐ ์š”์†Œ๋Š” 2 * v_i๋กœ, sum_of_squares ํ•จ์ˆ˜๋Š” ๊ฐ ๋ณ€์ˆ˜์˜ ์ œ๊ณฑ์˜ ํ•ฉ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๊ทธ ๋ฏธ๋ถ„๊ฐ’์€ 2 * v_i๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.
  • sum_of_squares_gradient(v) → 2V1, 2V2, 2VN…. - Gradient Function.
  • ๋น„๊ณ : step_size๋Š” learning_rate๋ผ๊ณ ๋„ ํ•ฉ๋‹ˆ๋‹ค.
def step(v, direction, step_size):
    """v์—์„œ direction ๋ฐฉํ–ฅ์œผ๋กœ step_size๋งŒํผ ์ด๋™"""
    return [v_i + step_size * direction_i
            for v_i, direction_i in zip(v, direction)]

def sum_of_squares_gradient(v):
    """v์˜ ์ œ๊ณฑํ•ฉ ํ•จ์ˆ˜์˜ ๊ธฐ์šธ๊ธฐ(gradient) ๊ณ„์‚ฐ"""
    return [2 * v_i for v_i in v]

 

step ํ•จ์ˆ˜

  • ํ˜„์žฌ ์œ„์น˜ v์—์„œ ์ฃผ์–ด์ง„ direction ๋ฐฉํ–ฅ์œผ๋กœ step_size๋งŒํผ ์ด๋™ํ•œ ์ƒˆ๋กœ์šด ์œ„์น˜๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  • v: ํ˜„์žฌ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฒกํ„ฐ์ž…๋‹ˆ๋‹ค.
  • direction: ์ด๋™ํ•  ๋ฐฉํ–ฅ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฒกํ„ฐ์ž…๋‹ˆ๋‹ค. ๋ณดํ†ต ์ตœ์ ํ™”ํ•˜๋ ค๋Š” ํ•จ์ˆ˜์˜ ๊ธฐ์šธ๊ธฐ(gradient) ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์ด ๋ฉ๋‹ˆ๋‹ค.
  • step_size: ํ•œ ๋ฒˆ์— ์ด๋™ํ•  ๊ฑฐ๋ฆฌ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด ๊ฐ’์ด ๋„ˆ๋ฌด ํฌ๋ฉด ์ตœ์ ์ ์„ ๋„˜์–ด์„œ๊ฒŒ ๋˜๊ณ , ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด ์ตœ์ ํ™” ๊ณผ์ •์ด ๋งค์šฐ ๋Š๋ ค์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.๋ฐ˜ํ™˜๊ฐ’: ์ƒˆ๋กœ์šด ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฒกํ„ฐ์ž…๋‹ˆ๋‹ค. ์ด๋Š” v์˜ ๊ฐ ์š”์†Œ์— direction์˜ ํ•ด๋‹น ์š”์†Œ์™€ step_size๋ฅผ ๊ณฑํ•œ ๊ฐ’์„ ๋”ํ•ด์„œ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค.

sum_of_squares_gradient ํ•จ์ˆ˜

  • ์ฃผ์–ด์ง„ ๋ฒกํ„ฐ v์— ๋Œ€ํ•ด ์ œ๊ณฑํ•ฉ ํ•จ์ˆ˜์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ์ œ๊ณฑํ•ฉ ํ•จ์ˆ˜๋Š” ๋ชจ๋“  ์š”์†Œ์˜ ์ œ๊ณฑ์„ ๋”ํ•œ ๊ฒƒ์ด๋ฉฐ, ์ด ํ•จ์ˆ˜๋Š” ์ตœ์ ํ™”์—์„œ ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” ๊ฐ„๋‹จํ•œ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.
  • ํŒŒ๋ผ๋ฏธํ„ฐ:v: ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•  ๋ฒกํ„ฐ์ž…๋‹ˆ๋‹ค.๋ฐ˜ํ™˜๊ฐ’: ์ œ๊ณฑํ•ฉ ํ•จ์ˆ˜์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฒกํ„ฐ์ž…๋‹ˆ๋‹ค. ์ œ๊ณฑํ•ฉ ํ•จ์ˆ˜์˜ ๊ฐ ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ํŽธ๋ฏธ๋ถ„์€ 2 * v_i์ด๋ฏ€๋กœ, ๊ฒฐ๊ณผ ๋ฒกํ„ฐ๋Š” ์ž…๋ ฅ ๋ฒกํ„ฐ v์˜ ๊ฐ ์š”์†Œ์— 2๋ฅผ ๊ณฑํ•œ ๊ฐ’์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.
  • ์ด ๋‘ ํ•จ์ˆ˜๋Š” ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•(gradient descent) ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ธฐ๋ณธ ๊ตฌ์„ฑ ์š”์†Œ์ž…๋‹ˆ๋‹ค. ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์€ ํ•จ์ˆ˜์˜ ์ตœ์†Œ๊ฐ’์„ ์ฐพ๊ธฐ ์œ„ํ•ด ๊ธฐ์šธ๊ธฐ(๋˜๋Š” ๊ทธ๋ž˜๋””์–ธํŠธ) ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ˜๋ณต์ ์œผ๋กœ ํ˜„์žฌ ์œ„์น˜๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • sum_of_squares_gradient ํ•จ์ˆ˜๋Š” ์ตœ์ ํ™”ํ•˜๋ ค๋Š” ํ•จ์ˆ˜์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , step ํ•จ์ˆ˜๋Š” ์ด ๊ธฐ์šธ๊ธฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์Œ ์œ„์น˜๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค.

 

print("using the gradient")
# try range(n) n = 1,2,3,4,5,...
v = [random.randint(-10,10) for i in range(2)]  # ์ž„์˜์˜ ์ดˆ๊ธฐ ๋ฒกํ„ฐ v ์ƒ์„ฑ

tolerance = 0.0000001  # ์ˆ˜๋ ด ๊ธฐ์ค€ ๊ฐ’ (10**-5, 10์˜ -5์Šน ๊นŒ์ง€)

while True:
    #print(v, sum_of_squares(v))
    gradient = sum_of_squares_gradient(v)   # v์—์„œ์˜ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ
    next_v = step(v, gradient, -0.01)       # ์Œ์˜ ๊ธฐ์šธ๊ธฐ ๋ฐฉํ–ฅ์œผ๋กœ ํ•œ ๊ฑธ์Œ ์ด๋™
    if distance(next_v, v) < tolerance:     # ์ด์ „ v์™€ ํ˜„์žฌ v์˜ ๊ฑฐ๋ฆฌ๊ฐ€ tolerance๋ณด๋‹ค ์ž‘์œผ๋ฉด ์ค‘๋‹จ
        break
    v = next_v                              # ์•„๋‹ˆ๋ฉด ๊ณ„์†

print("minimum v", v)                       # ์ตœ์†Œํ™”๋œ v ์ถœ๋ ฅ
print("minimum value", sum_of_squares(v))   # ์ตœ์†Œํ™”๋œ ๊ฐ’ ์ถœ๋ ฅ
  • v: ์ตœ์ ํ™”๋ฅผ ์‹œ์ž‘ํ•  ์ž„์˜์˜ ๋ฒกํ„ฐ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” 2์ฐจ์› ๋ฒกํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • tolerance: ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ˆ˜๋ ดํ–ˆ๋‹ค๊ณ  ํŒ๋‹จํ•˜๋Š” ๊ธฐ์ค€์ž…๋‹ˆ๋‹ค. v์˜ ์—ฐ์†๋œ ๋‘ ๊ฐ’์˜ ์ฐจ์ด๊ฐ€ ์ด ๊ฐ’๋ณด๋‹ค ์ž‘์œผ๋ฉด ๋ฐ˜๋ณต์„ ๋ฉˆ์ถฅ๋‹ˆ๋‹ค.
  • while True ๋ฃจํ”„: ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ˆ˜๋ ด ์กฐ๊ฑด์„ ๋งŒ์กฑํ•  ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.
  • gradient = sum_of_squares_gradient(v): ํ˜„์žฌ ์œ„์น˜ v์—์„œ ๋ชฉ์  ํ•จ์ˆ˜์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. → Gradient Step Size ๋งŒํผ ๊ฑธ์–ด๋‹ค๋ฉด ๋‹ค์Œ Step Size
  • next_v = step(v, gradient, -0.01): ๊ณ„์‚ฐ๋œ ๊ธฐ์šธ๊ธฐ์˜ ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ(์ตœ์†Œํ™” ๋ฐฉํ–ฅ)์œผ๋กœ ์ž‘์€ ๊ฑธ์Œ(-0.01)์„ ์ด๋™ํ•˜์—ฌ ์ƒˆ๋กœ์šด ์œ„์น˜๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  • if distance(next_v, v) < tolerance: ์ƒˆ๋กœ์šด ์œ„์น˜์™€ ์ด์ „ ์œ„์น˜ ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๊ฐ€ tolerance๋ณด๋‹ค ์ž‘์œผ๋ฉด, ์ฆ‰ ๋ณ€ํ™”๊ฐ€ ์ถฉ๋ถ„ํžˆ ์ž‘์•„ ์ˆ˜๋ ดํ–ˆ๋‹ค๊ณ  ํŒ๋‹จํ•˜๋ฉด ๋ฐ˜๋ณต์„ ๋ฉˆ์ถฅ๋‹ˆ๋‹ค.
  • ์ตœ์ข…์ ์œผ๋กœ, ์ˆ˜๋ ดํ•œ ์œ„์น˜ v์™€ ๊ทธ ๋•Œ์˜ ๋ชฉ์  ํ•จ์ˆ˜ ๊ฐ’ sum_of_squares(v)๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
using the gradient
minimum v [-4.157730989669425e-06, -2.7718206597796163e-06]
minimum value 2.4969716752438604e-11

Choosing the Right Step Size (or Learning rate)

์˜ฌ๋ฐ”๋ฅธ ์Šคํ… ํฌ๊ธฐ(๋˜๋Š” ํ•™์Šต ์†๋„) ์„ ํƒ
  • gradient์— ์—ญํ–‰ํ•˜์—ฌ ์›€์ง์ผ ์ˆ˜ ์žˆ๋Š” ๊ทผ๊ฑฐ๋Š” ๋ถ„๋ช…ํ•˜์ง€๋งŒ, ์–ผ๋งˆ๋‚˜ ๋ฉ€๋ฆฌ ์›€์ง์ผ ๊ฒƒ์ธ์ง€๋Š” ๋ช…ํ™•ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ ์˜ฌ๋ฐ”๋ฅธ ์Šคํ… ํฌ๊ธฐ๋ฅผ ์„ ํƒํ•˜๋Š” ๊ฒƒ์€ ๊ณผํ•™์ด๋ผ๊ธฐ๋ณด๋‹ค๋Š” ์˜ˆ์ˆ ์— ๊ฐ€๊น์Šต๋‹ˆ๋‹ค. ์ธ๊ธฐ ์žˆ๋Š” ์˜ต์…˜์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
    • ๊ณ ์ • ์Šคํ… ํฌ๊ธฐ ์‚ฌ์šฉ (Using a fixed step size)
    • ์‹œ๊ฐ„์ด ์ง€๋‚จ์— ๋”ฐ๋ผ ๋‹จ๊ณ„ ํฌ๊ธฐ๊ฐ€ ์ ์ฐจ ์ถ•์†Œ๋จ (Gradually shrinking the step size over time)
    • ๊ฐ ๋‹จ๊ณ„์—์„œ ๋ชฉ์  ํ•จ์ˆ˜์˜ ๊ฐ’์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๋‹จ๊ณ„ ํฌ๊ธฐ ์„ ํƒ (At each step, choosing the step size that minimizes the value of the objective function)
  • ๋งˆ์ง€๋ง‰์€ ์ตœ์ ์ธ ๊ฒƒ์ฒ˜๋Ÿผ ๋“ค๋ฆฌ์ง€๋งŒ ์‹ค์ œ๋กœ๋Š” ๋น„์šฉ์ด ๋งŽ์ด ๋“œ๋Š” ๊ณ„์‚ฐ์ž…๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋‹ค์–‘ํ•œ ๋‹จ๊ณ„ ํฌ๊ธฐ๋ฅผ ์‹œ๋„ํ•˜๊ณ  ๋ชฉ์  ํ•จ์ˆ˜์˜ ๊ฐ’์ด ๊ฐ€์žฅ ์ž‘์€ ๋‹จ๊ณ„๋ฅผ ์„ ํƒํ•จ์œผ๋กœ์จ ๊ทธ ๊ทผ์‚ฌ์น˜๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
step_sizes = [100, 10, 1, 0.1, 0.01, 0.001, 0.0001, 0.00001]
  • ํ•™์Šต๋ฅ ์ด ๋„ˆ๋ฌด ํฌ๋ฉด ํŽธ์ฐจ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.
  • ํ•™์Šต ์†๋„๊ฐ€ ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด ์ˆ˜๋ ด ์†๋„๊ฐ€ ๋„ˆ๋ฌด ๋Š๋ฆฌ๊ฑฐ๋‚˜ ๋กœ์ปฌ ์ตœ์†Œ๊ฐ’์œผ๋กœ ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์ง€์—ญ์ ์ธ Global Local Minimum


Experiment with various learning rates (๋‹ค์–‘ํ•œ Learning Rate๋กœ ์‹คํ—˜)

# in class, changes lr = 10, 1.1, 1, 0.1, 0.01
# 10 : diverge
# 1.1: diverge
# 1: oscilliating
# 0.1: good pace
# 0.01 : two slow
import numpy as np
import matplotlib.pyplot as plt

def sum_of_squares_gradient_np(v):
    """์ œ๊ณฑํ•ฉ ํ•จ์ˆ˜์˜ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค."""
    return 2 * v

def gradient_descent(gradient_f, init_x, lr=0.01, step_num=10000, tolerance=0.0000001):
    """๊ทธ๋ž˜๋””์–ธํŠธ ํ•˜๊ฐ•๋ฒ•์„ ์‚ฌ์šฉํ•ด ์ตœ์†Œ๊ฐ’์„ ์ฐพ์Šต๋‹ˆ๋‹ค."""
    x = init_x
    x_history = []
    for i in range(step_num):
        x_history.append(x.copy())
        x_prev = x.copy()
        x -= lr * gradient_f(x)  # ๊ทธ๋ž˜๋””์–ธํŠธ ์Šคํ…
        if np.linalg.norm(x - x_prev) < tolerance:  # ์ˆ˜๋ ด ์กฐ๊ฑด
            break
    return x, x_history

init_x = np.array([-1.0, 1.0])
lr = 1.1  # ํ•™์Šต๋ฅ , # try with 10, 1.1, 1, 0.1, 0.01
step_num = 100
x, x_history = gradient_descent(sum_of_squares_gradient_np, init_x, lr=lr, step_num=step_num)

# ์‹œ๊ฐํ™”
plt.plot( [-5, 5], [0,0], '--b')
plt.plot( [0,0], [-5, 5], '--b')
x_history = np.array(x_history)
plt.plot(x_history[:,0], x_history[:,1], 'o')

plt.xlim(-3.5, 3.5)
plt.ylim(-4.5, 4.5)
plt.xlabel("X0")
plt.ylabel("X1")
plt.axis('equal')
plt.show()

  • ์ด ์ฝ”๋“œ๋Š” ์ดˆ๊ธฐ ์ (init_x)์—์„œ ์‹œ์ž‘ํ•˜์—ฌ ์ œ๊ณฑํ•ฉ ํ•จ์ˆ˜์˜ ์ตœ์†Œ๊ฐ’์„ ์ฐพ๊ธฐ ์œ„ํ•œ ๊ฒฝ๋กœ๋ฅผ ์‹œ๊ฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ• ํ•จ์ˆ˜(gradient_descent)๋Š” ์ฃผ์–ด์ง„ ๊ทธ๋ž˜๋””์–ธํŠธ ํ•จ์ˆ˜(gradient_f), ์ดˆ๊ธฐ ์ (init_x), ํ•™์Šต๋ฅ (lr), ์ตœ๋Œ€ ์Šคํ… ์ˆ˜(step_num), ๊ทธ๋ฆฌ๊ณ  ์ˆ˜๋ ด ๊ธฐ์ค€(tolerance)์„ ์ธ์ž๋กœ ๋ฐ›์•„ ์ตœ์†Œ๊ฐ’์„ ์ฐพ๋Š” ๊ณผ์ •์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • ํ•™์Šต๋ฅ (lr)์€ ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์—์„œ ๋งค์šฐ ์ค‘์š”ํ•œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์ž…๋‹ˆ๋‹ค. ๋„ˆ๋ฌด ํฌ๋ฉด ๋ฐœ์‚ฐํ•  ์ˆ˜ ์žˆ๊ณ , ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด ์ˆ˜๋ ด์ด ๋งค์šฐ ๋Š๋ ค์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด ์˜ˆ์ œ์—์„œ lr ๊ฐ’์„ ๋‹ค์–‘ํ•˜๊ฒŒ ๋ณ€๊ฒฝํ•ด๋ณด๋ฉฐ ๊ทธ ์˜ํ–ฅ์„ ์‹œ๊ฐ์ ์œผ๋กœ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • x -= lr * gradient_f(x) → x์˜ Gradient Function (Gradient์˜ ๋ฐ˜๋Œ€๋ฐฉํ–ฅ์œผ๋กœ ๊ฑท๋Š”๋‹ค.)

  • (Out of function domain) ํŠน์ • ์Šคํ… ํฌ๊ธฐ๋กœ ์ธํ•ด ํ•จ์ˆ˜์— ๋Œ€ํ•œ ์ž…๋ ฅ์ด ์ž˜๋ชป๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • ๋”ฐ๋ผ์„œ ์ž˜๋ชป๋œ ์ž…๋ ฅ์— ๋Œ€ํ•ด ๋ฌดํ•œ๋Œ€๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” "์•ˆ์ „ ์ ์šฉ" ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ค์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
def safe(f):
    """f ํ•จ์ˆ˜๋ฅผ ์•ˆ์ „ํ•˜๊ฒŒ ์‹คํ–‰ํ•˜๋Š” ์ƒˆ๋กœ์šด ํ•จ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค."""
    def safe_f(*args, **kwargs):
        try:
            return f(*args, **kwargs)  # f๋ฅผ ์‹คํ–‰ํ•ด๋ด„ -> ๋ชจ๋“  ์ธ์ž, keyword & argument
        except:
            return float('inf')        # ์˜ˆ์™ธ๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ๋ฌดํ•œ๋Œ€๋ฅผ ๋ฐ˜ํ™˜
    return safe_f
ํ•จ์ˆ˜ f ๋ฅผ ๋ฐ›์•„, ํ•ด๋‹น ํ•จ์ˆ˜๋ฅผ ์‹คํ–‰ํ•  ๋•Œ ์˜ˆ์™ธ๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ํ”„๋กœ๊ทธ๋žจ์ด ์ค‘๋‹จ๋˜์ง€ ์•Š๊ณ  ๋Œ€์‹  ๋ฌดํ•œ๋Œ€(float('inf'))๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ์ƒˆ๋กœ์šด ํ•จ์ˆ˜ safe_f ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

Putting It All Together

  • ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ์—๋Š” ์ตœ์†Œํ™”ํ•˜๋ ค๋Š” target_fn์ด ์žˆ๊ณ , gradient_fn๋„ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์˜ˆ๋ฅผ ๋“ค์–ด target_fn์€ ๋ชจ๋ธ์˜ ์˜ค๋ฅ˜๋ฅผ ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ํ•จ์ˆ˜๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์˜ค๋ฅ˜๋ฅผ ๊ฐ€๋Šฅํ•œ ์ž‘๊ฒŒ ๋งŒ๋“œ๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ฐพ๊ณ ์ž ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
def minimize_batch(target_fn, gradient_fn, theta_0, tolerance=0.000001):
    """๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชฉํ‘œ ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” theta๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค."""
    # tolerance -> f๊ฐ’์ด ์ด๊ฒƒ๋ณด๋‹ค ๋‚ฎ์•„์ง€๋ฉด ๋.
    
    # ๊ฐ ๋‹จ๊ณ„ ํฌ๊ธฐ๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
    step_sizes = [100, 10, 1, 0.1, 0.01, 0.001, 0.0001, 0.00001]

    # theta๋ฅผ ์ดˆ๊ธฐ ๊ฐ’์œผ๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
    theta = theta_0                           
    
    # target_fn์˜ ์•ˆ์ „ ๋ฒ„์ „์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
    target_fn = safe(target_fn)               
    
    # ์ตœ์†Œํ™”ํ•  ๊ฐ’์„ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
    value = target_fn(theta)                  

    while True:
        # ํ˜„์žฌ theta์—์„œ์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
        gradient = gradient_fn(theta)
        
        # ๋‹ค์Œ ๋‹จ๊ณ„ ํ›„๋ณด๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
        next_thetas = [step(theta, gradient, -step_size)
                       for step_size in step_sizes]

        # ์˜ค์ฐจ ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.
        next_theta = min(next_thetas, key=target_fn) # target function ์ตœ์†Œ๊ฐ’ ๊ตฌํ•จ
        next_value = target_fn(next_theta) # value & next value์˜ ์ฐจ์ด

        # ์ˆ˜๋ ดํ•˜๋Š” ๊ฒฝ์šฐ ๋ฉˆ์ถฅ๋‹ˆ๋‹ค.
        if abs(value - next_value) < tolerance:
            return theta
        else:
            theta, value = next_theta, next_value

 

Example : Minimizing sum_of_squares

minimize_batch(sum_of_squares, sum_of_squares_gradient, [10,20,4,5])
[0.0006805647338418772,
 0.0013611294676837543,
 0.00027222589353675085,
 0.0003402823669209386]
minimize_batch(sum_of_squares, sum_of_squares_gradient, [10,20,4,5,0,1])
[0.0006805647338418772,
 0.0013611294676837543,
 0.00027222589353675085,
 0.0003402823669209386,
 0.0,
 6.805647338418771e-05]

 

Example : Centering a certain point

def myf(v):
    # ์ฃผ์–ด์ง„ ๋ฒกํ„ฐ v์˜ ๊ฐ ์š”์†Œ์™€ 3, 2 ๊ฐ๊ฐ์„ ๋บ€ ํ›„ ์ œ๊ณฑํ•˜์—ฌ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
    return (v[0]-3)**2 + (v[1]-2)**2

def myf_gradient(v):
    # ์ฃผ์–ด์ง„ ๋ฒกํ„ฐ v์˜ ๊ฐ ์š”์†Œ์— 2๋ฅผ ๊ณฑํ•˜๊ณ , ๊ฐ๊ฐ์— 6 ๋˜๋Š” 4๋ฅผ ๋บ€ ๊ฒฐ๊ณผ๋ฅผ ๋ฆฌ์ŠคํŠธ๋กœ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
    return [2.0*v[0]-6, 2.0*v[1]-4]

# minimize_batch ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ myf ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค.
minimize_batch(myf, myf_gradient, [5000., 50.])
[3.0016059738814325, 2.000015426605225]
  • myf_gradient ํ•จ์ˆ˜๋ฅผ ๊ทธ๋ž˜๋””์–ธํŠธ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • ์ดˆ๊ธฐ theta ๊ฐ’์€ [5000., 50.]์œผ๋กœ ์„ค์ •๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด ๊ฐ’์€ ์‚ฌ์šฉ์ž๊ฐ€ ๋ฌธ์ œ์— ๋”ฐ๋ผ ์ ์ ˆํ•˜๊ฒŒ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

 

from functools import partial

def f1(x, c):
    # ์ฃผ์–ด์ง„ ๋ฒกํ„ฐ x์™€ ์ƒ์ˆ˜ ๋ฒกํ„ฐ c์˜ ์ฐจ์ด๋ฅผ ์ œ๊ณฑํ•˜์—ฌ ํ•ฉ์‚ฐํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
    x = np.array(x)
    c = np.array(c)
    return np.sum((x - c)**2) # x, c ์‚ฌ์ด ๊ฑฐ๋ฆฌ

def f1_gradient(x, c):
    # ์ฃผ์–ด์ง„ ๋ฒกํ„ฐ x์™€ ์ƒ์ˆ˜ ๋ฒกํ„ฐ c์— ๊ฐ๊ฐ 2๋ฅผ ๊ณฑํ•˜๊ณ , ๊ฐ๊ฐ์— ์ƒ์ˆ˜ ๋ฒกํ„ฐ c๋ฅผ ๋บ€ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
    x = np.array(x)
    c = np.array(c)
    return 2*x - 2*c

def numerical_gradient(v, f, h=0.00001):
    # ์ค‘์•™ ์ฐจ๋ถ„๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์ฃผ์–ด์ง„ ๋ฒกํ„ฐ v์—์„œ ํ•จ์ˆ˜ f์˜ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
    return (np.apply_along_axis(f, 1, v + h * np.eye(len(v))) - f(v)) / h

c = np.array([7,70,7,4])

# f1 ํ•จ์ˆ˜๋ฅผ ์ƒ์ˆ˜ c๋ฅผ ๊ณ ์ •์‹œ์ผœ ๋ถ€๋ถ„ ํ•จ์ˆ˜๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค -> f function์— ์ธ์ž ๋จผ์ €
f = partial(f1, c=c)

# f1_gradient ํ•จ์ˆ˜๋ฅผ ์ƒ์ˆ˜ c๋ฅผ ๊ณ ์ •์‹œ์ผœ ๋ถ€๋ถ„ ํ•จ์ˆ˜๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
gradient_f = partial(f1_gradient, c=c)

# minimize_batch ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ f๋ฅผ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค.
# gradient_f๋ฅผ ๊ทธ๋ž˜๋””์–ธํŠธ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
# ์ดˆ๊ธฐ theta ๊ฐ’์€ [0,0,0,0]์œผ๋กœ ์„ค์ •๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
minimize_batch(f, gradient_f, [0,0,0,0])
[6.999843894783611, 69.99843894783609, 6.999843894783611, 3.9999107970192056]
  • f1 ํ•จ์ˆ˜์™€ f1_gradient ํ•จ์ˆ˜์—์„œ๋Š” ์ฃผ์–ด์ง„ ๋ฒกํ„ฐ x์™€ ์ƒ์ˆ˜ ๋ฒกํ„ฐ c์— ๋Œ€ํ•ด ๊ฐ๊ฐ ํ•จ์ˆ˜ ๊ฐ’์„ ๊ณ„์‚ฐํ•˜๊ณ  ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  • functools ๋ชจ๋“ˆ์˜ partial ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ c๋ฅผ ๊ณ ์ •์‹œํ‚จ ๋ถ€๋ถ„ ํ•จ์ˆ˜๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋ ‡๊ฒŒ ์ƒ์„ฑ๋œ ๋ถ€๋ถ„ ํ•จ์ˆ˜๋“ค์„ minimize_batch ํ•จ์ˆ˜์˜ ์ธ์ˆ˜๋กœ ์ „๋‹ฌํ•˜์—ฌ ์ตœ์ ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

 

  • ๋•Œ๋กœ๋Š” ํ•จ์ˆ˜์˜ ์Œ์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ์ตœ์†Œํ™”ํ•จ ์œผ๋กœ์จ ํ•จ์ˆ˜๋ฅผ ์ตœ๋Œ€ํ™”ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. (ํ•ด๋‹น ์Œ์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ฐ€์ง)
def negate(f):
    """์ฃผ์–ด์ง„ ํ•จ์ˆ˜ f์— ๋Œ€ํ•ด -f(x)๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค."""
    return lambda *args, **kwargs: -f(*args, **kwargs)

def negate_all(f):
    """๋ฆฌ์ŠคํŠธ๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜ f์˜ ๊ฐ ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด ์Œ์ˆ˜๋ฅผ ์ทจํ•œ ๋ฆฌ์ŠคํŠธ๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค."""
    return lambda *args, **kwargs: [-y for y in f(*args, **kwargs)] # -y: ๋ฐ˜๋Œ€ - functionํ™”

def maximize_batch(target_fn, gradient_fn, theta_0, tolerance=0.000001):
    # minimize_batch ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ๋ชฉํ‘œ ํ•จ์ˆ˜๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” theta ๊ฐ’์„ ์ฐพ์Šต๋‹ˆ๋‹ค.
    # negate ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชฉํ‘œ ํ•จ์ˆ˜๋ฅผ ์Œ์ˆ˜๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , gradient_fn์˜ ๊ฒฐ๊ณผ๋ฅผ ์Œ์ˆ˜๋กœ ๋ณ€ํ™˜ํ•˜๋Š” negate_all ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    return minimize_batch(negate(target_fn),
                          negate_all(gradient_fn),
                          theta_0,
                          tolerance)
  • minimize_batch ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ์ฃผ์–ด์ง„ ๋ชฉํ‘œ ํ•จ์ˆ˜๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” theta ๊ฐ’์„ ์ฐพ์Šต๋‹ˆ๋‹ค.
  • maximize_batch ํ•จ์ˆ˜๋Š” ์ฃผ์–ด์ง„ ๋ชฉํ‘œ ํ•จ์ˆ˜์™€ ๊ทธ๋ž˜๋””์–ธํŠธ ํ•จ์ˆ˜๋ฅผ negate ํ•จ์ˆ˜์™€ negate_all ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์Œ์ˆ˜๋กœ ๋ณ€ํ™˜ํ•œ ํ›„ minimize_batch ํ•จ์ˆ˜์— ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค.

Maximizing batch Example

์ •๊ทœ pdf๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ณ€์ˆ˜๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค.

  • normal pdf์˜ ๋„ํ•จ์ˆ˜๋Š” ... (deriv ๊ฐ€๋Šฅ)
from functools import partial

def normal_pdf(npx, mu, sigma):
    # ์ •๊ทœ ๋ถ„ํฌ์˜ ํ™•๋ฅ  ๋ฐ€๋„ ํ•จ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
    x = npx[0]
    return ((1/(np.sqrt(2*np.pi)*sigma)*np.exp(-(x-mu)**2/(2*sigma**2))))

def numerical_gradient(v, f, h=0.00001):
    # ์ฃผ์–ด์ง„ ํ•จ์ˆ˜ f์˜ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ์ค‘์•™ ์ฐจ๋ถ„๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ทผ์‚ฌํ•˜๋Š” ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค.
    return (np.apply_along_axis(f, 1, v + h * np.eye(len(v))) - f(v)) / h

# normal_pdf ํ•จ์ˆ˜์— ๋Œ€ํ•œ ๋ถ€๋ถ„ ํ•จ์ˆ˜๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. mu=1, sigma=1๋กœ ๊ณ ์ •๋ฉ๋‹ˆ๋‹ค.
f = partial(normal_pdf, mu=1, sigma=1)

# numerical_gradient ํ•จ์ˆ˜์— ๋Œ€ํ•œ ๋ถ€๋ถ„ ํ•จ์ˆ˜๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
gradient_f = partial(numerical_gradient, f=f)

# ์ดˆ๊ธฐ๊ฐ’์„ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
init_x = np.array([1.])

# maximize_batch ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ normal_pdf ํ•จ์ˆ˜๋ฅผ ์ตœ๋Œ€ํ™”ํ•ฉ๋‹ˆ๋‹ค.
# gradient_f๋ฅผ ๊ทธ๋ž˜๋””์–ธํŠธ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
maximize_batch(f, gradient_f, init_x)
array([1.])
  • maximize_batch ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ฃผ์–ด์ง„ ํ•จ์ˆ˜ normal_pdf๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ์˜ˆ์ œ์ž…๋‹ˆ๋‹ค.
  • ์ฃผ์–ด์ง„ ํ•จ์ˆ˜ normal_pdf๋Š” ์ •๊ทœ ๋ถ„ํฌ์˜ ํ™•๋ฅ  ๋ฐ€๋„ ํ•จ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  • ์ด ํ•จ์ˆ˜์— ๋Œ€ํ•œ ๊ทธ๋ž˜๋””์–ธํŠธ๋Š” numerical_gradient ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ทผ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
  • ์ดˆ๊ธฐ๊ฐ’์€ init_x๋กœ ์„ค์ •๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ตœ์ข…์ ์œผ๋กœ maximize_batch ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ์ตœ๋Œ€ํ™”๋œ ๊ฐ’์„ ์ฐพ์Šต๋‹ˆ๋‹ค.

ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ• (Stochastic Gradient Descent)

  • ๋ฐฐ์น˜ ๊ฒฝ์‚ฌ ํ•˜๊ฐ•
    • ๋ฐฐ์น˜ ์ ‘๊ทผ ๋ฐฉ์‹์—์„œ ๊ฐ ๊ทธ๋ผ๋ฐ์ด์…˜ ๋‹จ๊ณ„๋Š” ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ์ „์ฒด ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•œ ๊ทธ๋ผ๋ฐ์ด์…˜์„ ๊ณ„์‚ฐํ•ด์•ผ ํ•˜๋ฏ€๋กœ ๊ฐ ๋‹จ๊ณ„์— ์˜ค๋žœ ์‹œ๊ฐ„์ด ๊ฑธ๋ฆฝ๋‹ˆ๋‹ค.
  • ์ผ๋ฐ˜์ ์œผ๋กœ ์˜ค์ฐจ ํ•จ์ˆ˜๋Š” ๊ฐ€์‚ฐ์ ์ด๋ฉฐ, ์ด๋Š” ์ „์ฒด ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•œ ์˜ˆ์ธก ์˜ค์ฐจ๊ฐ€ ๋‹จ์ˆœํžˆ ๊ฐ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์— ๋Œ€ํ•œ ์˜ˆ์ธก ์˜ค์ฐจ์˜ ํ•ฉ์ด๋ผ๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  • ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•
    • ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์€ ํ•œ ๋ฒˆ์— ํ•œ ์ ์— ๋Œ€ํ•ด์„œ๋งŒ ๊ฒฝ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค(๊ทธ๋ฆฌ๊ณ  ๋‹จ๊ณ„๋ฅผ ๋ฐŸ์Šต๋‹ˆ๋‹ค).
    • ์ •์ง€ ์ง€์ ์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ˆœํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๋งค ์ฃผ๊ธฐ ๋™์•ˆ ์šฐ๋ฆฌ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋ฌด์ž‘์œ„ ์ˆœ์„œ๋กœ ๋ฐ˜๋ณตํ•˜๊ธฐ๋ฅผ ์›ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

Batch vs SGD vs Mini-batch

์ „์ฒด Error = ๋ถ€๋ถ„ ์—๋Ÿฌ์˜ ํ•ฉ → 1๊ฐœ ์”ฉ๋งŒ ๋ณธ๋‹ค.
๋ฐฉ๋ฒ• ์„ค๋ช… ์žฅ์  ๋‹จ์ 
Batch ์ „์ฒด ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•จ ์ •ํ™•ํ•จ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ๊ฐ€ ๋งค์šฐ ํฌ๋ฉด 1) ๋Š๋ฆผ, 2) ๋ฉ”๋ชจ๋ฆฌ์— ๋งž์ถ”๊ธฐ ์–ด๋ ค์›€
SGD ํ•˜๋‚˜์˜ ์ƒ˜ํ”Œ์—์„œ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ์—…๋ฐ์ดํŠธ ๋น ๋ฆ„ ๊ณผ๋„ํ•œ ์˜ค๋ฒ„์ŠˆํŒ… ๋ฐ ์ˆ˜๋ ด์ด ์–ด๋ ค์›€ (ํ•™์Šต๋ฅ  ๊ฐ์†Œ๋กœ ํ•ด๊ฒฐ ๊ฐ€๋Šฅ)
Mini-batch n๊ฐœ์˜ ์ƒ˜ํ”Œ์˜ ๋ฏธ๋‹ˆ๋ฐฐ์น˜์—์„œ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ์—…๋ฐ์ดํŠธ ๋ถ„์‚ฐ ๊ฐ์†Œ, ์•ˆ์ •์ ์ธ ์—…๋ฐ์ดํŠธ ๋น ๋ฅธ ๊ณ„์‚ฐ, (ํ•˜๋“œ์›จ์–ด/์†Œํ”„ํŠธ์›จ์–ด์˜ ๊ฐ•์ ์„ ํ™œ์šฉ). ์ด๊ฒƒ์€ ์‹ ๊ฒฝ๋ง์„ ํ›ˆ๋ จํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค.


NeuralNet Terminology: Epoch

ํ•œ epoch๋Š” ์ „์ฒด ๋ฐ์ดํ„ฐ ์„ธํŠธ๊ฐ€ ํ›ˆ๋ จ์„ ์œ„ํ•ด ์†Œ๋น„๋˜๋Š” ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค.
import random

def in_random_order(data):
    # ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฌด์ž‘์œ„ ์ˆœ์„œ๋กœ ๋ฐ˜ํ™˜ํ•˜๋Š” ์ œ๋„ˆ๋ ˆ์ดํ„ฐ์ž…๋‹ˆ๋‹ค.
    # Args: data: ๋ฌด์ž‘์œ„๋กœ ๋ฐ˜ํ™˜ํ•  ๋ฐ์ดํ„ฐ, Returns: ๋ฌด์ž‘์œ„๋กœ ์„ž์ธ ๋ฐ์ดํ„ฐ
    
    indexes = [i for i, _ in enumerate(data)]  # ๋ฐ์ดํ„ฐ์˜ ์ธ๋ฑ์Šค ๋ชฉ๋ก์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    random.shuffle(indexes)                    # ์ธ๋ฑ์Šค๋ฅผ ์„ž์Šต๋‹ˆ๋‹ค.
    for i in indexes:                          # ํ•ด๋‹น ์ˆœ์„œ๋Œ€๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
        yield data[i]
  • in_random_order ์ œ๋„ˆ๋ ˆ์ดํ„ฐ ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. ์ด ํ•จ์ˆ˜๋Š” ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฌด์ž‘์œ„ ์ˆœ์„œ๋กœ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  • enumerate ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ์˜ ์ธ๋ฑ์Šค ๋ชฉ๋ก์„ ์ƒ์„ฑํ•˜๊ณ , random.shuffle ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ธ๋ฑ์Šค๋ฅผ ์„ž์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฐ ๋‹ค์Œ ์„ž์ธ ์ธ๋ฑ์Šค์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

Understanding SGD Code

x๋Š” Training ๋ฐ์ดํ„ฐ ์„ธํŠธ์ž…๋‹ˆ๋‹ค.
y๋Š” Lavel(๋˜๋Š” ํด๋ž˜์Šค) ์ง‘ํ•ฉ์ž…๋‹ˆ๋‹ค.
def minimize_stochastic(target_fn, gradient_fn, x, y, theta_0, alpha_0=0.01):

    data = list(zip(x, y))
    theta = theta_0                             # ์ดˆ๊ธฐ ์ถ”์ •๊ฐ’
    alpha = alpha_0                             # ์ดˆ๊ธฐ ์Šคํ… ํฌ๊ธฐ
    min_theta, min_value = None, float("inf")   # ํ˜„์žฌ๊นŒ์ง€์˜ ์ตœ์†Ÿ๊ฐ’
    iterations_with_no_improvement = 0

    # 100ํšŒ ๋ฐ˜๋ณตํ•˜์—ฌ ๊ฐœ์„  ์—†์œผ๋ฉด ์ข…๋ฃŒ
    while iterations_with_no_improvement < 100: # Total Error
        value = sum(target_fn(x_i, y_i, theta) for x_i, y_i in data)

        if value < min_value:
            # ์ƒˆ๋กœ์šด ์ตœ์†Ÿ๊ฐ’์„ ์ฐพ์•˜์œผ๋ฉด ๊ธฐ์–ตํ•˜๊ณ  ์ดˆ๊ธฐ ์Šคํ… ํฌ๊ธฐ๋กœ ๋Œ์•„๊ฐ
            min_theta, min_value = theta, value
            iterations_with_no_improvement = 0
            alpha = alpha_0
        else:
            # ๊ฐœ์„ ์ด ์—†๋‹ค๋ฉด ์Šคํ… ํฌ๊ธฐ๋ฅผ ์ค„์—ฌ๋ด„
            iterations_with_no_improvement += 1
            alpha *= 0.9

        # ๊ฐ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์— ๋Œ€ํ•ด ๊ทธ๋ž˜๋””์–ธํŠธ ์Šคํ…์„ ์ทจํ•จ
        for x_i, y_i in in_random_order(data):
            gradient_i = gradient_fn(x_i, y_i, theta)
            theta = vector_subtract(theta, scalar_multiply(alpha, gradient_i))

    return min_theta
  • target_fn: ๋ชฉํ‘œ ํ•จ์ˆ˜
  • gradient_fn: ๋ชฉํ‘œ ํ•จ์ˆ˜์˜ ๊ทธ๋ž˜๋””์–ธํŠธ(๊ธฐ์šธ๊ธฐ)
  • x: ์ž…๋ ฅ ๋ฐ์ดํ„ฐ
  • y: ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ
  • theta_0: ์ดˆ๊ธฐ theta ๊ฐ’
  • alpha_0: ์ดˆ๊ธฐ ํ•™์Šต๋ฅ  (๊ธฐ๋ณธ๊ฐ’: 0.01)
  • Returns: ์ตœ๋Œ€๊ฐ’์„ ๊ฐ€์ง€๋Š” theta ๊ฐ’
def maximize_stochastic(target_fn, gradient_fn, x, y, theta_0, alpha_0=0.01):
    """
    ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ์ƒ์Šน๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชฉํ‘œ ํ•จ์ˆ˜๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค.
    """
    return minimize_stochastic(negate(target_fn),
                               negate_all(gradient_fn),
                               x, y, theta_0, alpha_0)
  • target_fn์€ ๋ชฉํ‘œ ํ•จ์ˆ˜์ด๊ณ , gradient_fn์€ ๋ชฉํ‘œ ํ•จ์ˆ˜์˜ ๊ทธ๋ž˜๋””์–ธํŠธ(๊ธฐ์šธ๊ธฐ) ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค.
  • x๋Š” ์ž…๋ ฅ ๋ฐ์ดํ„ฐ, y๋Š” ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ, theta_0์€ ์ดˆ๊ธฐ ์ถ”์ •๊ฐ’์ž…๋‹ˆ๋‹ค.
  • alpha_0์€ ์ดˆ๊ธฐ ํ•™์Šต๋ฅ ๋กœ, ๊ธฐ๋ณธ๊ฐ’์€ 0.01์ž…๋‹ˆ๋‹ค. ์ตœ์ ํ™”๋œ theta ๊ฐ’์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.