การเปิดใช้งาน GELU คืออะไร

18

ฉันกำลังอ่านกระดาษ BERTซึ่งใช้GELU (Gaussian Error Linear Unit)ซึ่งระบุสมการเป็น ซึ่งจะอยู่ที่ประมาณ

G E L ยู (x) = x P (X \leq x) = x Φ (x) .

$GELU(x) = xP(X ≤ x) = xΦ(x).$

0.5 x (1 + เสื้อ a n ชั่วโมง [\sqrt{2 / π} (x + 0.044715 x^{3})])

$0.5x(1 + tanh[\sqrt{ 2/π}(x + 0.044715x^3)])$

คุณช่วยทำให้สมการง่ายขึ้นและอธิบายว่ามันประมาณได้อย่างไร

activation-function bert mathematics

— thanatoz
แหล่งที่มา

19

ฟังก์ชั่น GELU

เราสามารถขยายการแจกแจงสะสมของ $\mathcal{N}(0, 1)$ , คือ $\Phi(x)$ , ดังนี้:

GELU (x) := x P (X \leq x) = x Φ (x) = 0.5 x (1 + erf (\frac{x}{\sqrt{2}}))

$\text{GELU}(x):=x{\Bbb P}(X \le x)=x\Phi(x)=0.5x\left(1+\text{erf}\left(\frac{x}{\sqrt{2}}\right)\right)$

โปรดทราบว่านี่เป็นคำนิยามไม่ใช่สมการ (หรือความสัมพันธ์) ผู้เขียนได้ให้เหตุผลบางอย่างสำหรับข้อเสนอนี้เช่นการเปรียบเทียบสุ่มอย่างไรก็ตามในทางคณิตศาสตร์นี่เป็นเพียงคำจำกัดความ

นี่คือเนื้อเรื่องของ GELU:

การประมาณทาห์

สำหรับการประมาณค่าตัวเลขประเภทนี้แนวคิดหลักคือการหาฟังก์ชั่นที่คล้ายกัน (ส่วนใหญ่ขึ้นอยู่กับประสบการณ์) กำหนดพารามิเตอร์แล้วปรับให้เข้ากับชุดของคะแนนจากฟังก์ชันต้นฉบับ

รู้ว่า $\text{erf}(x)$ อยู่ใกล้กับ $\text{tanh}(x)$

และอนุพันธ์อันดับแรกของ $\text{erf}(\frac{x}{\sqrt{2}})$ เกิดขึ้นพร้อมกับของ $\text{tanh}(\sqrt{\frac{2}{\pi}}x)$ ที่ $x=0$ ซึ่งก็คือ $\sqrt{\frac{2}{\pi}}$ เราดำเนินการเพื่อให้พอดีกับ

tanh (\sqrt{\frac{2}{π}} (x + a x^{2} + ข x^{3} + ค x^{4} + d x^{5}))

$\text{tanh}\left(\sqrt{\frac{2}{\pi}}(x+ax^2+bx^3+cx^4+dx^5)\right)$ (หรือด้วยเงื่อนไขเพิ่มเติม) กับชุดของคะแนน

(x_{i}, erf (\frac{x_{i}}{\sqrt{2}}))

$\left(x_i, \text{erf}\left(\frac{x_i}{\sqrt{2}}\right)\right)$ )

ฉันได้ติดตั้งฟังก์ชั่นนี้กับตัวอย่าง 20 ตัวระหว่าง $(-1.5, 1.5)$ ( ใช้ไซต์นี้ ) และนี่คือสัมประสิทธิ์:

โดยการตั้งค่า , ก็จะประมาณ0.04495641ด้วยตัวอย่างเพิ่มเติมจากช่วงที่กว้างขึ้น (ไซต์นั้นอนุญาตให้ 20 เท่านั้น) สัมประสิทธิ์จะใกล้เคียงกับของกระดาษมากขึ้น ในที่สุดเราก็ได้ $a=c=d=0$ $b$ $0.04495641$ $b$ $0.044715$

$\text{GELU}(x)=x\Phi(x)=0.5x\left(1+\text{erf}\left(\frac{x}{\sqrt{2}}\right)\right)\simeq 0.5x\left(1+\text{tanh}\left(\sqrt{\frac{2}{\pi}}(x+0.044715x^3)\right)\right)$

มีค่าเฉลี่ยข้อผิดพลาด Squared $\sim 10^{-8}$ สำหรับ $x \in [-10, 10]$ ]

โปรดทราบว่าหากเราไม่ได้ใช้ประโยชน์จากความสัมพันธ์ระหว่างอนุพันธ์อันดับแรกคำศัพท์ $\sqrt{\frac{2}{\pi}}$ จะรวมอยู่ในพารามิเตอร์ดังนี้

0.5 x (1 + tanh (0.797885 x + 0.035677 x^{3}))

$0.5x\left(1+\text{tanh}\left(0.797885x+0.035677x^3\right)\right)$ ซึ่งมีความสวยงามน้อยกว่า (การวิเคราะห์น้อยกว่าตัวเลขมากกว่า)!

ใช้ความเท่าเทียมกัน

ตามที่แนะนำโดย@BookYourLuckเราสามารถใช้ฟังก์ชั่นพาริตี้เพื่อ จำกัด พื้นที่ของพหุนามที่เราค้นหา นั่นคือเนื่องจาก $\text{erf}$ เป็นฟังก์ชันคี่เช่น $f(-x)=-f(x)$ และ $\text{tanh}$ ยังเป็นฟังก์ชันคี่ฟังก์ชันพหุนามฟังก์ชัน $\text{pol}(x)$ ภายใน $\text{tanh}$ ก็ควรจะแปลกด้วย (ควรมีพลังคี่ของ $x$ ) มี

erf (- x) ≃ tanh (pol (- x)) = tanh (- pol (x)) = - tanh (pol (x)) ≃ - erf (x)

$\text{erf}(-x)\simeq\text{tanh}(\text{pol}(-x))=\text{tanh}(-\text{pol}(x))=-\text{tanh}(\text{pol}(x))\simeq-\text{erf}(x)$

$x^2$ $x^4$ $0.23x^2$ $0x^2$

การประมาณ Sigmoid

$\text{erf}(x)$ $2\left(\sigma(x)-\frac{1}{2}\right)$ $\sim 10^{-4}$ for $x \in [-10, 10]$ .

Here is a Python code for generating data points, fitting the functions, and calculating the mean squared errors:

import math
import numpy as np
import scipy.optimize as optimize


def tahn(xs, a):
    return [math.tanh(math.sqrt(2 / math.pi) * (x + a * x**3)) for x in xs]


def sigmoid(xs, a):
    return [2 * (1 / (1 + math.exp(-a * x)) - 0.5) for x in xs]


print_points = 0
np.random.seed(123)
# xs = [-2, -1, -.9, -.7, 0.6, -.5, -.4, -.3, -0.2, -.1, 0,
#       .1, 0.2, .3, .4, .5, 0.6, .7, .9, 2]
# xs = np.concatenate((np.arange(-1, 1, 0.2), np.arange(-4, 4, 0.8)))
# xs = np.concatenate((np.arange(-2, 2, 0.5), np.arange(-8, 8, 1.6)))
xs = np.arange(-10, 10, 0.001)
erfs = np.array([math.erf(x/math.sqrt(2)) for x in xs])
ys = np.array([0.5 * x * (1 + math.erf(x/math.sqrt(2))) for x in xs])

# Fit tanh and sigmoid curves to erf points
tanh_popt, _ = optimize.curve_fit(tahn, xs, erfs)
print('Tanh fit: a=%5.5f' % tuple(tanh_popt))

sig_popt, _ = optimize.curve_fit(sigmoid, xs, erfs)
print('Sigmoid fit: a=%5.5f' % tuple(sig_popt))

# curves used in https://mycurvefit.com:
# 1. sinh(sqrt(2/3.141593)*(x+a*x^2+b*x^3+c*x^4+d*x^5))/cosh(sqrt(2/3.141593)*(x+a*x^2+b*x^3+c*x^4+d*x^5))
# 2. sinh(sqrt(2/3.141593)*(x+b*x^3))/cosh(sqrt(2/3.141593)*(x+b*x^3))
y_paper_tanh = np.array([0.5 * x * (1 + math.tanh(math.sqrt(2/math.pi)*(x + 0.044715 * x**3))) for x in xs])
tanh_error_paper = (np.square(ys - y_paper_tanh)).mean()
y_alt_tanh = np.array([0.5 * x * (1 + math.tanh(math.sqrt(2/math.pi)*(x + tanh_popt[0] * x**3))) for x in xs])
tanh_error_alt = (np.square(ys - y_alt_tanh)).mean()

# curve used in https://mycurvefit.com:
# 1. 2*(1/(1+2.718281828459^(-(a*x))) - 0.5)
y_paper_sigmoid = np.array([x * (1 / (1 + math.exp(-1.702 * x))) for x in xs])
sigmoid_error_paper = (np.square(ys - y_paper_sigmoid)).mean()
y_alt_sigmoid = np.array([x * (1 / (1 + math.exp(-sig_popt[0] * x))) for x in xs])
sigmoid_error_alt = (np.square(ys - y_alt_sigmoid)).mean()

print('Paper tanh error:', tanh_error_paper)
print('Alternative tanh error:', tanh_error_alt)
print('Paper sigmoid error:', sigmoid_error_paper)
print('Alternative sigmoid error:', sigmoid_error_alt)

if print_points == 1:
    print(len(xs))
    for x, erf in zip(xs, erfs):
        print(x, erf)

Output:

Tanh fit: a=0.04485
Sigmoid fit: a=1.70099
Paper tanh error: 2.4329173471294176e-08
Alternative tanh error: 2.698034519269613e-08
Paper sigmoid error: 5.6479106346814546e-05
Alternative sigmoid error: 5.704246564663601e-05

— Esmailian
แหล่งที่มา

2

Why is the approximation needed? Couldn't they just use erf function?

— SebiSebi

8

First note that

Φ (x) = \frac{1}{2} e r f c (- \frac{x}{\sqrt{2}}) = \frac{1}{2} (1 + e r f (\frac{x}{\sqrt{2}}))

$\Phi(x) = \frac12 \mathrm{erfc}\left(-\frac{x}{\sqrt{2}}\right) = \frac12 \left(1 + \mathrm{erf}\left(\frac{x}{\sqrt2}\right)\right)$ by parity of

e r f

$\mathrm{erf}$ . We need to show that

e r f (\frac{x}{\sqrt{2}}) \approx \tanh (\sqrt{\frac{2}{π}} (x + a x^{3}))

$\mathrm{erf}\left(\frac x {\sqrt2}\right) \approx \tanh\left(\sqrt{\frac2\pi} \left(x + a x^3\right)\right)$ for

a \approx 0.044715

$a \approx 0.044715$ .

For large values of $x$ , both functions are bounded in $[-1, 1]$ . For small $x$ , the respective Taylor series read

\tanh (x) = x - \frac{x^{3}}{3} + o (x^{3})

$\tanh(x) = x - \frac{x^3}{3} + o(x^3)$ and

e r f (x) = \frac{2}{\sqrt{π}} (x - \frac{x^{3}}{3}) + o (x^{3}) .

$\mathrm{erf}(x) = \frac{2}{\sqrt{\pi}} \left(x - \frac{x^3}{3}\right) + o(x^3).$ Substituting, we get that

\tanh (\sqrt{\frac{2}{π}} (x + a x^{3})) = \sqrt{\frac{2}{π}} (x + (a - \frac{2}{3 π}) x^{3}) + โอ (x^{3})

$\tanh\left(\sqrt{\frac2\pi} \left(x + a x^3\right)\right) = \sqrt\frac{2}{\pi} \left(x + \left(a-\frac{2}{3\pi}\right)x^3\right) + o(x^3)$ และ

อี R ฉ (\frac{x}{\sqrt{2}}) = \sqrt{\frac{2}{π}} (x - \frac{x^{3}}{6}) + โอ (x^{3}) .

$\mathrm{erf}\left(\frac x {\sqrt2}\right) = \sqrt\frac2\pi \left(x - \frac{x^3}{6}\right) + o(x^3).$ การเทียบค่าสัมประสิทธิ์สำหรับ

x^{3}

$x^3$ เราพบว่า

a \approx 0.04553992412

$a \approx 0.04553992412$ ใกล้กับกระดาษ

0.044715

$0.044715$ .

— BookYourLuck
แหล่งที่มา