วิธีการเริ่มต้นน้ำหนักใน PyTorch

Question 1

วิธีเริ่มต้นน้ำหนักและอคติ (ตัวอย่างเช่นด้วยการเริ่มต้นด้วย He หรือ Xavier) ในเครือข่ายใน PyTorch

Question 2

ชั้นเดียว

torch.nn.initในการเริ่มต้นน้ำหนักของชั้นเดียวให้ใช้ฟังก์ชั่นจาก ตัวอย่างเช่น:

conv1 = torch.nn.Conv2d(...)
torch.nn.init.xavier_uniform(conv1.weight)

หรือคุณสามารถแก้ไขพารามิเตอร์โดยเขียนถึงconv1.weight.data(ซึ่งก็คือ a torch.Tensor) ตัวอย่าง:

conv1.weight.data.fill_(0.01)

เช่นเดียวกับอคติ:

conv1.bias.data.fill_(0.01)

`nn.Sequential` หรือกำหนดเอง `nn.Module`

torch.nn.Module.applyผ่านฟังก์ชั่นเริ่มต้นที่จะ มันจะเริ่มต้นการชั่งน้ำหนักnn.Moduleแบบวนซ้ำทั้งหมด

ใช้ ( fn ):ใช้fnซ้ำกับทุกโมดูลย่อย (ตามที่ส่งคืนโดย.children()) เช่นเดียวกับตัวเอง การใช้งานทั่วไปรวมถึงการเริ่มต้นพารามิเตอร์ของโมเดล (ดูเพิ่มเติมที่ torch-nn-init)

ตัวอย่าง:

def init_weights(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform(m.weight)
        m.bias.data.fill_(0.01)

net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
net.apply(init_weights)

Question 3

เราเปรียบเทียบโหมดการเริ่มต้นน้ำหนักที่แตกต่างกันโดยใช้สถาปัตยกรรมเครือข่ายประสาทเทียม (NN) เดียวกัน

เลขศูนย์หรือคนทั้งหมด

หากคุณปฏิบัติตามหลักการของมีดโกนของOccamคุณอาจคิดว่าการตั้งค่าน้ำหนักทั้งหมดเป็น 0 หรือ 1 จะเป็นทางออกที่ดีที่สุด กรณีนี้ไม่ได้.

เมื่อน้ำหนักเท่ากันเซลล์ประสาททั้งหมดในแต่ละชั้นจะให้ผลผลิตเท่ากัน ทำให้ยากที่จะตัดสินใจว่าจะปรับน้ำหนักตัวใด

    # initialize two NN's with 0 and 1 constant weights
    model_0 = Net(constant_weight=0)
    model_1 = Net(constant_weight=1)

หลังจาก 2 ยุค:

Validation Accuracy
9.625% -- All Zeros
10.050% -- All Ones
Training Loss
2.304  -- All Zeros
1552.281  -- All Ones

การเริ่มต้นอย่างสม่ำเสมอ

การแจกแจงแบบสม่ำเสมอมีความน่าจะเป็นเท่ากันในการเลือกตัวเลขใด ๆ จากชุดตัวเลข

มาดูกันดีกว่าว่าโครงข่ายประสาทเทียมใช้การเริ่มต้นน้ำหนักสม่ำเสมอที่ไหนlow=0.0และhigh=1.0อย่างไร

ด้านล่างนี้เราจะเห็นวิธีอื่น (นอกเหนือจากในรหัสคลาส Net) ในการเริ่มต้นน้ำหนักของเครือข่าย ในการกำหนดน้ำหนักนอกเหนือจากนิยามโมเดลเราสามารถ:

กำหนดฟังก์ชันที่กำหนดน้ำหนักตามประเภทของเลเยอร์เครือข่ายจากนั้น

นำน้ำหนักเหล่านั้นไปใช้กับโมเดลเริ่มต้นโดยใช้model.apply(fn)ซึ่งจะใช้ฟังก์ชันกับเลเยอร์โมเดลแต่ละชั้น

    # takes in a module and applies the specified weight initialization
    def weights_init_uniform(m):
        classname = m.__class__.__name__
        # for every Linear layer in a model..
        if classname.find('Linear') != -1:
            # apply a uniform distribution to the weights and a bias=0
            m.weight.data.uniform_(0.0, 1.0)
            m.bias.data.fill_(0)

    model_uniform = Net()
    model_uniform.apply(weights_init_uniform)

หลังจาก 2 ยุค:

Validation Accuracy
36.667% -- Uniform Weights
Training Loss
3.208  -- Uniform Weights

กฎทั่วไปสำหรับการกำหนดน้ำหนัก

กฎทั่วไปสำหรับการตั้งค่าน้ำหนักในโครงข่ายประสาทเทียมคือการตั้งค่าให้ใกล้เคียงกับศูนย์โดยไม่ให้เล็กเกินไป

แนวปฏิบัติที่ดีคือการเริ่มต้นน้ำหนักของคุณในช่วง [-y, y] โดยที่y=1/sqrt(n)
(n คือจำนวนอินพุตของเซลล์ประสาทที่กำหนด)

    # takes in a module and applies the specified weight initialization
    def weights_init_uniform_rule(m):
        classname = m.__class__.__name__
        # for every Linear layer in a model..
        if classname.find('Linear') != -1:
            # get the number of the inputs
            n = m.in_features
            y = 1.0/np.sqrt(n)
            m.weight.data.uniform_(-y, y)
            m.bias.data.fill_(0)

    # create a new model with these weights
    model_rule = Net()
    model_rule.apply(weights_init_uniform_rule)

ด้านล่างเราเปรียบเทียบประสิทธิภาพของ NN น้ำหนักเริ่มต้นด้วยการกระจายสม่ำเสมอ [-0.5,0.5) เทียบกับน้ำหนักที่เริ่มต้นโดยใช้กฎทั่วไป

หลังจาก 2 ยุค:

Validation Accuracy
75.817% -- Centered Weights [-0.5, 0.5)
85.208% -- General Rule [-y, y)
Training Loss
0.705  -- Centered Weights [-0.5, 0.5)
0.469  -- General Rule [-y, y)

การแจกแจงแบบปกติเพื่อเริ่มต้นน้ำหนัก

การแจกแจงปกติควรมีค่าเฉลี่ยเป็น 0 และส่วนเบี่ยงเบนมาตรฐานy=1/sqrt(n)โดยที่ n คือจำนวนอินพุตของ NN

    ## takes in a module and applies the specified weight initialization
    def weights_init_normal(m):
        '''Takes in a module and initializes all linear layers with weight
           values taken from a normal distribution.'''

        classname = m.__class__.__name__
        # for every Linear layer in a model
        if classname.find('Linear') != -1:
            y = m.in_features
        # m.weight.data shoud be taken from a normal distribution
            m.weight.data.normal_(0.0,1/np.sqrt(y))
        # m.bias.data should be 0
            m.bias.data.fill_(0)

ด้านล่างเราจะแสดงประสิทธิภาพของ NN สองตัวที่เริ่มต้นโดยใช้การแจกแจงแบบสม่ำเสมอและอีกอันหนึ่งใช้การแจกแจงแบบปกติ

หลังจาก 2 ยุค:

Validation Accuracy
85.775% -- Uniform Rule [-y, y)
84.717% -- Normal Distribution
Training Loss
0.329  -- Uniform Rule [-y, y)
0.443  -- Normal Distribution

Question 4

ในการเริ่มต้นเลเยอร์โดยทั่วไปคุณไม่จำเป็นต้องทำอะไรเลย

PyTorch จะทำเพื่อคุณ ถ้าคุณคิดเกี่ยวกับสิ่งนี้มีความหมายมาก เหตุใดเราจึงควรเริ่มต้นเลเยอร์เมื่อ PyTorch สามารถทำเช่นนั้นตามแนวโน้มล่าสุด

ตรวจสอบเช่นชั้นเชิงเส้น

ใน__init__วิธีการนี้จะเรียกใช้ฟังก์ชันKaiming He init

    def reset_parameters(self):
        init.kaiming_uniform_(self.weight, a=math.sqrt(3))
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in)
            init.uniform_(self.bias, -bound, bound)

สิ่งที่คล้ายกันสำหรับเลเยอร์ประเภทอื่น ๆ สำหรับconv2dตัวอย่างเช่นการตรวจสอบที่นี่

หมายเหตุ: การได้รับการเริ่มต้นที่เหมาะสมคือความเร็วในการฝึกซ้อมที่เร็วขึ้น หากปัญหาของคุณสมควรได้รับการเริ่มต้นพิเศษคุณสามารถทำได้

Question 5

    import torch.nn as nn        

    # a simple network
    rand_net = nn.Sequential(nn.Linear(in_features, h_size),
                             nn.BatchNorm1d(h_size),
                             nn.ReLU(),
                             nn.Linear(h_size, h_size),
                             nn.BatchNorm1d(h_size),
                             nn.ReLU(),
                             nn.Linear(h_size, 1),
                             nn.ReLU())

    # initialization function, first checks the module type,
    # then applies the desired changes to the weights
    def init_normal(m):
        if type(m) == nn.Linear:
            nn.init.uniform_(m.weight)

    # use the modules apply function to recursively apply the initialization
    rand_net.apply(init_normal)

Question 6

ขอโทษที่มาช้าฉันหวังว่าคำตอบของฉันจะช่วยได้

ในการเริ่มต้นน้ำหนักด้วยการnormal distributionใช้งาน:

torch.nn.init.normal_(tensor, mean=0, std=1)

หรือใช้constant distributionเขียน:

torch.nn.init.constant_(tensor, value)

หรือจะใช้uniform distribution:

torch.nn.init.uniform_(tensor, a=0, b=1) # a: lower_bound, b: upper_bound

คุณสามารถตรวจสอบวิธีการอื่น ๆ เพื่อเริ่มต้นเทนเซอร์ได้ที่นี่

Question 7

หากคุณต้องการความยืดหยุ่นเป็นพิเศษคุณสามารถตั้งค่าน้ำหนักด้วยตนเองได้

สมมติว่าคุณมีข้อมูลทั้งหมด:

import torch
import torch.nn as nn

input = torch.ones((8, 8))
print(input)

tensor([[1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

และคุณต้องการสร้างเลเยอร์ที่หนาแน่นโดยไม่มีอคติ (เพื่อให้เราเห็นภาพ):

d = nn.Linear(8, 8, bias=False)

ตั้งค่าน้ำหนักทั้งหมดเป็น 0.5 (หรืออย่างอื่น):

d.weight.data = torch.full((8, 8), 0.5)
print(d.weight.data)

น้ำหนัก:

Out[14]: 
tensor([[0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000]])

น้ำหนักทั้งหมดของคุณตอนนี้คือ 0.5 ส่งข้อมูลผ่าน:

d(input)

Out[13]: 
tensor([[4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.]], grad_fn=<MmBackward>)

โปรดจำไว้ว่าเซลล์ประสาทแต่ละตัวได้รับอินพุต 8 ตัวซึ่งทั้งหมดมีน้ำหนัก 0.5 และค่าเป็น 1 (และไม่มีอคติ) ดังนั้นจึงรวมได้ถึง 4 สำหรับแต่ละอินพุท

Question 8

วนซ้ำพารามิเตอร์

หากคุณไม่สามารถใช้applyเช่นหากโมเดลไม่ได้นำSequentialไปใช้โดยตรง:

เหมือนกันสำหรับทุกคน

# see UNet at https://github.com/milesial/Pytorch-UNet/tree/master/unet


def init_all(model, init_func, *params, **kwargs):
    for p in model.parameters():
        init_func(p, *params, **kwargs)

model = UNet(3, 10)
init_all(model, torch.nn.init.normal_, mean=0., std=1) 
# or
init_all(model, torch.nn.init.constant_, 1.)

ขึ้นอยู่กับรูปร่าง

def init_all(model, init_funcs):
    for p in model.parameters():
        init_func = init_funcs.get(len(p.shape), init_funcs["default"])
        init_func(p)

model = UNet(3, 10)
init_funcs = {
    1: lambda x: torch.nn.init.normal_(x, mean=0., std=1.), # can be bias
    2: lambda x: torch.nn.init.xavier_normal_(x, gain=1.), # can be weight
    3: lambda x: torch.nn.init.xavier_uniform_(x, gain=1.), # can be conv1D filter
    4: lambda x: torch.nn.init.xavier_uniform_(x, gain=1.), # can be conv2D filter
    "default": lambda x: torch.nn.init.constant(x, 1.), # everything else
}

init_all(model, init_funcs)

คุณสามารถลองtorch.nn.init.constant_(x, len(x.shape))ตรวจสอบว่ามีการเริ่มต้นอย่างเหมาะสม:

init_funcs = {
    "default": lambda x: torch.nn.init.constant_(x, len(x.shape))
}

Question 9

หากคุณเห็นคำเตือนการเลิกใช้งาน (@ Fábio Perez) ...

def init_weights(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform_(m.weight)
        m.bias.data.fill_(0.01)

net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
net.apply(init_weights)

Question 10

เพราะจนถึงตอนนี้ฉันยังมีชื่อเสียงไม่เพียงพอจึงไม่สามารถเพิ่มความคิดเห็นได้

คำตอบที่โพสต์โดยprostiในวันที่ 26 มิ.ย. 62 เวลา 13:16 น .

    def reset_parameters(self):
        init.kaiming_uniform_(self.weight, a=math.sqrt(3))
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in)
            init.uniform_(self.bias, -bound, bound)

แต่ผมอยากจะชี้ให้เห็นว่าจริง ๆ แล้วเรารู้ว่าสมมติฐานบางอย่างในกระดาษของKaiming เขา , Delving ลึกเข้าไปในวงจรเรียงกระแส: เหนือกว่าประสิทธิภาพของมนุษย์ระดับบน ImageNet การจำแนกประเภทจะไม่เหมาะสมแม้ว่ามันจะดูเหมือนว่าวิธีการเริ่มต้นการออกแบบจงใจทำให้ตีในทางปฏิบัติ .

เช่นภายในส่วนย่อยของBackward Propagation Caseพวกเขาถือว่า $ w_l $ และ $ \ delta y_l $ เป็นอิสระจากกัน แต่อย่างที่เราทราบกันดีว่าให้ใช้แผนที่คะแนน $ \ delta y ^ L_i $ เป็นตัวอย่างซึ่งมักจะเป็น $ y_i-softmax (y ^ L_i) = y_i-softmax (w ^ L_ix ^ L_i) $ ถ้าเราใช้แบบทั่วไป วัตถุประสงค์ฟังก์ชันการสูญเสียเอนโทรปีข้าม

ดังนั้นฉันคิดว่าเหตุผลพื้นฐานที่แท้จริงว่าทำไมการเริ่มต้นของเขาทำงานได้ดียังคงคลี่คลาย เพราะทุกคนได้เห็นพลังในการส่งเสริมการฝึกอบรมการเรียนรู้เชิงลึก