std :: vector ถดถอยประสิทธิภาพเมื่อเปิดใช้งาน C ++ 11

235

ฉันพบการถดถอยของประสิทธิภาพที่น่าสนใจใน C ++ snippet ขนาดเล็กเมื่อฉันเปิดใช้ C ++ 11:

#include <vector>

struct Item
{
  int a;
  int b;
};

int main()
{
  const std::size_t num_items = 10000000;
  std::vector<Item> container;
  container.reserve(num_items);
  for (std::size_t i = 0; i < num_items; ++i) {
    container.push_back(Item());
  }
  return 0;
}

ด้วย g ++ (GCC) 4.8.2 20131219 (ก่อนวางจำหน่าย) และ C ++ 03 ฉันได้รับ:

milian:/tmp$ g++ -O3 main.cpp && perf stat -r 10 ./a.out

Performance counter stats for './a.out' (10 runs):

        35.206824 task-clock                #    0.988 CPUs utilized            ( +-  1.23% )
                4 context-switches          #    0.116 K/sec                    ( +-  4.38% )
                0 cpu-migrations            #    0.006 K/sec                    ( +- 66.67% )
              849 page-faults               #    0.024 M/sec                    ( +-  6.02% )
       95,693,808 cycles                    #    2.718 GHz                      ( +-  1.14% ) [49.72%]
  <not supported> stalled-cycles-frontend 
  <not supported> stalled-cycles-backend  
       95,282,359 instructions              #    1.00  insns per cycle          ( +-  0.65% ) [75.27%]
       30,104,021 branches                  #  855.062 M/sec                    ( +-  0.87% ) [77.46%]
            6,038 branch-misses             #    0.02% of all branches          ( +- 25.73% ) [75.53%]

      0.035648729 seconds time elapsed                                          ( +-  1.22% )

เมื่อเปิดใช้งาน C ++ 11 ประสิทธิภาพจะลดลงอย่างมีนัยสำคัญ:

milian:/tmp$ g++ -std=c++11 -O3 main.cpp && perf stat -r 10 ./a.out

Performance counter stats for './a.out' (10 runs):

        86.485313 task-clock                #    0.994 CPUs utilized            ( +-  0.50% )
                9 context-switches          #    0.104 K/sec                    ( +-  1.66% )
                2 cpu-migrations            #    0.017 K/sec                    ( +- 26.76% )
              798 page-faults               #    0.009 M/sec                    ( +-  8.54% )
      237,982,690 cycles                    #    2.752 GHz                      ( +-  0.41% ) [51.32%]
  <not supported> stalled-cycles-frontend 
  <not supported> stalled-cycles-backend  
      135,730,319 instructions              #    0.57  insns per cycle          ( +-  0.32% ) [75.77%]
       30,880,156 branches                  #  357.057 M/sec                    ( +-  0.25% ) [75.76%]
            4,188 branch-misses             #    0.01% of all branches          ( +-  7.59% ) [74.08%]

    0.087016724 seconds time elapsed                                          ( +-  0.50% )

มีคนอธิบายเรื่องนี้ได้ไหม จนถึงตอนนี้ประสบการณ์ของฉันคือ STL ทำงานได้เร็วขึ้นด้วยการเปิดใช้งาน C ++ 11 โดยเฉพาะ ขอบคุณที่จะย้ายความหมาย

แก้ไข:ตามที่แนะนำให้ใช้container.emplace_back();แทนประสิทธิภาพที่ได้รับเสมอกับรุ่น C ++ 03 รุ่น C ++ 03 สามารถบรรลุสิ่งเดียวกันได้push_backอย่างไร

milian:/tmp$ g++ -std=c++11 -O3 main.cpp && perf stat -r 10 ./a.out

Performance counter stats for './a.out' (10 runs):

        36.229348 task-clock                #    0.988 CPUs utilized            ( +-  0.81% )
                4 context-switches          #    0.116 K/sec                    ( +-  3.17% )
                1 cpu-migrations            #    0.017 K/sec                    ( +- 36.85% )
              798 page-faults               #    0.022 M/sec                    ( +-  8.54% )
       94,488,818 cycles                    #    2.608 GHz                      ( +-  1.11% ) [50.44%]
  <not supported> stalled-cycles-frontend 
  <not supported> stalled-cycles-backend  
       94,851,411 instructions              #    1.00  insns per cycle          ( +-  0.98% ) [75.22%]
       30,468,562 branches                  #  840.991 M/sec                    ( +-  1.07% ) [76.71%]
            2,723 branch-misses             #    0.01% of all branches          ( +-  9.84% ) [74.81%]

   0.036678068 seconds time elapsed                                          ( +-  0.80% )

— milianw
แหล่งที่มา

หากคุณรวบรวมแอสเซมบลีคุณสามารถดูสิ่งที่เกิดขึ้นภายใต้ประทุน ดูstackoverflow.com/questions/8021874/…

— Cogwheel

จะเกิดอะไรขึ้นถ้าคุณเปลี่ยนpush_back(Item())เป็นemplace_back()รุ่น C ++ 11

— Cogwheel

ดูข้างต้นว่า "แก้ไข" การถดถอย ฉันยังคงสงสัยว่าทำไม push_back ถอยกลับในการทำงานระหว่าง C ++ 03 และ C ++ 11 แม้ว่า

— milianw

@milianw ปรากฎว่าฉันได้รวบรวมโปรแกรมที่ผิด ละเว้นความคิดเห็นของฉัน

ด้วย clang3.4 c ++ 11 รุ่นเร็ว 0.047s VS 0.058 สำหรับซี ++ 98 รุ่น

— กองกำลัง

247

ฉันสามารถทำซ้ำผลลัพธ์ของคุณบนเครื่องของฉันด้วยตัวเลือกที่คุณเขียนในโพสต์ของคุณ

อย่างไรก็ตามหากฉันเปิดใช้งานการเพิ่มประสิทธิภาพเวลาลิงก์ (ฉันผ่านการ-fltoตั้งค่าสถานะไปยัง gcc 4.7.2) ผลลัพธ์จะเหมือนกัน:

(ฉันกำลังรวบรวมรหัสต้นฉบับของคุณด้วยcontainer.push_back(Item());)

$ g++ -std=c++11 -O3 -flto regr.cpp && perf stat -r 10 ./a.out 

 Performance counter stats for './a.out' (10 runs):

         35.426793 task-clock                #    0.986 CPUs utilized            ( +-  1.75% )
                 4 context-switches          #    0.116 K/sec                    ( +-  5.69% )
                 0 CPU-migrations            #    0.006 K/sec                    ( +- 66.67% )
            19,801 page-faults               #    0.559 M/sec                  
        99,028,466 cycles                    #    2.795 GHz                      ( +-  1.89% ) [77.53%]
        50,721,061 stalled-cycles-frontend   #   51.22% frontend cycles idle     ( +-  3.74% ) [79.47%]
        25,585,331 stalled-cycles-backend    #   25.84% backend  cycles idle     ( +-  4.90% ) [73.07%]
       141,947,224 instructions              #    1.43  insns per cycle        
                                             #    0.36  stalled cycles per insn  ( +-  0.52% ) [88.72%]
        37,697,368 branches                  # 1064.092 M/sec                    ( +-  0.52% ) [88.75%]
            26,700 branch-misses             #    0.07% of all branches          ( +-  3.91% ) [83.64%]

       0.035943226 seconds time elapsed                                          ( +-  1.79% )



$ g++ -std=c++98 -O3 -flto regr.cpp && perf stat -r 10 ./a.out 

 Performance counter stats for './a.out' (10 runs):

         35.510495 task-clock                #    0.988 CPUs utilized            ( +-  2.54% )
                 4 context-switches          #    0.101 K/sec                    ( +-  7.41% )
                 0 CPU-migrations            #    0.003 K/sec                    ( +-100.00% )
            19,801 page-faults               #    0.558 M/sec                    ( +-  0.00% )
        98,463,570 cycles                    #    2.773 GHz                      ( +-  1.09% ) [77.71%]
        50,079,978 stalled-cycles-frontend   #   50.86% frontend cycles idle     ( +-  2.20% ) [79.41%]
        26,270,699 stalled-cycles-backend    #   26.68% backend  cycles idle     ( +-  8.91% ) [74.43%]
       141,427,211 instructions              #    1.44  insns per cycle        
                                             #    0.35  stalled cycles per insn  ( +-  0.23% ) [87.66%]
        37,366,375 branches                  # 1052.263 M/sec                    ( +-  0.48% ) [88.61%]
            26,621 branch-misses             #    0.07% of all branches          ( +-  5.28% ) [83.26%]

       0.035953916 seconds time elapsed

สำหรับเหตุผลหนึ่งต้องดูรหัสประกอบที่สร้างขึ้น ( g++ -std=c++11 -O3 -S regr.cpp) ใน C ++ 11 โหมดรหัสที่สร้างขึ้นอย่างมีนัยสำคัญมากขึ้นรกกว่าสำหรับ C ++ 98 โหมดและอินไลน์ฟังก์ชั่น
void std::vector<Item,std::allocator<Item>>::_M_emplace_back_aux<Item>(Item&&)
ล้มเหลวใน C ++ 11 inline-limitโหมดที่มีการเริ่มต้น

อินไลน์ที่ล้มเหลวนี้มีผลกระทบโดมิโน ไม่ใช่เพราะฟังก์ชั่นนี้กำลังถูกเรียก (มันไม่ได้เรียกว่า!) แต่เพราะเราต้องเตรียม: ถ้ามันถูกเรียกมา, ฟังก์ชั่นอาร์กิวเมนต์ ( Item.aและItem.b) จะต้องอยู่ในตำแหน่งที่ถูกต้องแล้ว สิ่งนี้นำไปสู่โค้ดยุ่ง ๆ

นี่คือส่วนที่เกี่ยวข้องของรหัสที่สร้างขึ้นสำหรับกรณีที่การinlining สำเร็จ :

.L42:
    testq   %rbx, %rbx  # container$D13376$_M_impl$_M_finish
    je  .L3 #,
    movl    $0, (%rbx)  #, container$D13376$_M_impl$_M_finish_136->a
    movl    $0, 4(%rbx) #, container$D13376$_M_impl$_M_finish_136->b
.L3:
    addq    $8, %rbx    #, container$D13376$_M_impl$_M_finish
    subq    $1, %rbp    #, ivtmp.106
    je  .L41    #,
.L14:
    cmpq    %rbx, %rdx  # container$D13376$_M_impl$_M_finish, container$D13376$_M_impl$_M_end_of_storage
    jne .L42    #,

นี่มันดีและกะทัดรัดสำหรับลูป ตอนนี้เรามาเปรียบเทียบสิ่งนี้กับกรณีแบบอินไลน์ที่ล้มเหลว :

.L49:
    testq   %rax, %rax  # D.15772
    je  .L26    #,
    movq    16(%rsp), %rdx  # D.13379, D.13379
    movq    %rdx, (%rax)    # D.13379, *D.15772_60
.L26:
    addq    $8, %rax    #, tmp75
    subq    $1, %rbx    #, ivtmp.117
    movq    %rax, 40(%rsp)  # tmp75, container.D.13376._M_impl._M_finish
    je  .L48    #,
.L28:
    movq    40(%rsp), %rax  # container.D.13376._M_impl._M_finish, D.15772
    cmpq    48(%rsp), %rax  # container.D.13376._M_impl._M_end_of_storage, D.15772
    movl    $0, 16(%rsp)    #, D.13379.a
    movl    $0, 20(%rsp)    #, D.13379.b
    jne .L49    #,
    leaq    16(%rsp), %rsi  #,
    leaq    32(%rsp), %rdi  #,
    call    _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_   #

รหัสนี้มีความยุ่งเหยิงและมีจำนวนมากที่เกิดขึ้นในวงมากกว่าในกรณีก่อนหน้า ก่อนฟังก์ชั่นcall(แสดงบรรทัดสุดท้าย) อาร์กิวเมนต์ต้องถูกวางอย่างเหมาะสม:

leaq    16(%rsp), %rsi  #,
leaq    32(%rsp), %rdi  #,
call    _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_   #

แม้ว่าสิ่งนี้จะไม่ถูกดำเนินการจริง แต่ลูปจะจัดเรียงสิ่งต่าง ๆ ก่อนหน้า:

movl    $0, 16(%rsp)    #, D.13379.a
movl    $0, 20(%rsp)    #, D.13379.b

สิ่งนี้นำไปสู่รหัสที่ยุ่งเหยิงหากไม่มีฟังก์ชั่นcallเพราะการซับในสำเร็จเรามีเพียง 2 คำแนะนำการย้ายในลูปและไม่มีการไปยุ่งกับ%rsp(ตัวชี้สแต็ค) แต่ถ้าอินไลน์ล้มเหลวที่เราได้รับ 6 %rspย้ายและเรายุ่งมากกับ

เพียงเพื่อยืนยันทฤษฎีของฉัน (หมายเหตุ -finline-limit ) ทั้งในโหมด C ++ 11:

 $ g++ -std=c++11 -O3 -finline-limit=105 regr.cpp && perf stat -r 10 ./a.out

 Performance counter stats for './a.out' (10 runs):

         84.739057 task-clock                #    0.993 CPUs utilized            ( +-  1.34% )
                 8 context-switches          #    0.096 K/sec                    ( +-  2.22% )
                 1 CPU-migrations            #    0.009 K/sec                    ( +- 64.01% )
            19,801 page-faults               #    0.234 M/sec                  
       266,809,312 cycles                    #    3.149 GHz                      ( +-  0.58% ) [81.20%]
       206,804,948 stalled-cycles-frontend   #   77.51% frontend cycles idle     ( +-  0.91% ) [81.25%]
       129,078,683 stalled-cycles-backend    #   48.38% backend  cycles idle     ( +-  1.37% ) [69.49%]
       183,130,306 instructions              #    0.69  insns per cycle        
                                             #    1.13  stalled cycles per insn  ( +-  0.85% ) [85.35%]
        38,759,720 branches                  #  457.401 M/sec                    ( +-  0.29% ) [85.43%]
            24,527 branch-misses             #    0.06% of all branches          ( +-  2.66% ) [83.52%]

       0.085359326 seconds time elapsed                                          ( +-  1.31% )

 $ g++ -std=c++11 -O3 -finline-limit=106 regr.cpp && perf stat -r 10 ./a.out

 Performance counter stats for './a.out' (10 runs):

         37.790325 task-clock                #    0.990 CPUs utilized            ( +-  2.06% )
                 4 context-switches          #    0.098 K/sec                    ( +-  5.77% )
                 0 CPU-migrations            #    0.011 K/sec                    ( +- 55.28% )
            19,801 page-faults               #    0.524 M/sec                  
       104,699,973 cycles                    #    2.771 GHz                      ( +-  2.04% ) [78.91%]
        58,023,151 stalled-cycles-frontend   #   55.42% frontend cycles idle     ( +-  4.03% ) [78.88%]
        30,572,036 stalled-cycles-backend    #   29.20% backend  cycles idle     ( +-  5.31% ) [71.40%]
       140,669,773 instructions              #    1.34  insns per cycle        
                                             #    0.41  stalled cycles per insn  ( +-  1.40% ) [88.14%]
        38,117,067 branches                  # 1008.646 M/sec                    ( +-  0.65% ) [89.38%]
            27,519 branch-misses             #    0.07% of all branches          ( +-  4.01% ) [86.16%]

       0.038187580 seconds time elapsed                                          ( +-  2.05% )

อันที่จริงถ้าเราขอให้คอมไพเลอร์ลองอินไลน์ฟังก์ชั่นนั้นยากขึ้นความแตกต่างของประสิทธิภาพจะหายไป

ดังนั้นสิ่งที่จะไปจากเรื่องนี้? การอินไลน์ที่ล้มเหลวนั้นอาจทำให้คุณต้องเสียค่าใช้จ่ายเป็นจำนวนมากและคุณควรใช้ประโยชน์จากความสามารถในการรวบรวมอย่างเต็มรูปแบบ: ฉันสามารถแนะนำการเพิ่มประสิทธิภาพเวลาลิงค์เท่านั้นมันเพิ่มประสิทธิภาพที่สำคัญให้กับโปรแกรมของฉัน (มากถึง 2.5 เท่า) และสิ่งที่ฉันต้องทำก็คือผ่านการ-fltoตั้งค่าสถานะ นั่นเป็นข้อตกลงที่ดีทีเดียว! ;)

อย่างไรก็ตามฉันไม่แนะนำให้ทิ้งโค้ดของคุณด้วยคำหลักแบบอินไลน์ ให้คอมไพเลอร์ตัดสินใจว่าจะทำอย่างไร (เครื่องมือเพิ่มประสิทธิภาพได้รับอนุญาตให้ใช้คำหลักแบบอินไลน์เป็นพื้นที่สีขาวอยู่ดี)

เป็นคำถามที่ดี +1!

— อาลี
แหล่งที่มา

หมายเหตุ: inlineไม่มีอะไรเกี่ยวข้องกับฟังก์ชั่นอินไลน์ มันหมายถึง“ กำหนดแบบอินไลน์” ไม่ใช่“ โปรดทำแบบอินไลน์นี้” หากคุณต้องการขออินไลน์จริงๆให้ใช้__attribute__((always_inline))หรือคล้ายกัน

— Jon Purdy

@JonPurdy ไม่มากนักตัวอย่างเช่นฟังก์ชันสมาชิกของคลาสจะอินไลน์โดยปริยาย inlineเป็นคำขอไปยังคอมไพเลอร์ที่คุณต้องการให้ฟังก์ชั่นอินไลน์และตัวอย่างเช่น Intel C ++ Compiler ที่ใช้ในการแจ้งเตือนประสิทธิภาพหากมันไม่ได้ตอบสนองคำขอของคุณ (ฉันยังไม่ได้ตรวจสอบ ICC เมื่อไม่นานมานี้) มันน่าเสียดายที่ฉันเคยเห็นผู้คนทุบตีโค้ดของพวกเขาinlineและรอปาฏิหาริย์ให้เกิดขึ้น ฉันจะไม่ใช้__attribute__((always_inline)); โอกาสที่นักพัฒนาคอมไพเลอร์รู้ดีกว่าว่าจะทำอะไรในอินไลน์และอะไรที่ควรทำ (แม้จะมีตัวอย่างที่นี่)

— อาลี

@JonPurdy ในทางกลับกันถ้าคุณกำหนดฟังก์ชั่นแบบอินไลน์ซึ่งไม่ใช่ฟังก์ชั่นสมาชิกของคลาสคุณไม่มีทางเลือกแน่นอน แต่ต้องทำเครื่องหมายอินไลน์มิฉะนั้นคุณจะได้รับข้อผิดพลาดหลายคำจำกัดความจาก linker ถ้านั่นคือสิ่งที่คุณหมายถึงแล้วตกลง

— Ali

ใช่นั่นคือสิ่งที่ฉันหมายถึง มาตรฐานกล่าวว่า“ ตัวinlineระบุบ่งชี้ถึงการใช้งานที่การทดแทนแบบอินไลน์ของร่างกายของฟังก์ชั่น ณ จุดที่โทรจะเป็นที่ต้องการกับกลไกการเรียกฟังก์ชั่นปกติ” (§7.1.2.2) อย่างไรก็ตามการใช้งานไม่จำเป็นต้องใช้เพื่อเพิ่มประสิทธิภาพนั้นเพราะมันเป็นเรื่องบังเอิญที่inlineฟังก์ชั่นมักจะเป็นตัวเลือกที่ดีสำหรับการทำอินไลน์ ดังนั้นจึงเป็นการดีกว่าที่จะมีความชัดเจนและใช้คอมไพเลอร์ pragma

— Jon Purdy

@ JonPurdy สำหรับครึ่งแรก: ใช่นั่นคือสิ่งที่ฉันหมายถึงโดยการพูดว่า" เครื่องมือเพิ่มประสิทธิภาพได้รับอนุญาตให้รักษาคำหลักแบบอินไลน์เป็นพื้นที่สีขาวอยู่แล้ว" สำหรับคอมไพเลอร์ pragma ฉันจะไม่ใช้สิ่งนั้นฉันจะทิ้งมันไว้กับการเพิ่มประสิทธิภาพเวลาลิงค์ว่าจะอินไลน์หรือไม่ มันทำงานได้ค่อนข้างดี มันยังแก้ไขปัญหานี้โดยอัตโนมัติที่กล่าวถึงที่นี่ในคำตอบ

— อาลี