Regex: จับคู่ซีรี่ส์ที่คุ้มค่า

บทนำ

ฉันไม่เห็นความท้าทายของ regex มากมายในที่นี้ดังนั้นฉันจึงขอเสนอวิธีง่าย ๆ ที่หลอกลวงซึ่งสามารถทำได้หลายวิธีโดยใช้รสชาติที่หลากหลายของ regex ฉันหวังว่ามันจะมอบความสนุกให้กับผู้ที่ชื่นชอบการเล่นกอล์ฟ regex

ท้าทาย

ความท้าทายคือการจับคู่สิ่งที่ฉันได้ขนานนามอย่างซีรี่ส์ "คุ้มทุน" อย่างหลวม ๆ : ชุดของตัวละครที่แตกต่างกันจำนวนเท่ากัน นี่คือตัวอย่างที่อธิบายได้ดีที่สุด

การจับคู่:

aaabbbccc
xyz 
iillppddff
ggggggoooooollllllffffff
abc
banana

ไม่ตรงกัน:

aabc
xxxyyzzz
iilllpppddff
ggggggoooooollllllfff
aaaaaabbbccc
aaabbbc
abbaa
aabbbc

ที่จะพูดคุยเราต้องการเพื่อให้ตรงกับเรื่องของรูปแบบ ( สำหรับรายชื่อใด ๆ ของตัวละครเพื่อที่ทั้งหมดc₁)ⁿ(c₂)ⁿ(c₃)ⁿ...(c_k)ⁿc₁c_kc_i != c_i+1i, k > 1, and n > 0.

ชี้แจง:

อินพุตจะไม่ว่างเปล่า
อักขระอาจซ้ำตัวเองในภายหลังในสตริง (เช่น "Banana")
k > 1ดังนั้นจะมีอักขระที่แตกต่างกันอย่างน้อย 2 ตัวในสตริง
คุณสามารถสันนิษฐานได้ว่ามีเพียงอักขระ ASCII เท่านั้นที่ถูกส่งผ่านเป็นอินพุตและไม่มีอักขระใด ๆ ที่จะเป็นตัวยุติบรรทัด

กฎระเบียบ

(ขอบคุณ Martin Ender สำหรับบล็อกกฎที่ระบุไว้อย่างยอดเยี่ยมนี้)

คำตอบของคุณควรประกอบด้วย regex เพียงหนึ่งเดียวโดยไม่มีรหัสเพิ่มเติมใด ๆ (ยกเว้นเป็นทางเลือกรายการของตัวดัดแปลง regex ที่จำเป็นสำหรับการแก้ปัญหาของคุณ) คุณต้องไม่ใช้คุณสมบัติของรสชาติ regex ของภาษาที่อนุญาตให้คุณเรียกใช้โค้ดในภาษาโฮสติ้ง (เช่นeตัวดัดแปลงของ Perl )

คุณสามารถใช้รสชาติของ regex ใด ๆ ที่มีอยู่ก่อนความท้าทายนี้ แต่โปรดระบุรสชาติ

อย่าสันนิษฐานว่า regex นั้นถูกยึดโดยปริยายเช่นถ้าคุณใช้ Python ให้ถือว่า regex ของคุณถูกใช้กับ re.search และไม่ใช่กับ re.match regex ของคุณจะต้องตรงกับสตริงทั้งหมดสำหรับสตริงที่ถูกต้องและไม่ได้ผลสำหรับสตริงที่ไม่ถูกต้อง คุณสามารถใช้กลุ่มจับภาพได้มากเท่าที่คุณต้องการ

คุณอาจสันนิษฐานว่าการป้อนข้อมูลจะเป็นสตริงที่มีอักขระ ASCII ตั้งแต่สองตัวขึ้นไปเสมอโดยไม่มีตัวคั่นบรรทัด

นี่คือสนามกอล์ฟ regex ดังนั้น regex ที่สั้นที่สุดเป็นไบต์จะเป็นผู้ชนะ หากภาษาของคุณต้องการตัวคั่น (โดยปกติ/.../) เพื่อแสดงถึงนิพจน์ทั่วไปอย่านับตัวคั่นด้วยตนเอง หากโซลูชันของคุณต้องการตัวดัดแปลงให้เพิ่มหนึ่งไบต์ต่อตัวดัดแปลง

เกณฑ์

นี่เป็นสนามกอล์ฟที่ดีดังนั้นจงลืมประสิทธิภาพและลอง regex ของคุณให้เล็กที่สุดเท่าที่จะทำได้

โปรดระบุถึงรสชาติที่คุณใช้ใน regex และถ้าเป็นไปได้ให้ใส่ลิงค์ที่แสดงตัวอย่างการแสดงออกทางออนไลน์ของคุณ

code-golf string regular-expression

— jaytea
แหล่งที่มา

นี่เป็นสนามกอล์ฟ Regex หรือไม่? คุณควรชี้แจงว่าพร้อมกับกฎสำหรับมัน สิ่งที่ท้าทายที่สุดในเว็บไซต์นี้คือภาษาการเขียนโปรแกรมที่หลากหลาย

— LyricLy

@LyricLy ขอบคุณสำหรับคำแนะนำ! ใช่ฉันต้องการให้ regex หมดจดเช่น การแสดงออกปกติเดียวในรสชาติ regex ของทางเลือกของผู้ส่ง มีกฎอื่น ๆ ที่ฉันควรคำนึงถึงหรือไม่

— jaytea

ฉันไม่เข้าใจคำจำกัดความของคุณของ "คุ้มทุน" เช่นนั่นbananaคือคุ้ม

— msh210

@ msh210 เมื่อฉันพูดถึงคำว่า "egalitarian" เพื่ออธิบายซีรี่ส์ฉันไม่ได้คิดว่าจะอนุญาตให้ตัวละครซ้ำในภายหลังในซีรีส์ (เช่นใน "Banana" หรือ "aaabbbcccaaa" ฯลฯ ) . ฉันแค่ต้องการคำที่แสดงถึงความคิดที่ว่าตัวละครซ้ำทุกชิ้นมีขนาดเท่ากัน เนื่องจาก "Banana" ไม่มีอักขระซ้ำคำจำกัดความนี้จึงเป็นจริงสำหรับมัน

— jaytea

คำตอบ:

.NET รสชาติ 48 ไบต์

^(.)\1*((?<=(\5())*(.))(.)(?<-4>\6)*(?!\4|\6))+$

ลองออนไลน์! (ใช้Retina )

ปรากฎว่าการไม่ลบล้างตรรกะนั้นง่ายกว่าทั้งหมด ฉันทำให้นี่เป็นคำตอบที่แยกต่างหากเพราะทั้งสองวิธีนั้นแตกต่างกันอย่างสิ้นเชิง

คำอธิบาย

^            # Anchor the match to the beginning of the string.
(.)\1*       # Match the first run of identical characters. In principle, 
             # it's possible that this matches only half, a quarter, an 
             # eighth etc of of the first run, but that won't affect the 
             # result of the match (in other words, if the match fails with 
             # matching this as the entire first run, then backtracking into
             # only matching half of it won't cause the rest of the regex to
             # match either).
(            # Match this part one or more times. Each instance matches one
             # run of identical letters.
  (?<=       #   We start with a lookbehind to record the length
             #   of the preceding run. Remember that the lookbehind
             #   should be read from the bottom up (and so should
             #   my comments).
    (\5())*  #     And then we match all of its adjacent copies, pushing an
             #     empty capture onto stack 4 each time. That means at the
             #     end of the lookbehind, we will have n-1 captures stack 4, 
             #     where n is the length of the preceding run. Due to the 
             #     atomic nature of lookbehinds, we don't have to worry 
             #     about backtracking matching less than n-1 copies here.
    (.)      #     We capture the character that makes up the preceding
             #     run in group 5.
  )
  (.)        #   Capture the character that makes up the next run in group 6.
  (?<-4>\6)* #   Match copies of that character while depleting stack 4.
             #   If the runs are the same length that means we need to be
             #   able to get to the end of the run at the same time we
             #   empty stack 4 completely.
  (?!\4|\6)  #   This lookahead ensures that. If stack 4 is not empty yet,
             #   \4 will match, because the captures are all empty, so the
             #   the backreference can't fail. If the stack is empty though,
             #   then the backreference will always fail. Similarly, if we
             #   are not at the end of the run yet, then \6 will match 
             #   another copy of the run. So we ensure that neither \4 nor
             #   \6 are possible at this position to assert that this run
             #   has the same length das the previous one.
)+
$            # Finally, we make sure that we can cover the entire string
             # by going through runs of identical lengths like this.

— Martin Ender
แหล่งที่มา

ฉันรักที่คุณเห็นด้วยระหว่างสองวิธี! ฉันยังคิดว่าวิธีการเชิงลบควรจะสั้นกว่านี้จนกว่าฉันจะลองใช้และพบว่ามันน่าอึดอัดใจมากขึ้น (แม้ว่ามันจะรู้สึกว่ามันควรจะง่ายกว่า) ฉันมี 48b ใน PCRE และ 49b ใน Perl ด้วยวิธีที่แตกต่างกันอย่างสมบูรณ์และด้วยวิธีที่ 3 ของคุณใน. NET ในขนาดเท่ากันฉันจะบอกว่านี่เป็นความท้าทายที่ยอดเยี่ยมมาก: D

— jaytea

@ Jaytea ฉันชอบที่จะเห็นเหล่านั้น หากไม่มีใครมากับอะไรเป็นเวลาหนึ่งสัปดาห์ฉันหวังว่าคุณจะโพสต์ด้วยตัวเอง :) และใช่เห็นด้วยเป็นเรื่องดีที่วิธีการนั้นใกล้เคียงกับจำนวนไบต์

— Martin Ender

ฉันแค่อาจจะ! นอกจากนี้ Perl หนึ่งได้รับการแข็งแรงเล่นกอล์ฟลงไป 46B;)

— jaytea

ดังนั้นฉันคิดว่าคุณอาจต้องการเห็นสิ่งเหล่านี้ในขณะนี้! นี่คือ 48b ใน PCRE: ((^.|\2(?=.*\4\3)|\4(?!\3))(?=\2*+((.)\3?)))+\3$ฉันกำลังทดลอง\3*แทนที่(?!\3)45b แต่ล้มเหลวใน "aabbbc" :( รุ่น Perl เข้าใจง่ายกว่าและมันลดลงเหลือ 45b แล้ว: ^((?=(.)\2*(.))(?=(\2(?4)?\3)(?!\3))\2+)+\3+$- เหตุผลที่ฉันเรียกมันว่า Perl แม้ว่ามันจะเป็น ดูเหมือนว่าจะเป็น PCRE ที่ถูกต้องคือ PCRE คิดว่า(\2(?4)?\3)สามารถคืนเงินอย่างไม่มีกำหนดในขณะที่ Perl เป็นคนฉลาดขึ้น / ให้อภัยเล็กน้อย!

— jaytea

@ Jaytea Ah, นั่นเป็นวิธีการแก้ปัญหาที่ดีจริงๆ คุณควรโพสต์พวกเขาด้วยคำตอบแยกต่างหาก :)

— Martin Ender

. NET รส 54 ไบต์

^(?!.*(?<=(\2)*(.))(?!\2)(?>(.)(?<-1>\3)*)(?(1)|\3)).+

ลองออนไลน์! (ใช้Retina )

ฉันค่อนข้างแน่ใจว่านี่เป็นสิ่งที่ไม่คุ้มค่า แต่มันเป็นสิ่งที่ดีที่สุดที่ฉันจะได้รับเมื่อเทียบกับกลุ่มที่สมดุลในตอนนี้ ฉันมีตัวเลือกหนึ่งตัวที่จำนวนไบต์เดียวกันซึ่งส่วนใหญ่เหมือนกัน:

^(?!.*(?<=(\3())*(.))(?!\3)(?>(.)(?<-2>\4)*)(\2|\4)).+

คำอธิบาย

ความคิดหลักคือการคว่ำปัญหาจับคู่สตริงที่ไม่คุ้มทุนและวางสิ่งทั้งหมดไว้ในหัวเชิงลบเพื่อลบล้างผลลัพธ์ ประโยชน์คือเราไม่ต้องติดตามnตลอดทั้งสาย (เนื่องจากลักษณะของกลุ่มที่สมดุลคุณมักจะใช้nเมื่อตรวจสอบ) เพื่อตรวจสอบว่าการวิ่งทั้งหมดมีความยาวเท่ากัน แต่เราเพียงแค่มองหาคู่วิ่งที่อยู่ติดกันคู่เดียวที่ไม่มีความยาวเท่ากัน ด้วยวิธีนี้ฉันต้องใช้nครั้งเดียวเท่านั้น

นี่คือรายละเอียดของ regex

^(?!.*         # This negative lookahead means that we will match
               # all strings where the pattern inside the lookahead
               # would fail if it were used as a regex on its own.
               # Due to the .* that inner regex can match from any
               # position inside the string. The particular position
               # we're looking for is between two runs (and this
               # will be ensured later).

  (?<=         #   We start with a lookbehind to record the length
               #   of the preceding run. Remember that the lookbehind
               #   should be read from the bottom up (and so should
               #   my comments).
    (\2)*      #     And then we match all of its adjacent copies, capturing
               #     them separately in group 1. That means at the
               #     end of the lookbehind, we will have n-1 captures
               #     on stack 1, where n is the length of the preceding
               #     run. Due to the atomic nature of lookbehinds, we
               #     don't have to worry about backtracking matching
               #     less than n-1 copies here.
    (.)        #     We capture the character that makes up the preceding
               #     run in group 2.
  )
  (?!\2)       #   Make sure the next character isn't the same as the one
               #   we used for the preceding run. This ensures we're at a
               #   boundary between runs.
  (?>          #   Match the next stuff with an atomic group to avoid
               #   backtracking.
    (.)        #     Capture the character that makes up the next run
               #     in group 3.
    (?<-1>\3)* #     Match as many of these characters as possible while
               #     depleting the captures on stack 1.
  )
               #   Due to the atomic group, there are three two possible
               #   situations that cause the previous quantifier to stopp
               #   matching. 
               #   Either the run has ended, or stack 1 has been depleted.
               #   If both of those are true, the runs are the same length,
               #   and we don't actually want a match here. But if the runs
               #   are of different lengths than either the run ended but
               #   the stack isn't empty yet, or the stack was depleted but
               #   the run hasn't ended yet.
  (?(1)|\3)    #   This conditional matches these last two cases. If there's
               #   still a capture on stack 1, we don't match anything,
               #   because we know this run was shorter than the previous
               #   one. But if stack 1, we want to match another copy of 
               #   the character in this run to ensure that this run is 
               #   longer than the previous one.
)
.+             # Finally we just match the entire string to comply with the
               # challenge spec.

— Martin Ender
แหล่งที่มา

ฉันพยายามที่จะทำให้มันล้มเหลวเมื่อ: banana, aba, bbbaaannnaaannnaaa, bbbaaannnaaannnaaaaaa, The Nineteenth Byte, 11, 110, ,^(?!.*(?<=(\2)*(.))(?!\2)(?>(.)(?<-1>\3)*)(?(1)|\3)).+ bababaเป็นฉันที่ล้มเหลว :( +1

— Erik the Outgolfer

ช่วงเวลานั้นเมื่อคุณทำคำอธิบายเสร็จแล้วคิดออกว่าคุณสามารถประหยัดได้ 1 ไบต์โดยใช้วิธีการตรงข้ามแน่นอน ... ฉันเดาว่าฉันจะตอบอีกสักครู่ ... : |

— Martin Ender

@ มาร์ตินเอนเดอร์ ... แล้วคุณก็รู้ว่าคุณสามารถตีกอล์ฟตัวนี้ทีละ 2 ไบต์ฮ่าฮ่า: P

— Mr. Xcoder

@ Mr.Xcoder ตอนนี้ต้องเป็น 7 ไบต์ดังนั้นฉันหวังว่าฉันจะปลอดภัย ;)

— Martin Ender