ใน Python ฉันจะแยกสตริงและรักษาตัวคั่นได้อย่างไร

226

นี่คือวิธีที่ง่ายที่สุดในการอธิบายสิ่งนี้ นี่คือสิ่งที่ฉันกำลังใช้:

re.split('\W', 'foo/bar spam\neggs')
-> ['foo', 'bar', 'spam', 'eggs']

นี่คือสิ่งที่ฉันต้องการ:

someMethod('\W', 'foo/bar spam\neggs')
-> ['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

เหตุผลก็คือฉันต้องการแยกสตริงเป็นโทเค็นจัดการกับมันแล้วนำกลับมารวมกันอีกครั้ง

python regex

— Ken Kinder
แหล่งที่มา

3

สิ่งที่\Wยืนหยัดเพื่อ? ฉันทำ Google ล้มเหลว

— Ooker

8

ไม่ใช่คำว่าตัวละครดูที่นี่เพื่อดูรายละเอียด

— รัสเซล

สำหรับการแยกสตริงดิบไบต์แทนการแยกสตริงดูStackoverflow.com/questions/62591863/ ที่

— Lorenz

295

>>> re.split('(\W)', 'foo/bar spam\neggs')
['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

— พลเรือจัตวา Jaeger
แหล่งที่มา

22

มันเท่ห์มาก ฉันไม่รู้ว่า re.split ทำอย่างนั้นกับกลุ่มจับ

— Laurence Gonsalves

16

@ Laurence: มันเป็นเอกสาร: docs.python.org/library/re.html#re.split : "แยกสตริงด้วยการเกิดขึ้นของรูปแบบถ้าจับวงเล็บถูกใช้ในรูปแบบแล้วข้อความของทุกกลุ่มในรูปแบบ จะถูกส่งกลับเป็นส่วนหนึ่งของรายการผลลัพธ์ "

— Vinay Sajip

40

มันด้อยค่าลงอย่างมาก ฉันใช้ Python มา 14 ปีแล้วและเพิ่งค้นพบสิ่งนี้

— smci

19

มีตัวเลือกให้แยกเอาท์พุทของการจับคู่กลุ่มกับสิ่งที่อยู่ทางด้านซ้าย ตัวอย่างเช่นสิ่งนี้สามารถแก้ไขได้อย่างง่ายดายดังนั้นผลลัพธ์คือ['foo', '/bar', ' spam', '\neggs']อะไร

— ely

3

@ Mr.F คุณอาจทำอะไรกับ re.sub ได้ ฉันต้องการแยกเปอร์เซ็นต์ตอนจบดังนั้นฉันจึงซับตัวละครสองตัวจากนั้นก็แยกกันแฮ็ค แต่ใช้ได้กับคดีของฉัน: re.split('% ', re.sub('% ', '%% ', '5.000% Additional Whatnot'))->['5.000%', 'Additional Whatnot']

— Kyle James Walker

29

splitlines(True)หากคุณเป็นแยกบนบรรทัดใหม่ใช้

>>> 'line 1\nline 2\nline without newline'.splitlines(True)
['line 1\n', 'line 2\n', 'line without newline']

(ไม่ใช่โซลูชันทั่วไป แต่เพิ่มสิ่งนี้ในกรณีที่มีบางคนมาที่นี่โดยไม่ทราบว่ามีวิธีนี้อยู่)

— มาร์คโลดาโต
แหล่งที่มา

12

โซลูชัน no-regex อื่นที่ทำงานได้ดีบน Python 3

# Split strings and keep separator
test_strings = ['<Hello>', 'Hi', '<Hi> <Planet>', '<', '']

def split_and_keep(s, sep):
   if not s: return [''] # consistent with string.split()

   # Find replacement character that is not used in string
   # i.e. just use the highest available character plus one
   # Note: This fails if ord(max(s)) = 0x10FFFF (ValueError)
   p=chr(ord(max(s))+1) 

   return s.replace(sep, sep+p).split(p)

for s in test_strings:
   print(split_and_keep(s, '<'))


# If the unicode limit is reached it will fail explicitly
unicode_max_char = chr(1114111)
ridiculous_string = '<Hello>'+unicode_max_char+'<World>'
print(split_and_keep(ridiculous_string, '<'))

— ootwch
แหล่งที่มา

10

หากคุณมีตัวคั่นเพียง 1 ตัวคุณสามารถใช้ความเข้าใจของรายการ:

text = 'foo,bar,baz,qux'  
sep = ','

ตัวต่อท้าย / ตัวต่อท้าย:

result = [x+sep for x in text.split(sep)]
#['foo,', 'bar,', 'baz,', 'qux,']
# to get rid of trailing
result[-1] = result[-1].strip(sep)
#['foo,', 'bar,', 'baz,', 'qux']

result = [sep+x for x in text.split(sep)]
#[',foo', ',bar', ',baz', ',qux']
# to get rid of trailing
result[0] = result[0].strip(sep)
#['foo', ',bar', ',baz', ',qux']

ตัวแยกที่เป็นองค์ประกอบของตัวเอง:

result = [u for x in text.split(sep) for u in (x, sep)]
#['foo', ',', 'bar', ',', 'baz', ',', 'qux', ',']
results = result[:-1]   # to get rid of trailing

— Granitosaurus
แหล่งที่มา

1

นอกจากนี้คุณยังสามารถเพิ่มif xเพื่อให้มั่นใจว่าก้อนที่ผลิตโดยsplitมีเนื้อหาบางอย่างเช่นresult = [x + sep for x in text.split(sep) if x]

— ฉันปลุกมนุษย์ต่างดาว

สำหรับฉันดึงแถบออกมากเกินไปและฉันต้องใช้สิ่งนี้:result = [sep+x for x in data.split(sep)] result[0] = result[0][len(sep):]

— scottlittle

9

อีกตัวอย่างหนึ่งแยกที่ไม่ใช่ตัวเลขและรักษาตัวคั่น

import re
a = "foo,bar@candy*ice%cream"
re.split('([^a-zA-Z0-9])',a)

เอาท์พุท:

['foo', ',', 'bar', '@', 'candy', '*', 'ice', '%', 'cream']

คำอธิบาย

re.split('([^a-zA-Z0-9])',a)

() <- keep the separators
[] <- match everything in between
^a-zA-Z0-9 <-except alphabets, upper/lower and numbers.

— anurag
แหล่งที่มา

แม้ว่าตามที่เอกสารบอกว่านี่เป็นคำตอบที่ยอมรับได้ แต่ฉันชอบความสามารถในการอ่านของรุ่นนี้ - แม้ว่า\Wจะเป็นวิธีที่กะทัดรัดกว่าในการแสดง

— ephsmith

3

คุณสามารถแยกสตริงด้วยอาร์เรย์ของสตริงแทนนิพจน์ทั่วไปเช่นนี้

def tokenizeString(aString, separators):
    #separators is an array of strings that are being used to split the the string.
    #sort separators in order of descending length
    separators.sort(key=len)
    listToReturn = []
    i = 0
    while i < len(aString):
        theSeparator = ""
        for current in separators:
            if current == aString[i:i+len(current)]:
                theSeparator = current
        if theSeparator != "":
            listToReturn += [theSeparator]
            i = i + len(theSeparator)
        else:
            if listToReturn == []:
                listToReturn = [""]
            if(listToReturn[-1] in separators):
                listToReturn += [""]
            listToReturn[-1] += aString[i]
            i += 1
    return listToReturn


print(tokenizeString(aString = "\"\"\"hi\"\"\" hello + world += (1*2+3/5) '''hi'''", separators = ["'''", '+=', '+', "/", "*", "\\'", '\\"', "-=", "-", " ", '"""', "(", ")"]))

— Anderson Green
แหล่งที่มา

3

# This keeps all separators  in result 
##########################################################################
import re
st="%%(c+dd+e+f-1523)%%7"
sh=re.compile('[\+\-//\*\<\>\%\(\)]')

def splitStringFull(sh, st):
   ls=sh.split(st)
   lo=[]
   start=0
   for l in ls:
     if not l : continue
     k=st.find(l)
     llen=len(l)
     if k> start:
       tmp= st[start:k]
       lo.append(tmp)
       lo.append(l)
       start = k + llen
     else:
       lo.append(l)
       start =llen
   return lo
  #############################

li= splitStringFull(sh , st)
['%%(', 'c', '+', 'dd', '+', 'e', '+', 'f', '-', '1523', ')%%', '7']

— Moisey Oysgelt
แหล่งที่มา

3

ทางออกเดียวที่ขี้เกียจและเรียบง่าย

สมมติว่ารูปแบบ regex ของคุณคือ split_pattern = r'(!|\?)'

ขั้นแรกให้คุณเพิ่มตัวละครตัวเดียวกับตัวคั่นใหม่เช่น '[ตัด]'

new_string = re.sub(split_pattern, '\\1[cut]', your_string)

จากนั้นคุณแยกตัวคั่นใหม่ new_string.split('[cut]')

— Yilei Wang
แหล่งที่มา

วิธีการนี้ฉลาด แต่จะล้มเหลวเมื่อสตริงต้นฉบับมี[cut]บางที่อยู่แล้ว

— Matthijs Kooijman

มันอาจจะเร็วขึ้นสำหรับปัญหาที่มีขนาดใหญ่เพราะในที่สุดมันก็ใช้ string.split () ในกรณีที่ re.split () มีค่าใช้จ่ายมากกว่า re.sub () ด้วย string.split () (ซึ่งฉันไม่รู้)

— Lorenz

1

หากต้องการแบ่งสตริงในขณะที่รักษาตัวแยกโดย regex โดยไม่ต้องจับกลุ่ม:

def finditer_with_separators(regex, s):
    matches = []
    prev_end = 0
    for match in regex.finditer(s):
        match_start = match.start()
        if (prev_end != 0 or match_start > 0) and match_start != prev_end:
            matches.append(s[prev_end:match.start()])
        matches.append(match.group())
        prev_end = match.end()
    if prev_end < len(s):
        matches.append(s[prev_end:])
    return matches

regex = re.compile(r"[\(\)]")
matches = finditer_with_separators(regex, s)

ถ้ามีคนสันนิษฐานว่า regex นั้นถูกรวมเข้าไปในกลุ่มการจับ:

def split_with_separators(regex, s):
    matches = list(filter(None, regex.split(s)))
    return matches

regex = re.compile(r"([\(\)])")
matches = split_with_separators(regex, s)

ทั้งสองวิธียังจะลบกลุ่มว่างเปล่าที่ไร้ประโยชน์และน่ารำคาญในกรณีส่วนใหญ่

— Dmitriy Sintsov
แหล่งที่มา

1

นี่เป็นวิธีง่ายๆ.splitที่ทำงานได้โดยไม่ต้อง regex

นี่คือคำตอบสำหรับการแบ่ง Python () โดยไม่ลบตัวคั่นดังนั้นจึงไม่ใช่สิ่งที่โพสต์ดั้งเดิมถาม แต่คำถามอื่นถูกปิดเหมือนซ้ำสำหรับอันนี้

def splitkeep(s, delimiter):
    split = s.split(delimiter)
    return [substr + delimiter for substr in split[:-1]] + [split[-1]]

การทดสอบแบบสุ่ม:

import random

CHARS = [".", "a", "b", "c"]
assert splitkeep("", "X") == [""]  # 0 length test
for delimiter in ('.', '..'):
    for idx in range(100000):
        length = random.randint(1, 50)
        s = "".join(random.choice(CHARS) for _ in range(length))
        assert "".join(splitkeep(s, delimiter)) == s

— orestisf
แหล่งที่มา

regex ควรหลีกเลี่ยงปัญหาขนาดใหญ่ด้วยเหตุผลความเร็วนั่นคือเหตุผลนี้เป็นคำแนะนำที่ดี

— Lorenz

0

ฉันมีปัญหาคล้ายกันที่พยายามแบ่งเส้นทางไฟล์และพยายามหาคำตอบง่ายๆ สิ่งนี้ใช้ได้กับฉันและไม่เกี่ยวข้องกับการแทนที่ตัวคั่นกลับเป็นข้อความแยก:

my_path = 'folder1/folder2/folder3/file1'

import re

re.findall('[^/]+/|[^/]+', my_path)

ผลตอบแทน:

['folder1/', 'folder2/', 'folder3/', 'file1']

— Conor
แหล่งที่มา

นี้ได้ง่ายขึ้นเล็กน้อยโดยใช้: re.findall('[^/]+/?', my_path)(เช่นการทำเฉือนท้ายตัวเลือกใช้?มากกว่าการให้ทางเลือกสองทางด้วย|.

— Matthijs Kooijman

0

ฉันพบว่าวิธีการสร้างตามนี้พอใจมากขึ้น:

def split_keep(string, sep):
    """Usage:
    >>> list(split_keep("a.b.c.d", "."))
    ['a.', 'b.', 'c.', 'd']
    """
    start = 0
    while True:
        end = string.find(sep, start) + 1
        if end == 0:
            break
        yield string[start:end]
        start = end
    yield string[start:]

มันหลีกเลี่ยงความจำเป็นในการค้นหา regex ที่ถูกต้องในทางทฤษฎีควรมีราคาถูกพอสมควร มันไม่ได้สร้างวัตถุสตริงใหม่และมอบหมายซ้ำส่วนใหญ่ทำงานกับวิธีการค้นหาที่มีประสิทธิภาพ

... และใน Python 3.8 สามารถสั้นได้:

def split_keep(string, sep):
    start = 0
    while (end := string.find(sep, start) + 1) > 0:
        yield string[start:end]
        start = end
    yield string[start:]

— เฉินเลวี
แหล่งที่มา

0

แทนที่ทั้งหมดseperator: (\W)ด้วยseperator + new_seperator: (\W;)
แยกตาม new_seperator: (;)

def split_and_keep(seperator, s):
  return re.split(';', re.sub(seperator, lambda match: match.group() + ';', s))

print('\W', 'foo/bar spam\neggs')

— kobako
แหล่งที่มา