แยกสตริงด้วยช่องว่าง - การรักษาสตริงย่อยที่อ้างถึง

269

ฉันมีสตริงซึ่งเป็นเช่นนี้:

this is "a test"

ฉันพยายามเขียนอะไรบางอย่างใน Python เพื่อแยกมันเป็นช่องว่างโดยไม่สนใจช่องว่างภายในเครื่องหมายคำพูด ผลลัพธ์ที่ฉันต้องการคือ:

['this','is','a test']

PS ฉันรู้ว่าคุณจะถามว่า "จะเกิดอะไรขึ้นหากมีคำพูดอยู่ในเครื่องหมายคำพูดในใบสมัครของฉันนั่นจะไม่เกิดขึ้นเลย

python regex

— อดัมเพียร์ซ
แหล่งที่มา

1

ขอบคุณที่ถามคำถามนี้ มันเป็นสิ่งที่ฉันต้องการสำหรับการแก้ไขโมดูลสร้าง pypar

— Martlark

392

คุณต้องการsplitจากshlexโมดูลในตัว

>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']

นี่ควรทำสิ่งที่คุณต้องการ

— เยรุบ
แหล่งที่มา

13

ใช้ "posix = False" เพื่อรักษาใบเสนอราคา shlex.split('this is "a test"', posix=False)ผลตอบแทน['this', 'is', '"a test"']

— บุญ

@MatthewG "แก้ไข" ใน Python 2.7.3 หมายความว่าการส่งผ่านสตริง Unicode ไปshlex.split()จะทำให้เกิดUnicodeEncodeErrorข้อยกเว้น

— Rockallite

57

มีลักษณะที่เป็นโมดูลโดยเฉพาะอย่างยิ่งshlexshlex.split

>>> import shlex
>>> shlex.split('This is "a test"')
['This', 'is', 'a test']

— อัลเลน
แหล่งที่มา

40

ฉันเห็นวิธีการ regex ที่นี่ซึ่งดูซับซ้อนและ / หรือผิด สิ่งนี้ทำให้ฉันประหลาดใจเพราะไวยากรณ์ของ regex สามารถอธิบาย "ช่องว่างหรือสิ่งที่ล้อมรอบด้วยเครื่องหมายคำพูด" ได้อย่างง่ายดายและเอ็นจิ้นส่วนใหญ่ของ regex (รวมถึง Python) สามารถแยกบน regex ดังนั้นหากคุณจะใช้ regexes ทำไมไม่เพียงแค่พูดว่าคุณหมายถึงอะไร:

test = 'this is "a test"'  # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]

คำอธิบาย:

[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators

แม้ว่า shlex อาจมีคุณสมบัติเพิ่มเติม

1

ฉันคิดเหมือนกันมาก แต่ขอแนะนำแทน [t.strip ('"') สำหรับ t ใน re.findall (r '[^ \ s"] + | "[^"] * "', 'นี่คือ" การทดสอบ "')]

— Darius Bacon

2

+1 ฉันใช้สิ่งนี้เพราะมันเร็วกว่า shlex มาก

— hanleyp

3

ทำไมแบ็กสแลชสามอัน? แบ็กสแลชที่เรียบง่ายจะไม่ทำเช่นเดียวกันหรือ

— Doppelganger

1

อันที่จริงสิ่งหนึ่งที่ฉันไม่ชอบเกี่ยวกับเรื่องนี้ก็คือว่าอะไรก็ตามก่อน / หลังราคาไม่ได้แยกอย่างถูกต้อง ถ้าฉันมีสตริงเช่นนี้ 'PARAMS val1 = "Thing" val2 = "Thing2"' ฉันคาดว่าสตริงจะแบ่งออกเป็นสามชิ้น แต่แยกออกเป็น 5 เป็นเวลานานแล้วที่ฉันได้ทำ regex ดังนั้นฉันไม่รู้สึกอยากลองแก้ปัญหาโดยใช้วิธีแก้ปัญหาของคุณตอนนี้

— leetNightshade

1

คุณควรใช้สตริงดิบเมื่อใช้นิพจน์ทั่วไป

— asmeurer

29

ขึ้นอยู่กับกรณีการใช้งานของคุณคุณอาจต้องการตรวจสอบcsvโมดูล:

import csv
lines = ['this is "a string"', 'and more "stuff"']
for row in csv.reader(lines, delimiter=" "):
    print(row)

เอาท์พุท:

['this', 'is', 'a string']
['and', 'more', 'stuff']

— Ryan Ginstrom
แหล่งที่มา

2

มีประโยชน์เมื่อ shlex ดึงอักขระที่จำเป็นออกมาบางตัว

— scraplesh

1

CSV ใช้เครื่องหมายคำพูดคู่สองตัวในหนึ่งแถว (เหมือนกันแบบ""คู่"'this is "a string""''this is "a string"""'['this', 'is', 'a string"']

— Boris

15

ฉันใช้ shlex.split เพื่อประมวลผลบันทึกปลาหมึกจำนวน 70,000,000 บรรทัดมันช้ามาก ดังนั้นฉันจึงเปลี่ยนเป็น

โปรดลองสิ่งนี้หากคุณมีปัญหาด้านประสิทธิภาพของ shlex

import re

def line_split(line):
    return re.findall(r'[^"\s]\S*|".+?"', line)

— แดเนียลได
แหล่งที่มา

8

เนื่องจากคำถามนี้ถูกแท็กด้วย regex ฉันตัดสินใจลองใช้วิธี regex ฉันจะแทนที่ช่องว่างทั้งหมดในส่วนอัญประกาศด้วย \ x00 จากนั้นแยกตามช่องว่างแล้วแทนที่ \ x00 กลับไปเป็นช่องว่างในแต่ละส่วน

ทั้งสองเวอร์ชันทำสิ่งเดียวกัน แต่ตัวแยกสัญญาณสามารถอ่านได้มากกว่าตัวแยก 2

import re

s = 'this is "a test" some text "another test"'

def splitter(s):
    def replacer(m):
        return m.group(0).replace(" ", "\x00")
    parts = re.sub('".+?"', replacer, s).split()
    parts = [p.replace("\x00", " ") for p in parts]
    return parts

def splitter2(s):
    return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]

print splitter2(s)

— elifiner
แหล่งที่มา

คุณควรใช้ reScanner แทน มันน่าเชื่อถือมากขึ้น (และในความเป็นจริงฉันใช้ shlex เหมือนโดยใช้ re.Scanner)

— Devin Jeanpierre

+1 หืมมนี่เป็นความคิดที่ชาญฉลาดโดยแบ่งปัญหาออกเป็นหลายขั้นตอนดังนั้นคำตอบจึงไม่ซับซ้อนนัก Shlex ไม่ได้ทำตามที่ฉันต้องการแม้จะพยายามบิดมัน และโซลูชั่น regex แบบ single pass ก็เริ่มแปลกและซับซ้อนจริงๆ

— leetNightshade

6

ดูเหมือนว่าด้วยเหตุผลด้านประสิทธิภาพreจะเร็วขึ้น นี่คือวิธีการแก้ปัญหาของฉันโดยใช้ตัวดำเนินการโลภน้อยที่รักษาราคาภายนอก:

re.findall("(?:\".*?\"|\S)+", s)

ผลลัพธ์:

['this', 'is', '"a test"']

มันปล่อยให้โครงสร้างเหมือนaaa"bla blub"bbbกันเนื่องจากโทเค็นเหล่านี้ไม่ได้ถูกคั่นด้วยช่องว่าง หากสตริงมีอักขระที่ใช้ Escape คุณสามารถจับคู่แบบนั้นได้:

>>> a = "She said \"He said, \\\"My name is Mark.\\\"\""
>>> a
'She said "He said, \\"My name is Mark.\\""'
>>> for i in re.findall("(?:\".*?[^\\\\]\"|\S)+", a): print(i)
...
She
said
"He said, \"My name is Mark.\""

โปรดทราบว่าสิ่งนี้ตรงกับสตริงที่ว่างเปล่า""ด้วย\Sส่วนของรูปแบบ

— hochl
แหล่งที่มา

1

ข้อได้เปรียบที่สำคัญอีกข้อหนึ่งของโซลูชันนี้คือความสามารถรอบตัวที่เกี่ยวกับอักขระที่กำหนดขอบเขต (เช่น,ผ่าน'(?:".*?"|[^,])+') เช่นเดียวกับตัวอักษร quoting (การปิดล้อม)

— a_guest

4

ปัญหาหลักของshlexวิธีการที่ยอมรับคือมันไม่ได้ละเว้นอักขระ escape นอกสตริงย่อยที่ยกมาและให้ผลลัพธ์ที่ไม่คาดคิดเล็กน้อยในบางกรณี

ฉันมีกรณีการใช้งานดังต่อไปนี้ซึ่งฉันต้องการฟังก์ชั่นแยกที่แยกสตริงอินพุตเช่นว่าสตริงย่อยเดียวที่ยกมาหรือสองครั้งจะถูกเก็บรักษาไว้ด้วยความสามารถในการหลีกเลี่ยงคำพูดในสตริงย่อย เครื่องหมายคำพูดในสตริงที่ไม่ได้อยู่ในเครื่องหมายคำพูดไม่ควรได้รับการปฏิบัติแตกต่างจากอักขระอื่น ๆ ตัวอย่างกรณีทดสอบที่มีเอาต์พุตที่คาดไว้:

สตริงอินพุต ผลผลิตที่คาดหวัง
===============================================
 'abc def' | ['abc', 'def']
 "abc \\ s def" | ['abc', '\\ s', 'def']
 '"abc def" ghi' | ['abc def', 'ghi']
 "'abc def' ghi" | ['abc def', 'ghi']
 '"abc \\" def "ghi' | ['abc" def', 'ghi']
 "'abc \\' def 'ghi" | ["abc 'def",' ghi ']
 "'abc \\ s def' ghi" | ['abc \\ s def', 'ghi']
 '"abc \\ s def" ghi' | ['abc \\ s def', 'ghi']
 '' "ทดสอบ '| ['', 'ทดสอบ']
 "'' ทดสอบ" | ['', 'ทดสอบ']
 "abc'def" | [ "abc'def"]
 "abc'def '" | [ "abc'def"]
 "abc'def 'ghi" | ["abc'def '",' ghi ']
 "abc'def'ghi" | [ "abc'def'ghi"]
 'abc "def' | ['abc" def']
 'abc "def"' | [ 'abc "def"']
 'abc "def" ghi' | ['abc "def"', 'ghi']
 'abc "def" ghi' | [ 'abc "def" GHI']
 "r'AA 'r'. * _ xyz $ '" | ["r'AA '", "r'. * _ xyz $ '"]

ฉันจบลงด้วยฟังก์ชั่นต่อไปนี้เพื่อแยกสตริงดังกล่าวว่าผลลัพธ์ที่คาดว่าจะได้ผลลัพธ์สำหรับทุกสายเข้า:

import re

def quoted_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") \
            for p in re.findall(r'"(?:\\.|[^"])*"|\'(?:\\.|[^\'])*\'|[^\s]+', s)]

แอปพลิเคชันทดสอบต่อไปนี้จะตรวจสอบผลลัพธ์ของวิธีการอื่น ๆ ( shlexและcsvตอนนี้) และการใช้การแยกแบบกำหนดเอง

#!/bin/python2.7

import csv
import re
import shlex

from timeit import timeit

def test_case(fn, s, expected):
    try:
        if fn(s) == expected:
            print '[ OK ] %s -> %s' % (s, fn(s))
        else:
            print '[FAIL] %s -> %s' % (s, fn(s))
    except Exception as e:
        print '[FAIL] %s -> exception: %s' % (s, e)

def test_case_no_output(fn, s, expected):
    try:
        fn(s)
    except:
        pass

def test_split(fn, test_case_fn=test_case):
    test_case_fn(fn, 'abc def', ['abc', 'def'])
    test_case_fn(fn, "abc \\s def", ['abc', '\\s', 'def'])
    test_case_fn(fn, '"abc def" ghi', ['abc def', 'ghi'])
    test_case_fn(fn, "'abc def' ghi", ['abc def', 'ghi'])
    test_case_fn(fn, '"abc \\" def" ghi', ['abc " def', 'ghi'])
    test_case_fn(fn, "'abc \\' def' ghi", ["abc ' def", 'ghi'])
    test_case_fn(fn, "'abc \\s def' ghi", ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"abc \\s def" ghi', ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"" test', ['', 'test'])
    test_case_fn(fn, "'' test", ['', 'test'])
    test_case_fn(fn, "abc'def", ["abc'def"])
    test_case_fn(fn, "abc'def'", ["abc'def'"])
    test_case_fn(fn, "abc'def' ghi", ["abc'def'", 'ghi'])
    test_case_fn(fn, "abc'def'ghi", ["abc'def'ghi"])
    test_case_fn(fn, 'abc"def', ['abc"def'])
    test_case_fn(fn, 'abc"def"', ['abc"def"'])
    test_case_fn(fn, 'abc"def" ghi', ['abc"def"', 'ghi'])
    test_case_fn(fn, 'abc"def"ghi', ['abc"def"ghi'])
    test_case_fn(fn, "r'AA' r'.*_xyz$'", ["r'AA'", "r'.*_xyz$'"])

def csv_split(s):
    return list(csv.reader([s], delimiter=' '))[0]

def re_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") for p in re.findall(r'"(?:\\.|[^"])*"|\'(?:\\.|[^\'])*\'|[^\s]+', s)]

if __name__ == '__main__':
    print 'shlex\n'
    test_split(shlex.split)
    print

    print 'csv\n'
    test_split(csv_split)
    print

    print 're\n'
    test_split(re_split)
    print

    iterations = 100
    setup = 'from __main__ import test_split, test_case_no_output, csv_split, re_split\nimport shlex, re'
    def benchmark(method, code):
        print '%s: %.3fms per iteration' % (method, (1000 * timeit(code, setup=setup, number=iterations) / iterations))
    benchmark('shlex', 'test_split(shlex.split, test_case_no_output)')
    benchmark('csv', 'test_split(csv_split, test_case_no_output)')
    benchmark('re', 'test_split(re_split, test_case_no_output)')

เอาท์พุท:

shlex

[OK] abc def -> ['abc', 'def']
[FAIL] abc \ s def -> ['abc', 's', 'def']
[ตกลง] "abc def" ghi -> ['abc def', 'ghi']
[ตกลง] 'abc def' ghi -> ['abc def', 'ghi']
[ตกลง] "abc \" def "ghi -> ['abc" def', 'ghi']
[FAIL] 'abc \' def 'ghi -> ข้อยกเว้น: ไม่มีใบเสนอราคาปิด
[ตกลง] 'abc \ s def' ghi -> ['abc \\ s def', 'ghi']
[ตกลง] "abc \ s def" ghi -> ['abc \\ s def', 'ghi']
[ตกลง] "" การทดสอบ -> ['', 'ทดสอบ']
[ตกลง] '' ทดสอบ -> ['', 'ทดสอบ']
[FAIL] abc'def -> ข้อยกเว้น: ไม่มีใบเสนอราคาปิด
[FAIL] abc'def '-> [' abcdef ']
[FAIL] abc'def 'ghi -> [' abcdef ',' ghi ']
[FAIL] abc'def'ghi -> ['abcdefghi']
[FAIL] abc "def -> ข้อยกเว้น: ไม่มีใบเสนอราคาปิด
[FAIL] abc "def" -> ['abcdef']
[FAIL] abc "def" ghi -> ['abcdef', 'ghi']
[FAIL] abc "def" ghi -> ['abcdefghi']
[FAIL] r'AA 'r'. * _ xyz $ '-> [' rAA ',' r. * _ xyz $ ']

CSV

[OK] abc def -> ['abc', 'def']
[ตกลง] abc \ s def -> ['abc', '\\ s', 'def']
[ตกลง] "abc def" ghi -> ['abc def', 'ghi']
[FAIL] 'abc def' ghi -> ["'abc", "def'", 'ghi']
[FAIL] "abc \" def "ghi -> ['abc \\', 'def"', 'ghi']
[FAIL] 'abc \' def 'ghi -> ["' abc", "\\ '", "def'", 'ghi']
[FAIL] 'abc \ s def' ghi -> ["'abc",' \\ s ', "def'", 'ghi']
[ตกลง] "abc \ s def" ghi -> ['abc \\ s def', 'ghi']
[ตกลง] "" การทดสอบ -> ['', 'ทดสอบ']
[FAIL] '' การทดสอบ -> ["''", 'ทดสอบ']
[OK] abc'def -> ["abc'def"]
[OK] abc'def '-> ["abc'def'"]
[ตกลง] abc'def 'ghi -> ["abc'def'", 'ghi']
[OK] abc'def'ghi -> ["abc'def'ghi"]
[ตกลง] abc "def -> ['abc" def']
[OK] abc "def" -> ['abc "def"']
[ตกลง] abc "def" ghi -> ['abc "def"', 'ghi']
[ตกลง] abc "def" ghi -> ['abc "def" ghi']
[ตกลง] r'AA 'r'. * _ xyz $ '-> ["r'AA'", "r '. * _ xyz $'"]

อีกครั้ง

[OK] abc def -> ['abc', 'def']
[ตกลง] abc \ s def -> ['abc', '\\ s', 'def']
[ตกลง] "abc def" ghi -> ['abc def', 'ghi']
[ตกลง] 'abc def' ghi -> ['abc def', 'ghi']
[ตกลง] "abc \" def "ghi -> ['abc" def', 'ghi']
[ตกลง] 'abc \' def 'ghi -> ["abc' def", 'ghi']
[ตกลง] 'abc \ s def' ghi -> ['abc \\ s def', 'ghi']
[ตกลง] "abc \ s def" ghi -> ['abc \\ s def', 'ghi']
[ตกลง] "" การทดสอบ -> ['', 'ทดสอบ']
[ตกลง] '' ทดสอบ -> ['', 'ทดสอบ']
[OK] abc'def -> ["abc'def"]
[OK] abc'def '-> ["abc'def'"]
[ตกลง] abc'def 'ghi -> ["abc'def'", 'ghi']
[OK] abc'def'ghi -> ["abc'def'ghi"]
[ตกลง] abc "def -> ['abc" def']
[OK] abc "def" -> ['abc "def"']
[ตกลง] abc "def" ghi -> ['abc "def"', 'ghi']
[ตกลง] abc "def" ghi -> ['abc "def" ghi']
[ตกลง] r'AA 'r'. * _ xyz $ '-> ["r'AA'", "r '. * _ xyz $'"]

shlex: 0.281ms ต่อการทำซ้ำ
csv: 0.030ms ต่อการทำซ้ำ
Re: 0.049ms ต่อการทำซ้ำ

ดังนั้นประสิทธิภาพจะดีกว่าshlexและสามารถปรับปรุงได้อีกด้วยการคอมไพล์นิพจน์ปกติซึ่งในกรณีนี้มันจะดีกว่าcsvวิธีการ

— ต้นแวนเดน Heuvel
แหล่งที่มา

ไม่แน่ใจว่าคุณกำลังพูดถึงอะไร: `` `>>> shlex.split ('นี่คือ" การทดสอบ "') ['นี่', 'คือ', 'การทดสอบ'] >>> shlex.split (' นี่คือ \\ "การทดสอบ \\" ') [' นี่ ',' คือ ',' "a ',' ทดสอบ" '] >>> shlex.split (' นี่คือ "a \\" ทดสอบ \\ " "') [' this ',' is ',' a" test "']` ``

— morsik

@morsik ประเด็นของคุณคืออะไร? บางทีกรณีการใช้งานของคุณไม่ตรงกับของฉัน เมื่อคุณดูกรณีทดสอบคุณจะเห็นทุกกรณีที่shlexไม่ทำงานตามที่คาดไว้สำหรับเคสการใช้งานของฉัน

— Ton van den Heuvel

3

เพื่อรักษาคำพูดใช้ฟังก์ชั่นนี้:

def getArgs(s):
    args = []
    cur = ''
    inQuotes = 0
    for char in s.strip():
        if char == ' ' and not inQuotes:
            args.append(cur)
            cur = ''
        elif char == '"' and not inQuotes:
            inQuotes = 1
            cur += char
        elif char == '"' and inQuotes:
            inQuotes = 0
            cur += char
        else:
            cur += char
    args.append(cur)
    return args

— THE_MAD_KING
แหล่งที่มา

เมื่อเปรียบเทียบกับสตริงที่ใหญ่กว่าฟังก์ชั่นของคุณช้ามาก

— Faran2007

3

ทดสอบความเร็วของคำตอบต่าง ๆ :

import re
import shlex
import csv

line = 'this is "a test"'

%timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
100000 loops, best of 3: 5.17 µs per loop

%timeit re.findall(r'[^"\s]\S*|".+?"', line)
100000 loops, best of 3: 2.88 µs per loop

%timeit list(csv.reader([line], delimiter=" "))
The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.4 µs per loop

%timeit shlex.split(line)
10000 loops, best of 3: 50.2 µs per loop

— har777
แหล่งที่มา

1

อืมดูเหมือนจะไม่พบปุ่ม "ตอบกลับ" ... คำตอบนี้ขึ้นอยู่กับวิธีการของ Kate แต่แยกสตริงที่มีสตริงย่อยที่มีคำพูดที่ใช้ Escape ออกมาอย่างถูกต้องและยังลบคำพูดเริ่มต้นและสิ้นสุดของสารตั้งต้น

  [i.strip('"').strip("'") for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

วิธีนี้ใช้งานได้กับสตริงเช่น'This is " a \\\"test\\\"\\\'s substring"'(มาร์กอัปที่เสียสตินั้นเป็นสิ่งที่จำเป็นเพื่อป้องกันไม่ให้ Python เอาการยกเว้นออก)

หากไม่ต้องการ Escape Escape ในสตริงในรายการที่ส่งคืนคุณสามารถใช้ฟังก์ชันรุ่นที่เปลี่ยนแปลงเล็กน้อยนี้:

[i.strip('"').strip("'").decode('string_escape') for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

1

หากต้องการแก้ไขปัญหายูนิโค้ดใน Python 2 บางรุ่นฉันขอแนะนำ:

from shlex import split as _split
split = lambda a: [b.decode('utf-8') for b in _split(a.encode('utf-8'))]

— moschlar
แหล่งที่มา

สำหรับ python 2.7.5 สิ่งนี้ควรเป็น: split = lambda a: [b.decode('utf-8') for b in _split(a)]มิฉะนั้นคุณจะได้รับ:UnicodeDecodeError: 'ascii' codec can't decode byte ... in position ...: ordinal not in range(128)

— Peter Varo

1

ในฐานะที่เป็นตัวเลือกลอง tssplit:

In [1]: from tssplit import tssplit
In [2]: tssplit('this is "a test"', quote='"', delimiter='')
Out[2]: ['this', 'is', 'a test']

— มิคาอิลซาคารอฟ
แหล่งที่มา

0

ฉันแนะนำ:

สตริงทดสอบ:

s = 'abc "ad" \'fg\' "kk\'rdt\'" zzz"34"zzz "" \'\''

ในการจับภาพด้วย "" และ '':

import re
re.findall(r'"[^"]*"|\'[^\']*\'|[^"\'\s]+',s)

ผลลัพธ์:

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz', '""', "''"]

วิธีละเว้น "" และ "" ที่ว่างเปล่า:

import re
re.findall(r'"[^"]+"|\'[^\']+\'|[^"\'\s]+',s)

ผลลัพธ์:

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz']

— hussic
แหล่งที่มา

สามารถเขียนได้เช่นre.findall("(?:\".*?\"|'.*?'|[^\s'\"]+)", s)กัน

— hochl

-3

หากคุณไม่สนใจเกี่ยวกับสตริงย่อยง่ายกว่า

>>> 'a short sized string with spaces '.split()

ประสิทธิภาพ:

>>> s = " ('a short sized string with spaces '*100).split() "
>>> t = timeit.Timer(stmt=s)
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
171.39 usec/pass

หรือโมดูลสตริง

>>> from string import split as stringsplit; 
>>> stringsplit('a short sized string with spaces '*100)

ประสิทธิภาพ: โมดูลสตริงดูเหมือนว่าจะทำงานได้ดีกว่าวิธีสตริง

>>> s = "stringsplit('a short sized string with spaces '*100)"
>>> t = timeit.Timer(s, "from string import split as stringsplit")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
154.88 usec/pass

หรือคุณสามารถใช้ RE engine

>>> from re import split as resplit
>>> regex = '\s+'
>>> medstring = 'a short sized string with spaces '*100
>>> resplit(regex, medstring)

ประสิทธิภาพ

>>> s = "resplit(regex, medstring)"
>>> t = timeit.Timer(s, "from re import split as resplit; regex='\s+'; medstring='a short sized string with spaces '*100")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
540.21 usec/pass

สำหรับสตริงที่ยาวมากคุณไม่ควรโหลดสตริงทั้งหมดลงในหน่วยความจำและแยกสายหรือใช้การวนซ้ำซ้ำ ๆ แทน

— เกรกอรี่
แหล่งที่มา

11

คุณดูเหมือนจะพลาดจุดทั้งหมดของคำถาม มีส่วนที่ยกมาในสตริงที่ไม่จำเป็นต้องแยก

— rjmunro

-3

ลองสิ่งนี้:

  def adamsplit(s):
    result = []
    inquotes = False
    for substring in s.split('"'):
      if not inquotes:
        result.extend(substring.split())
      else:
        result.append(substring)
      inquotes = not inquotes
    return result

สายการทดสอบบางส่วน:

'This is "a test"' -> ['This', 'is', 'a test']
'"This is \'a test\'"' -> ["This is 'a test'"]

— pjz
แหล่งที่มา

โปรดระบุการเรียงสตริงที่คุณคิดว่าจะล้มเหลว

— pjz

คิดยังไง adamsplit("This is 'a test'")→['This', 'is', "'a", "test'"]

— Matthew Schinckel

OP บอกว่า "อยู่ในเครื่องหมายคำพูด" เท่านั้นและมีตัวอย่างด้วยเครื่องหมายคำพูดคู่

— pjz

แยกสตริงด้วยช่องว่าง - การรักษาสตริงย่อยที่อ้างถึง - ใน Python