การค้นหาโฟลเดอร์ย่อยแบบเรียกซ้ำและส่งคืนไฟล์ในรายการ python

119

ฉันกำลังทำงานกับสคริปต์เพื่อเรียกดูโฟลเดอร์ย่อยในโฟลเดอร์หลักซ้ำ ๆ และสร้างรายการจากไฟล์บางประเภท ฉันมีปัญหากับสคริปต์ ชุดปัจจุบันมีดังนี้

for root, subFolder, files in os.walk(PATH):
    for item in files:
        if item.endswith(".txt") :
            fileNamePath = str(os.path.join(root,subFolder,item))

ปัญหาคือตัวแปร subFolder ดึงรายการของโฟลเดอร์ย่อยแทนที่จะเป็นโฟลเดอร์ที่มีไฟล์ ITEM อยู่ ฉันคิดว่าจะใช้ for loop สำหรับโฟลเดอร์ย่อยมาก่อนและเข้าร่วมส่วนแรกของเส้นทาง แต่ฉันคิดว่า Id ตรวจสอบอีกครั้งเพื่อดูว่าใครมีข้อเสนอแนะก่อนหน้านั้นหรือไม่ ขอบคุณสำหรับความช่วยเหลือของคุณ!

— user2709514
แหล่งที่มา

157

คุณควรจะใช้ที่คุณเรียกdirpath จะมาเพื่อให้คุณสามารถตัดมันถ้ามีโฟลเดอร์ที่คุณไม่ต้องการที่จะเข้าไปใน recurserootdirnamesos.walk

import os
result = [os.path.join(dp, f) for dp, dn, filenames in os.walk(PATH) for f in filenames if os.path.splitext(f)[1] == '.txt']

แก้ไข:

หลังจากการโหวตลดครั้งล่าสุดเกิดขึ้นกับฉันซึ่งglobเป็นเครื่องมือที่ดีกว่าในการเลือกตามส่วนขยาย

import os
from glob import glob
result = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.txt'))]

ยังเป็นรุ่นเครื่องกำเนิดไฟฟ้า

from itertools import chain
result = (chain.from_iterable(glob(os.path.join(x[0], '*.txt')) for x in os.walk('.')))

Edit2 สำหรับ Python 3.4+

from pathlib import Path
result = list(Path(".").rglob("*.[tT][xX][tT]"))

— จอห์นลารูอย
แหล่งที่มา

1

รูปแบบลูกโลก '*. [tt] [Xx] [Tt]' จะทำให้การค้นหาไม่คำนึงถึงตัวพิมพ์เล็กและใหญ่

— SergiyKolesnikov

@SergiyKolesnikov ขอบคุณฉันใช้สิ่งนั้นในการแก้ไขที่ด้านล่าง โปรดทราบว่าสิ่งrglobนี้ไม่ไวต่อความรู้สึกบนแพลตฟอร์ม Windows - แต่มันไม่ไวต่อการพกพา

— John La Rooy

1

@JohnLaRooy มันใช้ได้ด้วยglob(Python 3.6 ที่นี่):glob.iglob(os.path.join(real_source_path, '**', '*.[xX][mM][lL]')

— SergiyKolesnikov

@Sergiy: คุณใช้iglobไม่ได้กับไฟล์ในโฟลเดอร์ย่อยหรือด้านล่าง recursive=Trueคุณจำเป็นต้องเพิ่ม

— user136036

1

@ user136036 "ดีกว่า" ไม่ได้หมายความว่าเร็วที่สุดเสมอไป บางครั้งความสามารถในการอ่านและการบำรุงรักษาก็มีความสำคัญเช่นกัน

— John La Rooy

114

เปลี่ยนแปลงในPython 3.5 : รองรับการเรียกซ้ำ globs โดยใช้“ **”

glob.glob()มีใหม่พารามิเตอร์ recursive

หากคุณต้องการรับทุก.txtไฟล์ภายใต้my_path(เรียกซ้ำรวมถึง subdirs):

import glob

files = glob.glob(my_path + '/**/*.txt', recursive=True)

# my_path/     the dir
# **/       every file and dir under my_path
# *.txt     every file that ends with '.txt'

หากคุณต้องการตัววนซ้ำคุณสามารถใช้iglobเป็นทางเลือกอื่นได้:

for file in glob.iglob(my_path, recursive=False):
    # ...

— Rotareti
แหล่งที่มา

1

TypeError: glob () ได้รับอาร์กิวเมนต์คำหลักที่ไม่คาดคิด 'เรียกซ้ำ'

— CyberJacob

1

มันควรจะทำงาน ตรวจสอบให้แน่ใจว่าคุณใช้เวอร์ชัน> = 3.5 ฉันได้เพิ่มลิงก์ไปยังเอกสารประกอบในคำตอบของฉันเพื่อดูรายละเอียดเพิ่มเติม

— Rotareti

นั่นคงเป็นเหตุผลว่าทำไมฉันถึง 2.7

— CyberJacob

1

ทำไมรายการถึงเข้าใจและไม่เพียงfiles = glob.glob(PATH + '/*/**/*.txt', recursive=True)?

— tobltobs

อ๊ะ! :) มันซ้ำซ้อนโดยสิ้นเชิง ไม่รู้ว่าอะไรทำให้ฉันเขียนแบบนั้น ขอบคุณที่พูดถึง! ฉันจะแก้ไขมัน

— Rotareti

20

ฉันจะแปลความเข้าใจในรายการของ John La Rooy เป็นแบบซ้อนในกรณีที่คนอื่นมีปัญหาในการทำความเข้าใจ

result = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.txt'))]

ควรเทียบเท่ากับ:

import glob

result = []

for x in os.walk(PATH):
    for y in glob.glob(os.path.join(x[0], '*.txt')):
        result.append(y)

นี่คือเอกสารสำหรับรายการความเข้าใจและฟังก์ชั่นos.walkและglob.glob

— เจฟเฟอร์สันลิมา
แหล่งที่มา

1

คำตอบนี้ใช้ได้กับฉันใน Python 3.7.3 glob.glob(..., recursive=True)และlist(Path(dir).glob(...'))ไม่ได้

— miguelmorin

11

นี่ดูเหมือนจะเป็นวิธีแก้ปัญหาที่เร็วที่สุดที่ฉันคิดได้และเร็วกว่าos.walkและเร็วกว่าglobโซลูชันใด ๆมาก

นอกจากนี้ยังให้รายชื่อโฟลเดอร์ย่อยที่ซ้อนกันทั้งหมดโดยไม่มีค่าใช้จ่ายใด ๆ
คุณสามารถค้นหาส่วนขยายต่างๆได้
คุณยังสามารถเลือกที่จะส่งคืนเส้นทางแบบเต็มหรือเพียงแค่ชื่อของไฟล์โดยเปลี่ยนf.pathเป็นf.name(อย่าเปลี่ยนสำหรับโฟลเดอร์ย่อย!)

dir: str, ext: listargs:
ฟังก์ชันส่งคืนสองรายการ: subfolders, files.

ดูด้านล่างสำหรับการวิเคราะห์ความเร็วโดยละเอียด

def run_fast_scandir(dir, ext):    # dir: str, ext: list
    subfolders, files = [], []

    for f in os.scandir(dir):
        if f.is_dir():
            subfolders.append(f.path)
        if f.is_file():
            if os.path.splitext(f.name)[1].lower() in ext:
                files.append(f.path)


    for dir in list(subfolders):
        sf, f = run_fast_scandir(dir, ext)
        subfolders.extend(sf)
        files.extend(f)
    return subfolders, files


subfolders, files = run_fast_scandir(folder, [".jpg"])

การวิเคราะห์ความเร็ว

สำหรับวิธีการต่างๆในการรับไฟล์ทั้งหมดที่มีนามสกุลไฟล์เฉพาะภายในโฟลเดอร์ย่อยทั้งหมดและโฟลเดอร์หลัก

tl; dr:
- fast_scandirชนะอย่างชัดเจนและเร็วกว่าโซลูชันอื่น ๆ ถึงสองเท่ายกเว้น os.walk
- os.walkเป็นอันดับสองช้าลง
- การใช้globจะทำให้กระบวนการช้าลงอย่างมาก
- ไม่มีผลใช้เรียงลำดับตามธรรมชาติ ซึ่งหมายความว่าผลลัพธ์จะเรียงดังนี้ 1, 10, 2 หากต้องการจัดเรียงตามธรรมชาติ (1, 2, 10) โปรดดูที่https://stackoverflow.com/a/48030307/2441026

ผล:

fast_scandir    took  499 ms. Found files: 16596. Found subfolders: 439
os.walk         took  589 ms. Found files: 16596
find_files      took  919 ms. Found files: 16596
glob.iglob      took  998 ms. Found files: 16596
glob.glob       took 1002 ms. Found files: 16596
pathlib.rglob   took 1041 ms. Found files: 16596
os.walk-glob    took 1043 ms. Found files: 16596

ทำการทดสอบด้วย W7x64, Python 3.8.1, 20 รัน 16596 ไฟล์ในโฟลเดอร์ย่อย 439 (ซ้อนกันบางส่วน)
find_filesมาจากhttps://stackoverflow.com/a/45646357/2441026และให้คุณค้นหาส่วนขยายต่างๆ
fast_scandirเขียนขึ้นโดยตัวฉันเองและจะแสดงรายการโฟลเดอร์ย่อยด้วย คุณสามารถให้รายการส่วนขยายที่ต้องการค้นหาได้ (ฉันทดสอบรายการโดยมีรายการเดียวเป็นรายการธรรมดาif ... == ".jpg"และไม่มีความแตกต่างอย่างมีนัยสำคัญ)

# -*- coding: utf-8 -*-
# Python 3


import time
import os
from glob import glob, iglob
from pathlib import Path


directory = r"<folder>"
RUNS = 20


def run_os_walk():
    a = time.time_ns()
    for i in range(RUNS):
        fu = [os.path.join(dp, f) for dp, dn, filenames in os.walk(directory) for f in filenames if
                  os.path.splitext(f)[1].lower() == '.jpg']
    print(f"os.walk\t\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_os_walk_glob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = [y for x in os.walk(directory) for y in glob(os.path.join(x[0], '*.jpg'))]
    print(f"os.walk-glob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_glob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = glob(os.path.join(directory, '**', '*.jpg'), recursive=True)
    print(f"glob.glob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_iglob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = list(iglob(os.path.join(directory, '**', '*.jpg'), recursive=True))
    print(f"glob.iglob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_pathlib_rglob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = list(Path(directory).rglob("*.jpg"))
    print(f"pathlib.rglob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def find_files(files, dirs=[], extensions=[]):
    # https://stackoverflow.com/a/45646357/2441026

    new_dirs = []
    for d in dirs:
        try:
            new_dirs += [ os.path.join(d, f) for f in os.listdir(d) ]
        except OSError:
            if os.path.splitext(d)[1].lower() in extensions:
                files.append(d)

    if new_dirs:
        find_files(files, new_dirs, extensions )
    else:
        return


def run_fast_scandir(dir, ext):    # dir: str, ext: list
    # https://stackoverflow.com/a/59803793/2441026

    subfolders, files = [], []

    for f in os.scandir(dir):
        if f.is_dir():
            subfolders.append(f.path)
        if f.is_file():
            if os.path.splitext(f.name)[1].lower() in ext:
                files.append(f.path)


    for dir in list(subfolders):
        sf, f = run_fast_scandir(dir, ext)
        subfolders.extend(sf)
        files.extend(f)
    return subfolders, files



if __name__ == '__main__':
    run_os_walk()
    run_os_walk_glob()
    run_glob()
    run_iglob()
    run_pathlib_rglob()


    a = time.time_ns()
    for i in range(RUNS):
        files = []
        find_files(files, dirs=[directory], extensions=[".jpg"])
    print(f"find_files\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}")


    a = time.time_ns()
    for i in range(RUNS):
        subf, files = run_fast_scandir(directory, [".jpg"])
    print(f"fast_scandir\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}. Found subfolders: {len(subf)}")

— user136036
แหล่งที่มา

10

pathlibไลบรารีใหม่ทำให้สิ่งนี้ง่ายขึ้นเป็นหนึ่งบรรทัด:

from pathlib import Path
result = list(Path(PATH).glob('**/*.txt'))

คุณยังสามารถใช้เวอร์ชันเครื่องกำเนิดไฟฟ้า:

from pathlib import Path
for file in Path(PATH).glob('**/*.txt'):
    pass

นี้ผลตอบแทนวัตถุซึ่งคุณสามารถใช้สำหรับอะไรมากสวยหรือได้รับชื่อไฟล์เป็นสตริงโดยPathfile.name

— เอ็ม
แหล่งที่มา

6

มันไม่ใช่คำตอบที่ยิ่งใหญ่ที่สุด แต่ฉันจะใส่ไว้ที่นี่เพื่อความสนุกเพราะเป็นบทเรียนที่ประณีตในการเรียกซ้ำ

def find_files( files, dirs=[], extensions=[]):
    new_dirs = []
    for d in dirs:
        try:
            new_dirs += [ os.path.join(d, f) for f in os.listdir(d) ]
        except OSError:
            if os.path.splitext(d)[1] in extensions:
                files.append(d)

    if new_dirs:
        find_files(files, new_dirs, extensions )
    else:
        return

ในเครื่องของฉันฉันมีสองโฟลเดอร์rootและroot2

mender@multivax ]ls -R root root2
root:
temp1 temp2

root/temp1:
temp1.1 temp1.2

root/temp1/temp1.1:
f1.mid

root/temp1/temp1.2:
f.mi  f.mid

root/temp2:
tmp.mid

root2:
dummie.txt temp3

root2/temp3:
song.mid

สมมติว่าฉันต้องการค้นหาไฟล์.txtทั้งหมดและ.midไฟล์ทั้งหมดในไดเร็กทอรีเหล่านี้จากนั้นฉันก็ทำได้

files = []
find_files( files, dirs=['root','root2'], extensions=['.mid','.txt'] )
print(files)

#['root2/dummie.txt',
# 'root/temp2/tmp.mid',
# 'root2/temp3/song.mid',
# 'root/temp1/temp1.1/f1.mid',
# 'root/temp1/temp1.2/f.mid']

— dermen
แหล่งที่มา

4

Recursive เป็นของใหม่ใน Python 3.5 ดังนั้นจึงใช้ไม่ได้กับ Python 2.7 นี่คือตัวอย่างที่ใช้rสตริงดังนั้นคุณเพียงแค่ต้องระบุเส้นทางตามที่เป็นอยู่บน Win, Lin, ...

import glob

mypath=r"C:\Users\dj\Desktop\nba"

files = glob.glob(mypath + r'\**\*.py', recursive=True)
# print(files) # as list
for f in files:
    print(f) # nice looking single line per file

หมายเหตุ: มันจะแสดงรายการไฟล์ทั้งหมดไม่ว่าจะลึกแค่ไหนก็ตาม

— prosti
แหล่งที่มา

3

คุณสามารถทำได้ด้วยวิธีนี้เพื่อส่งคืนรายการไฟล์พา ธ สัมบูรณ์

def list_files_recursive(path):
    """
    Function that receives as a parameter a directory path
    :return list_: File List and Its Absolute Paths
    """

    import os

    files = []

    # r = root, d = directories, f = files
    for r, d, f in os.walk(path):
        for file in f:
            files.append(os.path.join(r, file))

    lst = [file for file in files]
    return lst


if __name__ == '__main__':

    result = list_files_recursive('/tmp')
    print(result)

— WilliamCanin
แหล่งที่มา

3

หากคุณไม่คิดจะติดตั้งไลบรารีไฟเพิ่มเติมคุณสามารถทำได้:

pip install plazy

การใช้งาน:

import plazy

txt_filter = lambda x : True if x.endswith('.txt') else False
files = plazy.list_files(root='data', filter_func=txt_filter, is_include_root=True)

ผลลัพธ์ควรมีลักษณะดังนี้:

['data/a.txt', 'data/b.txt', 'data/sub_dir/c.txt']

ใช้ได้กับทั้ง Python 2.7 และ Python 3

Github: https://github.com/kyzas/plazy#list-files

Disclaimer: plazyผมเป็นผู้เขียน

— มินเหงียน
แหล่งที่มา

1

ฟังก์ชันนี้จะใส่เฉพาะไฟล์ในรายการซ้ำ ๆ หวังว่านี่จะเป็นคุณ

import os


def ls_files(dir):
    files = list()
    for item in os.listdir(dir):
        abspath = os.path.join(dir, item)
        try:
            if os.path.isdir(abspath):
                files = files + ls_files(abspath)
            else:
                files.append(abspath)
        except FileNotFoundError as err:
            print('invalid directory\n', 'Error: ', err)
    return files

— Yossarian42
แหล่งที่มา

0

โซลูชันดั้งเดิมของคุณเกือบจะถูกต้องมาก แต่ตัวแปร "root" จะได้รับการอัปเดตแบบไดนามิกเนื่องจากมีการวนซ้ำรอบ ๆ os.walk () เป็นเครื่องกำเนิดไฟฟ้าแบบวนซ้ำ ชุดทูเปิลแต่ละชุดของ (root, subFolder, files) มีไว้สำหรับรูทเฉพาะตามที่คุณตั้งค่าไว้

กล่าวคือ

root = 'C:\\'
subFolder = ['Users', 'ProgramFiles', 'ProgramFiles (x86)', 'Windows', ...]
files = ['foo1.txt', 'foo2.txt', 'foo3.txt', ...]

root = 'C:\\Users\\'
subFolder = ['UserAccount1', 'UserAccount2', ...]
files = ['bar1.txt', 'bar2.txt', 'bar3.txt', ...]

...

ฉันปรับแต่งโค้ดของคุณเล็กน้อยเพื่อพิมพ์รายการทั้งหมด

import os
for root, subFolder, files in os.walk(PATH):
    for item in files:
        if item.endswith(".txt") :
            fileNamePath = str(os.path.join(root,item))
            print(fileNamePath)

หวังว่านี่จะช่วยได้!

— LastTigerEyes
แหล่งที่มา