ฉันจะตรวจสอบได้อย่างไรว่าไฟล์เป็นไบนารี (ไม่ใช่ข้อความ) ใน python

Question 1

ฉันจะทราบได้อย่างไรว่าไฟล์เป็นไบนารี (ไม่ใช่ข้อความ) ใน python

ฉันกำลังค้นหาไฟล์ชุดใหญ่ใน python และทำการจับคู่ไฟล์ไบนารีต่อไป สิ่งนี้ทำให้ผลลัพธ์ดูยุ่งเหยิงอย่างไม่น่าเชื่อ

ฉันรู้ว่าฉันสามารถใช้ได้grep -Iแต่ฉันใช้ข้อมูลมากกว่าที่ grep อนุญาต

ที่ผ่านมาฉันแค่ค้นหาตัวละครที่มากกว่า0x7fแต่utf8ในทำนองเดียวกันทำให้สิ่งนั้นเป็นไปไม่ได้ในระบบสมัยใหม่ ตามหลักการแล้วการแก้ปัญหาจะรวดเร็ว แต่วิธีแก้ปัญหาใด ๆ จะทำ

Question 2

คุณยังสามารถใช้โมดูลmimetypes :

import mimetypes
...
mime = mimetypes.guess_type(file)

การรวบรวมรายการประเภท mime ไบนารีค่อนข้างง่าย ตัวอย่างเช่น Apache กระจายด้วยไฟล์ mime.types ที่คุณสามารถแยกวิเคราะห์เป็นชุดของรายการไบนารีและข้อความจากนั้นตรวจสอบเพื่อดูว่า mime อยู่ในรายการข้อความหรือไบนารีของคุณหรือไม่

Question 3

อีกวิธีหนึ่งตามลักษณะการทำงานของไฟล์ (1) :

>>> textchars = bytearray({7,8,9,10,12,13,27} | set(range(0x20, 0x100)) - {0x7f})
>>> is_binary_string = lambda bytes: bool(bytes.translate(None, textchars))

ตัวอย่าง:

>>> is_binary_string(open('/usr/bin/python', 'rb').read(1024))
True
>>> is_binary_string(open('/usr/bin/dh_python3', 'rb').read(1024))
False

Question 4

หากคุณใช้ python3 กับ utf-8 มันตรงไปตรงมาเพียงเปิดไฟล์ในโหมดข้อความและหยุดการประมวลผลหากคุณได้รับUnicodeDecodeErrorไฟล์. Python3 จะใช้ Unicode เมื่อจัดการไฟล์ในโหมดข้อความ (และ ByteArray ในโหมด binary) - UnicodeDecodeErrorถ้าการเข้ารหัสของคุณไม่สามารถถอดรหัสไฟล์โดยพลการก็ค่อนข้างมีโอกาสที่คุณจะได้รับ

ตัวอย่าง:

try:
    with open(filename, "r") as f:
        for l in f:
             process_line(l)
except UnicodeDecodeError:
    pass # Fond non-text data

Question 5

ถ้ามันช่วยได้ไบนารี่หลายประเภทเริ่มต้นด้วยตัวเลขมหัศจรรย์ นี่คือรายการลายเซ็นของไฟล์

Question 6

ลองสิ่งนี้:

def is_binary(filename):
    """Return true if the given filename is binary.
    @raise EnvironmentError: if the file does not exist or cannot be accessed.
    @attention: found @ http://bytes.com/topic/python/answers/21222-determine-file-type-binary-text on 6/08/2010
    @author: Trent Mick <TrentM@ActiveState.com>
    @author: Jorge Orpinel <jorge@orpinel.com>"""
    fin = open(filename, 'rb')
    try:
        CHUNKSIZE = 1024
        while 1:
            chunk = fin.read(CHUNKSIZE)
            if '\0' in chunk: # found null byte
                return True
            if len(chunk) < CHUNKSIZE:
                break # done
    # A-wooo! Mira, python no necesita el "except:". Achis... Que listo es.
    finally:
        fin.close()

    return False

Question 7

นี่คือคำแนะนำที่ใช้คำสั่งไฟล์ Unix :

import re
import subprocess

def istext(path):
    return (re.search(r':.* text',
                      subprocess.Popen(["file", '-L', path], 
                                       stdout=subprocess.PIPE).stdout.read())
            is not None)

ตัวอย่างการใช้งาน:

>>> istext ('/ etc / motd') 
จริง
>>> istext ('/ vmlinuz') 
เท็จ
>>> open ('/ tmp / japanese') อ่าน ()
'\ xe3 \ x81 \ x93 \ xe3 \ x82 \ x8c \ xe3 \ x81 \ xaf \ xe3 \ x80 \ x81 \ xe3 \ x81 \ xbf \ xe3 \ x81 \ x9a \ xe3 \ x81 \ x8c \ xe3 \ x82 \ x81 \ xe5 \ xba \ xa7 \ xe3 \ x81 \ xae \ xe6 \ x99 \ x82 \ xe4 \ xbb \ xa3 \ xe3 \ x81 \ xae \ xe5 \ xb9 \ x95 \ xe9 \ x96 \ x8b \ xe3 \ x81 \ x91 \ xe3 \ x80 \ x82 \ n '
>>> istext ('/ tmp / japanese') # ทำงานบน UTF-8
จริง

มีข้อเสียของการไม่พกพาไปยัง Windows (เว้นแต่คุณจะมีบางอย่างเช่นfileคำสั่งที่นั่น) และต้องสร้างกระบวนการภายนอกสำหรับแต่ละไฟล์ซึ่งอาจไม่ถูกปาก

Question 8

ใช้binaryornot library ( GitHub )

มันง่ายมากและขึ้นอยู่กับรหัสที่พบในคำถาม stackoverflow นี้

คุณสามารถเขียนสิ่งนี้ได้ในโค้ด 2 บรรทัด แต่แพ็คเกจนี้ช่วยให้คุณไม่ต้องเขียนและทดสอบโค้ด 2 บรรทัดเหล่านั้นอย่างละเอียดด้วยไฟล์ประเภทแปลก ๆ ข้ามแพลตฟอร์ม

Question 9

โดยปกติคุณต้องเดา

คุณสามารถดูส่วนขยายเป็นเบาะแสเดียวหากไฟล์มี

คุณยังสามารถจดจำรูปแบบไบนารีและเพิกเฉยต่อสิ่งเหล่านั้นได้

มิฉะนั้นให้ดูสัดส่วนของไบต์ ASCII ที่ไม่สามารถพิมพ์ได้ที่คุณมีและเดาจากสิ่งนั้น

คุณยังสามารถลองถอดรหัสจาก UTF-8 และดูว่าให้ผลลัพธ์ที่สมเหตุสมผลหรือไม่

Question 10

วิธีแก้ปัญหาที่สั้นกว่าพร้อมคำเตือน UTF-16:

def is_binary(filename):
    """ 
    Return true if the given filename appears to be binary.
    File is considered to be binary if it contains a NULL byte.
    FIXME: This approach incorrectly reports UTF-16 as binary.
    """
    with open(filename, 'rb') as f:
        for block in f:
            if b'\0' in block:
                return True
    return False

Question 11

เราสามารถใช้ python เองเพื่อตรวจสอบว่าไฟล์นั้นเป็นไบนารีหรือไม่เพราะมันล้มเหลวหากเราพยายามเปิดไฟล์ไบนารีในโหมดข้อความ

def is_binary(file_name):
    try:
        with open(file_name, 'tr') as check_file:  # try open file in text mode
            check_file.read()
            return False
    except:  # if fail then file is non-text (binary)
        return True

Question 12

หากคุณไม่ได้ใช้ Windows คุณสามารถใช้Python Magicเพื่อกำหนดประเภทไฟล์ได้ จากนั้นตรวจสอบว่าเป็นประเภทข้อความ / ละครใบ้หรือไม่

Question 13

นี่คือฟังก์ชั่นที่ตรวจสอบก่อนว่าไฟล์เริ่มต้นด้วย BOM หรือไม่และหากไม่ได้ค้นหาศูนย์ไบต์ภายใน 8192 ไบต์เริ่มต้น:

import codecs


#: BOMs to indicate that a file is a text file even if it contains zero bytes.
_TEXT_BOMS = (
    codecs.BOM_UTF16_BE,
    codecs.BOM_UTF16_LE,
    codecs.BOM_UTF32_BE,
    codecs.BOM_UTF32_LE,
    codecs.BOM_UTF8,
)


def is_binary_file(source_path):
    with open(source_path, 'rb') as source_file:
        initial_bytes = source_file.read(8192)
    return not any(initial_bytes.startswith(bom) for bom in _TEXT_BOMS) \
           and b'\0' in initial_bytes

ในทางเทคนิคการตรวจสอบ UTF-8 BOM นั้นไม่จำเป็นเนื่องจากไม่ควรมีศูนย์ไบต์สำหรับวัตถุประสงค์ในทางปฏิบัติทั้งหมด แต่เนื่องจากเป็นการเข้ารหัสทั่วไปจึงตรวจสอบ BOM ได้เร็วกว่าในตอนเริ่มต้นแทนที่จะสแกน 8192 ไบต์ทั้งหมดเป็น 0

Question 14

ลองใช้python-magicที่ดูแลอยู่ในปัจจุบันซึ่งไม่ใช่โมดูลเดียวกันในคำตอบของ @Kami Kisiel สิ่งนี้รองรับทุกแพลตฟอร์มรวมถึง Windows แต่คุณจะต้องใช้ไฟล์libmagicไฟล์ไบนารี สิ่งนี้ได้อธิบายไว้ใน README

ไม่เหมือนกับโมดูลmimetypesคือไม่ใช้นามสกุลของไฟล์และจะตรวจสอบเนื้อหาของไฟล์แทน

>>> import magic
>>> magic.from_file("testdata/test.pdf", mime=True)
'application/pdf'
>>> magic.from_file("testdata/test.pdf")
'PDF document, version 1.2'
>>> magic.from_buffer(open("testdata/test.pdf").read(1024))
'PDF document, version 1.2'

Question 15

ฉันมาที่นี่เพื่อค้นหาสิ่งเดียวกัน - โซลูชันที่ครอบคลุมโดยไลบรารีมาตรฐานเพื่อตรวจจับไบนารีหรือข้อความ หลังจากตรวจสอบตัวเลือกที่คนแนะนำคำสั่งไฟล์ nix ดูเหมือนจะเป็นตัวเลือกที่ดีที่สุด (ฉันกำลังพัฒนาสำหรับ linux boxen เท่านั้น) คนอื่น ๆ บางคนโพสต์วิธีแก้ปัญหาโดยใช้ไฟล์แต่มีความซับซ้อนโดยไม่จำเป็นในความคิดของฉันดังนั้นนี่คือสิ่งที่ฉันคิดขึ้น:

def test_file_isbinary(filename):
    cmd = shlex.split("file -b -e soft '{}'".format(filename))
    if subprocess.check_output(cmd)[:4] in {'ASCI', 'UTF-'}:
        return False
    return True

ควรไปโดยไม่บอก แต่โค้ดของคุณที่เรียกใช้ฟังก์ชันนี้ควรตรวจสอบให้แน่ใจว่าคุณสามารถอ่านไฟล์ได้ก่อนที่จะทำการทดสอบมิฉะนั้นจะตรวจพบไฟล์โดยไม่ถูกต้องว่าเป็นไบนารี

Question 16

ฉันเดาว่าทางออกที่ดีที่สุดคือใช้ฟังก์ชัน guess_type มีรายการที่มีประเภทจำลองหลายแบบและคุณสามารถรวมประเภทของคุณเองได้ นี่คือสคริปต์ที่ฉันใช้เพื่อแก้ปัญหาของฉัน:

from mimetypes import guess_type
from mimetypes import add_type

def __init__(self):
        self.__addMimeTypes()

def __addMimeTypes(self):
        add_type("text/plain",".properties")

def __listDir(self,path):
        try:
            return listdir(path)
        except IOError:
            print ("The directory {0} could not be accessed".format(path))

def getTextFiles(self, path):
        asciiFiles = []
        for files in self.__listDir(path):
            if guess_type(files)[0].split("/")[0] == "text":
                asciiFiles.append(files)
        try:
            return asciiFiles
        except NameError:
            print ("No text files in directory: {0}".format(path))
        finally:
            del asciiFiles

มันอยู่ในชั้นเรียนอย่างที่คุณเห็นตามโครงสร้างของรหัส แต่คุณสามารถเปลี่ยนแปลงสิ่งต่างๆที่คุณต้องการนำไปใช้ในแอปพลิเคชันของคุณได้ มันค่อนข้างง่ายที่จะใช้ เมธอด getTextFiles ส่งคืนอ็อบเจ็กต์รายการพร้อมไฟล์ข้อความทั้งหมดที่อยู่บนไดเร็กทอรีที่คุณส่งผ่านตัวแปรพา ธ

Question 17

บน * NIX:

หากคุณสามารถเข้าถึง`file`คำสั่งเชลล์ shlex สามารถช่วยให้โมดูลกระบวนการย่อยใช้งานได้มากขึ้น:

from os.path import realpath
from subprocess import check_output
from shlex import split

filepath = realpath('rel/or/abs/path/to/file')
assert 'ascii' in check_output(split('file {}'.format(filepth).lower()))

หรือคุณสามารถติดมันใน for-loop เพื่อรับเอาต์พุตสำหรับไฟล์ทั้งหมดใน dir ปัจจุบันโดยใช้:

import os
for afile in [x for x in os.listdir('.') if os.path.isfile(x)]:
    assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

หรือสำหรับ subdirs ทั้งหมด:

for curdir, filelist in zip(os.walk('.')[0], os.walk('.')[2]):
     for afile in filelist:
         assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

Question 18

โปรแกรมส่วนใหญ่จะพิจารณาว่าไฟล์เป็นไบนารี (ซึ่งเป็นไฟล์ใด ๆ ที่ไม่ใช่ "line-oriented") หากมีอักขระ NULLตัวละครที่เป็นโมฆะ

นี่คือเวอร์ชันpp_fttext()( pp_sys.c) ของ perl ที่ใช้งานใน Python:

import sys
PY3 = sys.version_info[0] == 3

# A function that takes an integer in the 8-bit range and returns
# a single-character byte object in py3 / a single-character string
# in py2.
#
int2byte = (lambda x: bytes((x,))) if PY3 else chr

_text_characters = (
        b''.join(int2byte(i) for i in range(32, 127)) +
        b'\n\r\t\f\b')

def istextfile(fileobj, blocksize=512):
    """ Uses heuristics to guess whether the given file is text or binary,
        by reading a single block of bytes from the file.
        If more than 30% of the chars in the block are non-text, or there
        are NUL ('\x00') bytes in the block, assume this is a binary file.
    """
    block = fileobj.read(blocksize)
    if b'\x00' in block:
        # Files with null bytes are binary
        return False
    elif not block:
        # An empty file is considered a valid text file
        return True

    # Use translate's 'deletechars' argument to efficiently remove all
    # occurrences of _text_characters from the block
    nontext = block.translate(None, _text_characters)
    return float(len(nontext)) / len(block) <= 0.30

โปรดทราบว่าโค้ดนี้เขียนขึ้นเพื่อให้ทำงานบนทั้ง Python 2 และ Python 3 โดยไม่มีการเปลี่ยนแปลง

ที่มา: "เดาว่าไฟล์เป็นข้อความหรือไบนารี" ของ Perl ถูกนำไปใช้ใน Python

Question 19

คุณอยู่ใน Unix หรือเปล่า ถ้าเป็นเช่นนั้นลอง:

isBinary = os.system("file -b" + name + " | grep text > /dev/null")

ค่าการส่งคืนเชลล์จะกลับด้าน (0 ก็ใช้ได้ดังนั้นหากพบ "ข้อความ" ก็จะส่งกลับเป็น 0 และใน Python นั้นเป็นนิพจน์เท็จ)

Question 20

วิธีที่ง่ายกว่าคือตรวจสอบว่าไฟล์มีอักขระ NULL ( \x00) หรือไม่โดยใช้inตัวดำเนินการเช่น:

b'\x00' in open("foo.bar", 'rb').read()

ดูตัวอย่างทั้งหมดด้านล่าง:

#!/usr/bin/env python3
import argparse
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('file', nargs=1)
    args = parser.parse_args()
    with open(args.file[0], 'rb') as f:
        if b'\x00' in f.read():
            print('The file is binary!')
        else:
            print('The file is not binary!')