TypeError: ไม่สามารถใช้รูปแบบสตริงบนวัตถุคล้ายไบต์ใน re.findall ()

108

ฉันกำลังพยายามเรียนรู้วิธีดึง URL จากเพจโดยอัตโนมัติ ในรหัสต่อไปนี้ฉันกำลังพยายามหาชื่อของหน้าเว็บ:

import urllib.request
import re

url = "http://www.google.com"
regex = r'<title>(,+?)</title>'
pattern  = re.compile(regex)

with urllib.request.urlopen(url) as response:
   html = response.read()

title = re.findall(pattern, html)
print(title)

และฉันได้รับข้อผิดพลาดที่ไม่คาดคิดนี้:

Traceback (most recent call last):
  File "path\to\file\Crawler.py", line 11, in <module>
    title = re.findall(pattern, html)
  File "C:\Python33\lib\re.py", line 201, in findall
    return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object

ผมทำอะไรผิดหรือเปล่า?

python python-3.x web-crawler

— Inspired_Blue
แหล่งที่มา

1

อาจซ้ำกันของการแปลงไบต์เป็นสตริง Python

— gnat

161

คุณต้องการแปลง html (วัตถุคล้ายไบต์) เป็นสตริงโดยใช้.decodeเช่น html = response.read().decode('utf-8').

ดูแปลงไบต์เป็น Python String

— หิน
แหล่งที่มา

สิ่งนี้แก้ไขข้อผิดพลาดTypeError: cannot use a string pattern on a bytes-like objectแต่แล้วฉันก็พบข้อผิดพลาดเช่นUnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 1: invalid start byte. ฉันแก้ไขโดยใช้.decode("utf-8", "ignore"): stackoverflow.com/questions/62170614/…

— baptx

"ละเว้น" ละเว้น ถ้านั่นคือสิ่งที่คุณต้องการก็เป็นสิ่งที่ดี อย่างไรก็ตามในบางครั้งปัญหาประเภทนี้ก็เป็นปัญหาที่ลึกกว่าเช่นสิ่งที่คุณต้องการจะถอดรหัสนั้นไม่สามารถถอดรหัสได้จริง ๆ หรือตั้งใจจะเป็นเช่นข้อความที่ถูกบีบอัดหรือเข้ารหัส หรืออาจต้องมีการเข้ารหัสอื่น ๆ เช่นutf-16. ข้อแม้ emptor

— หิน

28

ปัญหาคือ regex ของคุณเป็นสตริง แต่htmlเป็นไบต์ :

>>> type(html)
<class 'bytes'>

เนื่องจาก python ไม่ทราบว่าไบต์เหล่านั้นเข้ารหัสอย่างไรจึงมีข้อยกเว้นเมื่อคุณพยายามใช้ regex สตริงกับพวกเขา

คุณสามารถdecodeไบต์เป็นสตริง:

html = html.decode('ISO-8859-1')  # encoding may vary!
title = re.findall(pattern, html)  # no more error

หรือใช้ regex ไบต์:

regex = rb'<title>(,+?)</title>'
#        ^

ในบริบทเฉพาะนี้คุณสามารถรับการเข้ารหัสจากส่วนหัวการตอบกลับ:

with urllib.request.urlopen(url) as response:
    encoding = response.info().get_param('charset', 'utf8')
    html = response.read().decode(encoding)

ดูurlopenเอกสารสำหรับรายละเอียดเพิ่มเติม

— อรัญ - เฟย์
แหล่งที่มา