ลำดับไบต์ทำเครื่องหมายการอ่านไฟล์ใน Java

107

ฉันพยายามอ่านไฟล์ CSV โดยใช้ Java ไฟล์บางไฟล์อาจมีเครื่องหมายลำดับไบต์ในตอนต้น แต่ไม่ใช่ทั้งหมด เมื่อมีอยู่คำสั่งไบต์จะถูกอ่านพร้อมกับส่วนที่เหลือของบรรทัดแรกจึงทำให้เกิดปัญหากับการเปรียบเทียบสตริง

มีวิธีง่ายๆในการข้ามเครื่องหมายลำดับไบต์เมื่อมีอยู่หรือไม่?

ขอบคุณ!

java utf-8 byte-order-mark

— ทอม
แหล่งที่มา

อาจจะ: rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html

— Chris

114

แก้ไข : ฉันได้ทำการเปิดตัวที่เหมาะสมบน GitHub: https://github.com/gpakosz/UnicodeBOMInputStream

นี่คือคลาสที่ฉันเขียนโค้ดเมื่อไม่นานมานี้ฉันเพิ่งแก้ไขชื่อแพ็กเกจก่อนวาง ไม่มีอะไรพิเศษมันค่อนข้างคล้ายกับโซลูชันที่โพสต์ในฐานข้อมูลจุดบกพร่องของ SUN รวมไว้ในรหัสของคุณและคุณก็สบายดี

/* ____________________________________________________________________________
 * 
 * File:    UnicodeBOMInputStream.java
 * Author:  Gregory Pakosz.
 * Date:    02 - November - 2005    
 * ____________________________________________________________________________
 */
package com.stackoverflow.answer;

import java.io.IOException;
import java.io.InputStream;
import java.io.PushbackInputStream;

/**
 * The <code>UnicodeBOMInputStream</code> class wraps any
 * <code>InputStream</code> and detects the presence of any Unicode BOM
 * (Byte Order Mark) at its beginning, as defined by
 * <a href="http://www.faqs.org/rfcs/rfc3629.html">RFC 3629 - UTF-8, a transformation format of ISO 10646</a>
 * 
 * <p>The
 * <a href="http://www.unicode.org/unicode/faq/utf_bom.html">Unicode FAQ</a>
 * defines 5 types of BOMs:<ul>
 * <li><pre>00 00 FE FF  = UTF-32, big-endian</pre></li>
 * <li><pre>FF FE 00 00  = UTF-32, little-endian</pre></li>
 * <li><pre>FE FF        = UTF-16, big-endian</pre></li>
 * <li><pre>FF FE        = UTF-16, little-endian</pre></li>
 * <li><pre>EF BB BF     = UTF-8</pre></li>
 * </ul></p>
 * 
 * <p>Use the {@link #getBOM()} method to know whether a BOM has been detected
 * or not.
 * </p>
 * <p>Use the {@link #skipBOM()} method to remove the detected BOM from the
 * wrapped <code>InputStream</code> object.</p>
 */
public class UnicodeBOMInputStream extends InputStream
{
  /**
   * Type safe enumeration class that describes the different types of Unicode
   * BOMs.
   */
  public static final class BOM
  {
    /**
     * NONE.
     */
    public static final BOM NONE = new BOM(new byte[]{},"NONE");

    /**
     * UTF-8 BOM (EF BB BF).
     */
    public static final BOM UTF_8 = new BOM(new byte[]{(byte)0xEF,
                                                       (byte)0xBB,
                                                       (byte)0xBF},
                                            "UTF-8");

    /**
     * UTF-16, little-endian (FF FE).
     */
    public static final BOM UTF_16_LE = new BOM(new byte[]{ (byte)0xFF,
                                                            (byte)0xFE},
                                                "UTF-16 little-endian");

    /**
     * UTF-16, big-endian (FE FF).
     */
    public static final BOM UTF_16_BE = new BOM(new byte[]{ (byte)0xFE,
                                                            (byte)0xFF},
                                                "UTF-16 big-endian");

    /**
     * UTF-32, little-endian (FF FE 00 00).
     */
    public static final BOM UTF_32_LE = new BOM(new byte[]{ (byte)0xFF,
                                                            (byte)0xFE,
                                                            (byte)0x00,
                                                            (byte)0x00},
                                                "UTF-32 little-endian");

    /**
     * UTF-32, big-endian (00 00 FE FF).
     */
    public static final BOM UTF_32_BE = new BOM(new byte[]{ (byte)0x00,
                                                            (byte)0x00,
                                                            (byte)0xFE,
                                                            (byte)0xFF},
                                                "UTF-32 big-endian");

    /**
     * Returns a <code>String</code> representation of this <code>BOM</code>
     * value.
     */
    public final String toString()
    {
      return description;
    }

    /**
     * Returns the bytes corresponding to this <code>BOM</code> value.
     */
    public final byte[] getBytes()
    {
      final int     length = bytes.length;
      final byte[]  result = new byte[length];

      // Make a defensive copy
      System.arraycopy(bytes,0,result,0,length);

      return result;
    }

    private BOM(final byte bom[], final String description)
    {
      assert(bom != null)               : "invalid BOM: null is not allowed";
      assert(description != null)       : "invalid description: null is not allowed";
      assert(description.length() != 0) : "invalid description: empty string is not allowed";

      this.bytes          = bom;
      this.description  = description;
    }

            final byte    bytes[];
    private final String  description;

  } // BOM

  /**
   * Constructs a new <code>UnicodeBOMInputStream</code> that wraps the
   * specified <code>InputStream</code>.
   * 
   * @param inputStream an <code>InputStream</code>.
   * 
   * @throws NullPointerException when <code>inputStream</code> is
   * <code>null</code>.
   * @throws IOException on reading from the specified <code>InputStream</code>
   * when trying to detect the Unicode BOM.
   */
  public UnicodeBOMInputStream(final InputStream inputStream) throws  NullPointerException,
                                                                      IOException

  {
    if (inputStream == null)
      throw new NullPointerException("invalid input stream: null is not allowed");

    in = new PushbackInputStream(inputStream,4);

    final byte  bom[] = new byte[4];
    final int   read  = in.read(bom);

    switch(read)
    {
      case 4:
        if ((bom[0] == (byte)0xFF) &&
            (bom[1] == (byte)0xFE) &&
            (bom[2] == (byte)0x00) &&
            (bom[3] == (byte)0x00))
        {
          this.bom = BOM.UTF_32_LE;
          break;
        }
        else
        if ((bom[0] == (byte)0x00) &&
            (bom[1] == (byte)0x00) &&
            (bom[2] == (byte)0xFE) &&
            (bom[3] == (byte)0xFF))
        {
          this.bom = BOM.UTF_32_BE;
          break;
        }

      case 3:
        if ((bom[0] == (byte)0xEF) &&
            (bom[1] == (byte)0xBB) &&
            (bom[2] == (byte)0xBF))
        {
          this.bom = BOM.UTF_8;
          break;
        }

      case 2:
        if ((bom[0] == (byte)0xFF) &&
            (bom[1] == (byte)0xFE))
        {
          this.bom = BOM.UTF_16_LE;
          break;
        }
        else
        if ((bom[0] == (byte)0xFE) &&
            (bom[1] == (byte)0xFF))
        {
          this.bom = BOM.UTF_16_BE;
          break;
        }

      default:
        this.bom = BOM.NONE;
        break;
    }

    if (read > 0)
      in.unread(bom,0,read);
  }

  /**
   * Returns the <code>BOM</code> that was detected in the wrapped
   * <code>InputStream</code> object.
   * 
   * @return a <code>BOM</code> value.
   */
  public final BOM getBOM()
  {
    // BOM type is immutable.
    return bom;
  }

  /**
   * Skips the <code>BOM</code> that was found in the wrapped
   * <code>InputStream</code> object.
   * 
   * @return this <code>UnicodeBOMInputStream</code>.
   * 
   * @throws IOException when trying to skip the BOM from the wrapped
   * <code>InputStream</code> object.
   */
  public final synchronized UnicodeBOMInputStream skipBOM() throws IOException
  {
    if (!skipped)
    {
      in.skip(bom.bytes.length);
      skipped = true;
    }
    return this;
  }

  /**
   * {@inheritDoc}
   */
  public int read() throws IOException
  {
    return in.read();
  }

  /**
   * {@inheritDoc}
   */
  public int read(final byte b[]) throws  IOException,
                                          NullPointerException
  {
    return in.read(b,0,b.length);
  }

  /**
   * {@inheritDoc}
   */
  public int read(final byte b[],
                  final int off,
                  final int len) throws IOException,
                                        NullPointerException
  {
    return in.read(b,off,len);
  }

  /**
   * {@inheritDoc}
   */
  public long skip(final long n) throws IOException
  {
    return in.skip(n);
  }

  /**
   * {@inheritDoc}
   */
  public int available() throws IOException
  {
    return in.available();
  }

  /**
   * {@inheritDoc}
   */
  public void close() throws IOException
  {
    in.close();
  }

  /**
   * {@inheritDoc}
   */
  public synchronized void mark(final int readlimit)
  {
    in.mark(readlimit);
  }

  /**
   * {@inheritDoc}
   */
  public synchronized void reset() throws IOException
  {
    in.reset();
  }

  /**
   * {@inheritDoc}
   */
  public boolean markSupported() 
  {
    return in.markSupported();
  }

  private final PushbackInputStream in;
  private final BOM                 bom;
  private       boolean             skipped = false;

} // UnicodeBOMInputStream

และคุณใช้วิธีนี้:

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;

public final class UnicodeBOMInputStreamUsage
{
  public static void main(final String[] args) throws Exception
  {
    FileInputStream fis = new FileInputStream("test/offending_bom.txt");
    UnicodeBOMInputStream ubis = new UnicodeBOMInputStream(fis);

    System.out.println("detected BOM: " + ubis.getBOM());

    System.out.print("Reading the content of the file without skipping the BOM: ");
    InputStreamReader isr = new InputStreamReader(ubis);
    BufferedReader br = new BufferedReader(isr);

    System.out.println(br.readLine());

    br.close();
    isr.close();
    ubis.close();
    fis.close();

    fis = new FileInputStream("test/offending_bom.txt");
    ubis = new UnicodeBOMInputStream(fis);
    isr = new InputStreamReader(ubis);
    br = new BufferedReader(isr);

    ubis.skipBOM();

    System.out.print("Reading the content of the file after skipping the BOM: ");
    System.out.println(br.readLine());

    br.close();
    isr.close();
    ubis.close();
    fis.close();
  }

} // UnicodeBOMInputStreamUsage

— Gregory Pakosz
แหล่งที่มา

2

ขออภัยสำหรับพื้นที่เลื่อนยาวแย่เกินไปไม่มีคุณสมบัติแนบ

— Gregory Pakosz

ขอบคุณ Gregory นั่นคือสิ่งที่ฉันกำลังมองหา

— ทอม

3

สิ่งนี้ควรอยู่ใน Java API หลัก

— Denis Kniazhev

7

10 ปีผ่านไปฉันยังคงได้รับผลกรรมสำหรับสิ่งนี้: D ฉันกำลังมองหาคุณ Java!

— Gregory Pakosz

1

โหวตขึ้นเนื่องจากคำตอบให้ประวัติเกี่ยวกับสาเหตุที่สตรีมอินพุตไฟล์ไม่มีตัวเลือกในการละทิ้ง BOM ตามค่าเริ่มต้น

— MxLDevs

95

Apache Commons IOห้องสมุดมีInputStreamที่สามารถตรวจจับและ BOMs ทิ้ง: BOMInputStream(Javadoc) :

BOMInputStream bomIn = new BOMInputStream(in);
int firstNonBOMByte = bomIn.read(); // Skips BOM
if (bomIn.hasBOM()) {
    // has a UTF-8 BOM
}

หากคุณต้องการตรวจจับการเข้ารหัสที่แตกต่างกันก็ยังสามารถแยกความแตกต่างระหว่างเครื่องหมายลำดับไบต์ต่างๆได้เช่น UTF-8 เทียบกับ UTF-16 big + little endian - ดูรายละเอียดที่ลิงค์ doc ด้านบน จากนั้นคุณสามารถใช้สิ่งที่ตรวจพบByteOrderMarkเพื่อเลือกไฟล์Charsetเพื่อถอดรหัสสตรีม (อาจมีวิธีที่คล่องตัวกว่าในการดำเนินการนี้หากคุณต้องการฟังก์ชันทั้งหมดนี้ - อาจเป็น UnicodeReader ในคำตอบของ BalusC?) โปรดทราบว่าโดยทั่วไปแล้วไม่มีวิธีที่ดีมากในการตรวจจับว่าการเข้ารหัสบางไบต์อยู่ในรูปแบบใด แต่ถ้าสตรีมเริ่มต้นด้วย BOM สิ่งนี้จะมีประโยชน์

แก้ไข : หากคุณต้องการตรวจหา BOM ใน UTF-16, UTF-32 เป็นต้นตัวสร้างควรเป็น:

new BOMInputStream(is, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE,
        ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE)

ความคิดเห็นของ Upvote @ martin-charlesworth :)

— rescdsk
แหล่งที่มา

เพียงแค่ข้าม BOM ควรเป็นโซลูชั่นที่สมบูรณ์แบบสำหรับ 99% ของกรณีการใช้งาน

— atamanroman

7

ฉันใช้คำตอบนี้สำเร็จ อย่างไรก็ตามฉันจะเพิ่มbooleanอาร์กิวเมนต์ด้วยความเคารพเพื่อระบุว่าจะรวมหรือไม่รวม BOM ตัวอย่าง:BOMInputStream bomIn = new BOMInputStream(in, false); // don't include the BOM

— Kevin Meredith

19

ฉันจะเพิ่มด้วยว่าสิ่งนี้ตรวจพบ UTF-8 BOM เท่านั้น หากคุณต้องการตรวจจับ utf-X BOM ทั้งหมดคุณจะต้องส่งต่อไปยังตัวสร้าง BOMInputStream

BOMInputStream bomIn = new BOMInputStream(is, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE, 				ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE);

— Martin Charlesworth

สำหรับความคิดเห็นของ @KevinMeredith ฉันต้องการเน้นว่าตัวสร้างที่มีบูลีนนั้นชัดเจนกว่า แต่ตัวสร้างเริ่มต้นได้กำจัด UTF-8 BOM แล้วตามที่ JavaDoc แนะนำ:BOMInputStream(InputStream delegate) Constructs a new BOM InputStream that excludes a ByteOrderMark.UTF_8 BOM.

— WesternGun

การข้ามช่วยแก้ปัญหาส่วนใหญ่ของฉันได้ ถ้าไฟล์ของฉันขึ้นต้นด้วย BOM UTF_16BE ฉันสามารถสร้าง InputReader โดยข้าม BOM และอ่านไฟล์เป็น UTF_8 ได้หรือไม่ จนถึงตอนนี้มันใช้งานได้ฉันต้องการที่จะเข้าใจว่ามีขอบเคสหรือไม่? ขอบคุณล่วงหน้า.

— Bhaskar

31

วิธีง่ายๆเพิ่มเติม:

public class BOMSkipper
{
    public static void skip(Reader reader) throws IOException
    {
        reader.mark(1);
        char[] possibleBOM = new char[1];
        reader.read(possibleBOM);

        if (possibleBOM[0] != '\ufeff')
        {
            reader.reset();
        }
    }
}

ตัวอย่างการใช้งาน:

BufferedReader input = new BufferedReader(new InputStreamReader(new FileInputStream(file), fileExpectedCharset));
BOMSkipper.skip(input);
//Now UTF prefix not present:
input.readLine();
...

ใช้งานได้กับการเข้ารหัส UTF ทั้งหมด 5 รายการ!

1

Andrei ดีมาก แต่คุณช่วยอธิบายได้ไหมว่าทำไมมันถึงได้ผล? รูปแบบ 0xFEFF จับคู่ไฟล์ UTF-8 ได้อย่างไรซึ่งดูเหมือนจะมีรูปแบบที่แตกต่างกันและ 3 ไบต์แทนที่จะเป็น 2 แล้วรูปแบบนั้นจะจับคู่ endians ของ UTF16 และ UTF32 ได้อย่างไร?

— Vahid Pazirandeh

1

อย่างที่คุณเห็น - ฉันไม่ได้ใช้ไบต์สตรีม แต่สตรีมอักขระเปิดด้วยชุดอักขระที่คาดไว้ ดังนั้นหากตัวละครแรกจากสตรีมนี้คือ BOM - ฉันจะข้ามไป BOM สามารถมีการแทนค่าไบต์ที่แตกต่างกันสำหรับการเข้ารหัสแต่ละรายการ แต่เป็นอักขระเดียว โปรดอ่านบทความนี้ซึ่งช่วยฉันได้: joelonsoftware.com/articles/Unicode.html

ทางออกที่ดีตรวจสอบให้แน่ใจว่าไฟล์ไม่ว่างเปล่าเพื่อหลีกเลี่ยง IOException ในวิธีการข้ามก่อนอ่าน คุณสามารถทำได้โดยโทรไปที่ if (reader.ready ()) {reader.read (possibleBOM) ... }

— หิมะ

ฉันเห็นว่าคุณได้กล่าวถึง 0xFE 0xFF ซึ่งเป็นเครื่องหมายคำสั่งไบต์สำหรับ UTF-16BE แต่ถ้า 3 ไบต์แรกเป็น 0xEF 0xBB 0xEF ล่ะ? (เครื่องหมายลำดับไบต์สำหรับ UTF-8) คุณอ้างว่าสิ่งนี้ใช้ได้กับรูปแบบ UTF-8 ทั้งหมด ซึ่งอาจเป็นจริง (ฉันยังไม่ได้ทดสอบโค้ดของคุณ) แต่มันทำงานอย่างไร

— bvdb

1

ดูคำตอบของฉันสำหรับ Vahid: ฉันไม่ได้เปิดสตรีมไบต์ แต่เป็นสตรีมอักขระและอ่านอักขระหนึ่งตัวจากมัน ไม่ต้องสนใจว่าการเข้ารหัส utf ใดที่ใช้สำหรับไฟล์คำนำหน้า bom สามารถแสดงด้วยจำนวนไบต์ที่แตกต่างกัน แต่ในแง่ของอักขระมันเป็นเพียงอักขระเดียว

24

Google Data APIมีUnicodeReaderที่ตรวจจับการเข้ารหัสโดยอัตโนมัติ

คุณสามารถใช้แทนInputStreamReader. นี่คือ - สารสกัดจากแหล่งที่มาที่มีขนาดกะทัดรัดเล็กน้อยซึ่งค่อนข้างตรงไปตรงมา:

public class UnicodeReader extends Reader {
    private static final int BOM_SIZE = 4;
    private final InputStreamReader reader;

    /**
     * Construct UnicodeReader
     * @param in Input stream.
     * @param defaultEncoding Default encoding to be used if BOM is not found,
     * or <code>null</code> to use system default encoding.
     * @throws IOException If an I/O error occurs.
     */
    public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
        byte bom[] = new byte[BOM_SIZE];
        String encoding;
        int unread;
        PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
        int n = pushbackStream.read(bom, 0, bom.length);

        // Read ahead four bytes and check for BOM marks.
        if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
            encoding = "UTF-8";
            unread = n - 3;
        } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
            encoding = "UTF-16BE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
            encoding = "UTF-16LE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
            encoding = "UTF-32BE";
            unread = n - 4;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
            encoding = "UTF-32LE";
            unread = n - 4;
        } else {
            encoding = defaultEncoding;
            unread = n;
        }

        // Unread bytes if necessary and skip BOM marks.
        if (unread > 0) {
            pushbackStream.unread(bom, (n - unread), unread);
        } else if (unread < -1) {
            pushbackStream.unread(bom, 0, 0);
        }

        // Use given encoding.
        if (encoding == null) {
            reader = new InputStreamReader(pushbackStream);
        } else {
            reader = new InputStreamReader(pushbackStream, encoding);
        }
    }

    public String getEncoding() {
        return reader.getEncoding();
    }

    public int read(char[] cbuf, int off, int len) throws IOException {
        return reader.read(cbuf, off, len);
    }

    public void close() throws IOException {
        reader.close();
    }
}

— BalusC
แหล่งที่มา

ดูเหมือนว่าลิงก์จะบอกว่า Google Data API เลิกใช้งานแล้ว? ตอนนี้ควรมองหา Google Data API ที่ไหน

— SOUser

1

@XichenLi: GData API เลิกใช้งานตามวัตถุประสงค์ที่ตั้งใจไว้ ฉันไม่ได้ตั้งใจจะแนะนำให้ใช้ GData API โดยตรง (OP ไม่ได้ใช้บริการ GData ใด ๆ ) แต่ฉันตั้งใจจะรับซอร์สโค้ดเป็นตัวอย่างสำหรับการติดตั้งของคุณเอง นั่นเป็นเหตุผลที่ฉันรวมไว้ในคำตอบของฉันพร้อมสำหรับ copypaste

— BalusC

มีข้อบกพร่องในเรื่องนี้ ไม่สามารถเข้าถึงเคส UTF-32LE ได้ เพื่อ(bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)ให้เป็นจริงกรณี UTF-16LE ( (bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) จะตรงกันแล้ว

— Joshua Taylor

เนื่องจากรหัสนี้มาจาก Google Data API ฉันจึงโพสต์ปัญหา 471เกี่ยวกับเรื่องนี้

— Joshua Taylor

13

Apache Commons IOห้องสมุดBOMInputStreamได้ถูกกล่าวถึงโดย @rescdsk แต่ผมไม่เห็นมันพูดถึงวิธีการที่จะได้รับInputStream โดยไม่ต้องรายการวัสดุ

นี่คือวิธีที่ฉันทำใน Scala

 import java.io._
 val file = new File(path_to_xml_file_with_BOM)
 val fileInpStream = new FileInputStream(file)   
 val bomIn = new BOMInputStream(fileInpStream, 
         false); // false means don't include BOM

— เควินเมเรดิ ธ
แหล่งที่มา

ตัวสร้าง arg เดี่ยวทำมัน: public BOMInputStream(InputStream delegate) { this(delegate, false, ByteOrderMark.UTF_8); }. มันไม่รวมUTF-8 BOMโดยค่าเริ่มต้น

— Vladimir Vagaytsev

จุดดี Vladimir ฉันเห็นว่าในเอกสาร - commons.apache.org/proper/commons-io/javadocs/api-2.2/org/… :Constructs a new BOM InputStream that excludes a ByteOrderMark.UTF_8 BOM.

— Kevin Meredith

4

ในการลบอักขระ BOM ออกจากไฟล์ของคุณฉันขอแนะนำให้ใช้Apache Common IO

public BOMInputStream(InputStream delegate,
              boolean include)
Constructs a new BOM InputStream that detects a a ByteOrderMark.UTF_8 and optionally includes it.
Parameters:
delegate - the InputStream to delegate to
include - true to include the UTF-8 BOM or false to exclude it

ตั้งค่ารวมเป็นเท็จและอักขระ BOM ของคุณจะถูกยกเว้น

— Andreas Baaserud
แหล่งที่มา

2

ไม่น่าเสียใจ คุณจะต้องระบุและข้ามไปเอง หน้านี้แสดงรายละเอียดสิ่งที่คุณต้องระวัง ดูคำถาม SO นี้สำหรับรายละเอียดเพิ่มเติม

— Brian Agnew
แหล่งที่มา

1

ฉันมีปัญหาเดียวกันและเนื่องจากฉันไม่ได้อ่านไฟล์จำนวนมากฉันจึงหาวิธีแก้ปัญหาที่ง่ายกว่านี้ ผมคิดว่าการเข้ารหัสของฉันคือ UTF-8 เพราะเมื่อฉันพิมพ์ออกมาเป็นตัวละครที่กระทำผิดด้วยความช่วยเหลือของหน้านี้: รับค่ายูนิโค้ดของตัวละคร\ufeffที่ผมพบว่ามันเป็น ฉันใช้รหัสSystem.out.println( "\\u" + Integer.toHexString(str.charAt(0) | 0x10000).substring(1) );เพื่อพิมพ์ค่า Unicode ที่ไม่เหมาะสม

เมื่อฉันมีค่า Unicode ที่ไม่เหมาะสมฉันก็แทนที่มันในบรรทัดแรกของไฟล์ก่อนที่จะอ่านต่อ ตรรกะทางธุรกิจของส่วนนั้น:

String str = reader.readLine().trim();
str = str.replace("\ufeff", "");

สิ่งนี้ช่วยแก้ปัญหาของฉันได้ จากนั้นฉันก็สามารถดำเนินการกับไฟล์ได้โดยไม่มีปัญหา ฉันเพิ่มtrim()ในกรณีของช่องว่างนำหน้าหรือต่อท้ายคุณสามารถทำเช่นนั้นหรือไม่ก็ได้ขึ้นอยู่กับความต้องการเฉพาะของคุณ

— เอมี่บีฮิกกินส์
แหล่งที่มา

1

นั่นไม่ได้ผลสำหรับฉัน แต่ฉันใช้. replaceFirst ("\ u00EF \ u00BB \ u00BF", "") ซึ่งทำได้

— StackUMan