วิธีแปลงไฟล์ pdf เป็นสีขาว

18

ฉันต้องการแปลงไฟล์ PDF ที่มีข้อความสีและรูปภาพในไฟล์ PDF อื่นที่มีเฉพาะขาวดำเพื่อลดขนาดของมัน นอกจากนี้ฉันต้องการเก็บข้อความเป็นข้อความโดยไม่ต้องเปลี่ยนองค์ประกอบของหน้าในรูปภาพ ฉันลองคำสั่งต่อไปนี้:

convert -density 150 -threshold 50% input.pdf output.pdf

พบในคำถามอื่นลิงก์แต่ทำในสิ่งที่ฉันไม่ต้องการ: ข้อความในเอาต์พุตถูกแปลงในรูปภาพที่ไม่ดีและไม่สามารถเลือกได้อีกต่อไป ฉันลอง Ghostscript แล้ว:

gs      -sOutputFile=output.pdf \
        -q -dNOPAUSE -dBATCH -dSAFER \
        -sDEVICE=pdfwrite \
        -dCompatibilityLevel=1.3 \
        -dPDFSETTINGS=/screen \
        -dEmbedAllFonts=true \
        -dSubsetFonts=true \
        -sColorConversionStrategy=/Mono \
        -sColorConversionStrategyForImages=/Mono \
        -sProcessColorModel=/DeviceGray \
        $1

แต่มันทำให้ฉันข้อความผิดพลาดต่อไปนี้:

./script.sh: 19: ./script.sh: output.pdf: not found

มีวิธีอื่นในการสร้างไฟล์หรือไม่?

— BowPark
แหล่งที่มา

สิ่งนี้ดูดีมากสำหรับsuperuser.com/questions/200378/…

— slackmart

1

ที่เกี่ยวข้อง: unix.stackexchange.com/questions/84709/…

— slm

ข้อควรระวังเมื่อใช้วิธี superuser บางวิธีพวกเขาแปลง PDF เป็นเวอร์ชันที่แรสเตอร์ดังนั้นจึงไม่มีกราฟิกแบบเวกเตอร์อีกต่อไป

— slm

1

นั่นคือสคริปต์ทั้งหมดที่คุณวิ่งใช่ไหม ดูเหมือนว่าคุณจะสามารถโพสต์สคริปต์ทั้งหมดได้ไหม

— terdon

23

ตัวอย่าง gs

gsคำสั่งคุณกำลังใช้งานดังกล่าวข้างต้นมีต่อท้าย$1ที่มีความหมายโดยทั่วไปแล้วสำหรับการส่งผ่านอาร์กิวเมนต์บรรทัดคำสั่งลงในสคริปต์ ดังนั้นฉันไม่แน่ใจว่าสิ่งที่คุณพยายามจริง ๆ แต่ฉันเดาว่าคุณพยายามที่จะทำให้คำสั่งนั้นเป็นสคริปต์script.sh:

#!/bin/bash

gs      -sOutputFile=output.pdf \
        -q -dNOPAUSE -dBATCH -dSAFER \
        -sDEVICE=pdfwrite \
        -dCompatibilityLevel=1.3 \
        -dPDFSETTINGS=/screen \
        -dEmbedAllFonts=true \
        -dSubsetFonts=true \
        -sColorConversionStrategy=/Mono \
        -sColorConversionStrategyForImages=/Mono \
        -sProcessColorModel=/DeviceGray \
        $1

และเรียกใช้เช่นนี้:

$ ./script.sh: 19: ./script.sh: output.pdf: not found

ไม่แน่ใจว่าคุณติดตั้งสคริปต์นี้ได้อย่างไร แต่จำเป็นต้องเรียกใช้งาน

$ chmod +x script.sh

มีบางสิ่งที่ดูเหมือนจะไม่ถูกต้องกับสคริปต์นั้น เมื่อฉันลองฉันได้รับข้อผิดพลาดนี้:

ข้อผิดพลาดที่ไม่สามารถกู้คืนได้: rangecheck ใน .putdeviceprops

ทางเลือกอื่น

แทนที่จะใช้สคริปต์นั้นฉันจะใช้อันนี้จากคำถาม SU แทน

#!/bin/bash

gs \
 -sOutputFile=output.pdf \
 -sDEVICE=pdfwrite \
 -sColorConversionStrategy=Gray \
 -dProcessColorModel=/DeviceGray \
 -dCompatibilityLevel=1.4 \
 -dNOPAUSE \
 -dBATCH \
 $1

จากนั้นเรียกใช้ดังนี้:

$ ./script.bash LeaseContract.pdf 
GPL Ghostscript 8.71 (2010-02-10)
Copyright (C) 2010 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 2.
Page 1
Page 2

— slm
แหล่งที่มา

คุณพูดถูกมันมีบางอย่างผิดปกติกับสคริปต์: "บางอย่าง" ในกรณีนี้น่าจะเป็นsProcessColorModelสิ่งที่ควรจะเป็นdProcessColorModelแทน

— โซระ

8

ฉันพบสคริปต์ที่นี่ซึ่งสามารถทำได้ มันต้องมีgsที่คุณดูเหมือนจะมี pdftkแต่ยัง คุณไม่ได้กล่าวถึงการกระจายของคุณ แต่ในระบบที่ใช้เดเบียนคุณควรจะติดตั้งได้

sudo apt-get install pdftk

คุณสามารถค้นหา RPM ได้ที่นี่นี่

เมื่อคุณติดตั้งpdftkแล้วให้บันทึกสคริปต์เป็นgraypdf.shและเรียกใช้ดังนี้:

./greypdf.sh input.pdf

input-gray.pdfมันจะสร้างไฟล์ที่เรียกว่า ฉันกำลังรวมสคริปต์ทั้งหมดไว้ที่นี่เพื่อหลีกเลี่ยงการเน่าลิงก์

# convert pdf to grayscale, preserving metadata
# "AFAIK graphicx has no feature for manipulating colorspaces. " http://groups.google.com/group/latexusersgroup/browse_thread/thread/5ebbc3ff9978af05
# "> Is there an easy (or just standard) way with pdflatex to do a > conversion from color to grayscale when a PDF file is generated? No." ... "If you want to convert a multipage document then you better have pdftops from the xpdf suite installed because Ghostscript's pdf to ps doesn't produce nice Postscript." http://osdir.com/ml/tex.pdftex/2008-05/msg00006.html
# "Converting a color EPS to grayscale" - http://en.wikibooks.org/wiki/LaTeX/Importing_Graphics
# "\usepackage[monochrome]{color} .. I don't know of a neat automatic conversion to monochrome (there might be such a thing) although there was something in Tugboat a while back about mapping colors on the fly. I would probably make monochrome versions of the pictures, and name them consistently. Then conditionally load each one" http://newsgroups.derkeiler.com/Archive/Comp/comp.text.tex/2005-08/msg01864.html
# "Here comes optional.sty. By adding \usepackage{optional} ... \opt{color}{\includegraphics[width=0.4\textwidth]{intro/benzoCompounds_color}} \opt{grayscale}{\includegraphics[width=0.4\textwidth]{intro/benzoCompounds}} " - http://chem-bla-ics.blogspot.com/2008/01/my-phd-thesis-in-color-and-grayscale.html
# with gs:
# http://handyfloss.net/2008.09/making-a-pdf-grayscale-with-ghostscript/
# note - this strips metadata! so:
# http://etutorials.org/Linux+systems/pdf+hacks/Chapter+5.+Manipulating+PDF+Files/Hack+64+Get+and+Set+PDF+Metadata/
COLORFILENAME=$1
OVERWRITE=$2
FNAME=${COLORFILENAME%.pdf}
# NOTE: pdftk does not work with logical page numbers / pagination;
# gs kills it as well;
# so check for existence of 'pdfmarks' file in calling dir;
# if there, use it to correct gs logical pagination
# for example, see
# http://askubuntu.com/questions/32048/renumber-pages-of-a-pdf/65894#65894
PDFMARKS=
if [ -e pdfmarks ] ; then
PDFMARKS="pdfmarks"
echo "$PDFMARKS exists, using..."
# convert to gray pdf - this strips metadata!
gs -sOutputFile=$FNAME-gs-gray.pdf -sDEVICE=pdfwrite \
-sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH "$COLORFILENAME" "$PDFMARKS"
else # not really needed ?!
gs -sOutputFile=$FNAME-gs-gray.pdf -sDEVICE=pdfwrite \
-sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH "$COLORFILENAME"
fi
# dump metadata from original color pdf
## pdftk $COLORFILENAME dump_data output $FNAME.data.txt
# also: pdfinfo -meta $COLORFILENAME
# grep to avoid BookmarkTitle/Level/PageNumber:
pdftk $COLORFILENAME dump_data output | grep 'Info\|Pdf' > $FNAME.data.txt
# "pdftk can take a plain-text file of these same key/value pairs and update a PDF's Info dictionary to match. Currently, it does not update the PDF's XMP stream."
pdftk $FNAME-gs-gray.pdf update_info $FNAME.data.txt output $FNAME-gray.pdf
# (http://wiki.creativecommons.org/XMP_Implementations : Exempi ... allows reading/writing XMP metadata for various file formats, including PDF ... )
# clean up
rm $FNAME-gs-gray.pdf
rm $FNAME.data.txt
if [ "$OVERWRITE" == "y" ] ; then
echo "Overwriting $COLORFILENAME..."
mv $FNAME-gray.pdf $COLORFILENAME
fi
# BUT NOTE:
# Mixing TEX & PostScript : The GEX Model - http://www.tug.org/TUGboat/Articles/tb21-3/tb68kost.pdf
# VTEX is a (commercial) extended version of TEX, sold by MicroPress, Inc. Free versions of VTEX have recently been made available, that work under OS/2 and Linux. This paper describes GEX, a fast fully-integrated PostScript interpreter which functions as part of the VTEX code-generator. Unless specified otherwise, this article describes the functionality in the free- ware version of the VTEX compiler, as available on CTAN sites in systems/vtex.
# GEX is a graphics counterpart to TEX. .. Since GEX may exercise subtle influence on TEX (load fonts, or change TEX registers), GEX is op- tional in VTEX implementations: the default oper- ation of the program is with GEX off; it is enabled by a command-line switch.
# \includegraphics[width=1.3in, colorspace=grayscale 256]{macaw.jpg}
# http://mail.tug.org/texlive/Contents/live/texmf-dist/doc/generic/FAQ-en/html/FAQ-TeXsystems.html
# A free version of the commercial VTeX extended TeX system is available for use under Linux, which among other things specialises in direct production of PDF from (La)TeX input. Sadly, it���s no longer supported, and the ready-built images are made for use with a rather ancient Linux kernel.
# NOTE: another way to capture metadata; if converting via ghostscript:
# http://compgroups.net/comp.text.pdf/How-to-specify-metadata-using-Ghostscript
# first:
# grep -a 'Keywo' orig.pdf
# /Author(xxx)/Title(ttt)/Subject()/Creator(LaTeX)/Producer(pdfTeX-1.40.12)/Keywords(kkkk)
# then - copy this data in a file prologue.ini:
#/pdfmark where {pop} {userdict /pdfmark /cleartomark load put} ifelse
#[/Author(xxx)
#/Title(ttt)
#/Subject()
#/Creator(LaTeX with hyperref package + gs w/ prologue)
#/Producer(pdfTeX-1.40.12)
#/Keywords(kkkk)
#/DOCINFO pdfmark
#
# finally, call gs on the orig file,
# asking to process pdfmarks in prologue.ini:
# gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
# -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -dDOPDFMARKS \
# -sOutputFile=out.pdf in.pdf prologue.ini
# then the metadata will be in output too (which is stripped otherwise;
# note bookmarks are preserved, however).

— terdon
แหล่งที่มา

3

ฉันยังมีไฟล์ PDF สีสแกนและไฟล์ PDF ระดับสีเทาที่ฉันต้องการแปลงเป็น bw ด้วย ฉันลองใช้gsกับรหัสที่แสดงไว้ที่นี่และคุณภาพของภาพดีกับข้อความ pdf ที่ยังคงอยู่ อย่างไรก็ตามรหัส gs นั้นจะแปลงเป็นสีเทาเท่านั้น (ตามที่ถามในคำถาม) และยังคงมีขนาดไฟล์ใหญ่convertให้ผลลัพธ์ที่น่าสงสารมากเมื่อใช้โดยตรง

ฉันต้องการไฟล์ PDF bw ที่มีคุณภาพของภาพที่ดีและขนาดไฟล์เล็ก ฉันจะลองใช้วิธีแก้ปัญหาของ terdon แต่ฉันไม่สามารถpdftkใช้งาน centOS 7 ได้โดยใช้ yum (ตอนที่เขียน)

โซลูชันของฉันใช้gsเพื่อแยกไฟล์ bmp สีเทาจาก pdf convertเพื่อกำหนด bmps เหล่านั้นเป็น bw และบันทึกเป็นไฟล์ TIFF จากนั้นimg2pdfเพื่อบีบอัดภาพ TIFF และรวมเป็นไฟล์ PDF เดียว

ฉันพยายามไปที่ tiff โดยตรงจาก pdf แต่คุณภาพไม่เหมือนกันดังนั้นฉันจึงบันทึกแต่ละหน้าเป็น bmp สำหรับไฟล์ pdf หนึ่งหน้าconvertทำได้ดีมากจาก bmp เป็น pdf ตัวอย่าง:

gs -sDEVICE=bmpgray -dNOPAUSE -dBATCH -r300x300 \
   -sOutputFile=./pdf_image.bmp ./input.pdf

convert ./pdf_image.bmp -threshold 40% -compress zip ./bw_out.pdf

สำหรับหลาย ๆ หน้าgsสามารถรวมไฟล์ PDF หลาย ๆ ไฟล์ไว้ในที่เดียว แต่img2pdfให้ขนาดไฟล์ที่เล็กกว่า gs ไฟล์ TIFF ต้องไม่ถูกบีบอัดเป็นอินพุตให้กับ img2pdf โปรดทราบว่าหน้าเว็บจำนวนมากไฟล์ bmp และ tiff ระดับกลางมักจะมีขนาดใหญ่ pdftkหรือjoinpdfจะดีกว่าถ้าพวกเขาสามารถผสานไฟล์ PDF convertบีบอัดจาก

ฉันคิดว่ามันมีทางออกที่ดีกว่า อย่างไรก็ตามวิธีการของฉันให้ผลลัพธ์ที่มีคุณภาพของภาพที่ดีมากและขนาดไฟล์เล็กลงมาก หากต้องการรับข้อความกลับใน bw pdf ให้เรียกใช้ OCR อีกครั้ง

เชลล์สคริปต์ของฉันใช้ gs, แปลงและ img2pdf เปลี่ยนพารามิเตอร์ (# ของหน้าเว็บสแกน dpi เกณฑ์% ฯลฯ ) ของ บริษัท chmod +x ./pdf2bw.shจดทะเบียนในการเริ่มต้นตามความจำเป็นและการทำงาน นี่คือสคริปต์แบบเต็ม (pdf2bw.sh):

#!/bin/bash

num_pages=12
dpi_res=300
input_pdf_name=color_or_grayscale.pdf
bw_threshold=40%
output_pdf_name=out_bw.pdf
#-------------------------------------------------------------------------
gs -sDEVICE=bmpgray -dNOPAUSE -dBATCH -q -r$dpi_res \
   -sOutputFile=./%d.bmp ./$input_pdf_name
#-------------------------------------------------------------------------
for file_num in `seq 1 $num_pages`
do
  convert ./$file_num.bmp -threshold $bw_threshold \
          ./$file_num.tif
done
#-------------------------------------------------------------------------
input_files=""

for file_num in `seq 1 $num_pages`
do
  input_files+="./$file_num.tif "
done

img2pdf -o ./$output_pdf_name --dpi $dpi_res $input_files
#-------------------------------------------------------------------------
# clean up bmp and tif files used in conversion

for file_num in `seq 1 $num_pages`
do
  rm ./$file_num.bmp
  rm ./$file_num.tif
done

— OccamsRazor
แหล่งที่มา

1

RHEL6 และ RHEL5 ซึ่ง Ghostscript พื้นฐานทั้งสองใน 8.70 ไม่สามารถใช้รูปแบบของคำสั่งที่ระบุด้านบน สมมติว่าสคริปต์หรือฟังก์ชั่นคาดหวังว่าไฟล์ PDF เป็นอาร์กิวเมนต์แรก "$ 1" ต่อไปนี้ควรพกพาได้มากกว่า:

gs \
    -sOutputFile="grey_$1" \
    -sDEVICE=pdfwrite \
    -sColorConversionStrategy=Mono \
    -sColorConversionStrategyForImages=/Mono \
    -dProcessColorModel=/DeviceGray \
    -dCompatibilityLevel=1.3 \
    -dNOPAUSE -dBATCH \
    "$1"

ตำแหน่งที่ไฟล์เอาต์พุตจะถูกขึ้นต้นด้วย "grey_"

RHEL6 และ 5 สามารถใช้CompatibilityLevel = 1.4ซึ่งเร็วกว่ามาก แต่ฉันตั้งใจจะพกพา

— รวย
แหล่งที่มา

ผู้พัฒนาพูดว่า ( 1 , 2 , 3 , 4 ) ว่าไม่มีsColorConversionStrategyForImagesสวิตช์

— Igor

ขอบคุณ @Igor - ฉันไม่รู้เลยว่าฉันได้ข้อมูลมาจากไหน! ฉันรู้ถึงความจริงที่ว่าฉันทดสอบและใช้งานได้ในเวลานั้น (และนั่นคือเหตุผลที่คุณควรให้การอ้างอิงรหัสของคุณอยู่เสมอ)

— Rich

1

"พารามิเตอร์ปลอม" นั้นดูเหมือนจะเป็นที่นิยมอย่างไม่น่าเชื่อในเว็บ GS ไม่สนใจสวิตช์ที่ไม่รู้จัก (ซึ่งน่าเศร้า) ดังนั้นจึงสามารถใช้งานได้

— Igor

1

ฉันได้รับผลลัพธ์ที่น่าเชื่อถือในการทำความสะอาดไฟล์ PDF ที่สแกนแล้วเพื่อเปรียบเทียบกับสคริปต์นี้ได้ดี

#!/bin/bash
# 
# $ sudo apt install poppler-utils img2pdf pdftk imagemagick
#
# Output is still greyscale, but lots of scanner light tone fuzz removed.
#

pdfimages $1 pages

ls ./pages*.ppm | xargs -L1 -I {} convert {}  -quality 100 -density 400 \
  -fill white -fuzz 80% -auto-level -depth 4 +opaque "#000000" {}.jpg

ls -1 ./pages*jpg | xargs -L1 -I {} img2pdf {} -o {}.pdf

pdftk pages*.pdf cat output ${1/.pdf/}_bw.pdf

rm pages*

— Bijou Smith
แหล่งที่มา

วิธีแปลงไฟล์ pdf เป็นสีขาว - ดำ

ตัวอย่าง gs

ทางเลือกอื่น