User Tools

Site Tools


software:unpaper_test

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
software:unpaper_test [2009/03/10 01:47] adminsoftware:unpaper_test [2015/04/22 21:51] (current) – [findblack] admin
Line 1: Line 1:
-====== Unpaper test =====+====== Unpaper test ======
 In order to better understand the vast amount of options in unpaper, I've done several tests, which show differences between settings.\\ I scanned a lot of sheetmusic, processed them manually with another application to remove skew. All actions have been done in 8-bit gray at a resolution of 300 dpi.\\ This test should show which settings should be done with unpaper to get a decent automatic conversion to black and white with despeckle and noise removal. In order to better understand the vast amount of options in unpaper, I've done several tests, which show differences between settings.\\ I scanned a lot of sheetmusic, processed them manually with another application to remove skew. All actions have been done in 8-bit gray at a resolution of 300 dpi.\\ This test should show which settings should be done with unpaper to get a decent automatic conversion to black and white with despeckle and noise removal.
  
Line 6: Line 6:
   * imagemagick   * imagemagick
   * unpaper 0.3 (installed by downloading the archive, extracting manually and use make command)   * unpaper 0.3 (installed by downloading the archive, extracting manually and use make command)
-  * pdftk+  * ghostscript 
 +  * <del>pdftk</del> ((I get the following warning when I do an identify file.pdf: Warning:  File has an invalid xref entry:  2.  Rebuilding xref table.))
   * [[http://www.auditeon.com/xyz/projects/unpaper/out.pdf|testimage 8-bit gray at a resolution of 300 dpi (.pdf)]], converted in .pgm format. (Scanner: Epson GT-10000+) contrast had been set manually at a value where music from the other page side ('see through') is minimal and image background is white instead of gray.   * [[http://www.auditeon.com/xyz/projects/unpaper/out.pdf|testimage 8-bit gray at a resolution of 300 dpi (.pdf)]], converted in .pgm format. (Scanner: Epson GT-10000+) contrast had been set manually at a value where music from the other page side ('see through') is minimal and image background is white instead of gray.
  
Line 22: Line 23:
 <code bash> <code bash>
 #!/bin/sh #!/bin/sh
-donot="--no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite"+ 
 +donot="--no-blurfilter --no-noisefilter --no-grayfilter --no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite" 
 + 
 +pdflst="" 
 +ptmpf="tempfile_unpaptst_$$" 
 nmb=0 nmb=0
 while [ "$nmb" -le 9 ] while [ "$nmb" -le 9 ]
 do do
- str1="-w 0.$nmb $donot -t pbm out.pgm outc.pbm" + str1="0.$nmb
- unpaper $str1 + 
-# convert inputfile option annotation outputfile + unpaper -b $str1 $donot -t pbm $ptmpf-1.pgm $ptmpf.pbm 
- convert outc.pbm -gravity NorthWest -annotate 0 "unpaper $str1" outca.pbm + 
- convert -density 300 -units PixelsPerInch outca.pbm ps:- | ps2pdf13 -sPAPERSIZE=a4 - "out-$nmb.pdf" + # convert inputfile option annotation outputfile 
- rm outc.pbm + convert $ptmpf.pbm -gravity center -background lightblue -font Open-Sans -pointsize 60 caption:"$str1 \n $donot-composite $ptmpf_ca.pbm 
- rm outca.pbm+ convert -monochrome -density 300 -units PixelsPerInch /$ptmpf_ca.pbm ps:- | ps2pdf13 -sPAPERSIZE=a4 - "$ptmpf-$nmb.pdf" 
 + rm $ptmpf.pbm $ptmpf_ca.pbm 
 + pdflst="$pdflst $ptmpf-$nmb.pdf"
  nmb=$(($nmb+1))  nmb=$(($nmb+1))
 done done
-pdftk out-*.pdf output result.pdf +# merge all pdf documents with ghostscriptAlthough pdftk will give a shorter command, the identify command 
-rm out-*.pdf+# gives an error when it analyses the pdf file. 
 +gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -sOutputFile=result.pdf $pdflst 
 +rm $pdflst
  
 exit 0 exit 0
Line 67: Line 77:
 Black borders at edges of your scan can be removed automatically with this option. See [[http://unpaper.berlios.de/unpaper.html|unpaper user documentation]] for details. Black borders at edges of your scan can be removed automatically with this option. See [[http://unpaper.berlios.de/unpaper.html|unpaper user documentation]] for details.
  
-=== 3. +=== 3. blurfilter === 
 +See pdf examples above. 
 +=== 4. grayfilter === 
 +See pdf examples above. 
 + 
 +===== Scripting with unpaper ===== 
 +==== gray2black ==== 
 +Automatic conversion from pdf files, containing gray images, to black and white images with unpaper and its noise reduction and cleaning up features can be done with the following script: 
 +<code bash> 
 +#!/bin/sh 
 +# filename: gray2black 
 +# This script processes a pdf file containing gray images with 
 +# unpaper, imagemagick and ghostscript to a black and white pdf file. 
 +#  
 +# Usage: gray2black input-file [option] output-file 
 +
 +# option: -b [value] specify black threshold value as being used in 
 +#                     unpaper with -b. If omitted, default value will 
 +#                     be used: -b 0.12 
 +# input-file: a pdf file containing sheetmusic with one or more gray 
 +#             images. 
 +# output-file: a multipage pdf file with a series of black-threshold 
 +#              settings will be created. 
 +
 +# Example: 
 +# Process gray images in mozart.pdf: gray2black mozart.pdf -b 0.23 test.pdf 
 +#  
 +# This script uses imagemagick to convert and center an image on an 
 +# a4 page. (With 2480x3508 pixels at a resolution of 300 dpi) and 
 +# ghostscript to create a small pdf file.  By using efficiently the 
 +# piping method (|) in conjunction with reading and writing to the 
 +# standard input (-) we get less tempfiles. 
 +
 +# NB. This script is time consuming. So have a lot of patience. 
 +# Each page will take about 15 seconds to execute on a single core 
 +# AMD64 3500+ cpu. 
 +
 +# User text  
 +_fbhelp="Usage: gray2black input-file [OPTION...] output-file\n\n  -b [value], specify black threshold value as being used in\n              unpaper with -b. If omitted, default value will\n              be used: -b 0.12.\n\nExample:\nProcess pdf file with unpaper black threshold 0.23:\n  gray2black mozart.pdf -b 0.23 result.pdf\nProcess pdf file with -b 0.12 settings:\n  gray2black mozart.pdf result.pdf" 
 +_fbspdf="please supply a pdf file" 
 +_fbcvrt="Check your pdf file. Does it contain gray images?" 
 +# Unpaper settings, start value for black-threshold is 0.12 
 +b_threshold="0.12" 
 +donot="--no-grayfilter --no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite" 
 +filter="-ni 10 -ls 8 -lp 4 -li 0.20" 
 +# Other settings 
 +ptmpf="tempfile_gray2black_$$" 
 + 
 +# Page dimensions 
 +psize="a4" 
 +horpix=2480 
 +verpix=3508 
 +pdim="$horpix"x"$verpix" 
 +pres="300" 
 + 
 +# Only resize if absolutely neccessary. Values for resizing are for left, right, top and bottom border. 
 +hborder=33 
 +vborder=33 
 +lhorpix=$horpix 
 +lverpix=$verpix 
 + 
 +# Store input-file name temporarily, because it's deleted by shift command. 
 +arg1=$1 
 + 
 +# Find number of parameters passed 
 +if [ "$#" -le 1 ]; then 
 + echo "$_fbhelp" 
 + exit 1 
 +fi 
 + 
 +# Check OPTIONS 
 +if [ "$#" -eq 3 ]; then 
 + case "$2" in 
 + * ) 
 + echo "$_fbhelp" 
 + exit 1 
 + ;; 
 + esac 
 + elif [ "$#" -eq 4 ]; then 
 + case "$2" in 
 + -b )  
 + shift 
 + b_threshold=$2 
 + shift 
 + ;; 
 + * )  
 + echo "$_fbhelp" 
 + exit 1 
 + ;; 
 + esac 
 +fi 
 + 
 +# only process further if there are two arguments left. 
 +if [ "$#" -eq 2 ]; then 
 + if [ ! "$(identify "$arg1"|grep PDF)" ]; then 
 + echo "$_fbspdf" 
 + exit 1 
 + else 
 + # convert first page to a pgm file with filename $ptmpf-*.pgm 
 + pdftoppm -gray -r 300 "$arg1" $ptmpf 
 + 
 + # Check for document resize neccessity. Iterating through all pages, makes sure, all pages get uniform resizing later. 
 + 
 + resize=0 
 + # Iterate through files. We need to sort files in a numerical order because pdftoppm generates files without a leading zero.  
 + # However before the numbers, a '-' delimiter character had been added by pdftoppm. Now cut the part before that. 
 + for filenum in $(ls $ptmpf-*.pgm | cut -d'-' -f2 | sort -n); do 
 + dimension="$(identify -format "%[fx:w]x%[fx:h]" $ptmpf'-'$filenum)" 
 + #  width: "${dimension%%x*}" -> From the end removes the longest part of dimension that matches x* and returns the rest. 
 + #  height: "${dimension##*x}" -> From the beginning removes the longest part of dimension that matches *x and returns the rest. 
 + if [ ${dimension%%x*} -gt $lhorpix ]; then 
 + lhorpix=${dimension%%x*} 
 + resize=1 
 + fi 
 + if [ ${dimension##*x} -gt $lverpix ]; then 
 + lverpix=${dimension##*x} 
 + resize=1 
 + fi 
 + done 
 + # Resize document if neccessary 
 + if [ $resize -eq 1 ]; then 
 + # image fits within horizontal boundary, so adaptation must be vertical 
 + if [ $horpix -eq $lhorpix ]; then 
 + vscale=$(( ($verpix-30)*100/$lverpix )) 
 + else 
 + vscale=100 
 + fi 
 + # image fits within vertical boundary, so adaptation must be horizontal 
 + if [ $verpix -eq $lverpix ]; then 
 + hscale=$(( ($horpix-30)*100/$lhorpix )) 
 + else 
 + hscale=100 
 + fi 
 + # scaling neccessary both horizontal and vertical 
 + if [ $hscale -lt $vscale ]; then scale=$hscale; else scale=$vscale; fi 
 + 
 + echo Some images exceed the maximum size. Now scaling to "$scale"%. 
 + # iterate through all files and resize all with the same percentage, use 8 bits per pixel 
 + for filenum in $(ls $ptmpf-*.pgm | cut -d'-' -f2 | sort -n); do 
 + mogrify -depth 8 -resize "$scale"% "$ptmpf-$filenum" 
 + echo resizing file: "$ptmpf-$filenum" 
 + done 
 + fi 
 + # apply unpaper onto each page 
 + for filenum in $(ls $ptmpf-*.pgm | cut -d'-' -f2 | sort -n); do 
 + # Parameter expansion: ${param%word} From the end remov? Mis je nog iets?es the smallest part of param that matches word and returns the rest. (p.72) 
 + unpaper -b $b_threshold $filter $donot -t pbm "$ptmpf-$filenum" "$ptmpf-"${filenum%.pgm}.pbm 
 + rm "$ptmpf-$filenum" 
 + done 
 + 
 + # center pbm page on an a4 canvas, convert to pdf 
 + pdflst="" 
 + for filenum in $(ls $ptmpf-*.pbm | cut -d'-' -f2 | sort -n); do 
 + convert -size $pdim xc:white miff:- | composite -density $pres -units PixelsPerInch -compose atop -gravity Center "$ptmpf-$filenum" - miff:- | convert -monochrome -density 300 -units PixelsPerInch - ps:- | ps2pdf13 -sPAPERSIZE=a4 - "$ptmpf-"${filenum%.pbm}.pdf 
 + rm "$ptmpf-$filenum" 
 + pdflst="$pdflst $ptmpf-"${filenum%.pbm}.pdf 
 + done 
 + 
 + # merge all pages into on page 
 + gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH "-sOutputFile=$2" $pdflst 
 + rm $pdflst 
 + fi 
 +fi 
 +exit 0 
 +</code> 
 + 
 +==== findblack ==== 
 +Find a value for black threshold with following script: 
 +<code bash> 
 +#!/bin/sh 
 +# filename: findblack 
 +# This script helps to find an unpaper's black threshold value (-b option) by 
 +# creating a series of pdf files each with a different black threshold setting. By observing 
 +# the pages afterwards, selecting a proper black threshold value should be easier. Values are 
 +# annotated on top of each page. 
 +
 +# usage: findblack input-file [option] output-file 
 +
 +# option: -p [number] which page should be used as source for unpaper. If not specified, first page will be taken. 
 +# input-file: a pdf file containing one or more gray images. 
 +# output-file: a multipage pdf file with a series of black-threshold settings will be created. 
 +
 +# Example: 
 +# Process page 4 from mozart.pdf: ./findblack mozart.pdf -p 4 test.pdf 
 +#  
 +# This script uses imagemagick to convert and center an image on an a4 page. (With 2480x3508 pixels at a resolution of 300 dpi) 
 +# and ghostscript to create a small pdf file.  By using efficiently the piping method (|) in conjunction with 
 +# reading and writing to the standard input (-) we get less tempfiles. 
 +
 +# NB. This script is time consuming. So have a lot of patience. 
 +# Each page will take about 15 seconds to execute on a single core AMD64 3500+ cpu. 
 + 
 +# User text  
 +_fbhelp="Usage: findblack input-file [OPTION...] output-file\n\n  -p foo, processes page number foo\n          If this option is omitted page 1 will be processed by default\n\nExample:\nProcess page 5:\n  findblack mozart.pdf -p 5 result.pdf\nProcess the first page:\n  findblack mozart.pdf result.pdf" 
 +_fbspdf="please supply a pdf file" 
 +_fbcvrt="Can't find converted first page of pdf file. Check your pdf file" 
 + 
 +# Unpaper settings, start value for black-threshold is 0.05, end value for black-threshold is 0.40, step = 0.02 
 +a=5 
 +b=20 
 +s=5 
 +donot="--no-blurfilter --no-noisefilter --no-grayfilter --no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite" 
 + 
 +# Other settings; $$ references the current PID 
 +ptmpf="tempfile_findblack_$$" 
 +psize="a4" 
 +pdim="2480x3508" 
 +pres="300" 
 +ptxt="black threshold:" 
 +pstr="-font Times-Roman -pointsize 36 -gravity NorthWest -annotate +5+5" 
 +# Default page number 
 +defpage=1 
 + 
 +# Find number of parameters passed 
 +if [ "$#" -le 1 ]; then 
 + echo "$_fbhelp" 
 + exit 1 
 +fi 
 + 
 +# Store input-file name temporarily, because it's deleted by shift command. 
 +arg1=$1 
 + 
 +# Check OPTIONS 
 +if [ "$#" -eq 3 ]; then 
 + case "$2" in 
 + * ) 
 + echo "$_fbhelp" 
 + exit 1 
 + ;; 
 + esac 
 + elif [ "$#" -eq 4 ]; then 
 + case "$2" in 
 + -p )  
 + shift 
 + defpage=$2 
 + shift 
 + ;; 
 + * )  
 + echo "$_fbhelp" 
 + exit 1 
 + ;; 
 + esac 
 +fi 
 + 
 +# only process further if there are two arguments left. 
 +if [ "$#" -eq 2 ]; then 
 + if [ ! "$(identify "$arg1" | grep PDF)" ]; then 
 + echo "$_fbspdf" 
 + exit 1 
 + else 
 + # find last page, start counting from 0, necessary for proper deleting pdftoppm result file which may have extra 0 
 + trail="" 
 + lastPage="$(identify "$arg1" | tail -n 1 | sed 's/\(.*\)\[\(.*\)\].*/\2/')" 
 + if [ "$lastPage" -ge 10 ]; then 
 + trail="0" 
 + elif [ "$lastPage" -ge 100 ]; then 
 + trail="00" 
 + elif [ "$lastPage" -ge 1000 ]; then 
 + trail="000" 
 + elif [ "$lastPage" -ge 10000 ]; then 
 + trail="0000" 
 + fi 
 + # convert first page to a pgm file with filename tmpfile_PID-1.pgm 
 + pdftoppm -gray -r 300 -f $defpage -l $defpage "$arg1" $ptmpf 
 + # check for succesful conversion to .pgm 
 + if [ ! -f "$ptmpf-$trail$defpage.pgm" ]; then 
 + echo "$_fbcvrt" 
 + rm "$ptmpf-$trail$defpage.pgm" 
 + exit 1 
 + else 
 + pdflst="" 
 +  
 + nmb=$a 
 + while [ "$nmb" -le $b ] 
 + do 
 + # Cope with digits 
 + if [ "$nmb" -le 9 ]; then 
 + str1="0.0$nmb" 
 + else 
 + str1="0.$nmb" 
 + fi 
 +  
 + # process unpaper 
 + unpaper -b $str1 $donot -t pbm $ptmpf-$trail$defpage.pgm $ptmpf.pbm 
 +  
 + # Create white A4 size canvas -> center pbm file into canvas -> convert pbm to ps -> convert ps to pdf 
 + # New to bash? 
 + # Send to and retreive from 'standard input' with using dash: - 
 + # Sending result to next command with pipe: | 
 + # Using this 'technique' we don't need to create temporary files. 
 + convert -size $pdim xc:white miff:- | composite -density $pres -units PixelsPerInch -compose atop -gravity Center $ptmpf.pbm - miff:- | convert - $pstr "$ptxt $str1" - | convert -monochrome -density 300 -units PixelsPerInch - ps:- | ps2pdf13 -sPAPERSIZE=a4 - "$ptmpf-$nmb.pdf" 
 +  
 + pdflst="$pdflst $ptmpf-$nmb.pdf" 
 + nmb=$(($nmb+$s)) 
 + rm $ptmpf.pbm 
 + done 
 + rm $ptmpf-$trail$defpage.pgm 
 + # Merge all pdf files in $pdflst into one single file with filename $2 
 + gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -sOutputFile=$2 $pdflst 
 + rm $pdflst 
 + fi 
 + fi 
 +fi 
 +exit 0 
 +</code> 
 + 
 +===== Bugs ===== 
 +[[software:unpaper_test:bugs|See the following page]]
software/unpaper_test.1236646029.txt.gz · Last modified: 2009/03/10 01:47 by admin