User Tools

Site Tools


software:unpaper_test

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
software:unpaper_test [2009/03/11 05:16] adminsoftware:unpaper_test [2015/04/22 21:51] (current) – [findblack] admin
Line 1: Line 1:
-====== Unpaper test =====+====== Unpaper test ======
 In order to better understand the vast amount of options in unpaper, I've done several tests, which show differences between settings.\\ I scanned a lot of sheetmusic, processed them manually with another application to remove skew. All actions have been done in 8-bit gray at a resolution of 300 dpi.\\ This test should show which settings should be done with unpaper to get a decent automatic conversion to black and white with despeckle and noise removal. In order to better understand the vast amount of options in unpaper, I've done several tests, which show differences between settings.\\ I scanned a lot of sheetmusic, processed them manually with another application to remove skew. All actions have been done in 8-bit gray at a resolution of 300 dpi.\\ This test should show which settings should be done with unpaper to get a decent automatic conversion to black and white with despeckle and noise removal.
  
Line 34: Line 34:
  str1="0.$nmb"  str1="0.$nmb"
  
- unpaper -b $str1 $donot -t pbm /tmp/$ptmpf-1.pgm /tmp/$ptmpf.pbm+ unpaper -b $str1 $donot -t pbm $ptmpf-1.pgm $ptmpf.pbm
  
  # convert inputfile option annotation outputfile  # convert inputfile option annotation outputfile
- convert /tmp/$ptmpf.pbm -gravity NorthWest -annotate 0 "unpaper -$str1 $donot" /tmp/$ptmpf_ca.pbm + convert $ptmpf.pbm -gravity center -background lightblue -font Open-Sans -pointsize 60 caption:"$str1 \n $donot" -composite $ptmpf_ca.pbm 
- convert -monochrome -density 300 -units PixelsPerInch /tmp/$ptmpf_ca.pbm ps:- | ps2pdf13 -sPAPERSIZE=a4 - "/tmp/$ptmpf-$nmb.pdf" + convert -monochrome -density 300 -units PixelsPerInch /$ptmpf_ca.pbm ps:- | ps2pdf13 -sPAPERSIZE=a4 - "$ptmpf-$nmb.pdf" 
- rm /tmp/$ptmpf.pbm /tmp/$ptmpf_ca.pbm + rm $ptmpf.pbm $ptmpf_ca.pbm 
- pdflst="$pdflst /tmp/$ptmpf-$nmb.pdf"+ pdflst="$pdflst $ptmpf-$nmb.pdf"
  nmb=$(($nmb+1))  nmb=$(($nmb+1))
 done done
Line 83: Line 83:
  
 ===== Scripting with unpaper ===== ===== Scripting with unpaper =====
 +==== gray2black ====
 Automatic conversion from pdf files, containing gray images, to black and white images with unpaper and its noise reduction and cleaning up features can be done with the following script: Automatic conversion from pdf files, containing gray images, to black and white images with unpaper and its noise reduction and cleaning up features can be done with the following script:
 <code bash> <code bash>
 #!/bin/sh #!/bin/sh
-# filename: findblack.sh +# filename: gray2black 
-# This script helps to find the value for unpaper'black threshold value (-option)+# This script processes a pdf file containing gray images with 
 +unpaper, imagemagick and ghostscript to a black and white pdf file. 
 +#  
 +# Usage: gray2black input-file [option] output-file
 # #
-usagefindblack.sh <inputfile.pdf> <outputfile.pdf>+option-b [value] specify black threshold value as being used in 
 +#                     unpaper with -bIf omitted, default value will 
 +#                     be used: -b 0.12 
 +# input-file: a pdf file containing sheetmusic with one or more gray 
 +#             images. 
 +# output-file: a multipage pdf file with a series of black-threshold 
 +#              settings will be created.
 # #
-<inputfile.pdf> a pdf file containing one or more gray images+Example: 
-<outputfile.pdf> multipage pdf file with a series of black-threshold settings will be created.+# Process gray images in mozart.pdf: gray2black mozart.pdf -b 0.23 test.pdf 
 + 
 +# This script uses imagemagick to convert and center an image on an 
 +# a4 page(With 2480x3508 pixels at resolution of 300 dpi) and 
 +# ghostscript to create a small pdf file.  By using efficiently the 
 +# piping method (|) in conjunction with reading and writing to the 
 +# standard input (-) we get less tempfiles.
 # #
-Observe the outputfile.pdf and find the page which has the best value for black-threshold(Text is embedded)+NBThis script is time consuming. So have a lot of patience. 
 +# Each page will take about 15 seconds to execute on a single core 
 +# AMD64 3500+ cpu.
 # #
-#  +User text  
-# Use imagemagick to convert and center an image on an a4 page(With 2480x3508 pixels at a resolution of 300 dpi). +_fbhelp="Usage: gray2black input-file [OPTION...] output-file\n\n  -b [value], specify black threshold value as being used in\n              unpaper with -b. If omitted, default value will\n              be used: -b 0.12.\n\nExample:\nProcess pdf file with unpaper black threshold 0.23:\n  gray2black mozart.pdf -b 0.23 result.pdf\nProcess pdf file with -b 0.12 settings:\n  gray2black mozart.pdf result.pdf" 
-# Use ghostscript to create a small pdf file.  By using efficiently the piping method (|) in conjunction with +_fbspdf="please supply a pdf file" 
-reading and writing to the standard input (-) we get:+_fbcvrt="Check your pdf file. Does it contain gray images?" 
 +Unpaper settings, start value for black-threshold is 0.12 
 +b_threshold="0.12" 
 +donot="--no-grayfilter --no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite" 
 +filter="-ni 10 -ls 8 -lp 4 -li 0.20" 
 +# Other settings 
 +ptmpf="tempfile_gray2black_$$" 
 + 
 +# Page dimensions
 psize="a4" psize="a4"
-pdim="2480x3508"+horpix=2480 
 +verpix=3508 
 +pdim="$horpix"x"$verpix"
 pres="300" pres="300"
-convert -size $pdim xc:white pbm:- | composite -density $pres -units PixelsPerInch -compose atop -gravity Center inputfile - ps:- | ps2pdf13 -sPAPERSIZE=$psize output.pdf + 
-This will take about 5 seconds for a single page to execute on a single core AMD64 3500+ cpu.+# Only resize if absolutely neccessary. Values for resizing are for left, right, top and bottom border. 
 +hborder=33 
 +vborder=33 
 +lhorpix=$horpix 
 +lverpix=$verpix 
 + 
 +# Store input-file name temporarily, because it's deleted by shift command. 
 +arg1=$1 
 + 
 +# Find number of parameters passed 
 +if [ "$#" -le 1 ]; then 
 + echo "$_fbhelp" 
 + exit 1 
 +fi 
 + 
 +# Check OPTIONS 
 +if [ "$#" -eq 3 ]; then 
 + case "$2" in 
 + * ) 
 + echo "$_fbhelp" 
 + exit 1 
 + ;; 
 + esac 
 + elif [ "$#" -eq 4 ]; then 
 + case "$2" in 
 + -b )  
 + shift 
 + b_threshold=$2 
 + shift 
 + ;; 
 + * )  
 + echo "$_fbhelp" 
 + exit 1 
 + ;; 
 + esac 
 +fi 
 + 
 +# only process further if there are two arguments left. 
 +if [ "$#" -eq 2 ]; then 
 + if [ ! "$(identify "$arg1"|grep PDF)" ]; then 
 + echo "$_fbspdf" 
 + exit 1 
 + else 
 + # convert first page to a pgm file with filename $ptmpf-*.pgm 
 + pdftoppm -gray -r 300 "$arg1" $ptmpf 
 + 
 + # Check for document resize neccessity. Iterating through all pages, makes sure, all pages get uniform resizing later. 
 + 
 + resize=0 
 + # Iterate through files. We need to sort files in a numerical order because pdftoppm generates files without a leading zero.  
 + # However before the numbers, a '-' delimiter character had been added by pdftoppm. Now cut the part before that. 
 + for filenum in $(ls $ptmpf-*.pgm | cut -d'-' -f2 | sort -n); do 
 + dimension="$(identify -format "%[fx:w]x%[fx:h]" $ptmpf'-'$filenum)" 
 + #  width: "${dimension%%x*}" -> From the end removes the longest part of dimension that matches x* and returns the rest. 
 + #  height: "${dimension##*x}" -> From the beginning removes the longest part of dimension that matches *x and returns the rest. 
 + if [ ${dimension%%x*} -gt $lhorpix ]; then 
 + lhorpix=${dimension%%x*} 
 + resize=1 
 + fi 
 + if [ ${dimension##*x} -gt $lverpix ]; then 
 + lverpix=${dimension##*x} 
 + resize=1 
 + fi 
 + done 
 + # Resize document if neccessary 
 + if [ $resize -eq 1 ]; then 
 + # image fits within horizontal boundary, so adaptation must be vertical 
 + if [ $horpix -eq $lhorpix ]; then 
 + vscale=$(( ($verpix-30)*100/$lverpix )) 
 + else 
 + vscale=100 
 + fi 
 + # image fits within vertical boundary, so adaptation must be horizontal 
 + if [ $verpix -eq $lverpix ]; then 
 + hscale=$(( ($horpix-30)*100/$lhorpix )) 
 + else 
 + hscale=100 
 + fi 
 + # scaling neccessary both horizontal and vertical 
 + if [ $hscale -lt $vscale ]; then scale=$hscale; else scale=$vscale; fi 
 + 
 + echo Some images exceed the maximum size. Now scaling to "$scale"%. 
 + # iterate through all files and resize all with the same percentage, use 8 bits per pixel 
 + for filenum in $(ls $ptmpf-*.pgm | cut -d'-' -f2 | sort -n); do 
 + mogrify -depth 8 -resize "$scale"% "$ptmpf-$filenum" 
 + echo resizing file: "$ptmpf-$filenum" 
 + done 
 + fi 
 + # apply unpaper onto each page 
 + for filenum in $(ls $ptmpf-*.pgm | cut -d'-' -f2 | sort -n); do 
 + # Parameter expansion: ${param%word} From the end remov? Mis je nog iets?es the smallest part of param that matches word and returns the rest. (p.72) 
 + unpaper -b $b_threshold $filter $donot -t pbm "$ptmpf-$filenum" "$ptmpf-"${filenum%.pgm}.pbm 
 + rm "$ptmpf-$filenum" 
 + done 
 + 
 + # center pbm page on an a4 canvas, convert to pdf 
 + pdflst="" 
 + for filenum in $(ls $ptmpf-*.pbm | cut -d'-' -f2 | sort -n); do 
 + convert -size $pdim xc:white miff:- | composite -density $pres -units PixelsPerInch -compose atop -gravity Center "$ptmpf-$filenum" - miff:- | convert -monochrome -density 300 -units PixelsPerInch - ps:- | ps2pdf13 -sPAPERSIZE=a4 - "$ptmpf-"${filenum%.pbm}.pdf 
 + rm "$ptmpf-$filenum" 
 + pdflst="$pdflst $ptmpf-"${filenum%.pbm}.pdf 
 + done 
 + 
 + merge all pages into on page 
 + gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH "-sOutputFile=$2" $pdflst 
 + rm $pdflst 
 + fi 
 +fi 
 +exit 0
 </code> </code>
  
-Find the value for black threshold with following script:+==== findblack ==== 
 +Find value for black threshold with following script:
 <code bash> <code bash>
 #!/bin/sh #!/bin/sh
-# filename: findblack.sh+# filename: findblack
 # This script helps to find an unpaper's black threshold value (-b option) by # This script helps to find an unpaper's black threshold value (-b option) by
 # creating a series of pdf files each with a different black threshold setting. By observing # creating a series of pdf files each with a different black threshold setting. By observing
Line 116: Line 253:
 # annotated on top of each page. # annotated on top of each page.
 # #
-# usage: findblack.sh [optional: pagenumber<inputfile.pdf> <outputfile.pdf>+# usage: findblack input-file [optionoutput-file
 # #
-pagenumber: which page should be used as source for unpaper. If not specified, first page will be taken. +option-p [number] which page should be used as source for unpaper. If not specified, first page will be taken. 
-<inputfile.pdf> a pdf file containing one or more gray images. +input-file: a pdf file containing one or more gray images. 
-<outputfile.pdf> a multipage pdf file with a series of black-threshold settings will be created.+output-file: a multipage pdf file with a series of black-threshold settings will be created.
 # #
 +# Example:
 +# Process page 4 from mozart.pdf: ./findblack mozart.pdf -p 4 test.pdf
  
 # This script uses imagemagick to convert and center an image on an a4 page. (With 2480x3508 pixels at a resolution of 300 dpi) # This script uses imagemagick to convert and center an image on an a4 page. (With 2480x3508 pixels at a resolution of 300 dpi)
Line 128: Line 267:
 # #
 # NB. This script is time consuming. So have a lot of patience. # NB. This script is time consuming. So have a lot of patience.
-The line: "convert -size $pdim xc:white pbm:- | ...... | ps2pdf13 -sPAPERSIZE=$psize - output.pdf" will take about +Each page will take about 15 seconds to execute on a single core AMD64 3500+ cpu.
-# 10 seconds for a single page to execute on a single core AMD64 3500+ cpu. +
-# Written on march 2009, m.nijdam+
  
 +# User text 
 +_fbhelp="Usage: findblack input-file [OPTION...] output-file\n\n  -p foo, processes page number foo\n          If this option is omitted page 1 will be processed by default\n\nExample:\nProcess page 5:\n  findblack mozart.pdf -p 5 result.pdf\nProcess the first page:\n  findblack mozart.pdf result.pdf"
 +_fbspdf="please supply a pdf file"
 +_fbcvrt="Can't find converted first page of pdf file. Check your pdf file"
  
-# start value for black-threshold is 0.05+Unpaper settings, start value for black-threshold is 0.05, end value for black-threshold is 0.40, step = 0.02
 a=5 a=5
-# end value for black-threshold is 0.40 +b=20 
-b=40 +s=5
 donot="--no-blurfilter --no-noisefilter --no-grayfilter --no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite" donot="--no-blurfilter --no-noisefilter --no-grayfilter --no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite"
  
 +# Other settings; $$ references the current PID
 ptmpf="tempfile_findblack_$$" ptmpf="tempfile_findblack_$$"
 psize="a4" psize="a4"
Line 146: Line 287:
 ptxt="black threshold:" ptxt="black threshold:"
 pstr="-font Times-Roman -pointsize 36 -gravity NorthWest -annotate +5+5" pstr="-font Times-Roman -pointsize 36 -gravity NorthWest -annotate +5+5"
 +# Default page number
 +defpage=1
  
 # Find number of parameters passed # Find number of parameters passed
 if [ "$#" -le 1 ]; then if [ "$#" -le 1 ]; then
- echo "Need input filename and output filename for pdf file"+ echo "$_fbhelp"
  exit 1  exit 1
 fi fi
  
-ptst=`identify $1 | grep PDF` +# Store input-file name temporarily, because it's deleted by shift command. 
-if [ ! "$ptst" ]; then +arg1=$1 
- echo "please supply a pdf file"+ 
 +# Check OPTIONS 
 +if [ "$#" -eq 3 ]; then 
 + case "$2" in 
 + * ) 
 + echo "$_fbhelp" 
 + exit 1 
 + ;; 
 + esac 
 + elif [ "$#" -eq 4 ]; then 
 + case "$2" in 
 + -p )  
 + shift 
 + defpage=$2 
 + shift 
 + ;; 
 + * )  
 + echo "$_fbhelp" 
 + exit 1 
 + ;; 
 + esac 
 +fi 
 + 
 +# only process further if there are two arguments left. 
 +if [ "$#" -eq 2 ]; then 
 + if [ ! "$(identify "$arg1" | grep PDF)" ]; then 
 + echo "$_fbspdf" 
 + exit 1
  else  else
 + # find last page, start counting from 0, necessary for proper deleting pdftoppm result file which may have extra 0
 + trail=""
 + lastPage="$(identify "$arg1" | tail -n 1 | sed 's/\(.*\)\[\(.*\)\].*/\2/')"
 + if [ "$lastPage" -ge 10 ]; then
 + trail="0"
 + elif [ "$lastPage" -ge 100 ]; then
 + trail="00"
 + elif [ "$lastPage" -ge 1000 ]; then
 + trail="000"
 + elif [ "$lastPage" -ge 10000 ]; then
 + trail="0000"
 + fi
  # convert first page to a pgm file with filename tmpfile_PID-1.pgm  # convert first page to a pgm file with filename tmpfile_PID-1.pgm
- pdftoppm -gray -r 300 -f -l $1 /tmp/$ptmpf+ pdftoppm -gray -r 300 -f $defpage -l $defpage "$arg1" $ptmpf
  # check for succesful conversion to .pgm  # check for succesful conversion to .pgm
- if [ ! -f "/tmp/$ptmpf-1.pgm" ]; then + if [ ! -f "$ptmpf-$trail$defpage.pgm" ]; then 
- echo "Can't find converted first page of pdf file. Check your pdf file+ echo "$_fbcvrt
- rm "/tmp/$ptmpf-1.pgm"+ rm "$ptmpf-$trail$defpage.pgm"
  exit 1  exit 1
  else  else
  pdflst=""  pdflst=""
 + 
  nmb=$a  nmb=$a
  while [ "$nmb" -le $b ]  while [ "$nmb" -le $b ]
Line 176: Line 358:
  str1="0.$nmb"  str1="0.$nmb"
  fi  fi
 + 
  # process unpaper  # process unpaper
- unpaper -b $str1 $donot -t pbm /tmp/$ptmpf-1.pgm /tmp/$ptmpf.pbm + unpaper -b $str1 $donot -t pbm $ptmpf-$trail$defpage.pgm $ptmpf.pbm 
 + 
  # Create white A4 size canvas -> center pbm file into canvas -> convert pbm to ps -> convert ps to pdf  # Create white A4 size canvas -> center pbm file into canvas -> convert pbm to ps -> convert ps to pdf
  # New to bash?  # New to bash?
Line 185: Line 367:
  # Sending result to next command with pipe: |  # Sending result to next command with pipe: |
  # Using this 'technique' we don't need to create temporary files.  # Using this 'technique' we don't need to create temporary files.
- convert -size $pdim xc:white miff:- | composite -density $pres -units PixelsPerInch -compose atop -gravity Center /tmp/$ptmpf.pbm - miff:- | convert - $pstr "$ptxt $str1" - | convert -monochrome -density 300 -units PixelsPerInch - ps:- | ps2pdf13 -sPAPERSIZE=a4 - "/tmp/$ptmpf-$nmb.pdf" + convert -size $pdim xc:white miff:- | composite -density $pres -units PixelsPerInch -compose atop -gravity Center $ptmpf.pbm - miff:- | convert - $pstr "$ptxt $str1" - | convert -monochrome -density 300 -units PixelsPerInch - ps:- | ps2pdf13 -sPAPERSIZE=a4 - "$ptmpf-$nmb.pdf" 
-  +  
- pdflst="$pdflst /tmp/$ptmpf-$nmb.pdf" + pdflst="$pdflst $ptmpf-$nmb.pdf" 
- nmb=$(($nmb+1)) + nmb=$(($nmb+$s)) 
- rm /tmp/$ptmpf.pbm+ rm $ptmpf.pbm
  done  done
- rm /tmp/$ptmpf-1.pgm+ rm $ptmpf-$trail$defpage.pgm
  # Merge all pdf files in $pdflst into one single file with filename $2  # Merge all pdf files in $pdflst into one single file with filename $2
  gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -sOutputFile=$2 $pdflst  gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -sOutputFile=$2 $pdflst
Line 197: Line 379:
  fi  fi
  fi  fi
 +fi
 exit 0 exit 0
 </code> </code>
 +
 +===== Bugs =====
 +[[software:unpaper_test:bugs|See the following page]]
software/unpaper_test.1236745005.txt.gz · Last modified: 2009/03/11 05:16 by admin