Table of Contents
Unpaper test
In order to better understand the vast amount of options in unpaper, I've done several tests, which show differences between settings.
I scanned a lot of sheetmusic, processed them manually with another application to remove skew. All actions have been done in 8-bit gray at a resolution of 300 dpi.
This test should show which settings should be done with unpaper to get a decent automatic conversion to black and white with despeckle and noise removal.
prerequisites
- ubuntu 8.04
- imagemagick
- unpaper 0.3 (installed by downloading the archive, extracting manually and use make command)
- ghostscript
pdftk1)- testimage 8-bit gray at a resolution of 300 dpi (.pdf), converted in .pgm format. (Scanner: Epson GT-10000+) contrast had been set manually at a value where music from the other page side ('see through') is minimal and image background is white instead of gray.
Automatic conversion of a single pdf file consisting of several pages to pgm files with pdftoppm (result should be 300 dpi and gray):
pdftoppm -gray -r 300 inputfile.pdf outputfile
Check filetype with identify:
identify outputfile
Following script tests unpaper with different settings, produces a pdf with 10 pages, each page with a different setting. The relevant unpaper options are embedded into the file:
#!/bin/sh donot="--no-blurfilter --no-noisefilter --no-grayfilter --no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite" pdflst="" ptmpf="tempfile_unpaptst_$$" nmb=0 while [ "$nmb" -le 9 ] do str1="0.$nmb" unpaper -b $str1 $donot -t pbm $ptmpf-1.pgm $ptmpf.pbm # convert inputfile option annotation outputfile convert $ptmpf.pbm -gravity center -background lightblue -font Open-Sans -pointsize 60 caption:"$str1 \n $donot" -composite $ptmpf_ca.pbm convert -monochrome -density 300 -units PixelsPerInch /$ptmpf_ca.pbm ps:- | ps2pdf13 -sPAPERSIZE=a4 - "$ptmpf-$nmb.pdf" rm $ptmpf.pbm $ptmpf_ca.pbm pdflst="$pdflst $ptmpf-$nmb.pdf" nmb=$(($nmb+1)) done # merge all pdf documents with ghostscript. Although pdftk will give a shorter command, the identify command # gives an error when it analyses the pdf file. gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -sOutputFile=result.pdf $pdflst rm $pdflst exit 0
Results
unpaper --noisefilter-intensity (1-25) (.pdf) |
---|
option: -ni [1 … 25] Above a value of 10 tiny clusters consisting of about 3×3 (=9 pixels? and thus less than 10 pixels?) are removed |
unpaper –white-treshold (.pdf) |
option: -w [0.0 … 0.9] At option -w 0.6 suddenly a few light gray pixels become invisable. Higher and lower values shows no difference. Strange that this value has no continuity. |
unpaper --black-treshold (coarse) (.pdf) |
option: -b [0.0 … 0.9] Option -b 0.0 turns the slightest light gray spots into black pixels, giving the page large black areas and hard to read. Option -b 0.1 will turn most of the white/gray background color into white and turns remarks, written with light gray pencil into black. Single system lines consist of about 6 pixels high. Option -b 0.2 will make remarks, written with light gray pencil invisable. Darker gray is still visable. Single system lines consist of about 4 pixels high. Option -b 0.3 will make remarks, written with light gray pencil invisable. Darker gray is still visable. Single system lines consist of about 2 pixels high. Option -b 0.4 will make system lines and other less than 3 pixel wide lines disappear almost completely. Printed black strong text is still visable. Normal printed text starts to faint. Option -b 0.5 and higher removes everything that has been previously visable, keeping a white page. (Annotated text remains, because it has been added after unpaper has been executed) |
unpaper --black-treshold 0.10 ... 0.19 (.pdf) |
option: -b [0.10 … 0.19] Option -b 0.12 shows an optimum between light gray pencil remarks and line thickness. Single system lines consist of about 4 pixels high. |
unpaper –black-treshold 0.20 … 0.29 (.pdf) |
option: -b [0.20 … 0.29] Option -b 0.20 printed black symbols (like the alla breve symbol) is just visible as it is originally. Single system lines consist of about 3 pixels high. Stems are 2 pixels wide. Higher settings make all lines gradually become thinner. System lines are getting spots where they are much thinner. At value 0.29 and higher some stems are disintegrating. |
unpaper –black-treshold 0.30 … 0.39 (.pdf) |
option: -b [0.30 … 0.39] At a value of 0.30 symbols like crosses with thin vertical lines are disintegrating. System lines are still continuous and between 1 and 2 pixels. At values higher than 0.34 system lines are largely disintegrating. |
100 pixels size: unpaper --blurfilter-intensity 0.00 ... 0.25 (.pdf) 8 pixels size: unpaper --blurfilter-intensity 0.00 ... 0.25 (.pdf) 4 pixels size: unpaper --blurfilter-intensity 0.00 ... 0.25 (.pdf) |
option: -b 0.12 -ls 100 -lp 50 -li [0.00 … 0.25]: At values larger than 0.07 large square (100×100 pixels) parts are being removed. option: -b 0.12 -ls 8 -lp 4 -li [0.00 … 0.25]: When -li has a value of at least 0.20, tiny isolated dots (dirt), consisting of about 3×3 pixels are being removed. |
steps to find unpaper settings
Scan your raw material at 300 dpi with gray 8 bit.
1. black-treshold
With the following command, unpaper converts a gray image into a black and white image:
unpaper -b $black_treshold --no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite -t pbm inputfile.pgm outputfile.pbm
The variable $black_treshold is a ratio between 0 and 1. Assuming light pixels have high values and dark pixels have low values, with this ratio a pixel will be considered black when its value is below the ratio. Therefore a low ratio will yield much darker images.\\ Under normal situations, when the original scan has good visible contrast, $black_treshold should be somewhere within a range from [0.1 … 0.4]. When raw material is quite dark, $black_treshold may be 0.1 higher than usual. If sheet music contains pencil remarks which should be kept in the output result, a value of 0.12 may be useful. To show less pencil remarks, a value of 0.35 may be used.
2. removing black borders
Black borders at edges of your scan can be removed automatically with this option. See unpaper user documentation for details.
3. blurfilter
See pdf examples above.
4. grayfilter
See pdf examples above.
Scripting with unpaper
gray2black
Automatic conversion from pdf files, containing gray images, to black and white images with unpaper and its noise reduction and cleaning up features can be done with the following script:
#!/bin/sh # filename: gray2black # This script processes a pdf file containing gray images with # unpaper, imagemagick and ghostscript to a black and white pdf file. # # Usage: gray2black input-file [option] output-file # # option: -b [value] specify black threshold value as being used in # unpaper with -b. If omitted, default value will # be used: -b 0.12 # input-file: a pdf file containing sheetmusic with one or more gray # images. # output-file: a multipage pdf file with a series of black-threshold # settings will be created. # # Example: # Process gray images in mozart.pdf: gray2black mozart.pdf -b 0.23 test.pdf # # This script uses imagemagick to convert and center an image on an # a4 page. (With 2480x3508 pixels at a resolution of 300 dpi) and # ghostscript to create a small pdf file. By using efficiently the # piping method (|) in conjunction with reading and writing to the # standard input (-) we get less tempfiles. # # NB. This script is time consuming. So have a lot of patience. # Each page will take about 15 seconds to execute on a single core # AMD64 3500+ cpu. # # User text _fbhelp="Usage: gray2black input-file [OPTION...] output-file\n\n -b [value], specify black threshold value as being used in\n unpaper with -b. If omitted, default value will\n be used: -b 0.12.\n\nExample:\nProcess pdf file with unpaper black threshold 0.23:\n gray2black mozart.pdf -b 0.23 result.pdf\nProcess pdf file with -b 0.12 settings:\n gray2black mozart.pdf result.pdf" _fbspdf="please supply a pdf file" _fbcvrt="Check your pdf file. Does it contain gray images?" # Unpaper settings, start value for black-threshold is 0.12 b_threshold="0.12" donot="--no-grayfilter --no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite" filter="-ni 10 -ls 8 -lp 4 -li 0.20" # Other settings ptmpf="tempfile_gray2black_$$" # Page dimensions psize="a4" horpix=2480 verpix=3508 pdim="$horpix"x"$verpix" pres="300" # Only resize if absolutely neccessary. Values for resizing are for left, right, top and bottom border. hborder=33 vborder=33 lhorpix=$horpix lverpix=$verpix # Store input-file name temporarily, because it's deleted by shift command. arg1=$1 # Find number of parameters passed if [ "$#" -le 1 ]; then echo "$_fbhelp" exit 1 fi # Check OPTIONS if [ "$#" -eq 3 ]; then case "$2" in * ) echo "$_fbhelp" exit 1 ;; esac elif [ "$#" -eq 4 ]; then case "$2" in -b ) shift b_threshold=$2 shift ;; * ) echo "$_fbhelp" exit 1 ;; esac fi # only process further if there are two arguments left. if [ "$#" -eq 2 ]; then if [ ! "$(identify "$arg1"|grep PDF)" ]; then echo "$_fbspdf" exit 1 else # convert first page to a pgm file with filename $ptmpf-*.pgm pdftoppm -gray -r 300 "$arg1" $ptmpf # Check for document resize neccessity. Iterating through all pages, makes sure, all pages get uniform resizing later. resize=0 # Iterate through files. We need to sort files in a numerical order because pdftoppm generates files without a leading zero. # However before the numbers, a '-' delimiter character had been added by pdftoppm. Now cut the part before that. for filenum in $(ls $ptmpf-*.pgm | cut -d'-' -f2 | sort -n); do dimension="$(identify -format "%[fx:w]x%[fx:h]" $ptmpf'-'$filenum)" # width: "${dimension%%x*}" -> From the end removes the longest part of dimension that matches x* and returns the rest. # height: "${dimension##*x}" -> From the beginning removes the longest part of dimension that matches *x and returns the rest. if [ ${dimension%%x*} -gt $lhorpix ]; then lhorpix=${dimension%%x*} resize=1 fi if [ ${dimension##*x} -gt $lverpix ]; then lverpix=${dimension##*x} resize=1 fi done # Resize document if neccessary if [ $resize -eq 1 ]; then # image fits within horizontal boundary, so adaptation must be vertical if [ $horpix -eq $lhorpix ]; then vscale=$(( ($verpix-30)*100/$lverpix )) else vscale=100 fi # image fits within vertical boundary, so adaptation must be horizontal if [ $verpix -eq $lverpix ]; then hscale=$(( ($horpix-30)*100/$lhorpix )) else hscale=100 fi # scaling neccessary both horizontal and vertical if [ $hscale -lt $vscale ]; then scale=$hscale; else scale=$vscale; fi echo Some images exceed the maximum size. Now scaling to "$scale"%. # iterate through all files and resize all with the same percentage, use 8 bits per pixel for filenum in $(ls $ptmpf-*.pgm | cut -d'-' -f2 | sort -n); do mogrify -depth 8 -resize "$scale"% "$ptmpf-$filenum" echo resizing file: "$ptmpf-$filenum" done fi # apply unpaper onto each page for filenum in $(ls $ptmpf-*.pgm | cut -d'-' -f2 | sort -n); do # Parameter expansion: ${param%word} From the end remov? Mis je nog iets?es the smallest part of param that matches word and returns the rest. (p.72) unpaper -b $b_threshold $filter $donot -t pbm "$ptmpf-$filenum" "$ptmpf-"${filenum%.pgm}.pbm rm "$ptmpf-$filenum" done # center pbm page on an a4 canvas, convert to pdf pdflst="" for filenum in $(ls $ptmpf-*.pbm | cut -d'-' -f2 | sort -n); do convert -size $pdim xc:white miff:- | composite -density $pres -units PixelsPerInch -compose atop -gravity Center "$ptmpf-$filenum" - miff:- | convert -monochrome -density 300 -units PixelsPerInch - ps:- | ps2pdf13 -sPAPERSIZE=a4 - "$ptmpf-"${filenum%.pbm}.pdf rm "$ptmpf-$filenum" pdflst="$pdflst $ptmpf-"${filenum%.pbm}.pdf done # merge all pages into on page gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH "-sOutputFile=$2" $pdflst rm $pdflst fi fi exit 0
findblack
Find a value for black threshold with following script:
#!/bin/sh # filename: findblack # This script helps to find an unpaper's black threshold value (-b option) by # creating a series of pdf files each with a different black threshold setting. By observing # the pages afterwards, selecting a proper black threshold value should be easier. Values are # annotated on top of each page. # # usage: findblack input-file [option] output-file # # option: -p [number] which page should be used as source for unpaper. If not specified, first page will be taken. # input-file: a pdf file containing one or more gray images. # output-file: a multipage pdf file with a series of black-threshold settings will be created. # # Example: # Process page 4 from mozart.pdf: ./findblack mozart.pdf -p 4 test.pdf # # This script uses imagemagick to convert and center an image on an a4 page. (With 2480x3508 pixels at a resolution of 300 dpi) # and ghostscript to create a small pdf file. By using efficiently the piping method (|) in conjunction with # reading and writing to the standard input (-) we get less tempfiles. # # NB. This script is time consuming. So have a lot of patience. # Each page will take about 15 seconds to execute on a single core AMD64 3500+ cpu. # User text _fbhelp="Usage: findblack input-file [OPTION...] output-file\n\n -p foo, processes page number foo\n If this option is omitted page 1 will be processed by default\n\nExample:\nProcess page 5:\n findblack mozart.pdf -p 5 result.pdf\nProcess the first page:\n findblack mozart.pdf result.pdf" _fbspdf="please supply a pdf file" _fbcvrt="Can't find converted first page of pdf file. Check your pdf file" # Unpaper settings, start value for black-threshold is 0.05, end value for black-threshold is 0.40, step = 0.02 a=5 b=20 s=5 donot="--no-blurfilter --no-noisefilter --no-grayfilter --no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite" # Other settings; $$ references the current PID ptmpf="tempfile_findblack_$$" psize="a4" pdim="2480x3508" pres="300" ptxt="black threshold:" pstr="-font Times-Roman -pointsize 36 -gravity NorthWest -annotate +5+5" # Default page number defpage=1 # Find number of parameters passed if [ "$#" -le 1 ]; then echo "$_fbhelp" exit 1 fi # Store input-file name temporarily, because it's deleted by shift command. arg1=$1 # Check OPTIONS if [ "$#" -eq 3 ]; then case "$2" in * ) echo "$_fbhelp" exit 1 ;; esac elif [ "$#" -eq 4 ]; then case "$2" in -p ) shift defpage=$2 shift ;; * ) echo "$_fbhelp" exit 1 ;; esac fi # only process further if there are two arguments left. if [ "$#" -eq 2 ]; then if [ ! "$(identify "$arg1" | grep PDF)" ]; then echo "$_fbspdf" exit 1 else # find last page, start counting from 0, necessary for proper deleting pdftoppm result file which may have extra 0 trail="" lastPage="$(identify "$arg1" | tail -n 1 | sed 's/\(.*\)\[\(.*\)\].*/\2/')" if [ "$lastPage" -ge 10 ]; then trail="0" elif [ "$lastPage" -ge 100 ]; then trail="00" elif [ "$lastPage" -ge 1000 ]; then trail="000" elif [ "$lastPage" -ge 10000 ]; then trail="0000" fi # convert first page to a pgm file with filename tmpfile_PID-1.pgm pdftoppm -gray -r 300 -f $defpage -l $defpage "$arg1" $ptmpf # check for succesful conversion to .pgm if [ ! -f "$ptmpf-$trail$defpage.pgm" ]; then echo "$_fbcvrt" rm "$ptmpf-$trail$defpage.pgm" exit 1 else pdflst="" nmb=$a while [ "$nmb" -le $b ] do # Cope with digits if [ "$nmb" -le 9 ]; then str1="0.0$nmb" else str1="0.$nmb" fi # process unpaper unpaper -b $str1 $donot -t pbm $ptmpf-$trail$defpage.pgm $ptmpf.pbm # Create white A4 size canvas -> center pbm file into canvas -> convert pbm to ps -> convert ps to pdf # New to bash? # Send to and retreive from 'standard input' with using dash: - # Sending result to next command with pipe: | # Using this 'technique' we don't need to create temporary files. convert -size $pdim xc:white miff:- | composite -density $pres -units PixelsPerInch -compose atop -gravity Center $ptmpf.pbm - miff:- | convert - $pstr "$ptxt $str1" - | convert -monochrome -density 300 -units PixelsPerInch - ps:- | ps2pdf13 -sPAPERSIZE=a4 - "$ptmpf-$nmb.pdf" pdflst="$pdflst $ptmpf-$nmb.pdf" nmb=$(($nmb+$s)) rm $ptmpf.pbm done rm $ptmpf-$trail$defpage.pgm # Merge all pdf files in $pdflst into one single file with filename $2 gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -sOutputFile=$2 $pdflst rm $pdflst fi fi fi exit 0