This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
software:unpaper_test [2009/03/09 01:33] – admin | software:unpaper_test [2015/04/22 21:51] (current) – [findblack] admin | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Unpaper test ===== | + | ====== Unpaper test ====== |
In order to better understand the vast amount of options in unpaper, I've done several tests, which show differences between settings.\\ I scanned a lot of sheetmusic, processed them manually with another application to remove skew. All actions have been done in 8-bit gray at a resolution of 300 dpi.\\ This test should show which settings should be done with unpaper to get a decent automatic conversion to black and white with despeckle and noise removal. | In order to better understand the vast amount of options in unpaper, I've done several tests, which show differences between settings.\\ I scanned a lot of sheetmusic, processed them manually with another application to remove skew. All actions have been done in 8-bit gray at a resolution of 300 dpi.\\ This test should show which settings should be done with unpaper to get a decent automatic conversion to black and white with despeckle and noise removal. | ||
Line 6: | Line 6: | ||
* imagemagick | * imagemagick | ||
* unpaper 0.3 (installed by downloading the archive, extracting manually and use make command) | * unpaper 0.3 (installed by downloading the archive, extracting manually and use make command) | ||
- | * pdftk | + | * ghostscript |
- | * testimage 8-bit gray at a resolution of 300 dpi, .pgm format. (Scanner: Epson GT-10000+) contrast had been set manually at a value where music from the other page side ('see through' | + | * <del>pdftk</ |
+ | * [[http:// | ||
Automatic conversion of a single pdf file consisting of several pages to pgm files with pdftoppm (result should be 300 dpi and gray): | Automatic conversion of a single pdf file consisting of several pages to pgm files with pdftoppm (result should be 300 dpi and gray): | ||
Line 22: | Line 23: | ||
<code bash> | <code bash> | ||
#!/bin/sh | #!/bin/sh | ||
- | donot=" | + | |
+ | donot=" | ||
+ | |||
+ | pdflst="" | ||
+ | ptmpf=" | ||
nmb=0 | nmb=0 | ||
while [ " | while [ " | ||
do | do | ||
- | str1=" | + | str1=" |
- | unpaper $str1 | + | |
- | # convert inputfile option annotation outputfile | + | unpaper -b $str1 $donot -t pbm $ptmpf-1.pgm $ptmpf.pbm |
- | convert | + | |
- | convert -density 300 -units PixelsPerInch | + | # convert inputfile option annotation outputfile |
- | rm outc.pbm | + | convert |
- | rm outca.pbm | + | convert |
+ | rm $ptmpf.pbm $ptmpf_ca.pbm | ||
+ | pdflst=" | ||
nmb=$(($nmb+1)) | nmb=$(($nmb+1)) | ||
done | done | ||
- | pdftk out-*.pdf output result.pdf | + | # merge all pdf documents with ghostscript. Although pdftk will give a shorter command, the identify command |
- | rm out-*.pdf | + | # gives an error when it analyses the pdf file. |
+ | gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -sOutputFile=result.pdf $pdflst | ||
+ | rm $pdflst | ||
exit 0 | exit 0 | ||
Line 42: | Line 52: | ||
===== Results ===== | ===== Results ===== | ||
- | ^ unpaper --noisefilter-intensity (.pdf) ^ | + | ^ [[http:// |
- | | option: -ni [0 ... 9]\\ Values ranging from 0 to 9 doesn' | + | | option: -ni [1 ... 25]\\ Above a value of 10 tiny clusters consisting of about 3x3 (=9 pixels? and thus less than 10 pixels?) are removed |
^ unpaper --white-treshold (.pdf) ^ | ^ unpaper --white-treshold (.pdf) ^ | ||
| option: -w [0.0 ... 0.9]\\ At option -w 0.6 suddenly a few light gray pixels become invisable. Higher and lower values shows no difference. Strange that this value has no continuity. | | option: -w [0.0 ... 0.9]\\ At option -w 0.6 suddenly a few light gray pixels become invisable. Higher and lower values shows no difference. Strange that this value has no continuity. | ||
- | ^ [[http:// | + | ^ [[http:// |
| option: -b [0.0 ... 0.9]\\ Option -b 0.0 turns the slightest light gray spots into black pixels, giving the page large black areas and hard to read.\\ Option -b 0.1 will turn most of the white/gray background color into white and turns remarks, written with light gray pencil into black. Single system lines consist of about 6 pixels high.\\ Option -b 0.2 will make remarks, written with light gray pencil invisable. Darker gray is still visable. Single system lines consist of about 4 pixels high.\\ Option -b 0.3 will make remarks, written with light gray pencil invisable. Darker gray is still visable. Single system lines consist of about 2 pixels high.\\ Option -b 0.4 will make system lines and other less than 3 pixel wide lines disappear almost completely. Printed black strong text is still visable. Normal printed text starts to faint.\\ Option -b 0.5 and higher removes everything that has been previously visable, keeping a white page. (Annotated text remains, because it has been added after unpaper has been executed) | | option: -b [0.0 ... 0.9]\\ Option -b 0.0 turns the slightest light gray spots into black pixels, giving the page large black areas and hard to read.\\ Option -b 0.1 will turn most of the white/gray background color into white and turns remarks, written with light gray pencil into black. Single system lines consist of about 6 pixels high.\\ Option -b 0.2 will make remarks, written with light gray pencil invisable. Darker gray is still visable. Single system lines consist of about 4 pixels high.\\ Option -b 0.3 will make remarks, written with light gray pencil invisable. Darker gray is still visable. Single system lines consist of about 2 pixels high.\\ Option -b 0.4 will make system lines and other less than 3 pixel wide lines disappear almost completely. Printed black strong text is still visable. Normal printed text starts to faint.\\ Option -b 0.5 and higher removes everything that has been previously visable, keeping a white page. (Annotated text remains, because it has been added after unpaper has been executed) | ||
- | ^ [[http:// | + | ^ [[http:// |
| option: -b [0.10 ... 0.19]\\ Option -b 0.12 shows an optimum between light gray pencil remarks and line thickness. Single system lines consist of about 4 pixels high. | | | option: -b [0.10 ... 0.19]\\ Option -b 0.12 shows an optimum between light gray pencil remarks and line thickness. Single system lines consist of about 4 pixels high. | | ||
^ unpaper --black-treshold 0.20 ... 0.29 (.pdf) ^ | ^ unpaper --black-treshold 0.20 ... 0.29 (.pdf) ^ | ||
Line 54: | Line 64: | ||
^ unpaper --black-treshold 0.30 ... 0.39 (.pdf) ^ | ^ unpaper --black-treshold 0.30 ... 0.39 (.pdf) ^ | ||
| option: -b [0.30 ... 0.39]\\ At a value of 0.30 symbols like crosses with thin vertical lines are disintegrating. System lines are still continuous and between 1 and 2 pixels. At values higher than 0.34 system lines are largely disintegrating. | | option: -b [0.30 ... 0.39]\\ At a value of 0.30 symbols like crosses with thin vertical lines are disintegrating. System lines are still continuous and between 1 and 2 pixels. At values higher than 0.34 system lines are largely disintegrating. | ||
+ | ^ 100 pixels size: [[http:// | ||
+ | | option: -b 0.12 -ls 100 -lp 50 -li [0.00 ... 0.25]: At values larger than 0.07 large square (100x100 pixels) parts are being removed.\\ option: -b 0.12 -ls 8 -lp 4 -li [0.00 ... 0.25]: When -li has a value of at least 0.20, tiny isolated dots (dirt), consisting of about 3x3 pixels are being removed. | ||
+ | |||
+ | ===== steps to find unpaper settings ===== | ||
+ | Scan your raw material at 300 dpi with gray 8 bit. | ||
+ | === 1. black-treshold === | ||
+ | With the following command, unpaper converts a gray image into a black and white image: | ||
+ | unpaper -b $black_treshold --no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite -t pbm inputfile.pgm outputfile.pbm | ||
+ | The variable $black_treshold is a ratio between 0 and 1. Assuming light pixels have high values and dark pixels have low values, with this ratio a pixel will be considered black when its value is below the ratio. Therefore a low ratio will yield much darker images.\\ Under normal situations, when the original scan has good visible contrast, $black_treshold should be somewhere within a range from [0.1 ... 0.4]. When raw material is quite dark, $black_treshold may be 0.1 higher than usual. If sheet music contains pencil remarks which should be kept in the output result, a value of 0.12 may be useful. To show less pencil remarks, a value of 0.35 may be used. | ||
+ | |||
+ | === 2. removing black borders === | ||
+ | Black borders at edges of your scan can be removed automatically with this option. See [[http:// | ||
+ | |||
+ | === 3. blurfilter === | ||
+ | See pdf examples above. | ||
+ | === 4. grayfilter === | ||
+ | See pdf examples above. | ||
+ | |||
+ | ===== Scripting with unpaper ===== | ||
+ | ==== gray2black ==== | ||
+ | Automatic conversion from pdf files, containing gray images, to black and white images with unpaper and its noise reduction and cleaning up features can be done with the following script: | ||
+ | <code bash> | ||
+ | #!/bin/sh | ||
+ | # filename: gray2black | ||
+ | # This script processes a pdf file containing gray images with | ||
+ | # unpaper, imagemagick and ghostscript to a black and white pdf file. | ||
+ | # | ||
+ | # Usage: gray2black input-file [option] output-file | ||
+ | # | ||
+ | # option: -b [value] specify black threshold value as being used in | ||
+ | # | ||
+ | # be used: -b 0.12 | ||
+ | # input-file: a pdf file containing sheetmusic with one or more gray | ||
+ | # | ||
+ | # output-file: | ||
+ | # settings will be created. | ||
+ | # | ||
+ | # Example: | ||
+ | # Process gray images in mozart.pdf: gray2black mozart.pdf -b 0.23 test.pdf | ||
+ | # | ||
+ | # This script uses imagemagick to convert and center an image on an | ||
+ | # a4 page. (With 2480x3508 pixels at a resolution of 300 dpi) and | ||
+ | # ghostscript to create a small pdf file. By using efficiently the | ||
+ | # piping method (|) in conjunction with reading and writing to the | ||
+ | # standard input (-) we get less tempfiles. | ||
+ | # | ||
+ | # NB. This script is time consuming. So have a lot of patience. | ||
+ | # Each page will take about 15 seconds to execute on a single core | ||
+ | # AMD64 3500+ cpu. | ||
+ | # | ||
+ | # User text | ||
+ | _fbhelp=" | ||
+ | _fbspdf=" | ||
+ | _fbcvrt=" | ||
+ | # Unpaper settings, start value for black-threshold is 0.12 | ||
+ | b_threshold=" | ||
+ | donot=" | ||
+ | filter=" | ||
+ | # Other settings | ||
+ | ptmpf=" | ||
+ | |||
+ | # Page dimensions | ||
+ | psize=" | ||
+ | horpix=2480 | ||
+ | verpix=3508 | ||
+ | pdim=" | ||
+ | pres=" | ||
+ | |||
+ | # Only resize if absolutely neccessary. Values for resizing are for left, right, top and bottom border. | ||
+ | hborder=33 | ||
+ | vborder=33 | ||
+ | lhorpix=$horpix | ||
+ | lverpix=$verpix | ||
+ | |||
+ | # Store input-file name temporarily, | ||
+ | arg1=$1 | ||
+ | |||
+ | # Find number of parameters passed | ||
+ | if [ " | ||
+ | echo " | ||
+ | exit 1 | ||
+ | fi | ||
+ | |||
+ | # Check OPTIONS | ||
+ | if [ " | ||
+ | case " | ||
+ | * ) | ||
+ | echo " | ||
+ | exit 1 | ||
+ | ;; | ||
+ | esac | ||
+ | elif [ " | ||
+ | case " | ||
+ | -b ) | ||
+ | shift | ||
+ | b_threshold=$2 | ||
+ | shift | ||
+ | ;; | ||
+ | * ) | ||
+ | echo " | ||
+ | exit 1 | ||
+ | ;; | ||
+ | esac | ||
+ | fi | ||
+ | |||
+ | # only process further if there are two arguments left. | ||
+ | if [ " | ||
+ | if [ ! " | ||
+ | echo " | ||
+ | exit 1 | ||
+ | else | ||
+ | # convert first page to a pgm file with filename $ptmpf-*.pgm | ||
+ | pdftoppm -gray -r 300 " | ||
+ | |||
+ | # Check for document resize neccessity. Iterating through all pages, makes sure, all pages get uniform resizing later. | ||
+ | |||
+ | resize=0 | ||
+ | # Iterate through files. We need to sort files in a numerical order because pdftoppm generates files without a leading zero. | ||
+ | # However before the numbers, a ' | ||
+ | for filenum in $(ls $ptmpf-*.pgm | cut -d' | ||
+ | dimension=" | ||
+ | # width: " | ||
+ | # height: " | ||
+ | if [ ${dimension%%x*} -gt $lhorpix ]; then | ||
+ | lhorpix=${dimension%%x*} | ||
+ | resize=1 | ||
+ | fi | ||
+ | if [ ${dimension## | ||
+ | lverpix=${dimension## | ||
+ | resize=1 | ||
+ | fi | ||
+ | done | ||
+ | # Resize document if neccessary | ||
+ | if [ $resize -eq 1 ]; then | ||
+ | # image fits within horizontal boundary, so adaptation must be vertical | ||
+ | if [ $horpix -eq $lhorpix ]; then | ||
+ | vscale=$(( ($verpix-30)*100/ | ||
+ | else | ||
+ | vscale=100 | ||
+ | fi | ||
+ | # image fits within vertical boundary, so adaptation must be horizontal | ||
+ | if [ $verpix -eq $lverpix ]; then | ||
+ | hscale=$(( ($horpix-30)*100/ | ||
+ | else | ||
+ | hscale=100 | ||
+ | fi | ||
+ | # scaling neccessary both horizontal and vertical | ||
+ | if [ $hscale -lt $vscale ]; then scale=$hscale; | ||
+ | |||
+ | echo Some images exceed the maximum size. Now scaling to " | ||
+ | # iterate through all files and resize all with the same percentage, use 8 bits per pixel | ||
+ | for filenum in $(ls $ptmpf-*.pgm | cut -d' | ||
+ | mogrify -depth 8 -resize " | ||
+ | echo resizing file: " | ||
+ | done | ||
+ | fi | ||
+ | # apply unpaper onto each page | ||
+ | for filenum in $(ls $ptmpf-*.pgm | cut -d' | ||
+ | # Parameter expansion: ${param%word} From the end remov? Mis je nog iets?es the smallest part of param that matches word and returns the rest. (p.72) | ||
+ | unpaper -b $b_threshold $filter $donot -t pbm " | ||
+ | rm " | ||
+ | done | ||
+ | |||
+ | # center pbm page on an a4 canvas, convert to pdf | ||
+ | pdflst="" | ||
+ | for filenum in $(ls $ptmpf-*.pbm | cut -d' | ||
+ | convert -size $pdim xc:white miff:- | composite -density $pres -units PixelsPerInch -compose atop -gravity Center " | ||
+ | rm " | ||
+ | pdflst=" | ||
+ | done | ||
+ | |||
+ | # merge all pages into on page | ||
+ | gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH " | ||
+ | rm $pdflst | ||
+ | fi | ||
+ | fi | ||
+ | exit 0 | ||
+ | </ | ||
+ | |||
+ | ==== findblack ==== | ||
+ | Find a value for black threshold with following script: | ||
+ | <code bash> | ||
+ | #!/bin/sh | ||
+ | # filename: findblack | ||
+ | # This script helps to find an unpaper' | ||
+ | # creating a series of pdf files each with a different black threshold setting. By observing | ||
+ | # the pages afterwards, selecting a proper black threshold value should be easier. Values are | ||
+ | # annotated on top of each page. | ||
+ | # | ||
+ | # usage: findblack input-file [option] output-file | ||
+ | # | ||
+ | # option: -p [number] which page should be used as source for unpaper. If not specified, first page will be taken. | ||
+ | # input-file: a pdf file containing one or more gray images. | ||
+ | # output-file: | ||
+ | # | ||
+ | # Example: | ||
+ | # Process page 4 from mozart.pdf: ./findblack mozart.pdf -p 4 test.pdf | ||
+ | # | ||
+ | # This script uses imagemagick to convert and center an image on an a4 page. (With 2480x3508 pixels at a resolution of 300 dpi) | ||
+ | # and ghostscript to create a small pdf file. By using efficiently the piping method (|) in conjunction with | ||
+ | # reading and writing to the standard input (-) we get less tempfiles. | ||
+ | # | ||
+ | # NB. This script is time consuming. So have a lot of patience. | ||
+ | # Each page will take about 15 seconds to execute on a single core AMD64 3500+ cpu. | ||
+ | |||
+ | # User text | ||
+ | _fbhelp=" | ||
+ | _fbspdf=" | ||
+ | _fbcvrt=" | ||
+ | |||
+ | # Unpaper settings, start value for black-threshold is 0.05, end value for black-threshold is 0.40, step = 0.02 | ||
+ | a=5 | ||
+ | b=20 | ||
+ | s=5 | ||
+ | donot=" | ||
+ | |||
+ | # Other settings; $$ references the current PID | ||
+ | ptmpf=" | ||
+ | psize=" | ||
+ | pdim=" | ||
+ | pres=" | ||
+ | ptxt=" | ||
+ | pstr=" | ||
+ | # Default page number | ||
+ | defpage=1 | ||
+ | |||
+ | # Find number of parameters passed | ||
+ | if [ " | ||
+ | echo " | ||
+ | exit 1 | ||
+ | fi | ||
+ | |||
+ | # Store input-file name temporarily, | ||
+ | arg1=$1 | ||
+ | |||
+ | # Check OPTIONS | ||
+ | if [ " | ||
+ | case " | ||
+ | * ) | ||
+ | echo " | ||
+ | exit 1 | ||
+ | ;; | ||
+ | esac | ||
+ | elif [ " | ||
+ | case " | ||
+ | -p ) | ||
+ | shift | ||
+ | defpage=$2 | ||
+ | shift | ||
+ | ;; | ||
+ | * ) | ||
+ | echo " | ||
+ | exit 1 | ||
+ | ;; | ||
+ | esac | ||
+ | fi | ||
+ | |||
+ | # only process further if there are two arguments left. | ||
+ | if [ " | ||
+ | if [ ! " | ||
+ | echo " | ||
+ | exit 1 | ||
+ | else | ||
+ | # find last page, start counting from 0, necessary for proper deleting pdftoppm result file which may have extra 0 | ||
+ | trail="" | ||
+ | lastPage=" | ||
+ | if [ " | ||
+ | trail=" | ||
+ | elif [ " | ||
+ | trail=" | ||
+ | elif [ " | ||
+ | trail=" | ||
+ | elif [ " | ||
+ | trail=" | ||
+ | fi | ||
+ | # convert first page to a pgm file with filename tmpfile_PID-1.pgm | ||
+ | pdftoppm -gray -r 300 -f $defpage -l $defpage " | ||
+ | # check for succesful conversion to .pgm | ||
+ | if [ ! -f " | ||
+ | echo " | ||
+ | rm " | ||
+ | exit 1 | ||
+ | else | ||
+ | pdflst="" | ||
+ | |||
+ | nmb=$a | ||
+ | while [ " | ||
+ | do | ||
+ | # Cope with digits | ||
+ | if [ " | ||
+ | str1=" | ||
+ | else | ||
+ | str1=" | ||
+ | fi | ||
+ | |||
+ | # process unpaper | ||
+ | unpaper -b $str1 $donot -t pbm $ptmpf-$trail$defpage.pgm $ptmpf.pbm | ||
+ | |||
+ | # Create white A4 size canvas -> center pbm file into canvas -> convert pbm to ps -> convert ps to pdf | ||
+ | # New to bash? | ||
+ | # Send to and retreive from ' | ||
+ | # Sending result to next command with pipe: | | ||
+ | # Using this ' | ||
+ | convert -size $pdim xc:white miff:- | composite -density $pres -units PixelsPerInch -compose atop -gravity Center $ptmpf.pbm - miff:- | convert - $pstr "$ptxt $str1" - | convert -monochrome -density 300 -units PixelsPerInch - ps:- | ps2pdf13 -sPAPERSIZE=a4 - " | ||
+ | |||
+ | pdflst=" | ||
+ | nmb=$(($nmb+$s)) | ||
+ | rm $ptmpf.pbm | ||
+ | done | ||
+ | rm $ptmpf-$trail$defpage.pgm | ||
+ | # Merge all pdf files in $pdflst into one single file with filename $2 | ||
+ | gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -sOutputFile=$2 $pdflst | ||
+ | rm $pdflst | ||
+ | fi | ||
+ | fi | ||
+ | fi | ||
+ | exit 0 | ||
+ | </ | ||
+ | |||
+ | ===== Bugs ===== | ||
+ | [[software: |