Unpaper test

In order to better understand the vast amount of options in unpaper, I've done several tests, which show differences between settings.
I scanned a lot of sheetmusic, processed them manually with another application to remove skew. All actions have been done in 8-bit gray at a resolution of 300 dpi.
This test should show which settings should be done with unpaper to get a decent automatic conversion to black and white with despeckle and noise removal.

prerequisites

ubuntu 8.04
imagemagick
unpaper 0.3 (installed by downloading the archive, extracting manually and use make command)
ghostscript
~~pdftk~~ ¹⁾
testimage 8-bit gray at a resolution of 300 dpi (.pdf), converted in .pgm format. (Scanner: Epson GT-10000+) contrast had been set manually at a value where music from the other page side ('see through') is minimal and image background is white instead of gray.

Automatic conversion of a single pdf file consisting of several pages to pgm files with pdftoppm (result should be 300 dpi and gray):

pdftoppm -gray -r 300 inputfile.pdf outputfile

Check filetype with identify:

identify outputfile

Following script tests unpaper with different settings, produces a pdf with 10 pages, each page with a different setting. The relevant unpaper options are embedded into the file:

#!/bin/sh
 
donot="--no-blurfilter --no-noisefilter --no-grayfilter --no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite"
 
pdflst=""
ptmpf="tempfile_unpaptst_$$"
 
nmb=0
while [ "$nmb" -le 9 ]
do
	str1="0.$nmb"
 
	unpaper -b $str1 $donot -t pbm $ptmpf-1.pgm $ptmpf.pbm
 
	# convert inputfile option annotation outputfile
	convert $ptmpf.pbm -gravity center -background lightblue -font Open-Sans -pointsize 60 caption:"$str1 \n $donot" -composite $ptmpf_ca.pbm
	convert -monochrome -density 300 -units PixelsPerInch /$ptmpf_ca.pbm ps:- | ps2pdf13 -sPAPERSIZE=a4 - "$ptmpf-$nmb.pdf"
	rm $ptmpf.pbm $ptmpf_ca.pbm
	pdflst="$pdflst $ptmpf-$nmb.pdf"
	nmb=$(($nmb+1))
done
# merge all pdf documents with ghostscript. Although pdftk will give a shorter command, the identify command
# gives an error when it analyses the pdf file.
gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -sOutputFile=result.pdf $pdflst
rm $pdflst
 
exit 0

Results

unpaper --noisefilter-intensity (1-25) (.pdf)
option: -ni [1 … 25] Above a value of 10 tiny clusters consisting of about 3×3 (=9 pixels? and thus less than 10 pixels?) are removed
unpaper –white-treshold (.pdf)
option: -w [0.0 … 0.9] At option -w 0.6 suddenly a few light gray pixels become invisable. Higher and lower values shows no difference. Strange that this value has no continuity.
unpaper --black-treshold (coarse) (.pdf)
option: -b [0.0 … 0.9] Option -b 0.0 turns the slightest light gray spots into black pixels, giving the page large black areas and hard to read. Option -b 0.1 will turn most of the white/gray background color into white and turns remarks, written with light gray pencil into black. Single system lines consist of about 6 pixels high. Option -b 0.2 will make remarks, written with light gray pencil invisable. Darker gray is still visable. Single system lines consist of about 4 pixels high. Option -b 0.3 will make remarks, written with light gray pencil invisable. Darker gray is still visable. Single system lines consist of about 2 pixels high. Option -b 0.4 will make system lines and other less than 3 pixel wide lines disappear almost completely. Printed black strong text is still visable. Normal printed text starts to faint. Option -b 0.5 and higher removes everything that has been previously visable, keeping a white page. (Annotated text remains, because it has been added after unpaper has been executed)
unpaper --black-treshold 0.10 ... 0.19 (.pdf)
option: -b [0.10 … 0.19] Option -b 0.12 shows an optimum between light gray pencil remarks and line thickness. Single system lines consist of about 4 pixels high.
unpaper –black-treshold 0.20 … 0.29 (.pdf)
option: -b [0.20 … 0.29] Option -b 0.20 printed black symbols (like the alla breve symbol) is just visible as it is originally. Single system lines consist of about 3 pixels high. Stems are 2 pixels wide. Higher settings make all lines gradually become thinner. System lines are getting spots where they are much thinner. At value 0.29 and higher some stems are disintegrating.
unpaper –black-treshold 0.30 … 0.39 (.pdf)
option: -b [0.30 … 0.39] At a value of 0.30 symbols like crosses with thin vertical lines are disintegrating. System lines are still continuous and between 1 and 2 pixels. At values higher than 0.34 system lines are largely disintegrating.
100 pixels size: unpaper --blurfilter-intensity 0.00 ... 0.25 (.pdf) 8 pixels size: unpaper --blurfilter-intensity 0.00 ... 0.25 (.pdf) 4 pixels size: unpaper --blurfilter-intensity 0.00 ... 0.25 (.pdf)
option: -b 0.12 -ls 100 -lp 50 -li [0.00 … 0.25]: At values larger than 0.07 large square (100×100 pixels) parts are being removed. option: -b 0.12 -ls 8 -lp 4 -li [0.00 … 0.25]: When -li has a value of at least 0.20, tiny isolated dots (dirt), consisting of about 3×3 pixels are being removed.

steps to find unpaper settings

Scan your raw material at 300 dpi with gray 8 bit.

1. black-treshold

With the following command, unpaper converts a gray image into a black and white image:

unpaper -b $black_treshold --no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite -t pbm inputfile.pgm outputfile.pbm

The variable $black_treshold is a ratio between 0 and 1. Assuming light pixels have high values and dark pixels have low values, with this ratio a pixel will be considered black when its value is below the ratio. Therefore a low ratio will yield much darker images.\\ Under normal situations, when the original scan has good visible contrast, $black_treshold should be somewhere within a range from [0.1 … 0.4]. When raw material is quite dark, $black_treshold may be 0.1 higher than usual. If sheet music contains pencil remarks which should be kept in the output result, a value of 0.12 may be useful. To show less pencil remarks, a value of 0.35 may be used.

2. removing black borders

Black borders at edges of your scan can be removed automatically with this option. See unpaper user documentation for details.

3. blurfilter

See pdf examples above.

4. grayfilter

See pdf examples above.

Scripting with unpaper

gray2black

Automatic conversion from pdf files, containing gray images, to black and white images with unpaper and its noise reduction and cleaning up features can be done with the following script:

#!/bin/sh
# filename: gray2black
# This script processes a pdf file containing gray images with
# unpaper, imagemagick and ghostscript to a black and white pdf file.
# 
# Usage: gray2black input-file [option] output-file
#
# option: -b [value] specify black threshold value as being used in
#                     unpaper with -b. If omitted, default value will
#                     be used: -b 0.12
# input-file: a pdf file containing sheetmusic with one or more gray
#             images.
# output-file: a multipage pdf file with a series of black-threshold
#              settings will be created.
#
# Example:
# Process gray images in mozart.pdf: gray2black mozart.pdf -b 0.23 test.pdf
# 
# This script uses imagemagick to convert and center an image on an
# a4 page. (With 2480x3508 pixels at a resolution of 300 dpi) and
# ghostscript to create a small pdf file.  By using efficiently the
# piping method (|) in conjunction with reading and writing to the
# standard input (-) we get less tempfiles.
#
# NB. This script is time consuming. So have a lot of patience.
# Each page will take about 15 seconds to execute on a single core
# AMD64 3500+ cpu.
#
# User text 
_fbhelp="Usage: gray2black input-file [OPTION...] output-file\n\n  -b [value], specify black threshold value as being used in\n              unpaper with -b. If omitted, default value will\n              be used: -b 0.12.\n\nExample:\nProcess pdf file with unpaper black threshold 0.23:\n  gray2black mozart.pdf -b 0.23 result.pdf\nProcess pdf file with -b 0.12 settings:\n  gray2black mozart.pdf result.pdf"
_fbspdf="please supply a pdf file"
_fbcvrt="Check your pdf file. Does it contain gray images?"
# Unpaper settings, start value for black-threshold is 0.12
b_threshold="0.12"
donot="--no-grayfilter --no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite"
filter="-ni 10 -ls 8 -lp 4 -li 0.20"
# Other settings
ptmpf="tempfile_gray2black_$$"
 
# Page dimensions
psize="a4"
horpix=2480
verpix=3508
pdim="$horpix"x"$verpix"
pres="300"
 
# Only resize if absolutely neccessary. Values for resizing are for left, right, top and bottom border.
hborder=33
vborder=33
lhorpix=$horpix
lverpix=$verpix
 
# Store input-file name temporarily, because it's deleted by shift command.
arg1=$1
 
# Find number of parameters passed
if [ "$#" -le 1 ]; then
	echo "$_fbhelp"
	exit 1
fi
 
# Check OPTIONS
if [ "$#" -eq 3 ]; then
	case "$2" in
		* )
			echo "$_fbhelp"
			exit 1
			;;
	esac
	elif [ "$#" -eq 4 ]; then
		case "$2" in
			-b ) 
				shift
				b_threshold=$2
				shift
				;;
			* ) 
				echo "$_fbhelp"
				exit 1
				;;
		esac
fi
 
# only process further if there are two arguments left.
if [ "$#" -eq 2 ]; then
	if [ ! "$(identify "$arg1"|grep PDF)" ]; then
		echo "$_fbspdf"
		exit 1
	else
		# convert first page to a pgm file with filename $ptmpf-*.pgm
		pdftoppm -gray -r 300 "$arg1" $ptmpf
 
		# Check for document resize neccessity. Iterating through all pages, makes sure, all pages get uniform resizing later.
 
		resize=0
		# Iterate through files. We need to sort files in a numerical order because pdftoppm generates files without a leading zero. 
		# However before the numbers, a '-' delimiter character had been added by pdftoppm. Now cut the part before that.
		for filenum in $(ls $ptmpf-*.pgm | cut -d'-' -f2 | sort -n); do
			dimension="$(identify -format "%[fx:w]x%[fx:h]" $ptmpf'-'$filenum)"
			#  width: "${dimension%%x*}" -> From the end removes the longest part of dimension that matches x* and returns the rest.
			#  height: "${dimension##*x}" -> From the beginning removes the longest part of dimension that matches *x and returns the rest.
			if [ ${dimension%%x*} -gt $lhorpix ]; then
				lhorpix=${dimension%%x*}
				resize=1
			fi
			if [ ${dimension##*x} -gt $lverpix ]; then
				lverpix=${dimension##*x}
				resize=1
			fi
		done
		# Resize document if neccessary
		if [ $resize -eq 1 ]; then
			# image fits within horizontal boundary, so adaptation must be vertical
			if [ $horpix -eq $lhorpix ]; then
				vscale=$(( ($verpix-30)*100/$lverpix ))
			else
				vscale=100
			fi
			# image fits within vertical boundary, so adaptation must be horizontal
			if [ $verpix -eq $lverpix ]; then
				hscale=$(( ($horpix-30)*100/$lhorpix ))
			else
				hscale=100
			fi
			# scaling neccessary both horizontal and vertical
			if [ $hscale -lt $vscale ]; then scale=$hscale; else scale=$vscale; fi
 
			echo Some images exceed the maximum size. Now scaling to "$scale"%.
			# iterate through all files and resize all with the same percentage, use 8 bits per pixel
			for filenum in $(ls $ptmpf-*.pgm | cut -d'-' -f2 | sort -n); do
				mogrify -depth 8 -resize "$scale"% "$ptmpf-$filenum"
				echo resizing file: "$ptmpf-$filenum"
			done
		fi
		# apply unpaper onto each page
		for filenum in $(ls $ptmpf-*.pgm | cut -d'-' -f2 | sort -n); do
			# Parameter expansion: ${param%word} From the end remov? Mis je nog iets?es the smallest part of param that matches word and returns the rest. (p.72)
			unpaper -b $b_threshold $filter $donot -t pbm "$ptmpf-$filenum" "$ptmpf-"${filenum%.pgm}.pbm
			rm "$ptmpf-$filenum"
		done
 
		# center pbm page on an a4 canvas, convert to pdf
		pdflst=""
		for filenum in $(ls $ptmpf-*.pbm | cut -d'-' -f2 | sort -n); do
			convert -size $pdim xc:white miff:- | composite -density $pres -units PixelsPerInch -compose atop -gravity Center "$ptmpf-$filenum" - miff:- | convert -monochrome -density 300 -units PixelsPerInch - ps:- | ps2pdf13 -sPAPERSIZE=a4 - "$ptmpf-"${filenum%.pbm}.pdf
			rm "$ptmpf-$filenum"
			pdflst="$pdflst $ptmpf-"${filenum%.pbm}.pdf
		done
 
		# merge all pages into on page
		gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH "-sOutputFile=$2" $pdflst
		rm $pdflst
	fi
fi
exit 0

findblack

Find a value for black threshold with following script:

#!/bin/sh
# filename: findblack
# This script helps to find an unpaper's black threshold value (-b option) by
# creating a series of pdf files each with a different black threshold setting. By observing
# the pages afterwards, selecting a proper black threshold value should be easier. Values are
# annotated on top of each page.
#
# usage: findblack input-file [option] output-file
#
# option: -p [number] which page should be used as source for unpaper. If not specified, first page will be taken.
# input-file: a pdf file containing one or more gray images.
# output-file: a multipage pdf file with a series of black-threshold settings will be created.
#
# Example:
# Process page 4 from mozart.pdf: ./findblack mozart.pdf -p 4 test.pdf
# 
# This script uses imagemagick to convert and center an image on an a4 page. (With 2480x3508 pixels at a resolution of 300 dpi)
# and ghostscript to create a small pdf file.  By using efficiently the piping method (|) in conjunction with
# reading and writing to the standard input (-) we get less tempfiles.
#
# NB. This script is time consuming. So have a lot of patience.
# Each page will take about 15 seconds to execute on a single core AMD64 3500+ cpu.
 
# User text 
_fbhelp="Usage: findblack input-file [OPTION...] output-file\n\n  -p foo, processes page number foo\n          If this option is omitted page 1 will be processed by default\n\nExample:\nProcess page 5:\n  findblack mozart.pdf -p 5 result.pdf\nProcess the first page:\n  findblack mozart.pdf result.pdf"
_fbspdf="please supply a pdf file"
_fbcvrt="Can't find converted first page of pdf file. Check your pdf file"
 
# Unpaper settings, start value for black-threshold is 0.05, end value for black-threshold is 0.40, step = 0.02
a=5
b=20
s=5
donot="--no-blurfilter --no-noisefilter --no-grayfilter --no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite"
 
# Other settings; $$ references the current PID
ptmpf="tempfile_findblack_$$"
psize="a4"
pdim="2480x3508"
pres="300"
ptxt="black threshold:"
pstr="-font Times-Roman -pointsize 36 -gravity NorthWest -annotate +5+5"
# Default page number
defpage=1
 
# Find number of parameters passed
if [ "$#" -le 1 ]; then
	echo "$_fbhelp"
	exit 1
fi
 
# Store input-file name temporarily, because it's deleted by shift command.
arg1=$1
 
# Check OPTIONS
if [ "$#" -eq 3 ]; then
	case "$2" in
		* )
			echo "$_fbhelp"
			exit 1
			;;
	esac
	elif [ "$#" -eq 4 ]; then
		case "$2" in
			-p ) 
				shift
				defpage=$2
				shift
				;;
			* ) 
				echo "$_fbhelp"
				exit 1
				;;
		esac
fi
 
# only process further if there are two arguments left.
if [ "$#" -eq 2 ]; then
	if [ ! "$(identify "$arg1" | grep PDF)" ]; then
		echo "$_fbspdf"
		exit 1
	else
		# find last page, start counting from 0, necessary for proper deleting pdftoppm result file which may have extra 0
		trail=""
		lastPage="$(identify "$arg1" | tail -n 1 | sed 's/\(.*\)\[\(.*\)\].*/\2/')"
		if [ "$lastPage" -ge 10 ]; then
			trail="0"
		elif [ "$lastPage" -ge 100 ]; then
			trail="00"
		elif [ "$lastPage" -ge 1000 ]; then
			trail="000"
		elif [ "$lastPage" -ge 10000 ]; then
			trail="0000"
		fi
		# convert first page to a pgm file with filename tmpfile_PID-1.pgm
		pdftoppm -gray -r 300 -f $defpage -l $defpage "$arg1" $ptmpf
		# check for succesful conversion to .pgm
		if [ ! -f "$ptmpf-$trail$defpage.pgm" ]; then
			echo "$_fbcvrt"
			rm "$ptmpf-$trail$defpage.pgm"
			exit 1
		else
			pdflst=""
 
			nmb=$a
			while [ "$nmb" -le $b ]
			do
				# Cope with digits
				if [ "$nmb" -le 9 ]; then
					str1="0.0$nmb"
				else
					str1="0.$nmb"
				fi
 
				# process unpaper
				unpaper -b $str1 $donot -t pbm $ptmpf-$trail$defpage.pgm $ptmpf.pbm
 
				# Create white A4 size canvas -> center pbm file into canvas -> convert pbm to ps -> convert ps to pdf
				# New to bash?
				# Send to and retreive from 'standard input' with using dash: -
				# Sending result to next command with pipe: |
				# Using this 'technique' we don't need to create temporary files.
				convert -size $pdim xc:white miff:- | composite -density $pres -units PixelsPerInch -compose atop -gravity Center $ptmpf.pbm - miff:- | convert - $pstr "$ptxt $str1" - | convert -monochrome -density 300 -units PixelsPerInch - ps:- | ps2pdf13 -sPAPERSIZE=a4 - "$ptmpf-$nmb.pdf"
 
				pdflst="$pdflst $ptmpf-$nmb.pdf"
				nmb=$(($nmb+$s))
				rm $ptmpf.pbm
			done
			rm $ptmpf-$trail$defpage.pgm
			# Merge all pdf files in $pdflst into one single file with filename $2
			gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -sOutputFile=$2 $pdflst
			rm $pdflst
		fi
	fi
fi
exit 0

Bugs

See the following page

¹⁾

I get the following warning when I do an identify file.pdf: Warning: File has an invalid xref entry: 2. Rebuilding xref table.

Auditeon

Table of Contents