Differences

This shows you the differences between two versions of the page.

--- software:unpaper_test [2009/03/10 01:47] – admin
+++ software:unpaper_test [2015/04/22 21:51] (current) – [findblack] admin
@@ Line 1: / Line 1: @@
-====== Unpaper test =====
+====== Unpaper test ======
 In order to better understand the vast amount of options in unpaper, I've done several tests, which show differences between settings.\\ I scanned a lot of sheetmusic, processed them manually with another application to remove skew. All actions have been done in 8-bit gray at a resolution of 300 dpi.\\ This test should show which settings should be done with unpaper to get a decent automatic conversion to black and white with despeckle and noise removal.
@@ Line 6: / Line 6: @@
   * imagemagick
   * unpaper 0.3 (installed by downloading the archive, extracting manually and use make command)
-  * pdftk
+  * ghostscript
+  * <del>pdftk</del> ((I get the following warning when I do an identify file.pdf: Warning:  File has an invalid xref entry:  2.  Rebuilding xref table.))
   * [[http://www.auditeon.com/xyz/projects/unpaper/out.pdf|testimage 8-bit gray at a resolution of 300 dpi (.pdf)]], converted in .pgm format. (Scanner: Epson GT-10000+) contrast had been set manually at a value where music from the other page side ('see through') is minimal and image background is white instead of gray.
@@ Line 22: / Line 23: @@
 <code bash>
 #!/bin/sh
-donot="--no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite"
+donot="--no-blurfilter --no-noisefilter --no-grayfilter --no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite"
+pdflst=""
+ptmpf="tempfile_unpaptst_$$"
 nmb=0
 while [ "$nmb" -le 9 ]
 do
-	str1="-w 0.$nmb $donot -t pbm out.pgm outc.pbm"
+	str1="0.$nmb"
-	unpaper $str1
-# convert inputfile option annotation outputfile
+	unpaper -b $str1 $donot -t pbm $ptmpf-1.pgm $ptmpf.pbm
-	convert outc.pbm -gravity NorthWest -annotate 0 "unpaper $str1" outca.pbm
-	convert -density 300 -units PixelsPerInch outca.pbm ps:- | ps2pdf13 -sPAPERSIZE=a4 - "out-$nmb.pdf"
+	# convert inputfile option annotation outputfile
-	rm outc.pbm
+	convert $ptmpf.pbm -gravity center -background lightblue -font Open-Sans -pointsize 60 caption:"$str1 \n $donot" -composite $ptmpf_ca.pbm
-	rm outca.pbm
+	convert -monochrome -density 300 -units PixelsPerInch /$ptmpf_ca.pbm ps:- | ps2pdf13 -sPAPERSIZE=a4 - "$ptmpf-$nmb.pdf"
+	rm $ptmpf.pbm $ptmpf_ca.pbm
+	pdflst="$pdflst $ptmpf-$nmb.pdf"
 	nmb=$(($nmb+1))
 done
-pdftk out-*.pdf output result.pdf
+# merge all pdf documents with ghostscript. Although pdftk will give a shorter command, the identify command
-rm out-*.pdf
+# gives an error when it analyses the pdf file.
+gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -sOutputFile=result.pdf $pdflst
+rm $pdflst
 exit 0
@@ Line 67: / Line 77: @@
 Black borders at edges of your scan can be removed automatically with this option. See [[http://unpaper.berlios.de/unpaper.html|unpaper user documentation]] for details.
-=== 3.
+=== 3. blurfilter ===
+See pdf examples above.
+=== 4. grayfilter ===
+See pdf examples above.
+===== Scripting with unpaper =====
+==== gray2black ====
+Automatic conversion from pdf files, containing gray images, to black and white images with unpaper and its noise reduction and cleaning up features can be done with the following script:
+<code bash>
+#!/bin/sh
+# filename: gray2black
+# This script processes a pdf file containing gray images with
+# unpaper, imagemagick and ghostscript to a black and white pdf file.
+#
+# Usage: gray2black input-file [option] output-file
+#
+# option: -b [value] specify black threshold value as being used in
+#                     unpaper with -b. If omitted, default value will
+#                     be used: -b 0.12
+# input-file: a pdf file containing sheetmusic with one or more gray
+#             images.
+# output-file: a multipage pdf file with a series of black-threshold
+#              settings will be created.
+#
+# Example:
+# Process gray images in mozart.pdf: gray2black mozart.pdf -b 0.23 test.pdf
+#
+# This script uses imagemagick to convert and center an image on an
+# a4 page. (With 2480x3508 pixels at a resolution of 300 dpi) and
+# ghostscript to create a small pdf file.  By using efficiently the
+# piping method (|) in conjunction with reading and writing to the
+# standard input (-) we get less tempfiles.
+#
+# NB. This script is time consuming. So have a lot of patience.
+# Each page will take about 15 seconds to execute on a single core
+# AMD64 3500+ cpu.
+#
+# User text
+_fbhelp="Usage: gray2black input-file [OPTION...] output-file\n\n  -b [value], specify black threshold value as being used in\n              unpaper with -b. If omitted, default value will\n              be used: -b 0.12.\n\nExample:\nProcess pdf file with unpaper black threshold 0.23:\n  gray2black mozart.pdf -b 0.23 result.pdf\nProcess pdf file with -b 0.12 settings:\n  gray2black mozart.pdf result.pdf"
+_fbspdf="please supply a pdf file"
+_fbcvrt="Check your pdf file. Does it contain gray images?"
+# Unpaper settings, start value for black-threshold is 0.12
+b_threshold="0.12"
+donot="--no-grayfilter --no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite"
+filter="-ni 10 -ls 8 -lp 4 -li 0.20"
+# Other settings
+ptmpf="tempfile_gray2black_$$"
+# Page dimensions
+psize="a4"
+horpix=2480
+verpix=3508
+pdim="$horpix"x"$verpix"
+pres="300"
+# Only resize if absolutely neccessary. Values for resizing are for left, right, top and bottom border.
+hborder=33
+vborder=33
+lhorpix=$horpix
+lverpix=$verpix
+# Store input-file name temporarily, because it's deleted by shift command.
+arg1=$1
+# Find number of parameters passed
+if [ "$#" -le 1 ]; then
+	echo "$_fbhelp"
+	exit 1
+fi
+# Check OPTIONS
+if [ "$#" -eq 3 ]; then
+	case "$2" in
+		* )
+			echo "$_fbhelp"
+			exit 1
+			;;
+	esac
+	elif [ "$#" -eq 4 ]; then
+		case "$2" in
+			-b )
+				shift
+				b_threshold=$2
+				shift
+				;;
+			* )
+				echo "$_fbhelp"
+				exit 1
+				;;
+		esac
+fi
+# only process further if there are two arguments left.
+if [ "$#" -eq 2 ]; then
+	if [ ! "$(identify "$arg1"|grep PDF)" ]; then
+		echo "$_fbspdf"
+		exit 1
+	else
+		# convert first page to a pgm file with filename $ptmpf-*.pgm
+		pdftoppm -gray -r 300 "$arg1" $ptmpf
+		# Check for document resize neccessity. Iterating through all pages, makes sure, all pages get uniform resizing later.
+		resize=0
+		# Iterate through files. We need to sort files in a numerical order because pdftoppm generates files without a leading zero.
+		# However before the numbers, a '-' delimiter character had been added by pdftoppm. Now cut the part before that.
+		for filenum in $(ls $ptmpf-*.pgm | cut -d'-' -f2 | sort -n); do
+			dimension="$(identify -format "%[fx:w]x%[fx:h]" $ptmpf'-'$filenum)"
+			#  width: "${dimension%%x*}" -> From the end removes the longest part of dimension that matches x* and returns the rest.
+			#  height: "${dimension##*x}" -> From the beginning removes the longest part of dimension that matches *x and returns the rest.
+			if [ ${dimension%%x*} -gt $lhorpix ]; then
+				lhorpix=${dimension%%x*}
+				resize=1
+			fi
+			if [ ${dimension##*x} -gt $lverpix ]; then
+				lverpix=${dimension##*x}
+				resize=1
+			fi
+		done
+		# Resize document if neccessary
+		if [ $resize -eq 1 ]; then
+			# image fits within horizontal boundary, so adaptation must be vertical
+			if [ $horpix -eq $lhorpix ]; then
+				vscale=$(( ($verpix-30)*100/$lverpix ))
+			else
+				vscale=100
+			fi
+			# image fits within vertical boundary, so adaptation must be horizontal
+			if [ $verpix -eq $lverpix ]; then
+				hscale=$(( ($horpix-30)*100/$lhorpix ))
+			else
+				hscale=100
+			fi
+			# scaling neccessary both horizontal and vertical
+			if [ $hscale -lt $vscale ]; then scale=$hscale; else scale=$vscale; fi
+			echo Some images exceed the maximum size. Now scaling to "$scale"%.
+			# iterate through all files and resize all with the same percentage, use 8 bits per pixel
+			for filenum in $(ls $ptmpf-*.pgm | cut -d'-' -f2 | sort -n); do
+				mogrify -depth 8 -resize "$scale"% "$ptmpf-$filenum"
+				echo resizing file: "$ptmpf-$filenum"
+			done
+		fi
+		# apply unpaper onto each page
+		for filenum in $(ls $ptmpf-*.pgm | cut -d'-' -f2 | sort -n); do
+			# Parameter expansion: ${param%word} From the end remov? Mis je nog iets?es the smallest part of param that matches word and returns the rest. (p.72)
+			unpaper -b $b_threshold $filter $donot -t pbm "$ptmpf-$filenum" "$ptmpf-"${filenum%.pgm}.pbm
+			rm "$ptmpf-$filenum"
+		done
+		# center pbm page on an a4 canvas, convert to pdf
+		pdflst=""
+		for filenum in $(ls $ptmpf-*.pbm | cut -d'-' -f2 | sort -n); do
+			convert -size $pdim xc:white miff:- | composite -density $pres -units PixelsPerInch -compose atop -gravity Center "$ptmpf-$filenum" - miff:- | convert -monochrome -density 300 -units PixelsPerInch - ps:- | ps2pdf13 -sPAPERSIZE=a4 - "$ptmpf-"${filenum%.pbm}.pdf
+			rm "$ptmpf-$filenum"
+			pdflst="$pdflst $ptmpf-"${filenum%.pbm}.pdf
+		done
+		# merge all pages into on page
+		gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH "-sOutputFile=$2" $pdflst
+		rm $pdflst
+	fi
+fi
+exit 0
+</code>
+==== findblack ====
+Find a value for black threshold with following script:
+<code bash>
+#!/bin/sh
+# filename: findblack
+# This script helps to find an unpaper's black threshold value (-b option) by
+# creating a series of pdf files each with a different black threshold setting. By observing
+# the pages afterwards, selecting a proper black threshold value should be easier. Values are
+# annotated on top of each page.
+#
+# usage: findblack input-file [option] output-file
+#
+# option: -p [number] which page should be used as source for unpaper. If not specified, first page will be taken.
+# input-file: a pdf file containing one or more gray images.
+# output-file: a multipage pdf file with a series of black-threshold settings will be created.
+#
+# Example:
+# Process page 4 from mozart.pdf: ./findblack mozart.pdf -p 4 test.pdf
+#
+# This script uses imagemagick to convert and center an image on an a4 page. (With 2480x3508 pixels at a resolution of 300 dpi)
+# and ghostscript to create a small pdf file.  By using efficiently the piping method (|) in conjunction with
+# reading and writing to the standard input (-) we get less tempfiles.
+#
+# NB. This script is time consuming. So have a lot of patience.
+# Each page will take about 15 seconds to execute on a single core AMD64 3500+ cpu.
+# User text
+_fbhelp="Usage: findblack input-file [OPTION...] output-file\n\n  -p foo, processes page number foo\n          If this option is omitted page 1 will be processed by default\n\nExample:\nProcess page 5:\n  findblack mozart.pdf -p 5 result.pdf\nProcess the first page:\n  findblack mozart.pdf result.pdf"
+_fbspdf="please supply a pdf file"
+_fbcvrt="Can't find converted first page of pdf file. Check your pdf file"
+# Unpaper settings, start value for black-threshold is 0.05, end value for black-threshold is 0.40, step = 0.02
+a=5
+b=20
+s=5
+donot="--no-blurfilter --no-noisefilter --no-grayfilter --no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite"
+# Other settings; $$ references the current PID
+ptmpf="tempfile_findblack_$$"
+psize="a4"
+pdim="2480x3508"
+pres="300"
+ptxt="black threshold:"
+pstr="-font Times-Roman -pointsize 36 -gravity NorthWest -annotate +5+5"
+# Default page number
+defpage=1
+# Find number of parameters passed
+if [ "$#" -le 1 ]; then
+	echo "$_fbhelp"
+	exit 1
+fi
+# Store input-file name temporarily, because it's deleted by shift command.
+arg1=$1
+# Check OPTIONS
+if [ "$#" -eq 3 ]; then
+	case "$2" in
+		* )
+			echo "$_fbhelp"
+			exit 1
+			;;
+	esac
+	elif [ "$#" -eq 4 ]; then
+		case "$2" in
+			-p )
+				shift
+				defpage=$2
+				shift
+				;;
+			* )
+				echo "$_fbhelp"
+				exit 1
+				;;
+		esac
+fi
+# only process further if there are two arguments left.
+if [ "$#" -eq 2 ]; then
+	if [ ! "$(identify "$arg1" | grep PDF)" ]; then
+		echo "$_fbspdf"
+		exit 1
+	else
+		# find last page, start counting from 0, necessary for proper deleting pdftoppm result file which may have extra 0
+		trail=""
+		lastPage="$(identify "$arg1" | tail -n 1 | sed 's/\(.*\)\[\(.*\)\].*/\2/')"
+		if [ "$lastPage" -ge 10 ]; then
+			trail="0"
+		elif [ "$lastPage" -ge 100 ]; then
+			trail="00"
+		elif [ "$lastPage" -ge 1000 ]; then
+			trail="000"
+		elif [ "$lastPage" -ge 10000 ]; then
+			trail="0000"
+		fi
+		# convert first page to a pgm file with filename tmpfile_PID-1.pgm
+		pdftoppm -gray -r 300 -f $defpage -l $defpage "$arg1" $ptmpf
+		# check for succesful conversion to .pgm
+		if [ ! -f "$ptmpf-$trail$defpage.pgm" ]; then
+			echo "$_fbcvrt"
+			rm "$ptmpf-$trail$defpage.pgm"
+			exit 1
+		else
+			pdflst=""
+			nmb=$a
+			while [ "$nmb" -le $b ]
+			do
+				# Cope with digits
+				if [ "$nmb" -le 9 ]; then
+					str1="0.0$nmb"
+				else
+					str1="0.$nmb"
+				fi
+				# process unpaper
+				unpaper -b $str1 $donot -t pbm $ptmpf-$trail$defpage.pgm $ptmpf.pbm
+				# Create white A4 size canvas -> center pbm file into canvas -> convert pbm to ps -> convert ps to pdf
+				# New to bash?
+				# Send to and retreive from 'standard input' with using dash: -
+				# Sending result to next command with pipe: |
+				# Using this 'technique' we don't need to create temporary files.
+				convert -size $pdim xc:white miff:- | composite -density $pres -units PixelsPerInch -compose atop -gravity Center $ptmpf.pbm - miff:- | convert - $pstr "$ptxt $str1" - | convert -monochrome -density 300 -units PixelsPerInch - ps:- | ps2pdf13 -sPAPERSIZE=a4 - "$ptmpf-$nmb.pdf"
+				pdflst="$pdflst $ptmpf-$nmb.pdf"
+				nmb=$(($nmb+$s))
+				rm $ptmpf.pbm
+			done
+			rm $ptmpf-$trail$defpage.pgm
+			# Merge all pdf files in $pdflst into one single file with filename $2
+			gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -sOutputFile=$2 $pdflst
+			rm $pdflst
+		fi
+	fi
+fi
+exit 0
+</code>
+===== Bugs =====
+[[software:unpaper_test:bugs|See the following page]]