Differences

This shows you the differences between two versions of the page.

--- software:unpaper_test [2009/03/11 05:16] – admin
+++ software:unpaper_test [2015/04/22 21:51] (current) – [findblack] admin
@@ Line 1: / Line 1: @@
-====== Unpaper test =====
+====== Unpaper test ======
 In order to better understand the vast amount of options in unpaper, I've done several tests, which show differences between settings.\\ I scanned a lot of sheetmusic, processed them manually with another application to remove skew. All actions have been done in 8-bit gray at a resolution of 300 dpi.\\ This test should show which settings should be done with unpaper to get a decent automatic conversion to black and white with despeckle and noise removal.
@@ Line 34: / Line 34: @@
 	str1="0.$nmb"
-	unpaper -b $str1 $donot -t pbm /tmp/$ptmpf-1.pgm /tmp/$ptmpf.pbm
+	unpaper -b $str1 $donot -t pbm $ptmpf-1.pgm $ptmpf.pbm
 	# convert inputfile option annotation outputfile
-	convert /tmp/$ptmpf.pbm -gravity NorthWest -annotate 0 "unpaper -b $str1 $donot" /tmp/$ptmpf_ca.pbm
+	convert $ptmpf.pbm -gravity center -background lightblue -font Open-Sans -pointsize 60 caption:"$str1 \n $donot" -composite $ptmpf_ca.pbm
-	convert -monochrome -density 300 -units PixelsPerInch /tmp/$ptmpf_ca.pbm ps:- | ps2pdf13 -sPAPERSIZE=a4 - "/tmp/$ptmpf-$nmb.pdf"
+	convert -monochrome -density 300 -units PixelsPerInch /$ptmpf_ca.pbm ps:- | ps2pdf13 -sPAPERSIZE=a4 - "$ptmpf-$nmb.pdf"
-	rm /tmp/$ptmpf.pbm /tmp/$ptmpf_ca.pbm
+	rm $ptmpf.pbm $ptmpf_ca.pbm
-	pdflst="$pdflst /tmp/$ptmpf-$nmb.pdf"
+	pdflst="$pdflst $ptmpf-$nmb.pdf"
 	nmb=$(($nmb+1))
 done
@@ Line 83: / Line 83: @@
 ===== Scripting with unpaper =====
+==== gray2black ====
 Automatic conversion from pdf files, containing gray images, to black and white images with unpaper and its noise reduction and cleaning up features can be done with the following script:
 <code bash>
 #!/bin/sh
-# filename: findblack.sh
+# filename: gray2black
-# This script helps to find the value for unpaper's black threshold value (-b option)
+# This script processes a pdf file containing gray images with
+# unpaper, imagemagick and ghostscript to a black and white pdf file.
+#
+# Usage: gray2black input-file [option] output-file
 #
-# usage: findblack.sh <inputfile.pdf> <outputfile.pdf>
+# option: -b [value] specify black threshold value as being used in
+#                     unpaper with -b. If omitted, default value will
+#                     be used: -b 0.12
+# input-file: a pdf file containing sheetmusic with one or more gray
+#             images.
+# output-file: a multipage pdf file with a series of black-threshold
+#              settings will be created.
 #
-# <inputfile.pdf> a pdf file containing one or more gray images.
+# Example:
-# <outputfile.pdf> a multipage pdf file with a series of black-threshold settings will be created.
+# Process gray images in mozart.pdf: gray2black mozart.pdf -b 0.23 test.pdf
+#
+# This script uses imagemagick to convert and center an image on an
+# a4 page. (With 2480x3508 pixels at a resolution of 300 dpi) and
+# ghostscript to create a small pdf file.  By using efficiently the
+# piping method (|) in conjunction with reading and writing to the
+# standard input (-) we get less tempfiles.
 #
-# Observe the outputfile.pdf and find the page which has the best value for black-threshold. (Text is embedded)
+# NB. This script is time consuming. So have a lot of patience.
+# Each page will take about 15 seconds to execute on a single core
+# AMD64 3500+ cpu.
 #
-#
+# User text
-# Use imagemagick to convert and center an image on an a4 page. (With 2480x3508 pixels at a resolution of 300 dpi).
+_fbhelp="Usage: gray2black input-file [OPTION...] output-file\n\n  -b [value], specify black threshold value as being used in\n              unpaper with -b. If omitted, default value will\n              be used: -b 0.12.\n\nExample:\nProcess pdf file with unpaper black threshold 0.23:\n  gray2black mozart.pdf -b 0.23 result.pdf\nProcess pdf file with -b 0.12 settings:\n  gray2black mozart.pdf result.pdf"
-# Use ghostscript to create a small pdf file.  By using efficiently the piping method (|) in conjunction with
+_fbspdf="please supply a pdf file"
-# reading and writing to the standard input (-) we get:
+_fbcvrt="Check your pdf file. Does it contain gray images?"
+# Unpaper settings, start value for black-threshold is 0.12
+b_threshold="0.12"
+donot="--no-grayfilter --no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite"
+filter="-ni 10 -ls 8 -lp 4 -li 0.20"
+# Other settings
+ptmpf="tempfile_gray2black_$$"
+# Page dimensions
 psize="a4"
-pdim="2480x3508"
+horpix=2480
+verpix=3508
+pdim="$horpix"x"$verpix"
 pres="300"
-convert -size $pdim xc:white pbm:- | composite -density $pres -units PixelsPerInch -compose atop -gravity Center inputfile - ps:- | ps2pdf13 -sPAPERSIZE=$psize - output.pdf
-# This will take about 5 seconds for a single page to execute on a single core AMD64 3500+ cpu.
+# Only resize if absolutely neccessary. Values for resizing are for left, right, top and bottom border.
+hborder=33
+vborder=33
+lhorpix=$horpix
+lverpix=$verpix
+# Store input-file name temporarily, because it's deleted by shift command.
+arg1=$1
+# Find number of parameters passed
+if [ "$#" -le 1 ]; then
+	echo "$_fbhelp"
+	exit 1
+fi
+# Check OPTIONS
+if [ "$#" -eq 3 ]; then
+	case "$2" in
+		* )
+			echo "$_fbhelp"
+			exit 1
+			;;
+	esac
+	elif [ "$#" -eq 4 ]; then
+		case "$2" in
+			-b )
+				shift
+				b_threshold=$2
+				shift
+				;;
+			* )
+				echo "$_fbhelp"
+				exit 1
+				;;
+		esac
+fi
+# only process further if there are two arguments left.
+if [ "$#" -eq 2 ]; then
+	if [ ! "$(identify "$arg1"|grep PDF)" ]; then
+		echo "$_fbspdf"
+		exit 1
+	else
+		# convert first page to a pgm file with filename $ptmpf-*.pgm
+		pdftoppm -gray -r 300 "$arg1" $ptmpf
+		# Check for document resize neccessity. Iterating through all pages, makes sure, all pages get uniform resizing later.
+		resize=0
+		# Iterate through files. We need to sort files in a numerical order because pdftoppm generates files without a leading zero.
+		# However before the numbers, a '-' delimiter character had been added by pdftoppm. Now cut the part before that.
+		for filenum in $(ls $ptmpf-*.pgm | cut -d'-' -f2 | sort -n); do
+			dimension="$(identify -format "%[fx:w]x%[fx:h]" $ptmpf'-'$filenum)"
+			#  width: "${dimension%%x*}" -> From the end removes the longest part of dimension that matches x* and returns the rest.
+			#  height: "${dimension##*x}" -> From the beginning removes the longest part of dimension that matches *x and returns the rest.
+			if [ ${dimension%%x*} -gt $lhorpix ]; then
+				lhorpix=${dimension%%x*}
+				resize=1
+			fi
+			if [ ${dimension##*x} -gt $lverpix ]; then
+				lverpix=${dimension##*x}
+				resize=1
+			fi
+		done
+		# Resize document if neccessary
+		if [ $resize -eq 1 ]; then
+			# image fits within horizontal boundary, so adaptation must be vertical
+			if [ $horpix -eq $lhorpix ]; then
+				vscale=$(( ($verpix-30)*100/$lverpix ))
+			else
+				vscale=100
+			fi
+			# image fits within vertical boundary, so adaptation must be horizontal
+			if [ $verpix -eq $lverpix ]; then
+				hscale=$(( ($horpix-30)*100/$lhorpix ))
+			else
+				hscale=100
+			fi
+			# scaling neccessary both horizontal and vertical
+			if [ $hscale -lt $vscale ]; then scale=$hscale; else scale=$vscale; fi
+			echo Some images exceed the maximum size. Now scaling to "$scale"%.
+			# iterate through all files and resize all with the same percentage, use 8 bits per pixel
+			for filenum in $(ls $ptmpf-*.pgm | cut -d'-' -f2 | sort -n); do
+				mogrify -depth 8 -resize "$scale"% "$ptmpf-$filenum"
+				echo resizing file: "$ptmpf-$filenum"
+			done
+		fi
+		# apply unpaper onto each page
+		for filenum in $(ls $ptmpf-*.pgm | cut -d'-' -f2 | sort -n); do
+			# Parameter expansion: ${param%word} From the end remov? Mis je nog iets?es the smallest part of param that matches word and returns the rest. (p.72)
+			unpaper -b $b_threshold $filter $donot -t pbm "$ptmpf-$filenum" "$ptmpf-"${filenum%.pgm}.pbm
+			rm "$ptmpf-$filenum"
+		done
+		# center pbm page on an a4 canvas, convert to pdf
+		pdflst=""
+		for filenum in $(ls $ptmpf-*.pbm | cut -d'-' -f2 | sort -n); do
+			convert -size $pdim xc:white miff:- | composite -density $pres -units PixelsPerInch -compose atop -gravity Center "$ptmpf-$filenum" - miff:- | convert -monochrome -density 300 -units PixelsPerInch - ps:- | ps2pdf13 -sPAPERSIZE=a4 - "$ptmpf-"${filenum%.pbm}.pdf
+			rm "$ptmpf-$filenum"
+			pdflst="$pdflst $ptmpf-"${filenum%.pbm}.pdf
+		done
+		# merge all pages into on page
+		gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH "-sOutputFile=$2" $pdflst
+		rm $pdflst
+	fi
+fi
+exit 0
 </code>
-Find the value for black threshold with following script:
+==== findblack ====
+Find a value for black threshold with following script:
 <code bash>
 #!/bin/sh
-# filename: findblack.sh
+# filename: findblack
 # This script helps to find an unpaper's black threshold value (-b option) by
 # creating a series of pdf files each with a different black threshold setting. By observing
@@ Line 116: / Line 253: @@
 # annotated on top of each page.
 #
-# usage: findblack.sh [optional: pagenumber] <inputfile.pdf> <outputfile.pdf>
+# usage: findblack input-file [option] output-file
 #
-# pagenumber: which page should be used as source for unpaper. If not specified, first page will be taken.
+# option: -p [number] which page should be used as source for unpaper. If not specified, first page will be taken.
-# <inputfile.pdf> a pdf file containing one or more gray images.
+# input-file: a pdf file containing one or more gray images.
-# <outputfile.pdf> a multipage pdf file with a series of black-threshold settings will be created.
+# output-file: a multipage pdf file with a series of black-threshold settings will be created.
 #
+# Example:
+# Process page 4 from mozart.pdf: ./findblack mozart.pdf -p 4 test.pdf
 #
 # This script uses imagemagick to convert and center an image on an a4 page. (With 2480x3508 pixels at a resolution of 300 dpi)
@@ Line 128: / Line 267: @@
 #
 # NB. This script is time consuming. So have a lot of patience.
-# The line: "convert -size $pdim xc:white pbm:- | ...... | ps2pdf13 -sPAPERSIZE=$psize - output.pdf" will take about
+# Each page will take about 15 seconds to execute on a single core AMD64 3500+ cpu.
-# 10 seconds for a single page to execute on a single core AMD64 3500+ cpu.
-# Written on march 2009, m.nijdam
+# User text
+_fbhelp="Usage: findblack input-file [OPTION...] output-file\n\n  -p foo, processes page number foo\n          If this option is omitted page 1 will be processed by default\n\nExample:\nProcess page 5:\n  findblack mozart.pdf -p 5 result.pdf\nProcess the first page:\n  findblack mozart.pdf result.pdf"
+_fbspdf="please supply a pdf file"
+_fbcvrt="Can't find converted first page of pdf file. Check your pdf file"
-# start value for black-threshold is 0.05
+# Unpaper settings, start value for black-threshold is 0.05, end value for black-threshold is 0.40, step = 0.02
 a=5
-# end value for black-threshold is 0.40
+b=20
-b=40
+s=5
 donot="--no-blurfilter --no-noisefilter --no-grayfilter --no-mask-scan --no-mask-center --no-deskew --no-wipe --no-border --no-border-scan --no-border-align --overwrite"
+# Other settings; $$ references the current PID
 ptmpf="tempfile_findblack_$$"
 psize="a4"
@@ Line 146: / Line 287: @@
 ptxt="black threshold:"
 pstr="-font Times-Roman -pointsize 36 -gravity NorthWest -annotate +5+5"
+# Default page number
+defpage=1
 # Find number of parameters passed
 if [ "$#" -le 1 ]; then
-	echo "Need input filename and output filename for pdf file"
+	echo "$_fbhelp"
 	exit 1
 fi
-ptst=`identify $1 | grep PDF`
+# Store input-file name temporarily, because it's deleted by shift command.
-if [ ! "$ptst" ]; then
+arg1=$1
-	echo "please supply a pdf file"
+# Check OPTIONS
+if [ "$#" -eq 3 ]; then
+	case "$2" in
+		* )
+			echo "$_fbhelp"
+			exit 1
+			;;
+	esac
+	elif [ "$#" -eq 4 ]; then
+		case "$2" in
+			-p )
+				shift
+				defpage=$2
+				shift
+				;;
+			* )
+				echo "$_fbhelp"
+				exit 1
+				;;
+		esac
+fi
+# only process further if there are two arguments left.
+if [ "$#" -eq 2 ]; then
+	if [ ! "$(identify "$arg1" | grep PDF)" ]; then
+		echo "$_fbspdf"
+		exit 1
 	else
+		# find last page, start counting from 0, necessary for proper deleting pdftoppm result file which may have extra 0
+		trail=""
+		lastPage="$(identify "$arg1" | tail -n 1 | sed 's/\(.*\)\[\(.*\)\].*/\2/')"
+		if [ "$lastPage" -ge 10 ]; then
+			trail="0"
+		elif [ "$lastPage" -ge 100 ]; then
+			trail="00"
+		elif [ "$lastPage" -ge 1000 ]; then
+			trail="000"
+		elif [ "$lastPage" -ge 10000 ]; then
+			trail="0000"
+		fi
 		# convert first page to a pgm file with filename tmpfile_PID-1.pgm
-		pdftoppm -gray -r 300 -f 1 -l 1 $1 /tmp/$ptmpf
+		pdftoppm -gray -r 300 -f $defpage -l $defpage "$arg1" $ptmpf
 		# check for succesful conversion to .pgm
-		if [ ! -f "/tmp/$ptmpf-1.pgm" ]; then
+		if [ ! -f "$ptmpf-$trail$defpage.pgm" ]; then
-			echo "Can't find converted first page of pdf file. Check your pdf file"
+			echo "$_fbcvrt"
-			rm "/tmp/$ptmpf-1.pgm"
+			rm "$ptmpf-$trail$defpage.pgm"
 			exit 1
 		else
 			pdflst=""
 			nmb=$a
 			while [ "$nmb" -le $b ]
@@ Line 176: / Line 358: @@
 					str1="0.$nmb"
 				fi
 				# process unpaper
-				unpaper -b $str1 $donot -t pbm /tmp/$ptmpf-1.pgm /tmp/$ptmpf.pbm
+				unpaper -b $str1 $donot -t pbm $ptmpf-$trail$defpage.pgm $ptmpf.pbm
 				# Create white A4 size canvas -> center pbm file into canvas -> convert pbm to ps -> convert ps to pdf
 				# New to bash?
@@ Line 185: / Line 367: @@
 				# Sending result to next command with pipe: |
 				# Using this 'technique' we don't need to create temporary files.
-				convert -size $pdim xc:white miff:- | composite -density $pres -units PixelsPerInch -compose atop -gravity Center /tmp/$ptmpf.pbm - miff:- | convert - $pstr "$ptxt $str1" - | convert -monochrome -density 300 -units PixelsPerInch - ps:- | ps2pdf13 -sPAPERSIZE=a4 - "/tmp/$ptmpf-$nmb.pdf"
+				convert -size $pdim xc:white miff:- | composite -density $pres -units PixelsPerInch -compose atop -gravity Center $ptmpf.pbm - miff:- | convert - $pstr "$ptxt $str1" - | convert -monochrome -density 300 -units PixelsPerInch - ps:- | ps2pdf13 -sPAPERSIZE=a4 - "$ptmpf-$nmb.pdf"
-				pdflst="$pdflst /tmp/$ptmpf-$nmb.pdf"
+				pdflst="$pdflst $ptmpf-$nmb.pdf"
-				nmb=$(($nmb+1))
+				nmb=$(($nmb+$s))
-				rm /tmp/$ptmpf.pbm
+				rm $ptmpf.pbm
 			done
-			rm /tmp/$ptmpf-1.pgm
+			rm $ptmpf-$trail$defpage.pgm
 			# Merge all pdf files in $pdflst into one single file with filename $2
 			gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -sOutputFile=$2 $pdflst
@@ Line 197: / Line 379: @@
 		fi
 	fi
+fi
 exit 0
 </code>
+===== Bugs =====
+[[software:unpaper_test:bugs|See the following page]]