PDF documents <OT>

Bill Campbell linux-sxs
Thu Sep 14 14:01:22 PDT 2006


On Thu, Sep 14, 2006, Ric Moore wrote:
>I've finally managed to scan a series of documents, so I have 9 scanned
>images saved as PDF documents. How would I go about making one PDF
>document that contains each of them saved as one page each? I'm trying
>OpenOffice but I cannot find the command to insert a new page to copy
>the image to. I RTFM but cannot find it. I know this is slightly
>off-topic, but I could sure use some help. Any other method would be
>welcome as well. Ric

It's been a while since I was extensively involved with scanning
and OCR of documents so this may be somewhat dated.

The OCR software I've used with Linux (Vividata) requires TIFF input
or perhaps other standard image formats for conversion.  The gocr
program supports PostScript and a wide variety of image formats.

I think you could probably use pdf2ps to convert the PDF file to
PostScript, then use convert from ImageMagick to create single
page image files ``convert -monochrome document.ps tif:pages''

I played a bit converting some court documents for Groklaw,
converting it to HTML using Readiris Pro to OCR the PDF file and
save it as an RTF document (which required a fair amount of manual
fiddling to get rid of the line numbers and only get the text of
the document).  I then loaded that RTF into OpenOffice.org and
Microsoft Word from Office 2004 for Mac, saving it as HTML.
Finally I ran the HTML through a python filter I wrote that
removes all the fancy formatting and fonts inserted by M$Word and
OpenOffice.org to create a clean HTML format.  The results are
available for viewing here:

	http://www.celestial.com/Members/bill/Drafts/pdf2html/

Bill
--
INTERNET:   bill at Celestial.COM  Bill Campbell; Celestial Software LLC
URL: http://www.celestial.com/  PO Box 820; 6641 E. Mercer Way
FAX:            (206) 232-9186  Mercer Island, WA 98040-0820; (206) 236-1676

The day-to-day travails of the IBM programmer are so amusing to most of
us who are fortunate enough never to have been one -- like watching
Charlie Chaplin trying to cook a shoe.



More information about the Linux-users mailing list