pdf data recovery

Kurt Wall kwall
Fri Nov 5 09:12:38 PST 2004


On Fri, Nov 05, 2004 at 06:01:48AM -0800, Shawn Tayler took 14 lines to write:
> Hi Guys,
> 
> Have any of you ever been able to recover text data from a pdf file?  I
> have a pdf doc with a 4 column listing on it.  I need to recover the data
> in those 4 columns.  Typing it all back in is not really an option.  The
> original doc that was used to make the pdf is not available.

pdftotext should do. I don't know how well it would handle columnar data,
but at least you'd get the text. You might try:

$ pdftotext -layout -nopgbrk infile.pdf 

-layout tries to preserve layout (columns, tables, etc)
-nopgbrk doesn't insert ^L characters to signal page breaks
infile.pdf is you input PDF file; the output will be named input.txt

By default, pdftotext outputs paragraphs without newlines, so I typically
pipe the output through fmt. Thus:

$ pdftotext -layout -nopgbrk infile.pdf - | fmt > infile.txt

Kurt
-- 
"In short, N is Richardian if, and only if, N is not Richardian."


More information about the Linux-users mailing list