<div dir="ltr"><div>I used pdfsandwich in a project. I set up a Raspberry Pi, found all of the dependencies for it and wrote a script to automate it (I had over 200 files to work on). It did OCR on the files I put in the "original" directory. If it succeeded without errors, I moved it to a "Done" directory otherwise it went to a "Fail" directory. It did the OCR and added the text to the file. It wasn't perfect, but close enough for my needs.</div><div><br></div><div>Laura Brody<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jan 12, 2022 at 8:38 PM Cesar Baquerizo via Filepro-list <<a href="mailto:filepro-list@lists.celestial.com">filepro-list@lists.celestial.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Look for pdfsandwich. That should do what you want. Lots of info at the site. <br>

<br>

Regards<br>

---------------------<br>

<br>

<br>

<br>

<br>

********************************************************************<br>

<br>

This message and any attachments are solely for the intended recipient. If you are not the intended recipient, disclosure, copying, use or distribution of the information included in this message is prohibited. If you received this message in error, please notify the sender and permanently delete.<br>

<br>

<br>

> On Jan 12, 2022, at 8:28 PM, Jose Lerebours via Filepro-list <<a href="mailto:filepro-list@lists.celestial.com" target="_blank">filepro-list@lists.celestial.com</a>> wrote:<br>

> <br>

> I have an GSA that wants data extracted from PDF documents, most of which are scanned<br>

> documents saved as PDF; which in essence makes them images saved as PDF.<br>

> <br>

> I have written code in PHP to save the PDF to PNG and extract TEXT from PNG but this is not proving<br>

> to be reliable since lots of characters are read wrong or not read at all.<br>

> <br>

> It is like pulling teeth, I want this done but do not ask me to get you "true" PDFs, the scanned<br>

> documents is all I can get ... type of scenario.<br>

> <br>

> So, my question is: is anyone here successfully extracting data from scanned documents and if so,<br>

> what are you using?<br>

> <br>

> Regards,<br>

> <br>

> <br>

> -- <br>

> Jose Lerebours<br>

> 954-559-7186<br>

> <a href="https://www.asisuites.com" rel="noreferrer" target="_blank">https://www.asisuites.com</a><br>

> Accounting - Retail - Wholesale - Distribution<br>

> Manufacturing - Warehousing - Transportation - eCommerce - Web Development<br>

> <br>

> _______________________________________________<br>

> Filepro-list mailing list<br>

> <a href="mailto:Filepro-list@lists.celestial.com" target="_blank">Filepro-list@lists.celestial.com</a><br>

> Subscribe/Unsubscribe/Subscription Changes<br>

> <a href="http://mailman.celestial.com/mailman/listinfo/filepro-list" rel="noreferrer" target="_blank">http://mailman.celestial.com/mailman/listinfo/filepro-list</a><br>

-------------- next part --------------<br>

A non-text attachment was scrubbed...<br>

Name: smime.p7s<br>

Type: application/pkcs7-signature<br>

Size: 2349 bytes<br>

Desc: not available<br>

URL: <<a href="http://mailman.celestial.com/pipermail/filepro-list/attachments/20220112/7e3b435f/attachment.p7s" rel="noreferrer" target="_blank">http://mailman.celestial.com/pipermail/filepro-list/attachments/20220112/7e3b435f/attachment.p7s</a>><br>

_______________________________________________<br>

Filepro-list mailing list<br>

<a href="mailto:Filepro-list@lists.celestial.com" target="_blank">Filepro-list@lists.celestial.com</a><br>

Subscribe/Unsubscribe/Subscription Changes<br>

<a href="http://mailman.celestial.com/mailman/listinfo/filepro-list" rel="noreferrer" target="_blank">http://mailman.celestial.com/mailman/listinfo/filepro-list</a><br>

</blockquote></div>