MavEtJu's Distorted View of the World

pdftohtml on print-only PDF files

Posted on 2006-08-14 19:29:54, modified on 2006-08-14 19:35:59
Tags: Broken software

One of the jobs I have at this moment is the parsing of PDF files. Very easy... convert PDF to HTML, convert HTML to text and then parse it. Only sometimes the PDF files are protected:

[~] edwin@k7>pdftohtml 3798b854f6245a5b98ec0344aefd44b1.pdf
Error: Copying of text from this document is not allowed.

Luckely there is an easy solution for this:

[~] edwin@k7>pdf2ps 3798b854f6245a5b98ec0344aefd44b1.pdf 
[~] edwin@k7>ps2pdf 3798b854f6245a5b98ec0344aefd44b1.ps 
[~] edwin@k7>pdftohtml 3798b854f6245a5b98ec0344aefd44b1.pdf
Page-1
Page-2

Mission accomplished! :-P

| Share on Facebook | Share on Twitter
Comments: No comments yet
Leave a comment
Back to the main page