Christiaan Hofman
2016-02-21 22:54:52 UTC
Hi everybody,
I have a problem with converting notes on scanned PDFs. Whenever I convert Skimnotes to regular annotated PDFs the character encoding of a scanned and OCR'd PDF gets lost. Likewise when I convert PDFs that have been annotated on Adobe Reader to skimnotes the encoding gets lost (and the annotations show up empty). I believe it only concerns those PDFs that have been OCR'd using Acrobat's ClearScan feature. (A few years ago I started using Acrobat's ClearScan feature for vectorized OCR to resolve scrolling issues on my Mac.)
https://discussions.apple.com/message/24531613#24531613 <https://discussions.apple.com/message/24531613#24531613>
Since Skim is so smart to not mess with the original file, it all worked smoothly for years, until ... (I know I shouldn't have bought that damn iPad ...) And now trying to have my cake and eat it too.
So I'm trying to figure out where the issue lies. Acrobat Pro (and Reader) can annotate, save, and modify any OCR'd PDF perfectly fine, be they vectorized or not. But as soon as PDFKit comes in their way things get messy. I don't know whom to blame here, Adobe or Apple, but either Preview is just buggy (it can open and search text in those files perfectly fine, only upon annotating & saving it screws things up) or Adobe uses some cryptic Adobe format for clearscanned PDFs that others are unable to deal with â or both ... I did try to export a PDF that Adobe had vectorized to PDF/X and it wouldn't do it, some preflight error.
Unfortunately I understand next to nothing about PDFs and how exactly they work, so I'm completely lost with this and looking for any inspiration, a fix, a workaround or whatever (except for converting all the PDFs to images, reassembling them to a PDF and then doing a non-Clearscan OCR â not feasible).
Any help very much appreciated!
Jan
It should be clear that the problem lies with Appleâs PDFKit, as it happens always when that generates the PDF data. Itâs known that PDFKit sometimes has problems with encodings, especially for OCRâed documents, since as long as PDFKit exists (since 10.4). As itâs in PDFKit there is no workaround. The best you can do is file a bug report with Apple. But given that this bug exists since 10.4 and they must know about it as long, well, you get some idea how responsive they are.I have a problem with converting notes on scanned PDFs. Whenever I convert Skimnotes to regular annotated PDFs the character encoding of a scanned and OCR'd PDF gets lost. Likewise when I convert PDFs that have been annotated on Adobe Reader to skimnotes the encoding gets lost (and the annotations show up empty). I believe it only concerns those PDFs that have been OCR'd using Acrobat's ClearScan feature. (A few years ago I started using Acrobat's ClearScan feature for vectorized OCR to resolve scrolling issues on my Mac.)
https://discussions.apple.com/message/24531613#24531613 <https://discussions.apple.com/message/24531613#24531613>
Since Skim is so smart to not mess with the original file, it all worked smoothly for years, until ... (I know I shouldn't have bought that damn iPad ...) And now trying to have my cake and eat it too.
So I'm trying to figure out where the issue lies. Acrobat Pro (and Reader) can annotate, save, and modify any OCR'd PDF perfectly fine, be they vectorized or not. But as soon as PDFKit comes in their way things get messy. I don't know whom to blame here, Adobe or Apple, but either Preview is just buggy (it can open and search text in those files perfectly fine, only upon annotating & saving it screws things up) or Adobe uses some cryptic Adobe format for clearscanned PDFs that others are unable to deal with â or both ... I did try to export a PDF that Adobe had vectorized to PDF/X and it wouldn't do it, some preflight error.
Unfortunately I understand next to nothing about PDFs and how exactly they work, so I'm completely lost with this and looking for any inspiration, a fix, a workaround or whatever (except for converting all the PDFs to images, reassembling them to a PDF and then doing a non-Clearscan OCR â not feasible).
Any help very much appreciated!
Jan
Christiaan