Discussion:
[Skim-app-users] Converting notes of OCR'd files
Christiaan Hofman
2016-02-21 22:54:52 UTC
Permalink
Hi everybody,
I have a problem with converting notes on scanned PDFs. Whenever I convert Skimnotes to regular annotated PDFs the character encoding of a scanned and OCR'd PDF gets lost. Likewise when I convert PDFs that have been annotated on Adobe Reader to skimnotes the encoding gets lost (and the annotations show up empty). I believe it only concerns those PDFs that have been OCR'd using Acrobat's ClearScan feature. (A few years ago I started using Acrobat's ClearScan feature for vectorized OCR to resolve scrolling issues on my Mac.)
https://discussions.apple.com/message/24531613#24531613 <https://discussions.apple.com/message/24531613#24531613>
Since Skim is so smart to not mess with the original file, it all worked smoothly for years, until ... (I know I shouldn't have bought that damn iPad ...) And now trying to have my cake and eat it too.
So I'm trying to figure out where the issue lies. Acrobat Pro (and Reader) can annotate, save, and modify any OCR'd PDF perfectly fine, be they vectorized or not. But as soon as PDFKit comes in their way things get messy. I don't know whom to blame here, Adobe or Apple, but either Preview is just buggy (it can open and search text in those files perfectly fine, only upon annotating & saving it screws things up) or Adobe uses some cryptic Adobe format for clearscanned PDFs that others are unable to deal with — or both ... I did try to export a PDF that Adobe had vectorized to PDF/X and it wouldn't do it, some preflight error.
Unfortunately I understand next to nothing about PDFs and how exactly they work, so I'm completely lost with this and looking for any inspiration, a fix, a workaround or whatever (except for converting all the PDFs to images, reassembling them to a PDF and then doing a non-Clearscan OCR — not feasible).
Any help very much appreciated!
Jan
It should be clear that the problem lies with Apple’s PDFKit, as it happens always when that generates the PDF data. It’s known that PDFKit sometimes has problems with encodings, especially for OCR’ed documents, since as long as PDFKit exists (since 10.4). As it’s in PDFKit there is no workaround. The best you can do is file a bug report with Apple. But given that this bug exists since 10.4 and they must know about it as long, well, you get some idea how responsive they are.

Christiaan
Jan David Hauck
2016-02-22 00:56:56 UTC
Permalink
I had one idea, but I'm not sure whether it might work: Do you think it
would be possible to keep a copy of the original PDF, then extract the
notes from the copy, trash that PDF and attach the notes to the unmodified
original as skim notes?
That can certainly work, as you keep the original PDF data.
The problem is that when Skim does the conversion, it doesn't even get me
the content of the notes. Before saving the file in Preview they must be
still there, since not only Acrobat but Preview as well displays notes plus
note text correctly. So if there was a way of extracting the notes before
before modifying the PDF (just leaving the original Preview notes as they
are), maybe that could work?
I don’t really get what you;re saying here. We do extract the notes before
modifying the PDF, where else would we get it? (After modifying, there are
no notes to extract).
I am also not clear as to what “content” you are talking about that is
lost. The content of the notes is part of the note, and we just get that
when extracting. Perhaps you are confused with the text that may be
highlighted? That is not the content, it really is not part of the data of
the note.
By "content" I meant the value of the note (what you can modify in Skim by
clicking the note.) The text which is read from the pdf upon creating the
note in Skim (highlights and underlines, etc.), and which "should" be read
upon extracting the notes.
Unfortunately the notes show up "empty," i.e., without content (except for
words in italics) in Skim. I've done the following: I've opened the
modified PDF in Skim, converted the notes, saved in order to get the Skim
notes backup file, replaced the PDF with the original and then open the
original in Skim reattaching the notes. Notes are empty but PDF text is
readable (i.e., re-highlighting it copies the text into the note correctly)
:(
So upon extracting, or before extracting, PDFKit must have already screwed
up the PDF. When opening in Preview the notes' content is still visible
but it doesn't show up in the Skimnotes.

If you want to check, here's an example pdf:
https://www.dropbox.com/s/n4mz2zy620kri0i/Testfile.pdf?dl=0
and here is the extracted skimnotes file from that PDF:
https://www.dropbox.com/s/rcgnr072uevxmuj/Testfile.skim?dl=0
Or, if that doesn't work, since the notes' placement is extracted
correctly, is there a way after reattaching the notes to the original to
"read" the content that the notes point to and repopulate the notes with
that content?
If you mean the highlighted text rather than the actual content, that
certainly is not done automatically, and should not be done. You may go
over it using AppleScript perhaps, but that depends on whether the text is
readable by PDFKit.
Yes, I mean reinserting the highlighted text as content into the note. It
shouldn't be done automatically, of course, but if the above (preserving
the note content upon extracting) doesn't work, then maybe scripting this
would be a workaround. I only have no idea how to do that.
BTW, there is one bizarre fact I just figured out: the encoding does not
get messed up for all words in italics. It's really weird, but I've tested
multiple documents. And whenever italics are recognized by OCR as such
they preserve their encoding (which would also suggest the problem is on
Apple's end not on Adobe's).
It seems to have to do with the font, something PDFKit is often very bad
at.
Christiaan
Hi everybody,
I have a problem with converting notes on scanned PDFs. Whenever I
convert Skimnotes to regular annotated PDFs the character encoding of a
scanned and OCR'd PDF gets lost. Likewise when I convert PDFs that have
been annotated on Adobe Reader to skimnotes the encoding gets lost (and the
annotations show up empty). I believe it only concerns those PDFs that
have been OCR'd using Acrobat's ClearScan feature. (A few years ago I
started using Acrobat's ClearScan feature for vectorized OCR to resolve
scrolling issues on my Mac.)
And I believe the issue is caused (as always) by PDFKit: When I open
such a PDF in Preview, as soon as I do a single annotation and save and
then close and open again, encoding is lost. This seems to have been a
https://discussions.apple.com/message/24531613#24531613
Since Skim is so smart to not mess with the original file, it all worked
smoothly for years, until ... (I know I shouldn't have bought that damn
iPad ...) And now trying to have my cake and eat it too.
So I'm trying to figure out where the issue lies. Acrobat Pro (and
Reader) can annotate, save, and modify any OCR'd PDF perfectly fine, be
they vectorized or not. But as soon as PDFKit comes in their way things
get messy. I don't know whom to blame here, Adobe or Apple, but either
Preview is just buggy (it can open and search text in those files perfectly
fine, only upon annotating & saving it screws things up) or Adobe uses some
cryptic Adobe format for clearscanned PDFs that others are unable to deal
with — or both ... I did try to export a PDF that Adobe had vectorized to
PDF/X and it wouldn't do it, some preflight error.
Unfortunately I understand next to nothing about PDFs and how exactly
they work, so I'm completely lost with this and looking for any
inspiration, a fix, a workaround or whatever (except for converting all
the PDFs to images, reassembling them to a PDF and then doing a
non-Clearscan OCR — not feasible).
Any help very much appreciated!
Jan
It should be clear that the problem lies with Apple’s PDFKit, as it
happens always when that generates the PDF data. It’s known that PDFKit
sometimes has problems with encodings, especially for OCR’ed documents,
since as long as PDFKit exists (since 10.4). As it’s in PDFKit there is no
workaround. The best you can do is file a bug report with Apple. But given
that this bug exists since 10.4 and they must know about it as long, well,
you get some idea how responsive they are.
Christiaan
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Skim-app-users mailing list
https://lists.sourceforge.net/lists/listinfo/skim-app-users
Christiaan Hofman
2016-02-22 01:09:44 UTC
Permalink
I had one idea, but I'm not sure whether it might work: Do you think it would be possible to keep a copy of the original PDF, then extract the notes from the copy, trash that PDF and attach the notes to the unmodified original as skim notes?
That can certainly work, as you keep the original PDF data.
The problem is that when Skim does the conversion, it doesn't even get me the content of the notes. Before saving the file in Preview they must be still there, since not only Acrobat but Preview as well displays notes plus note text correctly. So if there was a way of extracting the notes before before modifying the PDF (just leaving the original Preview notes as they are), maybe that could work?
I don’t really get what you;re saying here. We do extract the notes before modifying the PDF, where else would we get it? (After modifying, there are no notes to extract).
I am also not clear as to what “content” you are talking about that is lost. The content of the notes is part of the note, and we just get that when extracting. Perhaps you are confused with the text that may be highlighted? That is not the content, it really is not part of the data of the note.
By "content" I meant the value of the note (what you can modify in Skim by clicking the note.) The text which is read from the pdf upon creating the note in Skim (highlights and underlines, etc.), and which "should" be read upon extracting the notes.
Unfortunately the notes show up "empty," i.e., without content (except for words in italics) in Skim. I've done the following: I've opened the modified PDF in Skim, converted the notes, saved in order to get the Skim notes backup file, replaced the PDF with the original and then open the original in Skim reattaching the notes. Notes are empty but PDF text is readable (i.e., re-highlighting it copies the text into the note correctly) :(
So upon extracting, or before extracting, PDFKit must have already screwed up the PDF. When opening in Preview the notes' content is still visible but it doesn't show up in the Skimnotes.
When converting notes we try to fill it with the highlighted text. But we get that text from the text context using PDFKit. That is not always reliable, perhaps the location of the text is not accurately reflected in the PDF, and then we won’t find it. You should realize that in the PDF there is really no relation between the highlight note and the highlighted text, they just “happen” to be at the same location.My guess is that the bounds of the OCD’ed text is slightly bigger than the note, and then PDFKit can’t find it because it does not fall into the area we are looking for it, which is the area covered by the note.
https://www.dropbox.com/s/n4mz2zy620kri0i/Testfile.pdf?dl=0 <https://www.dropbox.com/s/n4mz2zy620kri0i/Testfile.pdf?dl=0>
https://www.dropbox.com/s/rcgnr072uevxmuj/Testfile.skim?dl=0 <https://www.dropbox.com/s/rcgnr072uevxmuj/Testfile.skim?dl=0>
Or, if that doesn't work, since the notes' placement is extracted correctly, is there a way after reattaching the notes to the original to "read" the content that the notes point to and repopulate the notes with that content?
If you mean the highlighted text rather than the actual content, that certainly is not done automatically, and should not be done. You may go over it using AppleScript perhaps, but that depends on whether the text is readable by PDFKit.
Yes, I mean reinserting the highlighted text as content into the note. It shouldn't be done automatically, of course, but if the above (preserving the note content upon extracting) doesn't work, then maybe scripting this would be a workaround. I only have no idea how to do that.
Well, to “should” not necessarily be done. We just try to do it as good as we can. But we need PDFKit to get it, so it is as good as PDFKit can get it. But if your conversion process does not get it, AppleScript won’t get it either (because it does basically the same thing).

Christiaan
BTW, there is one bizarre fact I just figured out: the encoding does not get messed up for all words in italics. It's really weird, but I've tested multiple documents. And whenever italics are recognized by OCR as such they preserve their encoding (which would also suggest the problem is on Apple's end not on Adobe's).
It seems to have to do with the font, something PDFKit is often very bad at.
Christiaan
Hi everybody,
I have a problem with converting notes on scanned PDFs. Whenever I convert Skimnotes to regular annotated PDFs the character encoding of a scanned and OCR'd PDF gets lost. Likewise when I convert PDFs that have been annotated on Adobe Reader to skimnotes the encoding gets lost (and the annotations show up empty). I believe it only concerns those PDFs that have been OCR'd using Acrobat's ClearScan feature. (A few years ago I started using Acrobat's ClearScan feature for vectorized OCR to resolve scrolling issues on my Mac.)
https://discussions.apple.com/message/24531613#24531613 <https://discussions.apple.com/message/24531613#24531613>
Since Skim is so smart to not mess with the original file, it all worked smoothly for years, until ... (I know I shouldn't have bought that damn iPad ...) And now trying to have my cake and eat it too.
So I'm trying to figure out where the issue lies. Acrobat Pro (and Reader) can annotate, save, and modify any OCR'd PDF perfectly fine, be they vectorized or not. But as soon as PDFKit comes in their way things get messy. I don't know whom to blame here, Adobe or Apple, but either Preview is just buggy (it can open and search text in those files perfectly fine, only upon annotating & saving it screws things up) or Adobe uses some cryptic Adobe format for clearscanned PDFs that others are unable to deal with — or both ... I did try to export a PDF that Adobe had vectorized to PDF/X and it wouldn't do it, some preflight error.
Unfortunately I understand next to nothing about PDFs and how exactly they work, so I'm completely lost with this and looking for any inspiration, a fix, a workaround or whatever (except for converting all the PDFs to images, reassembling them to a PDF and then doing a non-Clearscan OCR — not feasible).
Any help very much appreciated!
Jan
It should be clear that the problem lies with Apple’s PDFKit, as it happens always when that generates the PDF data. It’s known that PDFKit sometimes has problems with encodings, especially for OCR’ed documents, since as long as PDFKit exists (since 10.4). As it’s in PDFKit there is no workaround. The best you can do is file a bug report with Apple. But given that this bug exists since 10.4 and they must know about it as long, well, you get some idea how responsive they are.
Christiaan
Jan David Hauck
2016-02-22 01:52:35 UTC
Permalink
Post by Jan David Hauck
I had one idea, but I'm not sure whether it might work: Do you think it
would be possible to keep a copy of the original PDF, then extract the
notes from the copy, trash that PDF and attach the notes to the unmodified
original as skim notes?
That can certainly work, as you keep the original PDF data.
The problem is that when Skim does the conversion, it doesn't even get me
the content of the notes. Before saving the file in Preview they must be
still there, since not only Acrobat but Preview as well displays notes plus
note text correctly. So if there was a way of extracting the notes before
before modifying the PDF (just leaving the original Preview notes as they
are), maybe that could work?
I don’t really get what you;re saying here. We do extract the notes
before modifying the PDF, where else would we get it? (After modifying,
there are no notes to extract).
I am also not clear as to what “content” you are talking about that is
lost. The content of the notes is part of the note, and we just get that
when extracting. Perhaps you are confused with the text that may be
highlighted? That is not the content, it really is not part of the data of
the note.
By "content" I meant the value of the note (what you can modify in Skim by
clicking the note.) The text which is read from the pdf upon creating the
note in Skim (highlights and underlines, etc.), and which "should" be read
upon extracting the notes.
Unfortunately the notes show up "empty," i.e., without content (except for
words in italics) in Skim. I've done the following: I've opened the
modified PDF in Skim, converted the notes, saved in order to get the Skim
notes backup file, replaced the PDF with the original and then open the
original in Skim reattaching the notes. Notes are empty but PDF text is
readable (i.e., re-highlighting it copies the text into the note correctly)
:(
So upon extracting, or before extracting, PDFKit must have already screwed
up the PDF. When opening in Preview the notes' content is still visible
but it doesn't show up in the Skimnotes.
When converting notes we try to fill it with the highlighted text. But we
get that text from the text context using PDFKit. That is not always
reliable, perhaps the location of the text is not accurately reflected in
the PDF, and then we won’t find it. You should realize that in the PDF
there is really no relation between the highlight note and the highlighted
text, they just “happen” to be at the same location.My guess is that the
bounds of the OCD’ed text is slightly bigger than the note, and then PDFKit
can’t find it because it does not fall into the area we are looking for it,
which is the area covered by the note.
But then how do Preview and Acrobat "see" the text and display it
correctly? At least Preview, since that'a also using PDFKit?
Post by Jan David Hauck
https://www.dropbox.com/s/n4mz2zy620kri0i/Testfile.pdf?dl=0
https://www.dropbox.com/s/rcgnr072uevxmuj/Testfile.skim?dl=0
Or, if that doesn't work, since the notes' placement is extracted
correctly, is there a way after reattaching the notes to the original to
"read" the content that the notes point to and repopulate the notes with
that content?
If you mean the highlighted text rather than the actual content, that
certainly is not done automatically, and should not be done. You may go
over it using AppleScript perhaps, but that depends on whether the text is
readable by PDFKit.
Yes, I mean reinserting the highlighted text as content into the note. It
shouldn't be done automatically, of course, but if the above (preserving
the note content upon extracting) doesn't work, then maybe scripting this
would be a workaround. I only have no idea how to do that.
Well, to “should” not necessarily be done. We just try to do it as good as
we can. But we need PDFKit to get it, so it is as good as PDFKit can get
it. But if your conversion process does not get it, AppleScript won’t get
it either (because it does basically the same thing).
I just tried a few things (this is all getting weirder):
Apparently Adobe Reader for iPad and Acrobat Pro on the Mac use different
ways of highlighting the PDFs. For a PDF highlighted with Adobe Reader for
iPad, the Skim conversion is not able to create the content of the notes,
but for a PDF highlighted by Acrobat Pro on the Mac, it does (encoding
still gets messed up, but at least my workaround above might work). Also
it looks like Adobe Reader highlights word by word, whereas Acrobat
highlights full sentences. So now I'll try to find a different iPad PDF
reader that will produce readable notes. I'll post any results later.

It's still a puzzle though, why Preview can see and display the text
correctly (before saving it)?

In any case, thanks so much for your help and clarifications (as always)!

J
Christiaan Hofman
2016-02-22 10:26:10 UTC
Permalink
Post by Christiaan Hofman
I had one idea, but I'm not sure whether it might work: Do you think it would be possible to keep a copy of the original PDF, then extract the notes from the copy, trash that PDF and attach the notes to the unmodified original as skim notes?
That can certainly work, as you keep the original PDF data.
The problem is that when Skim does the conversion, it doesn't even get me the content of the notes. Before saving the file in Preview they must be still there, since not only Acrobat but Preview as well displays notes plus note text correctly. So if there was a way of extracting the notes before before modifying the PDF (just leaving the original Preview notes as they are), maybe that could work?
I don’t really get what you;re saying here. We do extract the notes before modifying the PDF, where else would we get it? (After modifying, there are no notes to extract).
I am also not clear as to what “content” you are talking about that is lost. The content of the notes is part of the note, and we just get that when extracting. Perhaps you are confused with the text that may be highlighted? That is not the content, it really is not part of the data of the note.
By "content" I meant the value of the note (what you can modify in Skim by clicking the note.) The text which is read from the pdf upon creating the note in Skim (highlights and underlines, etc.), and which "should" be read upon extracting the notes.
Unfortunately the notes show up "empty," i.e., without content (except for words in italics) in Skim. I've done the following: I've opened the modified PDF in Skim, converted the notes, saved in order to get the Skim notes backup file, replaced the PDF with the original and then open the original in Skim reattaching the notes. Notes are empty but PDF text is readable (i.e., re-highlighting it copies the text into the note correctly) :(
So upon extracting, or before extracting, PDFKit must have already screwed up the PDF. When opening in Preview the notes' content is still visible but it doesn't show up in the Skimnotes.
When converting notes we try to fill it with the highlighted text. But we get that text from the text context using PDFKit. That is not always reliable, perhaps the location of the text is not accurately reflected in the PDF, and then we won’t find it. You should realize that in the PDF there is really no relation between the highlight note and the highlighted text, they just “happen” to be at the same location.My guess is that the bounds of the OCD’ed text is slightly bigger than the note, and then PDFKit can’t find it because it does not fall into the area we are looking for it, which is the area covered by the note.
But then how do Preview and Acrobat "see" the text and display it correctly? At least Preview, since that'a also using PDFKit?
Perhaps they are more lenient in their attempt to figure out the text, i.e. they may search a somewhat expanded area for text. That can also be a downside, because they may find too much text (e.g. one or more characters before or after the highlight).

Also you should realize that Preview has two very big advantages over Skim in this respect: they know *how* PDFKit finds text in an area, so they can optimally adapt their guess to find the text behind the notes. And they only target one single OS version, while we support several. And unfortunately Apple’s PDFKit is very inconsistent between OS versions: sometimes it defines text in a rectangle to be text that is *completely inside* the rectangle, while on other OS versions it is defined as text that *intersects* with the rectangle.
Post by Christiaan Hofman
https://www.dropbox.com/s/n4mz2zy620kri0i/Testfile.pdf?dl=0 <https://www.dropbox.com/s/n4mz2zy620kri0i/Testfile.pdf?dl=0>
https://www.dropbox.com/s/rcgnr072uevxmuj/Testfile.skim?dl=0 <https://www.dropbox.com/s/rcgnr072uevxmuj/Testfile.skim?dl=0>
Or, if that doesn't work, since the notes' placement is extracted correctly, is there a way after reattaching the notes to the original to "read" the content that the notes point to and repopulate the notes with that content?
If you mean the highlighted text rather than the actual content, that certainly is not done automatically, and should not be done. You may go over it using AppleScript perhaps, but that depends on whether the text is readable by PDFKit.
Yes, I mean reinserting the highlighted text as content into the note. It shouldn't be done automatically, of course, but if the above (preserving the note content upon extracting) doesn't work, then maybe scripting this would be a workaround. I only have no idea how to do that.
Well, to “should” not necessarily be done. We just try to do it as good as we can. But we need PDFKit to get it, so it is as good as PDFKit can get it. But if your conversion process does not get it, AppleScript won’t get it either (because it does basically the same thing).
Apparently Adobe Reader for iPad and Acrobat Pro on the Mac use different ways of highlighting the PDFs. For a PDF highlighted with Adobe Reader for iPad, the Skim conversion is not able to create the content of the notes, but for a PDF highlighted by Acrobat Pro on the Mac, it does (encoding still gets messed up, but at least my workaround above might work). Also it looks like Adobe Reader highlights word by word, whereas Acrobat highlights full sentences. So now I'll try to find a different iPad PDF reader that will produce readable notes. I'll post any results later.
It's still a puzzle though, why Preview can see and display the text correctly (before saving it)?
In any case, thanks so much for your help and clarifications (as always)!
J
I just saw the following in one particular PDF: when I save a PDF using PDFKit, and then open it in Skim, the size of the text (not the visible size, just the registered size) becomes bigger. That means that for a highlight note in the same PDF in one version the text will be *inside* the area of the highlight (and we will find it), but in the converted PDF that same highlight note will have its text expanded, and therefore it will not fall inside its area (and we won’t find it). The size difference can be quite significant. there is no way to correct for that, because of we would be much more lenient to get this bigger text, we would also get too much text for smaller fonts.

Christiaan
Jan David Hauck
2016-02-23 04:49:44 UTC
Permalink
Thank you Christiaan, for taking the time to look into it. I've tried with
other pdf readers for iPad and it just seems that Acrobat Reader for iPad
seems to be doing a bad job at highlighting. Highlights from either
Document 5 or Tiny PDF are converted very well to Skim notes. So please
don't worry about that.
I've now set up a duplicate of my main papers folder on dropbox which is
being synced via fswatch & rsync and those pdfs will be annotated on the
iPad and when finished I drop them into an Import folder which is synced
back to my Mac.

Now may I ask for your expertise once more in order to get the reattachment
of the skimnotes to the original file working?
I'm working on a folder action script that whenever a file is added to the
Import folder will do the following: convert the notes to a skim notes
file, move that file to the directory where the original file is, and then
attach those notes to that file. I came up with the following so far:

on adding folder items to thisFolder after receiving theAddedItems
repeat with newFile in theAddedItems
set fileName to name of (info for newFile)
set papersFolder to
"/Users/me/Documents/Academic/Library/PapersTestSync/"
set originalFile to (do shell script "find " & quoted form of
papersFolder & " -name " & quoted form of fileName)
-- display dialog "I found the following file:" & originalFile

do shell script
"/Applications/Skim.app/Contents/SharedSupport/skimpdf unembed " & quoted
form of newFile

*-- now here I guess I need to specify which PDF the notes to attach to,
and with this I am stuck. *
end repeat
end adding folder items to

I have successfully found the original file but how do I get the unembedded
notes attached to the file? I couldn't figure that out from the
documentation for skimpdf. With that I guess I should be able to make it
work.

Much appreciated!

Jan
Post by Jan David Hauck
Post by Jan David Hauck
I had one idea, but I'm not sure whether it might work: Do you think it
would be possible to keep a copy of the original PDF, then extract the
notes from the copy, trash that PDF and attach the notes to the unmodified
original as skim notes?
That can certainly work, as you keep the original PDF data.
The problem is that when Skim does the conversion, it doesn't even get
me the content of the notes. Before saving the file in Preview they must
be still there, since not only Acrobat but Preview as well displays notes
plus note text correctly. So if there was a way of extracting the notes
before before modifying the PDF (just leaving the original Preview notes as
they are), maybe that could work?
I don’t really get what you;re saying here. We do extract the notes
before modifying the PDF, where else would we get it? (After modifying,
there are no notes to extract).
I am also not clear as to what “content” you are talking about that is
lost. The content of the notes is part of the note, and we just get that
when extracting. Perhaps you are confused with the text that may be
highlighted? That is not the content, it really is not part of the data of
the note.
By "content" I meant the value of the note (what you can modify in Skim
by clicking the note.) The text which is read from the pdf upon creating
the note in Skim (highlights and underlines, etc.), and which "should" be
read upon extracting the notes.
Unfortunately the notes show up "empty," i.e., without content (except
for words in italics) in Skim. I've done the following: I've opened the
modified PDF in Skim, converted the notes, saved in order to get the Skim
notes backup file, replaced the PDF with the original and then open the
original in Skim reattaching the notes. Notes are empty but PDF text is
readable (i.e., re-highlighting it copies the text into the note correctly)
:(
So upon extracting, or before extracting, PDFKit must have already
screwed up the PDF. When opening in Preview the notes' content is still
visible but it doesn't show up in the Skimnotes.
When converting notes we try to fill it with the highlighted text. But we
get that text from the text context using PDFKit. That is not always
reliable, perhaps the location of the text is not accurately reflected in
the PDF, and then we won’t find it. You should realize that in the PDF
there is really no relation between the highlight note and the highlighted
text, they just “happen” to be at the same location.My guess is that the
bounds of the OCD’ed text is slightly bigger than the note, and then PDFKit
can’t find it because it does not fall into the area we are looking for it,
which is the area covered by the note.
But then how do Preview and Acrobat "see" the text and display it
correctly? At least Preview, since that'a also using PDFKit?
Perhaps they are more lenient in their attempt to figure out the text,
i.e. they may search a somewhat expanded area for text. That can also be a
downside, because they may find too much text (e.g. one or more characters
before or after the highlight).
Also you should realize that Preview has two very big advantages over Skim
in this respect: they know *how* PDFKit finds text in an area, so they can
optimally adapt their guess to find the text behind the notes. And they
only target one single OS version, while we support several. And
sometimes it defines text in a rectangle to be text that is *completely
inside* the rectangle, while on other OS versions it is defined as text
that *intersects* with the rectangle.
Post by Jan David Hauck
https://www.dropbox.com/s/n4mz2zy620kri0i/Testfile.pdf?dl=0
https://www.dropbox.com/s/rcgnr072uevxmuj/Testfile.skim?dl=0
Or, if that doesn't work, since the notes' placement is extracted
correctly, is there a way after reattaching the notes to the original to
"read" the content that the notes point to and repopulate the notes with
that content?
If you mean the highlighted text rather than the actual content, that
certainly is not done automatically, and should not be done. You may go
over it using AppleScript perhaps, but that depends on whether the text is
readable by PDFKit.
Yes, I mean reinserting the highlighted text as content into the note.
It shouldn't be done automatically, of course, but if the above (preserving
the note content upon extracting) doesn't work, then maybe scripting this
would be a workaround. I only have no idea how to do that.
Well, to “should” not necessarily be done. We just try to do it as good
as we can. But we need PDFKit to get it, so it is as good as PDFKit can get
it. But if your conversion process does not get it, AppleScript won’t get
it either (because it does basically the same thing).
Apparently Adobe Reader for iPad and Acrobat Pro on the Mac use different
ways of highlighting the PDFs. For a PDF highlighted with Adobe Reader for
iPad, the Skim conversion is not able to create the content of the notes,
but for a PDF highlighted by Acrobat Pro on the Mac, it does (encoding
still gets messed up, but at least my workaround above might work). Also
it looks like Adobe Reader highlights word by word, whereas Acrobat
highlights full sentences. So now I'll try to find a different iPad PDF
reader that will produce readable notes. I'll post any results later.
It's still a puzzle though, why Preview can see and display the text
correctly (before saving it)?
In any case, thanks so much for your help and clarifications (as always)!
J
I just saw the following in one particular PDF: when I save a PDF using
PDFKit, and then open it in Skim, the size of the text (not the visible
size, just the registered size) becomes bigger. That means that for a
highlight note in the same PDF in one version the text will be *inside* the
area of the highlight (and we will find it), but in the converted PDF that
same highlight note will have its text expanded, and therefore it will not
fall inside its area (and we won’t find it). The size difference can be
quite significant. there is no way to correct for that, because of we would
be much more lenient to get this bigger text, we would also get too much
text for smaller fonts.
Christiaan
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Skim-app-users mailing list
https://lists.sourceforge.net/lists/listinfo/skim-app-users
Christiaan Hofman
2016-02-23 10:48:10 UTC
Permalink
Thank you Christiaan, for taking the time to look into it. I've tried with other pdf readers for iPad and it just seems that Acrobat Reader for iPad seems to be doing a bad job at highlighting. Highlights from either Document 5 or Tiny PDF are converted very well to Skim notes. So please don't worry about that.
I've now set up a duplicate of my main papers folder on dropbox which is being synced via fswatch & rsync and those pdfs will be annotated on the iPad and when finished I drop them into an Import folder which is synced back to my Mac.
Now may I ask for your expertise once more in order to get the reattachment of the skimnotes to the original file working?
on adding folder items to thisFolder after receiving theAddedItems
repeat with newFile in theAddedItems
set fileName to name of (info for newFile)
set papersFolder to "/Users/me/Documents/Academic/Library/PapersTestSync/"
set originalFile to (do shell script "find " & quoted form of papersFolder & " -name " & quoted form of fileName)
-- display dialog "I found the following file:" & originalFile
do shell script "/Applications/Skim.app/Contents/SharedSupport/skimpdf unembed " & quoted form of newFile
-- now here I guess I need to specify which PDF the notes to attach to, and with this I am stuck.
end repeat
end adding folder items to
I have successfully found the original file but how do I get the unembedded notes attached to the file? I couldn't figure that out from the documentation for skimpdf. With that I guess I should be able to make it work.
Much appreciated!
Jan
You should export the notes (that are now attached as skim notes rather than embedded) to a .skim file using skimnotes get. You can then attach those notes to the original file containing just the PDF data using skimnotes set.

Christiaan
Jan David Hauck
2016-02-23 11:42:17 UTC
Permalink
Almost there!
Got the reattachment to work now with skimnotes. Thank you for pointing me
into the right direction!
(I could have figured that out myself.)

But, now I ran into a different problem:
When converting the notes with skimpdf, again I get "empty" notes :(

Same scenario as before with the Adobe Reader annotated PDFs, they get
converted properly but their values remain empty.
I tried with PDFs annotated by different readers and all give me the same
results.
And I tried to open these same PDFs in Skim and do the conversion there,
and voilá, here the notes are extracted correctly.
Is there something that skimpdf does differently than Skim itself, when
opening the PDF and converting?
I could probably have the script open Skim to do the conversion but it's
more elegant via command line.
Anything I am missing here?

J
Post by Jan David Hauck
Thank you Christiaan, for taking the time to look into it. I've tried
with other pdf readers for iPad and it just seems that Acrobat Reader for
iPad seems to be doing a bad job at highlighting. Highlights from either
Document 5 or Tiny PDF are converted very well to Skim notes. So please
don't worry about that.
I've now set up a duplicate of my main papers folder on dropbox which is
being synced via fswatch & rsync and those pdfs will be annotated on the
iPad and when finished I drop them into an Import folder which is synced
back to my Mac.
Now may I ask for your expertise once more in order to get the
reattachment of the skimnotes to the original file working?
I'm working on a folder action script that whenever a file is added to the
Import folder will do the following: convert the notes to a skim notes
file, move that file to the directory where the original file is, and then
on adding folder items to thisFolder after receiving theAddedItems
repeat with newFile in theAddedItems
set fileName to name of (info for newFile)
set papersFolder to
"/Users/me/Documents/Academic/Library/PapersTestSync/"
set originalFile to (do shell script "find " & quoted form of
papersFolder & " -name " & quoted form of fileName)
-- display dialog "I found the following file:" & originalFile
do shell script
"/Applications/Skim.app/Contents/SharedSupport/skimpdf unembed " & quoted
form of newFile
*-- now here I guess I need to specify which PDF the notes to attach to,
and with this I am stuck. *
end repeat
end adding folder items to
I have successfully found the original file but how do I get the
unembedded notes attached to the file? I couldn't figure that out from
the documentation for skimpdf. With that I guess I should be able to make
it work.
Much appreciated!
Jan
You should export the notes (that are now attached as skim notes rather
than embedded) to a .skim file using skimnotes get. You can then attach
those notes to the original file containing just the PDF data using
skimnotes set.
Christiaan
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Skim-app-users mailing list
https://lists.sourceforge.net/lists/listinfo/skim-app-users
Christiaan Hofman
2016-02-23 12:47:17 UTC
Permalink
Post by Jan David Hauck
Almost there!
Got the reattachment to work now with skimnotes. Thank you for pointing me into the right direction!
(I could have figured that out myself.)
When converting the notes with skimpdf, again I get "empty" notes :(
Same scenario as before with the Adobe Reader annotated PDFs, they get converted properly but their values remain empty.
I tried with PDFs annotated by different readers and all give me the same results.
And I tried to open these same PDFs in Skim and do the conversion there, and voilá, here the notes are extracted correctly.
Is there something that skimpdf does differently than Skim itself, when opening the PDF and converting?
I could probably have the script open Skim to do the conversion but it's more elegant via command line.
Anything I am missing here?
J
Skimpdf will never automatically fill the notes. That requires more information than just the input data, like information about the text in the PDF, something that is only available in the Skim app, not in the tool.

If your goal is to fill the highlight notes with the highlighted text, then I’m afraid skimped won’t be of any help to you, in fact it would work even less.

Christiaan
Post by Jan David Hauck
Thank you Christiaan, for taking the time to look into it. I've tried with other pdf readers for iPad and it just seems that Acrobat Reader for iPad seems to be doing a bad job at highlighting. Highlights from either Document 5 or Tiny PDF are converted very well to Skim notes. So please don't worry about that.
I've now set up a duplicate of my main papers folder on dropbox which is being synced via fswatch & rsync and those pdfs will be annotated on the iPad and when finished I drop them into an Import folder which is synced back to my Mac.
Now may I ask for your expertise once more in order to get the reattachment of the skimnotes to the original file working?
on adding folder items to thisFolder after receiving theAddedItems
repeat with newFile in theAddedItems
set fileName to name of (info for newFile)
set papersFolder to "/Users/me/Documents/Academic/Library/PapersTestSync/"
set originalFile to (do shell script "find " & quoted form of papersFolder & " -name " & quoted form of fileName)
-- display dialog "I found the following file:" & originalFile
do shell script "/Applications/Skim.app/Contents/SharedSupport/skimpdf unembed " & quoted form of newFile
-- now here I guess I need to specify which PDF the notes to attach to, and with this I am stuck.
end repeat
end adding folder items to
I have successfully found the original file but how do I get the unembedded notes attached to the file? I couldn't figure that out from the documentation for skimpdf. With that I guess I should be able to make it work.
Much appreciated!
Jan
You should export the notes (that are now attached as skim notes rather than embedded) to a .skim file using skimnotes get. You can then attach those notes to the original file containing just the PDF data using skimnotes set.
Christiaan
Jan David Hauck
2016-02-24 09:22:55 UTC
Permalink
Alright, then I'll invoke Skim for conversion first.
I'm glad it's supposed to be that way and not some other weird error.

One (hopefully last) question:
Is there a way to have the skimnotes tool to add Skim notes to a pdf if the
file already has Skim notes? (Now it seems to be always overwriting them.)
Post by Jan David Hauck
Almost there!
Got the reattachment to work now with skimnotes. Thank you for pointing
me into the right direction!
(I could have figured that out myself.)
When converting the notes with skimpdf, again I get "empty" notes :(
Same scenario as before with the Adobe Reader annotated PDFs, they get
converted properly but their values remain empty.
I tried with PDFs annotated by different readers and all give me the same results.
And I tried to open these same PDFs in Skim and do the conversion there,
and voilá, here the notes are extracted correctly.
Is there something that skimpdf does differently than Skim itself, when
opening the PDF and converting?
I could probably have the script open Skim to do the conversion but it's
more elegant via command line.
Anything I am missing here?
J
Skimpdf will never automatically fill the notes. That requires more
information than just the input data, like information about the text in
the PDF, something that is only available in the Skim app, not in the tool.
If your goal is to fill the highlight notes with the highlighted text,
then I’m afraid skimped won’t be of any help to you, in fact it would work
even less.
Christiaan
Post by Jan David Hauck
Thank you Christiaan, for taking the time to look into it. I've tried
with other pdf readers for iPad and it just seems that Acrobat Reader for
iPad seems to be doing a bad job at highlighting. Highlights from either
Document 5 or Tiny PDF are converted very well to Skim notes. So please
don't worry about that.
I've now set up a duplicate of my main papers folder on dropbox which is
being synced via fswatch & rsync and those pdfs will be annotated on the
iPad and when finished I drop them into an Import folder which is synced
back to my Mac.
Now may I ask for your expertise once more in order to get the
reattachment of the skimnotes to the original file working?
I'm working on a folder action script that whenever a file is added to
the Import folder will do the following: convert the notes to a skim notes
file, move that file to the directory where the original file is, and then
on adding folder items to thisFolder after receiving theAddedItems
repeat with newFile in theAddedItems
set fileName to name of (info for newFile)
set papersFolder to
"/Users/me/Documents/Academic/Library/PapersTestSync/"
set originalFile to (do shell script "find " & quoted form of
papersFolder & " -name " & quoted form of fileName)
-- display dialog "I found the following file:" & originalFile
do shell script
"/Applications/Skim.app/Contents/SharedSupport/skimpdf unembed " & quoted
form of newFile
*-- now here I guess I need to specify which PDF the notes to attach to,
and with this I am stuck. *
end repeat
end adding folder items to
I have successfully found the original file but how do I get the
unembedded notes attached to the file? I couldn't figure that out from
the documentation for skimpdf. With that I guess I should be able to make
it work.
Much appreciated!
Jan
You should export the notes (that are now attached as skim notes rather
than embedded) to a .skim file using skimnotes get. You can then attach
those notes to the original file containing just the PDF data using
skimnotes set.
Christiaan
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Skim-app-users mailing list
https://lists.sourceforge.net/lists/listinfo/skim-app-users
Christiaan Hofman
2016-02-24 10:06:54 UTC
Permalink
Post by Jan David Hauck
Alright, then I'll invoke Skim for conversion first.
I'm glad it's supposed to be that way and not some other weird error.
Is there a way to have the skimnotes tool to add Skim notes to a pdf if the file already has Skim notes? (Now it seems to be always overwriting them.)
No.

Christiaan
Jan David Hauck
2016-02-25 01:55:00 UTC
Permalink
Last words on the "embedding/converting notes from ClearScan OCR'd PDFs"
problem, just for the records, to save anyone who might stumble upon this
some hassle:

On iOS I tried Documents 5, Perfect Reader, Tiny PDF, and Adobe Reader.
All do a good job in saving the PDF (notes embedded) without screwing up
the character encoding (text remains readable upon opening the PDF later).
The first three also draw the notes in a smart way so that the notes, when
converted to Skimnotes and reattached to the original PDF preserve their
content (i.e., Skim can read the text that the notes highlight, underline,
etc.). Adobe Reader unfortunately doesn't, it might be that the notes are
drawn too small or something, so when notes are converted their content
doesn't show up upon converting (this is weird since notes done in Acrobat
Pro for the Desktop don't have that problem).
Tiny PDF has an additional feature to export the PDF either flattened or
without notes altogether. Here the encoding of the PDF is lost again even
on iOS, so the PDF becomes unreadable. Nonetheless, saving as "embedded"
works fine.

Therefore it seems that (a) there is a bug in PDFKit that causes the PDF to
lose its encoding when saving with embedded notes (happens also when
deleting/reordering pages or otherwise modifying the PDF, other PDF
software is fine here); but (b) ClearScan PDFs by themselves are not
without problems.
Hence, better not use either of them :)

Thanks to Christiaan for his help figuring this all out!
Post by Jan David Hauck
Alright, then I'll invoke Skim for conversion first.
I'm glad it's supposed to be that way and not some other weird error.
Is there a way to have the skimnotes tool to add Skim notes to a pdf if
the file already has Skim notes? (Now it seems to be always overwriting
them.)
No.
Christiaan
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Skim-app-users mailing list
https://lists.sourceforge.net/lists/listinfo/skim-app-users
Loading...