2015. február 27.

PDF to HTML for eBook creation

My general goal is to process PDF files through a HTML/XML state before making eBook (mobi/epub) from them.
So the goal is, to somehow generate a clean HTML file out of a PDF.
Some of my general needs are:
  • keep paragraphs together (not mixing up <br /> with <p></p>)
  • get images and text together
  • keep character formatting
  • handle multiple columns (convert to single column)
  • skip page numbers
First of all, here's what I found important from the Google search results:

Thomas Levine's Parsing PDF files walk-through
Tools to use:
  • Basic file analysis tools (ls or another language’s equivalent)
  • PDF metadata tools (pdfinfo or an equivalent)
  • pdftotext
  • pdftohtml -xml
  • Inkscape via pdf2svg
  • PDFMiner
My own experiments:
With this PDF file, and another one that I made for this purpose.


PDFtoHTML
$ pdftohtml example.pdf
  • incorrectly displayed character encoding
    • <meta http-equiv="CONTENT-TYPE" content="text/html; charset=utf-8"> has to be entered in the HTML HEAD for proper character encoding 
    • could put a -enc UTF-8 (Case Sensitive!) as an option if needed, but the header meta still has to be entered manually. 
    • the meta gets entered in the document if the -noframes option is used.
  • no paragraphs, only breaks
  • a navigation html file was generated with two frames: one for a page index html, and one for the actual text.
    • this can be avoided using the -noframes option
    • the -s option to generate single output (however this will only concatenate the whole html files, and links are not corrected)
  • pages are separated with horizontal rulers
  • images are processed all right
 $ pdftohtml -c example.pdf
  • formatting is mostly strictly preserved
    • styles by css
    • absolute positions of paragraphs
      • some paragraphs are kept together, some are not recognized properly
    • left text alignment is the only one that is kept, everything else is shown with absolute left position.
    • bold and italics is preserved, but underline shows wrong
    • columns are preserved okay
    • font face changes do not show.
  • images are embedded in page background
  • every page is a separate html file
$ pdftohtml -xml example.pdf
  • stores formatting info about absolute positions, font size and line height
  • no info about paragraphs, images
Alltogether experinece with PDFtoHTML: it's almost good for nothing, when it is not processed properly afterwards.

Postprocessing a file generated with the pdftohtml -enc UTF-8 -noframes -p -q example.pdf command:
Sturcture:
  • style element and some meta element in the html head are unnecessary.
  • <a name=[pagenunber]></a> marks the beginning of every page
  • <br/> in the middle of a line marks line break
  • <br/> at the end of the line marks paragraph break
  • <hr/> marks the end of every page
  • the last line before the end of page is probably a page number
PDFMiner
Download and Install PDFMiner.
Review the command line tools and their capacity.

$ pdf2txt.py -o example.html  example.pdf
or
$ pdf2txt.py -Y normal -o example.html  example.pdf 
  • no paragraphs, only breaks
    (paragraphs are not collected -- no easy way to restore them, unlike in pdftohtml)
  • formatting is not css but html span tags
    • Html Tidy can collect the formatting to the front of the html file as css, making it easier to review and modify:
      tidy -utf8 -c -o example_tidy.html  example.html
  • bold and italics are kept as font family style in span tag, underline is taken as an image (?)
  • images not processed
  • display is messy, text displayed on top of each other
  • code is quite all right.
$ pdf2txt.py -o example.xml  example.pdf
or
$ pdf2txt.py -Y exact -o example.html  example.pdf 
  • stores exact position of every single character
  • holds space for images
$ pdf2txt.py -t tag -o example.txt  example.pdf
  • stores pdf page data, with unformatted text content
$ pdf2txt.py -Y loose -o example.html  example.pdf
  • does not keep the line breaks, only the span style is present to indicate text changes. paragraphs not recognized properly
  • messy code
Alltogether experinece with PDFMiner: this is not what I'm looking for. It either store too much or too little information for my purposes, so in my case it's actually good for nothing, when it is not processed properly afterwards.


pdf2htmlEX
$ pdf2htmlEX example.pdf
  • omg wow amazing pretty output view!
    • everything looks exactly like the pdf file
  • (in exchange for an) extremely messy code :)
    • one page is one line, identified by a (div) id.
    • formatting is kept in classes (div, span, img, ...) by CSS:
      • @font-face {}
      • @media {}
      • .ff: font-family
      • (t) m: transformation matrix
      • v: vertical-align
      • ls: letter-spacing
      • sc: text-shadow
      • ws: word-spacing
      • _: display and width or margin-left
      • fc: color
      • fs: font-size
      • y: bottom
      • h: height
      • w: width
      • x: left
  • all file embedding can be turned off with --embed cfijo (will generate separate output files)
Alltogether useless for my purposes.However, the best if your purpose is to display a pdf file as a html page on the web.

PdfMasher (GUI)
Does not keep formatting or images, but is specialized to keep proper order of the text.
  • as said, does not keep font formatting or image placeholders
  • with a little manual adjustment:
    • can be set to ignore page numbers (amazing!)
    • can be set to collect and link footnotes to be endnotes (amazing!)
  • html code is quite clear
  • paragraphs are well kept if it was possible
Alltogether so far this is the best tool to prepare a simple text pdf for eBook creation.
Usage:
There are five type of elements that can be set in Edit mode:
  • Normal: will be default text
  • Title: will be Header tag H1 for once pressed, H2 for twice pressed, etc.
    • best to be filtered with sorting by Font Size
  • Footnote: will search for the reference and link it as endnote
    • best to be filtered with sorting by Font Size (or X or Y, respectively)
  • Ignore: will be ignored
    • Page numbers, footers and headers are best to be filtered with sorting by Y (or X, respectively)
  • To Fix:puts a FIXME sign in front of the paragraph. In HTML this becomes an italics formatted text.
These types can be set on the Table or on the Page tab.
Build options are:
  • Generate Markdown: generates a plain text file in the pdf directory with marks specifying the Title and To Fix parts, Ignored elements already ignored, and Footnotes already linked.
  •  Edit Markdown: opens the markdown text file
  • Reveal Markdown: opens the directory in the default file browser containing the markdown
  • View HTML: generates the html file out of the markdown file, and opens it in the default web browser
Markdown signs:
  • # for H1, ## for H2, etc.
  • *FIXME* for italics
  • *** for horizontal ruler  
  • numbers for lists (quite annoying)
  • more on markdown usage
  • I accidentally found an eBook creation software that works from text like this markdown text, so I'll just leave a link here for notice.
Formatting of the text can be fixed manually in the markdown form or in the html form.
When saving as MOBI or EPUB there will be Table of Contents and navigation generated from the headings. The book Start will be set to the first heading.

Summary:
  • To export all images from your file for further usage,
    use pdf2htmlEX --embed cfijo example.pdf.
    • can be opened in LibreOffice
  • To get a very simple html with proper paragraphs, endnotes, and headings, but without font formatting,
    use PDFMasher (GUI).
  • To get a fair html code with most of the images and font formatting but messed up paragraphing,
    use pdftohtml -enc UTF-8 -noframes -p -q example.pdf
    • cannot be opened in LibreOffice
That's it for so far.
Probably the best way would be to learn SED and create a html cleaning script for myself, but that's likely distant future.

Nincsenek megjegyzések: