2014. január 28.

Creating searchable PDFs on Ubuntu 2nd try

Need be:

  • image layer over text layer
  • good character encoding for Hungarian ű and ő chars
  • good placement of words and lines
  • fair enough good recognition
  • handling more column layout
1st try was Tesseract output hocr embedded with hocr2pdf in a pnm file.
  • image layer over text layer - YES
  • good character encoding for Hungarian ű and ő chars - NO
  • good placement of words and lines - NO
  • fair enough good recognition - YES
  • handling more column layout - NO
the strangest is, that hOCR editor does not handle well the tesseract output hocr. actually it does not handle it at all, showing html tags and everything where the editable text shoud be...

2nd try is: OCRopus 
I had no success installing and using orcopus.
  • "recognize" does not handle languages, and/or I could not find a Hungarian data file for it.
    it works like: ocroscript recognize input.pnm > output.html
  • rec-tess-complete should recognize through tesseract, and import language files with the --tesslanguage=hun option, but instead I got this error:
    Unable to load unicharset file /usr/share/tesseract-ocr/tessdata/hun.unicharset
  • so I unpacked the hun.traineddata like this:
    combine_tessdata -u hun.traineddata hun.
  • and put the files to /usr/share/tesseract-ocr/tessdata/
  • however I got this error:
    Error: Illegal malloc request size!
    Fatal error: No error trap defined!
    Signal_termination_handler called with signal 2001
  • than I tried with --tesslanguage=eng and it gave me:
    ocroscript: /usr/share/ocropus/scripts//rec-tess-complete.lua:52: attempt to call global 'hardcoded_version_string' (a nil value)
  • so I searched and found a patch, and installed it like this:
    patch /usr/share/ocropus/scripts/rec-tess-complete.lua rec-tess-complete3_r1308.patch
  • and now it gives me (with "eng")
    ocroscript: /usr/share/ocropus/scripts//rec-tess-complete.lua:61: Leptonica is disabled, please compile with it or don't use it!
I already have the newest tesseract on board, but I failed to manage a newest ocropus installation. it had too many unknown aspects with python and all...

results with ocropus 0.3.1-2 recognize and merged with hocr2pdf:
  • image layer over text layer - YES
  • good character encoding for Hungarian ű and ő chars - NO
  • good placement of words and lines - NO (makes large characters, I cannot even tell which line it should be)
  • fair enough good recognition - NO (because of english training data)
  • handling more column layout - DON'T KNOW (text was too big, it was impossible to tell)
maybe the big text was because of the dpi of the image... I should check on this to at least be able to qualify the layout option... nope, it did not help... at all.


...to be continued with:

3rd try is: Cuneiform
4th try: Adobe Acrobat XI on Windows

Nincsenek megjegyzések: