Alice@Ubuntu: Creating searchable PDFs on Ubuntu 2nd try

2014. január 28.

Creating searchable PDFs on Ubuntu 2nd try

Need be:

image layer over text layer
good character encoding for Hungarian ű and ő chars
good placement of words and lines
fair enough good recognition
handling more column layout

1st try was Tesseract output hocr embedded with hocr2pdf in a pnm file.

image layer over text layer - YES
good character encoding for Hungarian ű and ő chars - NO
good placement of words and lines - NO
fair enough good recognition - YES
handling more column layout - NO

the strangest is, that hOCR editor does not handle well the tesseract output hocr. actually it does not handle it at all, showing html tags and everything where the editable text shoud be...

2nd try is: OCRopus

I had no success installing and using orcopus.

"recognize" does not handle languages, and/or I could not find a Hungarian data file for it.
it works like: ocroscript recognize input.pnm > output.html
rec-tess-complete should recognize through tesseract, and import language files with the --tesslanguage=hun option, but instead I got this error:
Unable to load unicharset file /usr/share/tesseract-ocr/tessdata/hun.unicharset
so I unpacked the hun.traineddata like this:
combine_tessdata -u hun.traineddata hun.
and put the files to /usr/share/tesseract-ocr/tessdata/
however I got this error:
Error: Illegal malloc request size!
Fatal error: No error trap defined!
Signal_termination_handler called with signal 2001
than I tried with --tesslanguage=eng and it gave me:
ocroscript: /usr/share/ocropus/scripts//rec-tess-complete.lua:52: attempt to call global 'hardcoded_version_string' (a nil value)
so I searched and found a patch, and installed it like this:
patch /usr/share/ocropus/scripts/rec-tess-complete.lua rec-tess-complete3_r1308.patch
and now it gives me (with "eng")
ocroscript: /usr/share/ocropus/scripts//rec-tess-complete.lua:61: Leptonica is disabled, please compile with it or don't use it!

I already have the newest tesseract on board, but I failed to manage a newest ocropus installation. it had too many unknown aspects with python and all...

results with ocropus 0.3.1-2 recognize and merged with hocr2pdf:

image layer over text layer - YES
good character encoding for Hungarian ű and ő chars - NO
good placement of words and lines - NO (makes large characters, I cannot even tell which line it should be)
fair enough good recognition - NO (because of english training data)
handling more column layout - DON'T KNOW (text was too big, it was impossible to tell)

maybe the big text was because of the dpi of the image... I should check on this to at least be able to qualify the layout option... nope, it did not help... at all.

...to be continued with:

3rd try is: Cuneiform

4th try: Adobe Acrobat XI on Windows

Nincsenek megjegyzések:

Megjegyzés küldése