Page 1 of 1

OCRopus 0.2 optical character recognition + layout analysis

Posted: Thu 25 Sep 2008, 12:00
by disciple
This is an OCR program using the tesseract engine but with layout analysis. In my tests it seems "layout analysis" just means it recognizes columns of text, reads each of them and then strings them all together including headers, footers etc. It is noticeably less accurate than Tesseract, and probably only a little better than the OCR engine in Microsoft Office. This is still much better than gocr :)
I recommend using Tesseract itself (here) unless you intend to scan pages with parallel columns.
It produces an html file that copies and pastes nicely from a browser (not Dillo I think as it hasn't got UTF) to a word processor.

1. Install from here (1121kb).
2. Download your language file (English, French, German, Dutch, Spanish, Italian, Portuguese, Vietnamese or Old German) from http://code.google.com/p/tesseract-ocr/downloads/list (900 to 1400 kb)
YOU WANT THE "LANGUAGE DATA" file, not the "SOURCE TRAINING DATA" file. If you have a different language, you'd better make them ;)
3. Extract the language file into /usr/local/share (so the stuff should be ending up in /usr/local/share/tessdata).
4. Run with e.g.
ocroscript rec-tess /path/some_scan.png > /other_path/scan.html
N.B. doesn't work with tiffs, as I disabled libtiff support in tesseract because of a bug that they tell me will be fixed in the next version. Convert to something else.

Ocropus can also be compiled against a language modelling program and a program for making vector images of diagrams in a scan. I didn't look hard, but there doesn't seem to be a ready-to-go way to use these (or aspell, which I think I did compile against), so I didn't bother.

There were also two HUGE files produced by the install that I didn't include for the same reason - a US dictionary and a file for neural network modelling.

BTW unlike tesseract, I think ocropus converts to black-and-white, so there is no advantage in colour images.

Extra OCR related tools

Posted: Sat 11 Apr 2009, 12:19
by disciple
Also check out the extra tools I posted in the Tesseract thread.
http://www.murga-linux.com/puppy/viewto ... 332#279332
and the following post.

Posted: Sat 12 Feb 2011, 23:13
by miriam
A newer version of OCRopus is out that no longer uses tesseract -- it uses its own character recognition software instead. I haven't tried it yet.

The new tesseract 3.00 detects columns. I've compiled it, but haven't learned how to make pets yet. It feels slower than tesseract 2, probably because it is looking for columns and perhaps other things, though I'm running it on a much, much, much slower computer since my nice fast computer died (wah!).

Posted: Sat 12 Feb 2011, 23:51
by disciple
miriam wrote:A newer version of OCRopus is out that no longer uses tesseract -- it uses its own character recognition software instead. I haven't tried it yet.
Where is it out? You haven't built from trunk?
The new tesseract 3.00 detects columns. I've compiled it, but haven't learned how to make pets yet.
It is easy. Instead of running `make install` run `new2dir make install`.
It feels slower than tesseract 2, probably because it is looking for columns and perhaps other things, though I'm running it on a much, much, much slower computer since my nice fast computer died (wah!).
I thought they said it was actually faster, but I could be wrong.

The other OCR project worth trying is cuneiform. The last (free, but Windows only, and the interface is in Russian) version before they open-sourced it did an exceptional job, recognising layout, formatting and tables. Unfortunately they haven't managed to open-source the table recognition yet :(
If you do want to use the old Russian version, the interface is actually identical to the current Windows release (which is available in English). I've taken screenshots of this to find my way around the Russian version.

There are a few guis around for the linux version.
http://symmetrica.net/cuneiform-linux/yagf-en.html (QT + aspell - looks good.
http://en.altlinux.org/Cuneiform-Qt (QT)
http://code.google.com/p/cuneiform-gui/ (Java)
http://code.google.com/p/simplegui4cuneiform/ (zenity dialogs)
http://wiki.ubuntuusers.de/Cuneiform-Li ... g-in-XSane (script to integrate with Xsane and Imagemagick)
EDIT also https://code.google.com/p/ocrfeeder/ (Py/GTK) which I mentioned in the tesseract thread can use cuneiform.

Posted: Sun 13 Feb 2011, 01:22
by rcrsn51
Tesseract 3 is twice as big as Tesseract 2 when packaged as a PET and is somewhat slower. But it does a good job of detecting columns. And it's compatible with Peasyscan.

Posted: Sun 13 Feb 2011, 02:39
by Sit Heel Speak
In case anyone does wish to attempt building OCRopus from trunk, I have just posted a .pet of mercurial, here.

Posted: Sun 13 Feb 2011, 05:00
by miriam
Wow, thanks disciple. Now that I have the name of the new2dir command I looked it up and have learned lots about making pets, including the really essential part that I didn't know about dir2pet. I've been compiling programs for my various machines for many years. In future I'll make pets for Puppy and share them around. Yay!

I haven't actually tried OCRopus yet. I simply read the details on their GoogleCode page http://code.google.com/p/ocropus/ and on the Wikipedia page http://en.wikipedia.org/wiki/OCRopus however I expect I will try it in the near future. I've long felt that neural networks are the only sensible way to get reliable OCR. I'm especially interested that OCRopus can have the code to read handwriting enabled (it is disabled by default).

It is a bit hard to tell how tesseract 3 compares with the older tesseract 2 because my current machine (until I get a newer one) is soooo slow.

Thanks for the info about cunieform, but I'm very reluctant to put any effort into getting Wine working on my machine after having finally rid myself of all last traces of pesky M$ stuff. :)

Posted: Sun 13 Feb 2011, 05:56
by disciple
Please note: the only reason anyone would want to run the old Windows version of cuneiform is for table recognition. I believe the Linux version should be just as capable apart from that.

Posted: Sun 13 Feb 2011, 09:57
by miriam
Oops, sorry. My bad. My eyes glazed over at the mention of Windows and I am a little embarrassed to admit I didn't read the part that followed properly.

Much more interesting than I thought. Thank you.
Downloading now... Yeow! 25MB! Big.
But it does sound like a very cool program.
http://www.cuneiform.ru/eng/
http://en.wikipedia.org/wiki/CuneiForm_(software)

Weird... the text of this post disappeared, but is here when I edit it... I deleted it all and added stuff back in line by line.
Huh. It is the Wikipedia address. I can't make it a link -- the parentheses probably confuse the bulletin board software.

Posted: Thu 19 Nov 2015, 08:25
by greengeek
Hi, does anyone have an OCR package that may work on Slacko 5.6 please? I will be happy with anything that works no matter how restrictive (ie: I dont need table recognition or anything fancy - just the ability to recognise/analyze an image of a few words in a single font).
cheers!

Posted: Thu 19 Nov 2015, 12:19
by rcrsn51
Peasyscan has supported Tesseract OCR since 2010.

Posted: Thu 19 Nov 2015, 20:02
by greengeek
Many thanks rcrsn51!! I already have peasyscan grafted into my Slacko 5.6 derivative so I was able to grab Tesseract and pic2txt from your link here and now I have OCR. Fantastic!
Thanks so much.
(ps: I am getting really good results if I use mtPaint to scale the image size up before feeding the file to pic2txt - this upscaling seems to increase recognition integrity greatly. I also played with gamma, brightness and the "sharpen" function but none of those helped as much as simply making the image bigger - which particularly seemed to help with recognition of spaces between words).

Puppyocr works with Slacko 5.5, sure

Posted: Wed 03 Feb 2016, 11:50
by Pelo
greengeek Puppyocr works with Slacko 5.5, sure.
I have transfered a dozen of documents to texte with it. so puppyocr should work in Slacko 5.6.
About scaling documents, in the contrary, willing do better, enlarging them made them less recognized by puppy OCR. Strange.
I use puppyocr for judgements during french revolution , typewrited when first typing machines began to be used.

Re: Puppyocr works with Slacko 5.5, sure

Posted: Fri 05 Feb 2016, 13:58
by rcrsn51
Pelo wrote:About scaling documents, in the contrary, willing do better, enlarging them made them less recognized by puppy OCR.
I tried this with the latest pic2txt and Tesseract3. It worked, but only a small up-scaling was needed, like 110%.

It would depend a lot on the quality of the image.

in english Puppy Ocr has an easier Job

Posted: Tue 12 Apr 2016, 16:35
by Pelo
Main default with puppy OCR are the accents and punctuation. Sure in english Puppy Ocr has an easier Job. Où vais-je aller à la pêche ?
See remarks of OUI, our dear Franco-'german colleague , he wants a 64 bit OCR :) OCR requested by Oui.