How to make optical character recognition in Java, 100 % accurate

azhar · November 11, 2018, 3:08am

I need suggestions regarding, OCR in Java,
I have used tesseract but the problem is,

how to prepare image best for OCR,
how to remove noise,
how to differentiate between similar character.
best size , depth per pixel, many more.

All suggestions welcome.
Thank u.

pspeed · November 11, 2018, 5:16am

You might want to ask on a different forum since this is not really game development topic… not sure how many peoples’ interests would intersect with that.

thetoucher · November 11, 2018, 6:37am

Google, Amazon and Microsoft have OCR solutions as part of their Web Services Solutions, which use virtually unlimited computing resources, years of honed artificial learning. So, as far as I know, there is not a good enough Java solution ? If it were me, I would be using AWS to do it for me.

jayfella · November 11, 2018, 8:33am

Neural networks.

It’s pretty cool for drawing stuff On screen like in the Black and White game series.

danielp · November 12, 2018, 6:10am

Neural networks are powerful, but they require tremendous compute resources and a large training set to be used effectively. For a lot of cases they’re probably the best tool available, but I regard them as rather a sledgehammer solution - they’re powerful and get a lot done, but lack the finesse and efficiency of a more tailored solution.

Regarding the original problem, I’m going to stop right at the title. 100% accuracy is impossible. Period. Can’t be done. Humans are much, much, much better at reading handwriting than computers are, but it’s still not hard to find handwriting samples that are virtually or completely illegable to a number of people. (I probably have a less than 50% accuracy rate at reading doctors’ handwritten notes, and I use cursive handwriting far more than printed handwriting.)

Whenever you’re looking at something like image processing/recognition, you’re guaranteed a lower than 100% success rate, because even a number of tasks that humans can perform intuitively (and far better than the best machine implementations) have a non-100% success rate. Making out written letters is not easy when it’s anything other than a printed page, and even that has issues (if you can read a page with 100% accuracy, switch the font - sooner or later you’ll find a font that’ll screw up your recognizer). You’ll be best off using an existing service like the others have suggested.

azhar · November 12, 2018, 11:19am

what is .traindata file what it contains
also what is boxfile , why they are used in ocr.

how is .traindata file generated from font or box file

regarding 100% accuracy ,who hindered, what are the blocks that hinder 100 % accuracy
are they noise in image
are they size of image
are they depth per pixel.

thetoucher · November 12, 2018, 11:22am

Did you try turning it off and on again ?

pspeed · November 12, 2018, 3:32pm

Humans don’t even achieve 100% accuracy in OCR.

noncom · November 17, 2018, 12:33pm

@azhar as others have noted, this topic is far beyond this forum, you should be asking in different places. From my experiece one of the direction to go would be to goolge “JavaCV OCR” and start your research from there. OCR involves much math so you should be ready for diving into it and studying it yourself too. Non-Java related sources should be of interest too.