Tesseract OCR in Java

This article is maintained by the team at commabot.

Before using Tesseract in Java, you need to install it on your system. Tesseract is available for Windows, Linux, and Mac OS.

To use Tesseract in Java, you need a Java wrapper. Tess4J is a popular choice. It's a JNA wrapper for Tesseract API and can be easily integrated into Java projects.

Using Maven

If you are using Maven, add the following dependency to your pom.xml:

<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>4.x.x</version> <!-- Replace with the latest version -->
</dependency>

Using Gradle Dependency

If you're using Gradle, you can include Tess4J in your build.gradle file. Add the following line in the dependencies section:

implementation 'net.sourceforge.tess4j:tess4j:4.x.x' // Replace with the latest version

This will automatically download and include Tess4J in your project.

Downloading JAR Files Manually

For projects where you're not using a build management tool like Maven or Gradle, you can download the JAR files directly and include them in your project's classpath.

Visit the Tess4J SourceForge page or the GitHub repository.
Download the necessary JAR files and any dependencies.
Add these JAR files to your project's build path. In most IDEs, you can do this by right-clicking on the project and selecting something like "Properties" or "Project Structure" and then navigating to the "Libraries" or "Dependencies" section to add the JAR files.

Using an Integrated Development Environment (IDE)

Many IDEs like Eclipse, IntelliJ IDEA, or NetBeans have options to manage dependencies easily:

Eclipse: Use the built-in Maven or Gradle support, or manually add JARs to the build path via the project properties.
IntelliJ IDEA: Similarly, use its built-in Maven or Gradle support, or go to "File" → "Project Structure" → "Libraries" to add JAR files manually.
NetBeans: Use the "Projects" tab, right-click on the "Libraries" folder in your project, and select "Add JAR/Folder" to include the Tess4J JARs.

Write Java Code to Use Tesseract

Here's a simple example of how to use Tess4J in a Java application:

import net.sourceforge.tess4j.*;

import java.io.File;

public class TesseractExample {

    public static void main(String[] args) {
        File imageFile = new File("path/to/your/image.jpg");
        ITesseract instance = new Tesseract();  // JNA Interface Mapping

        try {
            instance.setLanguage("eng"); // Setting language to English
            instance.setDatapath("path/to/tessdata");

            String result = instance.doOCR(imageFile);
            System.out.println(result);
        } catch (TesseractException e) {
            System.err.println(e.getMessage());
        }
    }
}

Ensure the tessdata directory is correctly set in your Java code. This directory contains language files needed for OCR. Its location can vary based on your operating system and how you installed Tesseract.

Windows: If you used a pre-built binary or installer, the tessdata directory is usually located in the installation directory of Tesseract, often in C:\Program Files\Tesseract-OCR\tessdata.
Linux: For Linux installations via package managers like apt or yum, the tessdata files are generally located in a shared directory like /usr/share/tesseract-ocr/4.00/tessdata or /usr/local/share/tessdata.
macOS: If installed via Homebrew, it's typically located in /usr/local/Cellar/tesseract/{version}/share/tessdata, where {version} is the installed version of Tesseract.

This is a basic guide, for more complex use cases, you might need to delve deeper into the Tesseract and Tess4J documentation.

Document Processing Explained