PHP and OCR: Extracting Text from Images Using Tesseract OCR

PHP and OCR: Extracting Text from Images Using Tesseract OCR

Introduction

Optical Character Recognition (OCR) is a technique that allows extracting text from images. This is useful for scanning documents, recognizing text from receipts, invoices, or even handwritten notes.

With Tesseract OCR, a powerful open-source text recognition engine, we can integrate image-to-text conversion directly into PHP applications.

In this guide, we will cover:

βœ… Installing Tesseract OCR for PHP
βœ… Extracting text from images using Tesseract in PHP
βœ… Processing scanned documents for better text recognition
βœ… Handling multiple languages and output formats

By the end, you'll have a working OCR system in PHP that can read text from images dynamically. πŸš€

1. Installing Tesseract OCR on Your System

For Linux (Ubuntu/Debian)

Run the following command to install Tesseract:

sudo apt update
sudo apt install tesseract-ocr

Verify the installation:

tesseract -v

For Windows

  1. Download and install Tesseract OCR from GitHub Releases.
  2. Add the Tesseract-OCR directory to your system’s PATH environment variable.
  3. Verify installation by running:
    tesseract -v
    

For macOS

Use Homebrew:

brew install tesseract

βœ… Now, Tesseract OCR is installed and ready for PHP integration.

2. Installing the PHP Tesseract Wrapper

PHP doesn’t have native OCR support, so we use thiagoalessio/tesseract-ocr-for-php, a PHP wrapper for Tesseract.

Install via Composer:

composer require thiagoalessio/tesseract_ocr

βœ… This package allows executing Tesseract from PHP easily.

3. Extracting Text from an Image in PHP

Now, let's write a simple PHP script to extract text from an image using Tesseract.

Example: Extract Text from an Image

require 'vendor/autoload.php';

use thiagoalessio\TesseractOCR\TesseractOCR;

$text = (new TesseractOCR('text_image.png'))->run();
echo "Extracted Text: " . $text;

Explanation:

βœ… TesseractOCR('text_image.png') loads the image.
βœ… run() executes Tesseract OCR and returns recognized text.

4. Extracting Text from Scanned Documents (JPEG, PNG, PDF)

Tesseract can handle scanned documents efficiently. Let's extract text from a scanned invoice or receipt.

Example: Extract Text from a Scanned Document

$text = (new TesseractOCR('invoice.jpg'))
    ->lang('eng') // Specify English language
    ->run();

echo "Extracted Invoice Text: " . $text;

Best Practices for Better Text Extraction:

βœ… Use high-resolution images (300 DPI recommended).
βœ… Ensure text contrast is high (black text on white background works best).
βœ… Use .tif or .png formats for clearer OCR results.

5. Extracting Text from Multiple Languages

Tesseract supports multiple languages, including French, German, Spanish, Chinese, and Arabic.

Example: Extracting Multilingual Text

$text = (new TesseractOCR('multilang.png'))
    ->lang('eng+spa') // English and Spanish
    ->run();

echo "Extracted Text: " . $text;

βœ… Use multiple language codes (eng+fra+deu) for multilingual OCR processing.

6. Converting Images to Searchable PDF with OCR

Sometimes, we need to convert scanned PDFs into searchable PDFs. Tesseract can do this using:

tesseract document.jpg output_pdf pdf

βœ… This command generates a searchable PDF (output_pdf.pdf).

To integrate this in PHP:

exec("tesseract document.jpg output_pdf pdf");
echo "Searchable PDF created: output_pdf.pdf";

βœ… Useful for document management systems!

7. Preprocessing Images for Better OCR Accuracy

OCR results improve when images are preprocessed. Use ImageMagick for image sharpening and noise reduction.

Example: Preprocess Image with ImageMagick

convert input.png -resize 300% -monochrome -negate processed.png

βœ… Enhances text clarity before OCR processing.

8. Handling Noisy or Blurry Images

If OCR output is incorrect due to noise, use Tesseract's configuration options:

$text = (new TesseractOCR('noisy_text.jpg'))
    ->psm(6) // Assume a single uniform block of text
    ->oem(1) // Neural network OCR engine
    ->run();

βœ… Use psm() and oem() options to improve OCR accuracy.

9. Storing Extracted Text in a Database

After extracting text, save it to a MySQL database for later use.

Example: Save Extracted Text to MySQL

$conn = new mysqli("localhost", "root", "", "ocr_db");

$text = (new TesseractOCR('receipt.jpg'))->run();
$stmt = $conn->prepare("INSERT INTO extracted_text (content) VALUES (?)");
$stmt->bind_param("s", $text);
$stmt->execute();

echo "Text stored successfully!";

βœ… This makes OCR data searchable within a web application.

10. Automating OCR with a Batch Processor

For multiple image processing, use a loop:

$files = glob("images/*.jpg");

foreach ($files as $file) {
    $text = (new TesseractOCR($file))->run();
    file_put_contents("text_outputs/" . basename($file) . ".txt", $text);
}

echo "Batch OCR processing completed!";

βœ… Processes multiple images at once and saves text to .txt files.

Best Practices for PHP OCR Integration

βœ… Use high-quality images for better OCR results.
βœ… Optimize images with ImageMagick before processing.
βœ… Use language-specific models for multilingual OCR.
βœ… Batch process large datasets efficiently.
βœ… Store extracted text in a database for searchability.

Conclusion

Using PHP and Tesseract OCR, we can extract text from scanned documents, receipts, invoices, and multilingual texts. By following best practices, you can improve accuracy and automate OCR tasks.

βœ… Tesseract OCR provides powerful text extraction.
βœ… PHP makes it easy to integrate OCR into web applications.
βœ… Preprocessing images enhances OCR accuracy.

With this powerful PHP OCR system, you can automate text recognition from images for your applications. πŸš€

Leave a Reply