How to Convert PDF to Text in PHP

How to Convert PDF to Text in PHP

Introduction

Extracting text from PDFs is crucial for data analysis, search indexing, and automated document processing. Some PDFs contain searchable text, while others are scanned images that require OCR (Optical Character Recognition).

With PHP, we can:

βœ… Extract text from searchable PDFs using PDFParser
βœ… Convert scanned PDFs to text using OCR (Tesseract)
βœ… Process multi-page PDFs and extract structured text
βœ… Save extracted text to a database for indexing and search

By the end of this guide, you'll be able to dynamically extract text from PDFs and process them programmatically in PHP. πŸš€

1. Installing Required Libraries (PDFParser and Tesseract OCR)

To extract text from PDFs, we use:

βœ” PDFParser (Smalot) – Reads text-based PDFs.
βœ” Tesseract OCR – Converts scanned PDFs into text.

Install PDFParser via Composer

composer require smalot/pdfparser

Install Tesseract OCR (for Scanned PDFs)

For Linux (Ubuntu/Debian):

sudo apt install tesseract-ocr

For Windows:

  1. Download Tesseract OCR from GitHub Releases.
  2. Add Tesseract-OCR to your system’s PATH variable.

βœ… Now, PHP is ready to extract text from PDFs.

2. Extracting Text from a Searchable PDF Using PDFParser

If a PDF contains selectable text, we can extract it using PDFParser.

Example: Convert Searchable PDF to Text

require 'vendor/autoload.php';

use Smalot\PdfParser\Parser;

$parser = new Parser();
$pdf = $parser->parseFile('document.pdf');

$text = $pdf->getText();
echo nl2br($text); // Preserve line breaks

Explanation:

βœ… parseFile('document.pdf') – Loads the PDF.
βœ… getText() – Extracts plain text.
βœ… nl2br($text) – Keeps formatting intact.

πŸ”Ή Best for PDFs generated from word processors (e.g., invoices, reports).

3. Extracting Text from a Specific Page of a PDF

For large PDFs, extract only a specific page.

Example: Extract Text from Page 2

$pages = $pdf->getPages();
$pageText = $pages[1]->getText(); // Page index starts at 0
echo nl2br($pageText);

βœ… Useful for processing specific sections of long PDFs.

4. Extracting Metadata from a PDF (Title, Author, Creation Date)

Some PDFs store metadata, which can be extracted.

Example: Extract PDF Metadata

$details = $pdf->getDetails();
foreach ($details as $key => $value) {
    echo "$key: $value <br>";
}

Sample Output:

Title: Annual Report  
Author: John Doe  
CreationDate: 2023-10-12  

βœ… Great for document tracking and management systems.

5. Extracting Text from Scanned PDFs Using Tesseract OCR

If a PDF only contains images (scanned documents), OCR (Optical Character Recognition) is needed.

Example: Convert a Scanned PDF to an Image

use Imagick;

$imagick = new Imagick();
$imagick->readImage('scanned_document.pdf[0]'); // Extract first page
$imagick->setImageFormat('png');
$imagick->writeImage('scanned_page.png');

βœ… Converts a PDF page into an image for OCR processing.

6. Running OCR on the Extracted Image (Tesseract OCR)

Once the PDF is converted into an image, use Tesseract OCR to extract text.

Example: Extract Text from Image Using OCR

use thiagoalessio\TesseractOCR\TesseractOCR;

$text = (new TesseractOCR('scanned_page.png'))->run();
echo "Extracted Text: " . nl2br($text);

Why Use Tesseract?

βœ” Works with printed and handwritten text.
βœ” Supports multiple languages (eng, spa, fra, etc.).
βœ” Best for processing scanned invoices, receipts, and forms.

πŸ”Ή To extract text from multi-page scanned PDFs, repeat for each page.

7. Extracting Text from Multi-Page PDFs Automatically

For multi-page PDFs, loop through each page and extract text dynamically.

Example: Extract Text from Each Page

$pages = $pdf->getPages();
foreach ($pages as $index => $page) {
    echo "Page " . ($index + 1) . ":<br>";
    echo nl2br($page->getText()) . "<br><hr>";
}

βœ… Ensures structured extraction from PDFs with multiple sections.

8. Storing Extracted PDF Text in a Database

To store extracted text for later searches, save it in MySQL.

Example: Save PDF Text to MySQL Database

$conn = new mysqli("localhost", "root", "", "pdf_texts");

$text = $pdf->getText();
$stmt = $conn->prepare("INSERT INTO extracted_text (content) VALUES (?)");
$stmt->bind_param("s", $text);
$stmt->execute();

echo "Text stored successfully!";

βœ… Makes PDFs searchable within a web application.

9. Searching Extracted PDF Text in a Database

Once stored, search for keywords in the extracted text.

Example: Search for a Keyword in Extracted Text

$query = "SELECT * FROM extracted_text WHERE MATCH(content) AGAINST('invoice')";
$result = $conn->query($query);

while ($row = $result->fetch_assoc()) {
    echo $row['content'] . "<br>";
}

βœ… Allows full-text search inside stored PDFs!

10. Automating PDF to Text Conversion for Multiple PDFs

To process multiple PDFs, use a batch conversion loop.

Example: Extract Text from Multiple PDFs

$files = glob("pdfs/*.pdf");

foreach ($files as $file) {
    $pdf = $parser->parseFile($file);
    file_put_contents("text_outputs/" . basename($file) . ".txt", $pdf->getText());
}

echo "Batch PDF text extraction completed!";

βœ… Great for bulk processing scanned documents.

Best Practices for Converting PDFs to Text in PHP

βœ” Use PDFParser for text-based PDFs (fast & accurate).
βœ” Use OCR (Tesseract) for scanned PDFs (handwritten & printed text).
βœ” Store extracted text in a database for searchability.
βœ” Optimize scanned PDFs before OCR processing (increase contrast).
βœ” Automate multi-page and multi-file text extraction for efficiency.

Conclusion

With PDFParser and Tesseract OCR, you can extract text from PDFs, process scanned documents, and automate text conversion in PHP.

βœ… Extract text from standard PDFs effortlessly.
βœ… Convert scanned PDFs into text using OCR.
βœ… Store and search extracted text in a database.
βœ… Automate PDF processing for bulk documents.

By implementing these techniques, you can build powerful document management and text extraction systems in PHP! πŸš€

Leave a Reply