Extracting Text from PDFs in PHP Using PDFParser and Smalot

Introduction

Extracting text from PDF files in PHP is useful for data analysis, search indexing, document processing, and automation. PDFs may contain searchable text or scanned images—each requiring different extraction techniques.

In this guide, we will cover:

✅ Installing PDFParser (Smalot) for text extraction
✅ Extracting text from normal (searchable) PDFs
✅ Handling structured data (tables, paragraphs, and metadata)
✅ Extracting text from scanned PDFs using OCR (Tesseract OCR)

By the end, you'll have a fully functional PHP solution to extract text from PDFs dynamically. 🚀

1. Installing PDFParser in PHP

PDFParser, built on Smalot, is a PHP library for parsing and extracting text from PDFs.

Install PDFParser via Composer

composer require smalot/pdfparser

Include it in your PHP script:

require 'vendor/autoload.php';

use Smalot\PdfParser\Parser;

✅ PDFParser is now ready for use!

2. Extracting Plain Text from a PDF in PHP

Let’s extract text from a basic, searchable PDF file.

Example: Extract Text from a PDF File

require 'vendor/autoload.php';

use Smalot\PdfParser\Parser;

$parser = new Parser();
$pdf = $parser->parseFile('document.pdf');

$text = $pdf->getText();
echo nl2br($text); // Preserve line breaks

Explanation:

✅ parseFile('document.pdf') loads the PDF.
✅ getText() extracts plain text from the document.
✅ nl2br($text) keeps original line breaks.

🔹 This works best with searchable PDFs (not scanned ones).

3. Extracting Text from a Specific Page

For multi-page PDFs, extract text from a specific page.

Example: Extract Text from Page 2

$pages = $pdf->getPages();
$pageText = $pages[1]->getText(); // Page index starts at 0
echo nl2br($pageText);

✅ Useful for processing large PDFs without extracting everything.

4. Extracting Metadata from a PDF

PDFs store metadata like title, author, and creation date.

Example: Extract PDF Metadata

$details = $pdf->getDetails();
foreach ($details as $key => $value) {
    echo "$key: $value <br>";
}

Sample Output:

Title: Annual Report  
Author: John Doe  
CreationDate: 2023-10-12

✅ Great for document tracking and management systems.

5. Handling Structured Data (Tables and Paragraphs)

Extracting tables from PDFs is tricky since text may not be structured.

Example: Extract Text with Line Breaks

$text = preg_replace('/\n{2,}/', "\n", $pdf->getText()); // Remove excessive line breaks
echo nl2br($text);

✅ Formats paragraphs better while keeping table-like structures readable.

6. Extracting Text from Scanned PDFs (Using OCR)

Scanned PDFs contain images, not text, requiring OCR (Optical Character Recognition) to extract text.

Install Tesseract OCR for PHP

composer require thiagoalessio/tesseract_ocr
sudo apt install tesseract-ocr

Example: Extract Text from a Scanned PDF

use thiagoalessio\TesseractOCR\TesseractOCR;

$text = (new TesseractOCR('scanned_page.png'))->run();
echo $text;

✅ Extracts text from images inside PDFs by using OCR (Tesseract).
✅ Convert PDFs to images first, then process each page separately.

7. Extracting Tables from PDFs Using Tabula

PDFParser struggles with tables. Use Tabula (Python-based) to extract structured tables.

Install Tabula (Python Required)

pip install tabula-py

Extract Tables to CSV (Command Line)

tabula --pages all --format csv document.pdf > table.csv

✅ Tabula extracts structured tables better than PHP alone.

8. Converting Extracted Text to JSON Format

For API integrations, store extracted text in JSON format.

Example: Convert PDF Text to JSON

$textArray = explode("\n", $pdf->getText()); // Convert to array
$jsonData = json_encode($textArray, JSON_PRETTY_PRINT);

file_put_contents("text_output.json", $jsonData);
echo "Text saved as JSON!";

✅ Useful for document processing APIs and storage.

9. Storing Extracted Text in a MySQL Database

Save extracted text into a database for searching and retrieval.

Example: Store Extracted Text in MySQL

$conn = new mysqli("localhost", "root", "", "documents_db");

$text = $pdf->getText();
$stmt = $conn->prepare("INSERT INTO extracted_text (content) VALUES (?)");
$stmt->bind_param("s", $text);
$stmt->execute();

echo "Text stored successfully!";

✅ Allows for fast searches and retrieval of text from stored PDFs.

10. Searching Text Inside Extracted PDFs

Once stored in a database, search for keywords inside PDFs.

Example: Full-Text Search in MySQL

SELECT * FROM extracted_text WHERE MATCH(content) AGAINST('invoice');

✅ Efficient way to search for specific content within PDFs.

Best Practices for Extracting Text from PDFs in PHP

✅ Use PDFParser for extracting text from standard PDFs.
✅ Use OCR (Tesseract) for scanned PDFs.
✅ Preprocess text (remove extra spaces, new lines).
✅ Store extracted text in JSON or a database for easy retrieval.
✅ Use Tabula for extracting structured tables from PDFs.

Conclusion

Extracting text from PDFs in PHP is critical for document automation, search engines, and text processing.

✅ PDFParser is the best PHP library for extracting text.
✅ Tesseract OCR helps process scanned PDFs.
✅ Tabula is ideal for extracting tables from PDFs.

With these techniques, you can build powerful PDF text processing solutions in PHP. 🚀