Introduction
Extracting text from PDFs is crucial for data analysis, search indexing, and automated document processing. Some PDFs contain searchable text, while others are scanned images that require OCR (Optical Character Recognition).
With PHP, we can:
β
Extract text from searchable PDFs using PDFParser
β
Convert scanned PDFs to text using OCR (Tesseract)
β
Process multi-page PDFs and extract structured text
β
Save extracted text to a database for indexing and search
By the end of this guide, you'll be able to dynamically extract text from PDFs and process them programmatically in PHP. π
1. Installing Required Libraries (PDFParser and Tesseract OCR)
To extract text from PDFs, we use:
β PDFParser (Smalot) β Reads text-based PDFs.
β Tesseract OCR β Converts scanned PDFs into text.
Install PDFParser via Composer
composer require smalot/pdfparser
Install Tesseract OCR (for Scanned PDFs)
For Linux (Ubuntu/Debian):
sudo apt install tesseract-ocr
For Windows:
- Download Tesseract OCR from GitHub Releases.
- Add
Tesseract-OCR
to your systemβs PATH variable.
β Now, PHP is ready to extract text from PDFs.
2. Extracting Text from a Searchable PDF Using PDFParser
If a PDF contains selectable text, we can extract it using PDFParser.
Example: Convert Searchable PDF to Text
require 'vendor/autoload.php';
use Smalot\PdfParser\Parser;
$parser = new Parser();
$pdf = $parser->parseFile('document.pdf');
$text = $pdf->getText();
echo nl2br($text); // Preserve line breaks
Explanation:
β
parseFile('document.pdf')
β Loads the PDF.
β
getText()
β Extracts plain text.
β
nl2br($text)
β Keeps formatting intact.
πΉ Best for PDFs generated from word processors (e.g., invoices, reports).
3. Extracting Text from a Specific Page of a PDF
For large PDFs, extract only a specific page.
Example: Extract Text from Page 2
$pages = $pdf->getPages();
$pageText = $pages[1]->getText(); // Page index starts at 0
echo nl2br($pageText);
β Useful for processing specific sections of long PDFs.
4. Extracting Metadata from a PDF (Title, Author, Creation Date)
Some PDFs store metadata, which can be extracted.
Example: Extract PDF Metadata
$details = $pdf->getDetails();
foreach ($details as $key => $value) {
echo "$key: $value <br>";
}
Sample Output:
Title: Annual Report
Author: John Doe
CreationDate: 2023-10-12
β Great for document tracking and management systems.
5. Extracting Text from Scanned PDFs Using Tesseract OCR
If a PDF only contains images (scanned documents), OCR (Optical Character Recognition) is needed.
Example: Convert a Scanned PDF to an Image
use Imagick;
$imagick = new Imagick();
$imagick->readImage('scanned_document.pdf[0]'); // Extract first page
$imagick->setImageFormat('png');
$imagick->writeImage('scanned_page.png');
β Converts a PDF page into an image for OCR processing.
6. Running OCR on the Extracted Image (Tesseract OCR)
Once the PDF is converted into an image, use Tesseract OCR to extract text.
Example: Extract Text from Image Using OCR
use thiagoalessio\TesseractOCR\TesseractOCR;
$text = (new TesseractOCR('scanned_page.png'))->run();
echo "Extracted Text: " . nl2br($text);
Why Use Tesseract?
β Works with printed and handwritten text.
β Supports multiple languages (eng
, spa
, fra
, etc.).
β Best for processing scanned invoices, receipts, and forms.
πΉ To extract text from multi-page scanned PDFs, repeat for each page.
7. Extracting Text from Multi-Page PDFs Automatically
For multi-page PDFs, loop through each page and extract text dynamically.
Example: Extract Text from Each Page
$pages = $pdf->getPages();
foreach ($pages as $index => $page) {
echo "Page " . ($index + 1) . ":<br>";
echo nl2br($page->getText()) . "<br><hr>";
}
β Ensures structured extraction from PDFs with multiple sections.
8. Storing Extracted PDF Text in a Database
To store extracted text for later searches, save it in MySQL.
Example: Save PDF Text to MySQL Database
$conn = new mysqli("localhost", "root", "", "pdf_texts");
$text = $pdf->getText();
$stmt = $conn->prepare("INSERT INTO extracted_text (content) VALUES (?)");
$stmt->bind_param("s", $text);
$stmt->execute();
echo "Text stored successfully!";
β Makes PDFs searchable within a web application.
9. Searching Extracted PDF Text in a Database
Once stored, search for keywords in the extracted text.
Example: Search for a Keyword in Extracted Text
$query = "SELECT * FROM extracted_text WHERE MATCH(content) AGAINST('invoice')";
$result = $conn->query($query);
while ($row = $result->fetch_assoc()) {
echo $row['content'] . "<br>";
}
β Allows full-text search inside stored PDFs!
10. Automating PDF to Text Conversion for Multiple PDFs
To process multiple PDFs, use a batch conversion loop.
Example: Extract Text from Multiple PDFs
$files = glob("pdfs/*.pdf");
foreach ($files as $file) {
$pdf = $parser->parseFile($file);
file_put_contents("text_outputs/" . basename($file) . ".txt", $pdf->getText());
}
echo "Batch PDF text extraction completed!";
β Great for bulk processing scanned documents.
Best Practices for Converting PDFs to Text in PHP
β Use PDFParser for text-based PDFs (fast & accurate).
β Use OCR (Tesseract) for scanned PDFs (handwritten & printed text).
β Store extracted text in a database for searchability.
β Optimize scanned PDFs before OCR processing (increase contrast).
β Automate multi-page and multi-file text extraction for efficiency.
Conclusion
With PDFParser and Tesseract OCR, you can extract text from PDFs, process scanned documents, and automate text conversion in PHP.
β
Extract text from standard PDFs effortlessly.
β
Convert scanned PDFs into text using OCR.
β
Store and search extracted text in a database.
β
Automate PDF processing for bulk documents.
By implementing these techniques, you can build powerful document management and text extraction systems in PHP! π