How to Convert PDF to HTML in PHP

How to Convert PDF to HTML in PHP

Introduction

Converting PDF to HTML allows for better document accessibility, searchability, and responsive display on websites. This is useful for:

Displaying PDFs as interactive HTML without requiring downloads
Extracting text, images, and formatting for web-based processing
Enhancing SEO by making PDFs readable by search engines

In this guide, we'll cover:

Extracting text from PDFs and converting it to HTML
Extracting and embedding images from PDFs into HTML
Handling both text-based and scanned PDFs
Generating searchable web pages from PDF documents

By the end, you'll have a fully automated PDF-to-HTML conversion system in PHP. 🚀

1. Installing Required Libraries (PDFParser, FPDI, and Imagick)

To extract content from PDFs, we use:

PDFParser (Smalot) – Extracts text from PDFs
FPDI – Reads and processes PDF pages
Imagick – Extracts images from PDFs

Install via Composer (Recommended)

composer require smalot/pdfparser
composer require setasign/fpdi
composer require imagick/imagick

Include Libraries in Your PHP Script

require 'vendor/autoload.php';

use Smalot\PdfParser\Parser;
use setasign\Fpdi\Fpdi;
use Imagick;

Now, your system is ready to convert PDFs into HTML.

2. Extracting Text from a PDF and Converting to HTML

If a PDF contains searchable text, we can extract and display it as HTML.

Example: Convert PDF Text to HTML

$parser = new Parser();
$pdf = $parser->parseFile('document.pdf');

$text = nl2br($pdf->getText()); // Preserve line breaks
$html = "<html><body><h1>Extracted PDF Content</h1><p>$text</p></body></html>";

file_put_contents('converted.html', $html);

echo "PDF converted to HTML successfully!";

Explanation:

Extracts text from a PDF and converts it to HTML format
Uses nl2br() to preserve original line breaks
Saves the HTML output as converted.html

🔹 Works best with text-based PDFs (not scanned ones).

3. Extracting Images from a PDF and Embedding in HTML

Many PDFs include logos, charts, and illustrations that need to be preserved in the HTML version.

Example: Extract Images from PDF and Embed in HTML

$imagick = new Imagick();
$imagick->setResolution(300, 300);
$imagick->readImage('document.pdf');

$html = "<html><body><h1>Extracted PDF Images</h1>";

foreach ($imagick as $index => $img) {
    $imageFile = "image_$index.jpg";
    $img->setImageFormat('jpeg');
    $img->writeImage($imageFile);
    $html .= "<img src='$imageFile' alt='PDF Image'><br>";
}

$html .= "</body></html>";

file_put_contents('pdf_images.html', $html);

echo "Images extracted and embedded in HTML!";

Explanation:

Converts PDF images into JPEG format
Saves extracted images and embeds them into an HTML file

🔹 Ideal for scanned PDFs with graphical content.

4. Handling Scanned PDFs Using OCR (Tesseract OCR)

If a PDF contains scanned images of text, use OCR (Optical Character Recognition) to extract text.

Install Tesseract OCR

For Linux (Ubuntu/Debian):

sudo apt install tesseract-ocr

For Windows:

  1. Download Tesseract OCR from GitHub Releases.
  2. Add Tesseract-OCR to your system’s PATH variable.

Example: Convert Scanned PDF Text to HTML Using OCR

use thiagoalessio\TesseractOCR\TesseractOCR;

// Convert PDF to Image First
$imagick = new Imagick();
$imagick->readImage('scanned_document.pdf[0]');
$imagick->setImageFormat('png');
$imagick->writeImage('scanned_page.png');

// Run OCR on Extracted Image
$text = (new TesseractOCR('scanned_page.png'))->run();

$html = "<html><body><h1>Extracted OCR Text</h1><p>$text</p></body></html>";
file_put_contents('ocr_text.html', $html);

echo "Scanned PDF converted to searchable HTML!";

Why Use OCR?

Extracts text from scanned documents
Supports multiple languages (eng, spa, fra, etc.)
Converts printed PDFs into searchable HTML

🔹 Great for digitizing invoices, contracts, and historical documents.

5. Converting Multi-Page PDFs to HTML Dynamically

For multi-page PDFs, process each page separately and generate a paginated HTML output.

Example: Convert Multi-Page PDFs to HTML

$parser = new Parser();
$pdf = $parser->parseFile('multipage.pdf');

$html = "<html><body><h1>Multi-Page PDF Content</h1>";

foreach ($pdf->getPages() as $index => $page) {
    $html .= "<h2>Page " . ($index + 1) . "</h2>";
    $html .= "<p>" . nl2br($page->getText()) . "</p>";
}

$html .= "</body></html>";

file_put_contents('multipage.html', $html);

echo "Multi-page PDF converted to HTML!";

Generates separate sections per page for better readability.

6. Saving Extracted Text and Images in a Database

To store extracted PDF content, save it in a MySQL database.

Example: Save Extracted PDF Data to MySQL

$conn = new mysqli("localhost", "root", "", "pdf_data");

$text = $pdf->getText();
$stmt = $conn->prepare("INSERT INTO extracted_pdfs (content) VALUES (?)");
$stmt->bind_param("s", $text);
$stmt->execute();

echo "PDF text stored in the database!";

Allows searching and retrieval of extracted PDF data dynamically.

7. Searching Extracted PDF Content on a Website

Once stored in a database, users can search for keywords inside extracted PDFs.

Example: Search for Keywords in Extracted PDFs

$query = "SELECT * FROM extracted_pdfs WHERE MATCH(content) AGAINST('invoice')";
$result = $conn->query($query);

while ($row = $result->fetch_assoc()) {
    echo "<p>" . $row['content'] . "</p>";
}

Creates a web-based searchable PDF archive!

Best Practices for Converting PDFs to HTML in PHP

Use PDFParser for text-based PDFs (faster and more accurate).
Use OCR (Tesseract) for scanned PDFs (printed or handwritten text).
Embed extracted images for better document visualization.
Store extracted text in a database for easy retrieval and search.
Paginate multi-page PDFs for better readability in HTML.

Conclusion

With PDFParser, FPDI, and Tesseract OCR, you can convert PDFs into structured HTML, making them searchable, accessible, and interactive.

Extract text and images from PDFs dynamically.
Convert scanned PDFs into searchable content using OCR.
Store and search PDF content in a database.
Display PDFs as interactive web pages.

By implementing these techniques, you can seamlessly integrate PDF content into web applications in PHP! 🚀

Leave a Reply