Python: Extract Images from PDF

Python: Extract Images from PDF

Python has several libraries that make the task of extracting images from PDFs quite straightforward. Each of these libraries has its unique features and methods for handling PDFs and extracting images. Here's a list of the most common libraries.

  1. PDFMiner: A tool more focused on extracting text but also capable of handling image extraction. It allows detailed analysis and retrieval of images and is well-suited for complex PDFs.

  2. PyMuPDF (FitZ): Known for its speed and efficiency, this library is excellent for extracting high-quality images and also allows for various PDF manipulations. It is highly recommended for both text and image extraction due to its performance.

  3. PyPDF4: A fork of PyPDF2, this library provides enhancements and bug fixes over PyPDF2. It includes functionalities for reading PDFs, extracting text, and some capabilities for image extraction.

  4. Wand: A Python binding for ImageMagick, Wand is used for converting PDF pages into images. It's a powerful tool for comprehensive image manipulation and can be used for high-quality conversions of PDF pages into image files.

  5. ReportLab: Known for PDF generation, ReportLab can also be used in reverse to extract content from PDFs. It's more complex and suited for advanced users who need to handle sophisticated PDF manipulation tasks.

Each of these libraries has its strengths and ideal use cases, so the choice depends on the specific requirements of your task, such as the complexity of the PDF, the quality of images needed, and the level of control over the extraction process.

The Extraction Process

The process of extracting images using these libraries typically involves reading the PDF file, iterating over its pages, and then extracting and saving the images found. Each library has its specific way of handling this process, and the choice of library may depend on factors like the complexity of the PDF, the quality of images required, and the ease of use of the library.

PDFMiner

PDFMiner doesn't provide a direct function to extract images. It rather allows you to process PDF objects, and you have to manually handle image objects. Here's how to install it:

pip install pdfminer.six

Here's a simple script to give you a starting point:

import os
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams, LTImage
from pdfminer.converter import PDFPageAggregator

def extract_images_from_pdf(pdf_file):
    with open(pdf_file, 'rb') as file:
        parser = PDFParser(file)
        doc = PDFDocument(parser)

        rsrcmgr = PDFResourceManager()
        laparams = LAParams()
        device = PDFPageAggregator(rsrcmgr, laparams=laparams)
        interpreter = PDFPageInterpreter(rsrcmgr, device)

        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)
            layout = device.get_result()
            for lobj in layout:
                if isinstance(lobj, LTImage):
                    with open(f'{lobj.name}.jpg', 'wb') as img_file:
                        img_file.write(lobj.stream.get_rawdata())

# Example usage
pdf_path = 'your_pdf_file.pdf'
extract_images_from_pdf(pdf_path)

How It Works:

  • This script uses PDFMiner to parse the PDF file and then processes each page.

  • It looks for LTImage objects, which represent images.

  • Once an image object is found, it extracts the raw data and saves it as a file.

  • This script assumes the images are in JPEG format, which might not always be the case. You may need to modify the script based on the actual image formats in your PDF.

PyMuPDF

PyMuPDF provides you with a way to iterate over the images within the doc.

pip install PyMuPDF

This command will download and install PyMuPDF and its dependencies.

To give you a taste of how image extraction works in Python, let’s look at a simple example using PyMuPDF:

import fitz  # PyMuPDF

pdf_file = "example.pdf"
doc = fitz.open(pdf_file)

for i in range(len(doc)):
    for img in doc.get_page_images(i):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n < 5:        # this is GRAY or RGB
            pix.writePNG("page%s-%s.png" % (i, xref))
        else:                # CMYK: convert to RGB first
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG("page%s-%s.png" % (i, xref))
            pix1 = None
        pix = None

In this code snippet, we open a PDF file, iterate through its pages, and extract each image, saving them as PNG files.

PyPDF4

PyPDF4 is a fork of PyPDF2 with additional improvements and bug fixes. It's commonly used for various PDF manipulations, including extracting text and images.

pip install PyPDF4

Example how to use it:

from PyPDF4 import PdfFileReader
import io
from PIL import Image

pdf_file = 'example.pdf'
pdf = PdfFileReader(open(pdf_file, "rb"))

for page_num in range(pdf.getNumPages()):
    page = pdf.getPage(page_num)
    if '/XObject' in page['/Resources']:
        xObject = page['/Resources']['/XObject'].getObject()
        for obj in xObject:
            if xObject[obj]['/Subtype'] == '/Image':
                size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
                data = xObject[obj]._data
                if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                    mode = "RGB"
                else:
                    mode = "P"

                if '/Filter' in xObject[obj]:
                    if xObject[obj]['/Filter'] == '/DCTDecode':
                        img = Image.open(io.BytesIO(data))
                        img.save(obj[1:] + ".jpg")
                    elif xObject[obj]['/Filter'] == '/FlateDecode':
                        img = Image.frombytes(mode, size, data)
                        img.save(obj[1:] + ".png")

This script extracts images embedded in a PDF and saves them as JPEG or PNG depending on their format.

ReportLab

ReportLab isn't the ideal tool for extracting content from PDFs, but it is excellent for creating new PDFs or modifying existing ones.

pip install PyPDF2 ReportLab

PyPDF2 can be used to extract images from the PDF:

import PyPDF2
import os

def extract_images(pdf_path):
    pdf_file = open(pdf_path, 'rb')
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)

    images = []

    for page_num in range(pdf_reader.numPages):
        page = pdf_reader.getPage(page_num)
        if '/XObject' in page['/Resources']:
            xObject = page['/Resources']['/XObject'].getObject()
            for obj in xObject:
                if xObject[obj]['/Subtype'] == '/Image':
                    size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
                    data = xObject[obj].getData()
                    images.append(data)

    pdf_file.close()
    return images

# Extract images from a PDF file
pdf_path = 'example.pdf'
extracted_images = extract_images(pdf_path)

Now, you can use ReportLab to process or embed these images into a new PDF. Here's a basic example of creating a new PDF with these images:

from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
from io import BytesIO
from PIL import Image

def create_pdf_with_images(images, output_pdf_path):
    c = canvas.Canvas(output_pdf_path, pagesize=letter)

    for i, img_data in enumerate(images):
        image_stream = BytesIO(img_data)
        img = Image.open(image_stream)
        c.drawImage(image_stream, 100, 750 - 100 * i, width=img.width / 2, height=img.height / 2)

    c.save()

# Create a new PDF with the extracted images
new_pdf_path = 'new_example.pdf'
create_pdf_with_images(extracted_images, new_pdf_path)

Notes:

  1. This example extracts raw image data using PyPDF2 and then uses ReportLab to create a new PDF with these images.

  2. The images are added to the PDF with ReportLab's drawImage function. The position and size of the images can be adjusted as needed.

  3. Ensure that the PDFs you are working with are not encrypted or protected, as this may prevent image extraction.

Conclusion

These examples illustrate the flexibility of Python in handling PDFs and extracting images. Depending on your specific needs – whether it's extracting embedded images or converting entire pages into images – there's a Python library that can help. Remember, the choice of library may depend on factors like the complexity of the PDF, the desired image quality, and the specific requirements of your project.