Extract images from PDF in C#

·

2 min read

The article is maintained by the team at commabot.

Let's start by invoking the NuGet spirits:

Install-Package iTextSharp

The Sorcery Code

Now, for the real magic. We'll dive into the PDF, navigate its streamy underbelly, and extract the images.

using System;
using System.IO;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

public class PdfImageExtractor
{
    public static void ExtractImages(string pdfPath)
    {
        using (var reader = new PdfReader(pdfPath))
        {
            PdfReaderContentParser parser = new PdfReaderContentParser(reader);
            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                parser.ProcessContent(i, new MyImageRenderListener(pdfPath, i));
            }
        }
    }

    private class MyImageRenderListener : IRenderListener
    {
        private string _pdfPath;
        private int _pageNumber;

        public MyImageRenderListener(string pdfPath, int pageNumber)
        {
            _pdfPath = pdfPath;
            _pageNumber = pageNumber;
        }

        public void BeginTextBlock() { }
        public void EndTextBlock() { }
        public void RenderText(TextRenderInfo renderInfo) { }

        public void RenderImage(ImageRenderInfo renderInfo)
        {
            PdfImageObject imageObj = renderInfo.GetImage();
            if (imageObj == null) return;

            using (var imageStream = imageObj.GetImageAsBytes())
            {
                string outputFileName = Path.Combine("ExtractedImages", $"ExtractedImg_Page{_pageNumber}_{Guid.NewGuid()}.png");
                File.WriteAllBytes(outputFileName, imageStream);
            }
        }
    }
}

Invocation

Summon the method with the path to your PDF:

PdfImageExtractor.ExtractImages(@"path\to\your\pdfFile.pdf");

Reality Check

Make sure the "ExtractedImages" folder exists, or modify the code to create it dynamically. This method pulls every image it finds on each page and saves it. The Guid in the file name is to avoid overwrites, but you might want to implement a more sophisticated naming scheme based on your needs.

A Note on Images in PDFs

Images in PDFs can be tricky; they might be split into pieces, masked, or transformed in ways that make them look different when extracted than they do in the PDF. Also, this code extracts images in their raw form, which might not always be PNG. You might need additional handling for format conversion depending on your use case.