Using C# and .NET for PDF Text Extraction

SautinSoft.Pdf can read PDF files from C# or VB.NET applications at very high speeds; it can read the text of a 1,000 page PDF file (almost 500,000 words) in just 3 seconds.

Text extraction is fairly easy to perform. With a simple API and just a few lines of code, the entire text content from a PDF file can be extracted in a single String, ready for your further processing.

The text extraction method from PDF documents is essential for various industries and tasks such as data mining, information retrieval, content analysis, and document management. It allows for the automatic extraction of text data from PDF files, which can then be processed, analyzed, and utilized in a variety of ways. By using this method, users can easily extract and manipulate text content from PDF documents, enabling them to quickly search, edit, and repurpose the extracted text for their specific needs. Whether you are a researcher, a data analyst, a content creator, or a developer, the text extraction method from PDF files simplifies the task of working with textual information stored in PDF format.

Below is a step-by-step guide on how to extract text from PDF documents using PDF.Net.

Input file: simple text.pdf

Output result:

Step-by-Step Guide

  1. Create a New Project

    Open Visual Studio and create a new Console Application project.

  2. Add PDF.Net Reference

    Install SautinSoft.Pdf form nuget

  3. Write the Code to Extract Text

    Below is a sample code snippet to extract text from a PDF document:

  4. using System.IO;
    using SautinSoft;
    using SautinSoft.Pdf;
    using SautinSoft.Pdf.Content;
    
    namespace Sample
    {
        class Sample
        {
            /// 
            /// Create a page tree.
            /// 
            /// 
            /// Details: https://sautinsoft.com/products/pdf/help/net/developer-guide/read-text-from-pdf-files.php
            /// 
            static void Main(string[] args)
            {
                // Path to the input PDF file
                //string pdfFile = @"C:\path\to\your\document.pdf";
                string pdfFile = Path.GetFullPath(@"..\..\..\simple text.pdf");
                try
                {
                    using (var document = PdfDocument.Load(pdfFile))
                    {
                        foreach (var page in document.Pages)
                        {
                            var text = page.Content.GetText(new PdfTextOptions
                            {
                                FontFace = new PdfFontFace("Consolas"),
                                Order = PdfTextOrder.Reading,
                                Whitespaces = PdfTextWhitespaces.Space | PdfTextWhitespaces.Blank | PdfTextWhitespaces.NewLine
                            }).ToString();
                            Console.WriteLine(text);
                        }
                    }
                }
                catch (Exception ex)
                {
                    Console.WriteLine("Error: " + ex.Message);
                }
            }
        }
    }
  5. Run the Application

    Build and run your application. If everything is set up correctly, the text from the specified PDF file will be extracted.

Additional Features

    PDF.Net offers various other features for handling PDF documents, such as:
  • Extracting images from PDF files.
  • Converting PDF to other formats like DOCX, HTML, and images.
  • Merging and splitting PDF files.
  • Adding and reading interactive forms.

Conclusion

Extracting text from PDF documents using PDF.Net is a simple and efficient process. With just a few lines of code, you can integrate powerful PDF text extraction capabilities into your applications. Whether you are working on a small project or a large-scale application, PDF.Net provides the tools you need to handle PDF documents effectively.

Extracting text from PDF documents is a common requirement for various applications, such as data analysis, content management, and document processing. PDF.Net by SautinSoft provides a powerful and easy-to-use solution for this task. Below is a step-by-step guide on how to extract text from PDF documents using PDF.Net.


Если вам нужен пример кода или у вас есть вопрос: напишите нам по адресу [email protected] или спросите в онлайн-чате (правый нижний угол этой страницы) или используйте форму ниже:



Вопросы и предложения всегда приветствуются!

Мы разрабатываем компоненты .Net с 2002 года. Мы знаем форматы PDF, DOCX, RTF, HTML, XLSX и Images. Если вам нужна помощь в создании, изменении или преобразовании документов в различных форматах, мы можем вам помочь. Мы напишем для вас любой пример кода абсолютно бесплатно.