Step-by-Step Tutorial: Extracting Text from PDFs with C# and .NET

Text extraction is fairly easy to perform. With a simple API and just a few lines of code, the entire text content from a PDF file can be extracted in a single String, ready for your further processing. SautinSoft.PDF can read PDF files from C# or VB.NET applications at very high speeds; it can read the text of a 1,000 page PDF file (almost 500,000 words) in just 3 seconds.

The text extraction method from PDF documents is essential for various industries and tasks such as data mining, information retrieval, content analysis, and document management. It allows for the automatic extraction of text data from PDF files, which can then be processed, analyzed, and utilized in a variety of ways. By using this method, users can easily extract and manipulate text content from PDF documents, enabling them to quickly search, edit, and repurpose the extracted text for their specific needs. Whether you are a researcher, a data analyst, a content creator, or a developer, the text extraction method from PDF files simplifies the task of working with textual information stored in PDF format.

Below is a step-by-step guide on how to extract text from PDF documents using PDF.Net:

  1. Add SautinSoft.PDF from NuGet.
  2. Load PDF Document.
  3. Show all the text contained on each page in the console.

Input file: simple text.pdf

Output result:

Полный код

using System;
using System.IO;
using SautinSoft;
using SautinSoft.Pdf;
using SautinSoft.Pdf.Content;

namespace Sample
{
    class Sample
    {
        /// <summary>
        /// Read text from PDF.
        /// </summary>
        /// <remarks>
        /// Details: https://sautinsoft.com/products/pdf/help/net/developer-guide/read-text-from-pdf-files.php
        /// </remarks>
        static void Main(string[] args)
        {
            // Before starting this example, please get a free 100-day trial key:
            // https://sautinsoft.com/start-for-free/

            // Apply the key here:
            // PdfDocument.SetLicense("...");
            
            string pdfFile = Path.GetFullPath(@"..\..\..\simple text.pdf");

            // Load PDF Document.
            using (var document = PdfDocument.Load(pdfFile))
            {
                foreach (var page in document.Pages)
                {
                    // Write text from pdf file to console.
                    Console.WriteLine(page.Content.ToString());
                }
            }
        }
    }
}

Download

Option Infer On

Imports System
Imports System.IO
Imports SautinSoft
Imports SautinSoft.Pdf
Imports SautinSoft.Pdf.Content

Namespace Sample
	Friend Class Sample
		''' <summary>
		''' Read text from PDF.
		''' </summary>
		''' <remarks>
		''' Details: https://sautinsoft.com/products/pdf/help/net/developer-guide/read-text-from-pdf-files.php
		''' </remarks>
		Shared Sub Main(ByVal args() As String)
			' Before starting this example, please get a free 100-day trial key:
			' https://sautinsoft.com/start-for-free/

			' Apply the key here:
			' PdfDocument.SetLicense("...");

			Dim pdfFile As String = Path.GetFullPath("..\..\..\simple text.pdf")

			' Load PDF Document.
			Using document = PdfDocument.Load(pdfFile)
				For Each page In document.Pages
					' Write text from pdf file to console.
					Console.WriteLine(page.Content.ToString())
				Next page
			End Using
		End Sub
	End Class
End Namespace

Download


Если вам нужен пример кода или у вас есть вопрос: напишите нам по адресу support@sautinsoft.ru или спросите в онлайн-чате (правый нижний угол этой страницы) или используйте форму ниже:



Вопросы и предложения всегда приветствуются!

Мы разрабатываем компоненты .Net с 2002 года. Мы знаем форматы PDF, DOCX, RTF, HTML, XLSX и Images. Если вам нужна помощь в создании, изменении или преобразовании документов в различных форматах, мы можем вам помочь. Мы напишем для вас любой пример кода абсолютно бесплатно.