Retrieving Text in Custom Rectangle with C# and .NET

Extracting text based on coordinates from PDF documents is useful for tasks such as data extraction, form field analysis, and content filtering. This method allows you to precisely locate and retrieve specific information from a PDF document by defining the coordinates of the area containing the desired text. This can be particularly beneficial for automating data processing, document analysis, and information retrieval tasks in various industries and applications.

Below is a step-by-step guide to extract text at given coordinates from PDF documents using PDF.Net.

Output result:

Step-by-Step Guide

  1. Create a New Project

    Open Visual Studio and create a new Console Application project.

  2. Add PDF.Net Reference

    Download the PDF.Net library and add it to your project. You can do this by right-clicking on your project in the Solution Explorer, selecting "Add Reference," and browsing to the PDF.Net DLL.

  3. Write the Code to Extract Content

    Below is a sample code snippet to extract text from a PDF document:

  4. Полный код

    using System;
    using System.IO;
    using SautinSoft;
    using SautinSoft.Pdf;
    using SautinSoft.Pdf.Content;
    
    class Program
    {
        /// <summary>
        /// Reading text
        /// </summary>
        /// <remarks>
        /// Details: https://sautinsoft.com/products/pdf/help/net/developer-guide/reading-text-from-specific-rectangular-area.php
        /// </remarks>
        static void Main()
        {
            // Before starting this example, please get a free 100-day trial key:
            // https://sautinsoft.com/start-for-free/
    
            // Apply the key here:
            // PdfDocument.SetLicense("...");
    
            string pdfFile = Path.GetFullPath(@"..\..\..\simple text.pdf");
            var pageIndex = 0;
            double areaLeft = 200, areaRight = 520, areaBottom = 510, areaTop = 720;
            using (var document = PdfDocument.Load(pdfFile))
            {
                // Retrieve first page object.
                var page = document.Pages[pageIndex];
                // Retrieve text content elements that are inside specified area on the first page.
                var contentEnumerator = page.Content.Elements.All(page.Transform).GetEnumerator();
                while (contentEnumerator.MoveNext())
                {
                    if (contentEnumerator.Current.ElementType == PdfContentElementType.Text)
                    {
                        var textElement = (PdfTextContent)contentEnumerator.Current;
                        var bounds = textElement.Bounds;
                        contentEnumerator.Transform.Transform(bounds);
    
                        if (bounds.Left > areaLeft && bounds.Right < areaRight &&
                        bounds.Bottom > areaBottom && bounds.Top < areaTop)
                        {
                            // Read the text of an element located in a given area
                            Console.Write(textElement.ToString());
                        }
                    }
                }
            }
        }
    }

    Download

    Option Infer On
    
    Imports System
    Imports System.IO
    Imports SautinSoft
    Imports SautinSoft.Pdf
    Imports SautinSoft.Pdf.Content
    
    Friend Class Program
    	''' <summary>
    	''' Reading text
    	''' </summary>
    	''' <remarks>
    	''' Details: https://sautinsoft.com/products/pdf/help/net/developer-guide/reading-text-from-specific-rectangular-area.php
    	''' </remarks>
    	Shared Sub Main()
    		' Before starting this example, please get a free license:
    		' https://sautinsoft.com/start-for-free/
    
    		' Apply the key here:
    		' PdfDocument.SetLicense("...");
    
    		Dim pdfFile As String = Path.GetFullPath("..\..\..\simple text.pdf")
    		Dim pageIndex = 0
    		Dim areaLeft As Double = 200, areaRight As Double = 520, areaBottom As Double = 510, areaTop As Double = 720
    		Using document = PdfDocument.Load(pdfFile)
    			' Retrieve first page object.
    			Dim page = document.Pages(pageIndex)
    			' Retrieve text content elements that are inside specified area on the first page.
    			Dim contentEnumerator = page.Content.Elements.All(page.Transform).GetEnumerator()
    			Do While contentEnumerator.MoveNext()
    				If contentEnumerator.Current.ElementType = PdfContentElementType.Text Then
    					Dim textElement = CType(contentEnumerator.Current, PdfTextContent)
    					Dim bounds = textElement.Bounds
    					contentEnumerator.Transform.Transform(bounds)
    
    					If bounds.Left > areaLeft AndAlso bounds.Right < areaRight AndAlso bounds.Bottom > areaBottom AndAlso bounds.Top < areaTop Then
    						' Read the text of an element located in a given area
    						Console.Write(textElement.ToString())
    					End If
    				End If
    			Loop
    		End Using
    	End Sub
    End Class
    

    Download

  5. Run the Application

    Build and run your application. If everything is set up correctly, the content from the specified PDF file will be extracted.

Additional Features

    PDF.Net offers various other features for handling PDF documents, such as:
  • Extracting images from PDF files.
  • Converting PDF to other formats like DOCX, HTML, and images.
  • Merging and splitting PDF files.
  • Adding and reading interactive forms.

Conclusion

Extracting content from PDF documents based on specified boundaries using C# can be efficiently achieved with the help of SautinSoft's SautinSoft.Pdf library. This powerful tool allows developers to precisely locate and extract text or other elements within a PDF by defining specific boundaries.


Если вам нужен пример кода или у вас есть вопрос: напишите нам по адресу [email protected] или спросите в онлайн-чате (правый нижний угол этой страницы) или используйте форму ниже:



Вопросы и предложения всегда приветствуются!

Мы разрабатываем компоненты .Net с 2002 года. Мы знаем форматы PDF, DOCX, RTF, HTML, XLSX и Images. Если вам нужна помощь в создании, изменении или преобразовании документов в различных форматах, мы можем вам помочь. Мы напишем для вас любой пример кода абсолютно бесплатно.