PDF Analyzer

PDF content analysis

Operating mode

PDF Analyzer: Training Mode and Preview Mode

At the top of the PDF analyzer window, you can select between two modes: VIEW and TRAIN. In TRAIN mode, you can write scripts, while in VIEW mode, you can preview the document. You can navigate through the pages of the document using the UP and DOWN buttons on the left.

PDF

ISO standard PDF/A format is supported. Click PICK or use the %FILENAME% variable to select a PDF file.

PASSWORD

Password for selected PDF file (optional). Passwords can also be written in a text file and placed in the workspace. In this case, you would fill the field with the filename of that text file.

Input Object

The input object is a data object retrieved from the analysis of a PDF file. It contains all text objects, line objects, etc., in a unified coordinate system, as well as the necessary parsing functions.

Coordinate System

The input uses the Page Normalized Coordinate (PNC) coordinate system. The origin is at the top-left corner, and the coordinates for each page are normalized to adjacent integer ranges. For example, the coordinate range for the first page is [0,0] to [1,1], the second page is [0,1] to [1,2], and so on. In this way, the entire document can be viewed as a continuous coordinate system with x: [01], y: [0(N-1)], where N is the number of pages.

Text Object

Basic properties of a text object

Parsing Functions

Parsing logic of PDF files

The input object provides several parsing and utility functions to help users locate target objects within the document. The overall parsing logic involves narrowing down the collection of text objects based on spatial or textual conditions and then using relative relationships to find the target objects.

Utility Functions

Viewer and CodeGen

In PDF Analyzer's Viewer, you can not only mark text objects and coordinates but also generate function code automatically through mouse and keyboard operations. After completing the operations, users can simply copy and paste the generated code into the training mode and make minor modifications, which reduces the time needed for programming.

  • To capture an object:

Mouse click on the text object
  • To generate boundaries:

Drag the mouse to create a selection box in the desired area
  • To perform the directional analysis:

When the cursor is positioned on a key object, pressing Shift+Arrow Key will resolve the first object encountered in that direction

  • To perform range analysis:

Position the cursor on the starting key object, press Shift, move the cursor to the ending key object, and click to select the objects between them
  • To perform relative range analysis:

Place the cursor on the starting object, press Shift, move the cursor to the desired area, and drag to create a selection box for analysis

Output Object

Each key added to the output object will be output as a .txt file in the workspace. The file name will be the key, and the text content will be the corresponding value.

Examples

--

We are dedicated to improving our content. Please let us know if you come across any errors, including spelling, grammar, or other mistakes, as your feedback is valuable to us! 🤖️⚡️

Last updated