PDF Analyzer

PDF content analysis

Operating mode

At the top of the PDF analyzer window, you can select between two modes: VIEW and TRAIN. In TRAIN mode, you can write scripts, while in VIEW mode, you can preview the document. You can navigate through the pages of the document using the UP and DOWN buttons on the left.

PDF

ISO standard PDF/A format is supported. Click PICK or use the %FILENAME% variable to select a PDF file.

PASSWORD

Password for selected PDF file (optional). Passwords can also be written in a text file and placed in the workspace. In this case, you would fill the field with the filename of that text file.

Input Object

The input object is a data object retrieved from the analysis of a PDF file. It contains all text objects, line objects, etc., in a unified coordinate system, as well as the necessary parsing functions.

Doc{
    totalPages,    // The total number of pages in the file
    pageInfo[],    // Page information, such as the pixel width/height of the page
    textData[],    // Text object array
    hLines[],      // Horizontal line array
    vLines[],      // Vertical line array
    blocks[],      // Block array (composed of vertical and horizontal lines)
    tables[],      // Table array (composed of blocks) array
}

Coordinate System

The input uses the Page Normalized Coordinate (PNC) coordinate system. The origin is at the top-left corner, and the coordinates for each page are normalized to adjacent integer ranges. For example, the coordinate range for the first page is [0,0] to [1,1], the second page is [0,1] to [1,2], and so on. In this way, the entire document can be viewed as a continuous coordinate system with x: [01], y: [0(N-1)], where N is the number of pages.

Text Object

textObject{
    text,            // The content of the text object
    LT{},            // Coordinates of the top left corner of the text object
    LB{},            // Coordinates of the bottom left corner of the text object
    RT{},            // Coordinates of the top right corner of the text object
    RB{},            // Coordinates of the bottom right corner of the text object
    C{},             // Coordinates of the center point of the text object
    w,               // Width of the text object
    h,               // Height of the text object
    block{},         // Record the relationship between the text object and the document table
    isDeleted(),     // Detect whether the text object is crossed by a horizontal line
    detectLine(dir), // Detect and return the nearest line, dir can be "up", "down", "left", or "right"
    detectGroup(),   // Detect and return all text objects with the same adjacent lines, which can be used to detect all text objects in the same block  
}

Parsing Functions

The input object provides several parsing and utility functions to help users locate target objects within the document. The overall parsing logic involves narrowing down the collection of text objects based on spatial or textual conditions and then using relative relationships to find the target objects.

textObj = input.getKeyObj( keyName, keyBounds, options )

// keyName: the text string of the text object
// keyBounds: Object for defining the search range. The format is {"top":num,"bottom":num,"left":num,"right":num,"page":num}
// options: options object, which can provide the following options
// "regExp": Use regular expressions to filter the search range
// "ignoreHeader": ignoring the data in the top region of each page, which is defined by a percentage of the page's height
// "ignoreFooter": ignoring the data in the bottom region of each page, which is defined by a percentage of the page's height
// "all": Boolean value, if turned on, the function returns all found text objects in an array form
textObj = input.resolve( target, options )

// target: target object, including a key object (keyObj), a  keyName-keyBounds pair, resolution direction (valPos), resolution range (valBounds), or relative resolution range (relValBounds)
// options: options object, which can provide the following options
// "regExp": Use regular expressions to filter the search range
// "ignoreHeader": ignoring the data in the top region of each page, which is defined by a percentage of the page's height
// "ignoreFooter": ignoring the data in the bottom region of each page, which is defined by a percentage of the page's height
textObjs[] = input.resolveRange( target, options )
    
// target: The target object, containing the combination of a starting key object (startKeyObj) or (startKeyName/startKeyBounds), an ending key object (endKeyObj) or (endKeyName/endKeyBounds), resolution direction (valPos), resolution range (valBounds), or relative resolution range (relValBounds)\
// options: options object, which can provide the following options
// "regExp": Use regular expressions to filter the search range
// "ignoreHeader": ignoring the data in the top region of each page, which is defined by a percentage of the page's height
// "ignoreFooter": ignoring the data in the bottom region of each page, which is defined by a percentage of the page's height
[Experimental feature] textObjs[] = input.resolveTable( target, options )
	
// target: target object, including a key object (keyObj), a  keyName-keyBounds pair, resolution direction (valPos), resolution range (valBounds), or relative resolution range (relValBounds)
// valType: resolution type, which can be ROW, ROW(Num), COL, COL(Num), or REPEAT
// The ROW/COL type will parse the entire row/column of data
// The ROW(Num)/COL(Num) type will parse the Num values to the right or below the key object
// The REPEAT type will parse data in the position of the key object from all similar tables
// options: options object, which can provide the following options
// "regExp": Use regular expressions to filter the search range
// "includeKey": Boolean value, if true, the return value of row and column parsing will include the key value of the target
// "tableRefs": an array of objects with the format: {"refText": string, "R": num, "C": num}. This array is used to define which content is required for tables to be considered "similar."

Utility Functions

strings[] = input.textGrouping( textObjs , option)
    
// textObjs: Array of text objects that need to be reorganized
// Option: Arrangement option, which can be LINES, TDLR, LRTD (default)
// Note: The sorting logic involves grouping text objects where the line height is less than 1.5 times or the spacing is less than 2 character widths. These groups are then sorted in the order of TDLR and LRTD. After sorting, the strings from each group are output in sequence.
// Note2: In LINES mode, grouping is done based on text objects with similar heights on the same line, and strings are returned in a line-by-line manner.
textObjs[] = input.textSort( textObjs, mode = 0)

// textObjs: Array of text objects that need to be reorganized
// Mode: Sorting mode, currently supports left-to-right, top-to-bottom object sorting
// Note: Return the new sorted array
num = input.checkBounds( keyBounds, textObj )
// (0 for no overlap, 1 for partial overlap, 2 for object contained within the range)"

// keyBounds: The range to be tested. Format: {top: num, bottom: num, left: num, right: num}
// textObj: the text object to be tested

Viewer and CodeGen

In PDF Analyzer's Viewer, you can not only mark text objects and coordinates but also generate function code automatically through mouse and keyboard operations. After completing the operations, users can simply copy and paste the generated code into the training mode and make minor modifications, which reduces the time needed for programming.

  • To capture an object:

  • To generate boundaries:

  • To perform the directional analysis:

  • To perform range analysis:

  • To perform relative range analysis:

Output Object

Each key added to the output object will be output as a .txt file in the workspace. The file name will be the key, and the text content will be the corresponding value.

Examples

// 1.Resolve single value by Name:
let target = {keyName:"PRODUCTION ORDER", keyBounds = "page":1, valPos:"RIGHT" } 
let productOrderNum = input.resolve( target );
// result: productOrderNum.text = “423022”


// 2.Resolve single value by object:
let orderObj = getKeyObj("PRODUCTION ORDER", {page:1}) 
let target = {keyObj: orderObj, valPos:"RIGHT"}
let productOrderNum = input.resolve( target );
// result: productOrderNum.text = “423022”


// 3.Resolve single value with regExp Filter:
let target = { keyName:"Coord #", keyBounds = {page:1, valPos:"RIGHT"}
let options = {“regExp”:/\d{2}\/\d{2}\/\d{2}/ } 
let issueDate = input.resolve( target, options );
// result: issueDate .text = “06/24/20”


// 4.Resolve multiples values by start/endKeyObjects:
let startObj = input.getKeyObj( "Ship Type", {page:1} )
let endObj = input.getKeyObj( "Ship Terms", {page:1} )
let target = {startKeyObj:startObj, endKeyObj:endObj, valPos:"RIGHT"}
let interData = input.resolveRange(target)
//Result:  interData[0].text = “AW”
//Result:  interData[1].text = “Ship Via"
//Result:  interData[2].text = “SEA”


// 5.Resolve multiples values by valBounds:
let sellerObj = getKeyObj("To (Seller):", {page:1})
let codeObj = getKeyObj({"Code:", {page:1})
let accObj = getKeyObj("Acc Sup:", {page:1})
let target = {valBounds:{top:sellerObj.LB.y, right:accObj.LB.x, bottom:codeObj.LT.y, page:1 } }
let address = input.resolveRange(target)
//Result:  address[0].text = “TAIEASY INTERNATIONAL. CO., LTD”
//Result:  address[1].text = “11F., NO. 1, JIHU RD., NEIHU DIST.,"
//Result:  address[2].text = “TAIPEI CITY 114,”
//Result:  address[3].text = “TAIWAN (R.O.C.)”
//Result:  address[4].text = “TAIPEI CITY, TAIWAN”


// 6.Resolve multiples values by relValBounds
let facObj = getKeyObj("Mfrag Fac:", {page:1})
let target = {relValBounds:{top: -0.01, right:0.25, bottom:0.09, left:0.0} }
let location= input.resolveRange(target)
//Result:  location[0].text = “FORMOSA VIET NAM TEXTILE INDUSTRY”
//Result:  location[1].text = “CO.,LTD"
//Result:  location[2].text = “MY XUAN A2 INDUSTRIAL ZONE, MY XUAN”
//Result:  location[3].text = “WARD,”
//Result:  location[4].text = “PHU MY TOWN, BA RIA - VUNG YAU”


// 7.Resolve value with ignore options
let buyerObj = input.getKeyObj("By Buyer",{page:1})
let target = {keyObj:buyerObj, valPos:"DOWN"}
let options = {ignoreHeader:0.22, ignoreFooter:0.85}
let downVal= input.resolveRange(target, options)
// downVal.text = ”To (Seller)”


// 8.Resolve block values in Table.
let target = {keyName:"DESC", keyBounds:{page:1}, valPos:"DOWN"}
let blockVal= input.resolve(target)
// blockVal.text = “TRAINER PANT”
let allBlockVal = input.resolveTable({keyObj:blockVal})
// allBlockVal.text = “TRAINER PANT
//                               YOUTH”


// 9.Resolve row blocks values in Table.
let target = {keyName:"Ship Via", keyBounds:{page:1}, valType:"ROW"}
let blockVals = input.resolveTable(target)
// blockVals[0].text = “Ship Type”
// blockVals[1].text = “AW”
// blockVals[2].text = “Ship Via”
// blockVals[3].text = “SEA”
// blockVals[4].text = “Ship Terms”
// blockVals[5].text = “FOB”
target.valType = "ROW"
blockVals = input.resolveTable(target);
// blockVals[0].text = “SEA”
// blockVals[1].text = “Ship Terms”
// blockVals[2].text = “FOB”
target.valType = "ROW2"
blockVals = input.resolveTable(target);
// blockVals[0].text = “SEA”
// blockVals[1].text = “Ship Terms”


// 10.Resolve Column/Row blocks values in Table.
let target = {keyName:"Coord #", keyBounds:{page:1}, valType:"COL5"}
let options = {includeKey:true}
// get 6 headers in first column
let colVals = input.resolveTable(target, options)
// get row values for each header
colVals.forEach( head => {
	let rowVals = input.resolveTable({keyObj:head, valType:"ROW0"})
})
// for “Coord #” row
// 	rowVals[0].text = “TRAINING” 
// 	rowVals[1].text = “PO Issue Date” 
// 	rowVals[2].text = “06/24/20” 
// for “Season” row
// 	rowVals[0].text = “204HO” 
// 	rowVals[1].text = “Last Revised Date” 
// 	rowVals[2].text = “06/24/20” 
// for “Payment Terms” row
// 	rowVals[0].text = “75 DAY TERMS THROUGH GT NEXUS” 
// for “Ship Type” row
// 	rowVals[0].text = “AW” 
// 	rowVals[1].text = “Ship Via” 
// 	rowVals[2].text = “SEA” 
// 	rowVals[3].text = “Ship Terms” 
// 	rowVals[4].text = “FOB” 
// for “Ref#” row
// 	rowVals[0].text = “204HO NTM056 YRR PNP A” 
// for “Bene” rows
// 	rowVals[0].text = “NTM064 TAIEASY INTERNATIONAL CO.,LTD” 


// 11.Resolve repeat blocks values in Table.
// find first origin
let target = {keyName:"Country of Origin", keyBounds:{“page":1}, “valPos”:”RIGHT”}
let originVal = input.resolve(target)
//  originVal.text = “Vietnam”
// Resolve origins in other tables
let options = { tableRefs: [ 
                {refText:"Item:", R:0, C:0},
                {refText:"Country of Origin", R:1, C:0}
                ]}
let blockVals = this.resolveTable({keyObj:originVal, valType:"RIGHT"})
// blockVals[0].text = “VietNam”
// blockVals[1].text = “Taiwan”
// blockVals[2].text = “China”

--

We are dedicated to improving our content. Please let us know if you come across any errors, including spelling, grammar, or other mistakes, as your feedback is valuable to us! 🤖️⚡️

Last updated