In this guide, we will talk about how to extract a table locked in a PDF file. We are also going to write a NodeJS function in order to help with the same.
It’s a complete guide to PDF table extraction in Node JS.
Tabula The PDF Table Extraction Library
We are going to use the Tabula open-source project.
I am going to assume that you are using this library as a programmer. So, in order to do this, you can download a JAR file. You can run this JAR file using the Java CLI program. To download the JAR file you can go to the GitHub Releases Page.
Go ahead and download it and save it on your machine. In order to run it, you will also need the “java” CLI program installed.
A detailed guide on how to install Java on Windows, macOS & Ubuntu can be found here.
Also, here is a sample PDF with a table. You can use this PDF if you don’t have a particular PDF file in mind for exploring the library.
Quick Start
So, let us quickly try to get some results using the CLI. Run the command below:
$> java -jar tabula-1.0.5-jar-with-dependencies.jar --pages 1-9 PDF_File.pdf -outfile output.csv
In the above line, you see that the JAR is used and then we are asking the Tabula Library to go over pages 1 to 9. Then, we are giving it the PDF and ask for the result to be stored in an output CSV file.
Below, let’s look at some of the most important command line arguments you can use and how to use them.
Common Command Line Arguments
The complete list of command line arguments can be found on the GitHub Page.
Below are some of the important arguments.
Pages: –pages 1-2
This is easy to understand. It’s simply the number of pages you want Tabula to analyze.
It is used something like…
$> java -jar tabula-1.0.5-jar-with-dependencies.jar -pages 1-9 PDF_File.pdf --outfile output.csv
As you can see, in the above example, pages 1 to 9 are being analyzed.
Area To Analyze: –area 269.875,12.75,790.5,561
Here the idea is that we create a box by specifying coordinates. The following is the order in which the coordinates are specified.
y1,x1,y2,x2
The unit of the numbers is in “pts” or “points”. Now, how to come up with these numbers is a little bit elaborate. But, below, I have written a whole section about this. Check out, “Extracting A Table At A Particular Location” below.
Here is what that might look like:
$> java -jar tabula-1.0.5-jar-with-dependencies.jar -pages 1 --area 620.479,337.881,657.153,482.809 unicode.pdf --outfile output.csv
Format: –format CSV
Here you have 3 options: CSV or TSV or JSON.
In order to use this option, you can do something like:
$> java -jar tabula-1.0.5-jar-with-dependencies.jar -pages 1 --area 620.479,337.881,657.153,482.809 --format JSON unicode.pdf --outfile output.json
As you can see above the output format is set to JSON.
Lattice Mode: –lattice
This is an interesting one. Lattice means that the library should look for a table on the page where all the cells have boxes around them. Something like a spreadsheet.
So, take this below page as an example.
As you can see above, there are 2 tables in the above image. Now, when Tabula tries to analyze this page, with lattice mode turned off, gets a lot of extra stuff. But, turning lattice mode on, the output is just the 2 tables. See the screenshot below.
(Click the image to see it in full size)
Password: –password <password>
Well, as you can guess, if the PDF file has a password on it, this option can be passed in order to unlock the file first and then extract the tables. Below is how you would use it:
$> java -jar tabula-1.0.5-jar-with-dependencies.jar -pages 1-9 PDF_File.pdf --password MYPASSWORD --outfile output.csv
Output File: –outfile <file_path>
This one is quite obvious. And, we have already used it before many times. It’s just the output file path. It should be noted, that if you do not provide this option, the output will be printed on the console directly.
$> java -jar tabula-1.0.5-jar-with-dependencies.jar -pages 1 --format JSON unicode.pdf --outfile output.json
Extracting A Table At A Particular Location From A PDF
All the other options we have talked about above are fairly obvious. The only one that needs more explanation is the –area option.
Using that option, you can specify a box. The table to be extracted is found in this box. Now, in order to specify the box, you need 4 co-ordinates: y1,x1,y2,x2
And, these coordinates are specified in “pts” or “points”.
Now, the best way that I have found in order to define this box, is by using a visual program. We will be using the free program Inkscape. Inkscape is an amazing open-source project. It is a full-fledged vector design software. Something like Adobe Illustrator or Coral Draw. We will be using a very small part of it for this task.
It will be easier to understand things if you see a short video. So, I have created a small video about the same.
So, with that, you can run your command like so:
$> java -jar tabula-1.0.5-jar-with-dependencies.jar -pages 1 --area 620.479,337.881,657.153,482.809 --format JSON unicode.pdf --outfile output.json
A Node JS Function To Use
So, now that you understand how the PDF extraction library is used on the command line, let’s look at how to use it in NodeJS. I have created a function that will take a file as the first parameter and the options you would like to use as the second parameter. As is usually done in Node and JS-based libraries, all the options will be passed in the form of a JSON object.
Here is the node JS function. I have commented the code where needed so that you can understand what is going on. But, it’s really not very complex.
const { execSync } = require('child_process'); let extractTable = (pdfFilePath, options) => { const PATH_TO_JAVA = '/usr/bin/java'; const PATH_TO_JAR = '/home/main/Downloads/tabula-1.0.5-jar-with-dependencies.jar' // Collecting the options let pages = options?.pages; let topLeftY = options?.area?.topLeft?.y; let topLeftX = options?.area?.topLeft?.x; let bottomRightY = options?.area?.bottomRight?.y; let bottomRightX = options?.area?.bottomRight?.x; let latticeMode = options?.latticeMode; let outputFormat = options?.outputFormat; let password = options?.password; let optionsArray = []; // Constructing The Command Line Options One By One if (pages != undefined) { optionsArray.push(`--pages ${pages}`); } if ( topLeftX != undefined && topLeftY != undefined && bottomRightX != undefined && bottomRightY != undefined ) { optionsArray.push(`--area ${topLeftY},${topLeftX},${bottomRightY},${bottomRightX}`); } if (latticeMode != undefined && latticeMode == true) { optionsArray.push(`--lattice`); } if (outputFormat != undefined) { optionsArray.push(`--format ${outputFormat}`); } if (password != undefined) { optionsArray.push(`--password ${password}`) } // Making the full command let command = `${PATH_TO_JAVA} -jar ${PATH_TO_JAR} ${optionsArray.join(' ')} ${pdfFilePath}` // If an Output file location is given, using that // else the data will just be given as part of the JSON array if (options.outputFileName != undefined) { command = `${command} -o ${options.outputFileName}` } // Calling the command via execSync try { let stdout = execSync(command); if (outputFormat == 'JSON' && options.outputFileName == undefined) { let theJsonObj = JSON.parse(stdout); return { success: true, result: theJsonObj } } if ((outputFormat == 'CSV' || outputFormat == 'TSV') && options.outputFileName == undefined) { return { success: true, result: stdout.toString() } } return { success: true } } catch (err) { return { success: false, error: err.toString() } } } // Using the function let result = extractTable('pdf_with_table.pdf', { pages: '1', latticeMode: true, outputFormat: 'CSV', area: { topLeft: { y: 190.8, x: 35.8 }, bottomRight: { y: 481.5, x: 580.5 } } }) console.log(JSON.stringify(result, null, 4));
One of the things used above is execSync. This is used to execute a command but synchronously in Node. The official docs for the same can be found here.
On lines 5 and 6 you have to provide paths to your java CLI installation and the JAR file. If you are on some variant of Ubuntu you can get the path to Java by tying in: ‘which java’ on the command line.
The path to the JAR is based on where you put it.
Below are all the possibilities for the options object that is also passed in when calling the function.
{ pages: '1-3', // this can aso be something like '2' latticeMode: true, // This is optional. See above for what this does outputFormat: 'CSV', // Options are: JSON, CSV & TSV. CSV is default area: { topLeft: { y: 190.8, x: 35.8 }, bottomRight: { y: 481.5, x: 580.5 } }, // This is explained in the above section. But its optional password: 'mypass' // Optional. Can be passed if needed outputFileName: '/path/output.csv' // If not given, the result will be a JSON with the data }
If the outputFileName is provided the result is something like this:
{ success: true }
On the other hand, if not given a outputFileName, the response will be something like this.
{ "success": true, "result": "Number of Coils,Number of Paperclips\r\n5,\"3, 5, 4\"\r\n10,\"7, 8, 6\"\r\n15,\"11, 10, 12\"\r\n20,\"15, 13, 14\"\r\n" }
And in case, you get an error, then the response will be something like this:
{ "success": false, "error": "Error: Command failed: ....some error message" }
Installing Java & The Library On An Ubuntu Server
Now, finally, when you need to run this on a server, you will need to install Java. I am going to assume you are using an Ubuntu server for the below instructions. You can install the “java” CLI using something like:
$> sudo apt update $> sudo apt install default-jre $> java -version
On running the last command, you should get something like this:
openjdk version "11.0.11" 2021-04-20 OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04) OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.20.04, mixed mode, sharing))
Digital Ocean makes great guides about how to install and set up things on a server. Here is the guide that talks about the above. You might find another guide on the Digital Ocean website for your desired OS.
Conclusion
With that, you have everything you need for PDF table extraction in Node JS. Hope you found this article useful.