Convert PDF to Text (Using Apache PDFBox)

convert-text-from-pdf-apache-pdfbox

In this post we will see how we can convert PDF to Text or how we can extract text from PDF file.
We will be using a Java library called Apache PDFBox, it is one of the project on Apache.org website.

Apache PDFBox is really powerfull and handy, either you are a full fledge Java programmer or a common end user, you can use it with the APIs it provides or can use Commandline as well. We will see both ways to use it. Here I am using Apache PDFBox 2.0.

pdfbox-download-page-dataxone-com

Download the library
Download library from Apache PDFBox ( pdfbox-app-2.0.3.jar )

Using Java APIs:

First add the jar file pdfbox-app-2.0.3.jar in your project, below is a sample code to extract text from PDF:

Apart from above, Apache PDFBox provides lots of APIs, check it out on API Section

Using Commandline:

Syntax for Text Extraction:

Example:

command-line-conversion-pdf-to-text

Find some more parameters for commandline here

In both methods, you can iterate commands for multiple files, or can parallelize the process for bulk processing.

— Convert PDF to Text —
— extract text from PDF file —