PDF to TXT Converter

Convert pdf files in directory into txt files

1 minute

Description #

This project uses the Tesseract OCR and magick packages in R to convert .pdf files to .png files, which are then read by the optical recognition engine and written as .txt files.

Installation #

Use the app by downloading this repository and running it locally on RStudio.

Alternatively, you can run this code on your console to download the app and run it.

shiny::runGitHub("litProject", "josue-SH", subdir = "litProject")

Use #

Upload a pdf file from your computer onto the shiny app. Then click on the “Convert to TXT” button. I’m not sure what the maximum number of pages it will convert is. When the app is done with reading the pdf files and running the OCR, the download button will activate and the .txt files can be downloaded as a zip file by pressing it.

Special thanks to Nicholas Horton for hosting this on his server!

Roadmap #

There’s a few app functionalities that I would like to add:

  • Switching languages for the tesseract engine
  • Converting directly from URLs

Code #

View the code for this project on Github