OCR all PDFs in a folder by cron

I will not write a long explanation, just this short one.

The cronjob looks like this:

* * * * * sh -c 'flock -n /tmp/my.lockfile /home/user/scripts/ocr_pdf_scan.sh'

This will start sh (a shell, just like bash) and run the command after -c. The command flock will create a lock-file, to prevent that this cron job is started while there is still the previous cron jobs running.

The cron job runs every minute, so in crontab it has * * * * * (five stars in front of the command).

In the ocr_pdf_scan.sh I have this:


find /home/user/scans/ocr -type f \( -iname "*.pdf" -and -not -iname "*_ocr.pdf" \) | while read file ; do OLDTimestamp=$(stat -c "%Y" "$file") && ocrmypdf -q -l deu+eng --rotate-pages --rotate-pages-threshold 8 -c -s "$file" /home/user/scans/ocr_ready/"$(basename "$file" ".pdf")_ocr.pdf" && touch -d @$OLDTimestamp /home/user/scans/ocr_ready/"$(basename "$file" ".pdf")_ocr.pdf" && rm "$file" ; done

exit 0

Edit 31.07.2023: Added the stat/touch part to reset the modification time after ocrmypdf processed the file.

This will search for all files ending in “*.pdf” but not “*_ocr.pdf”, it will pipe these path and filenames through to the while read, which will execute for each found file the ocrmypdf for the path and file stored in $file and output to another folder, it will strip the .pdf from the filename and append “_ocr.pdf” to the filename again.

I tried this with output to the same folder as well, it also works, but for some reason I set it up like this.

Leave a Reply

Your email address will not be published. Required fields are marked *