How to search multiple pdf documents for words on Linux

When it comes to text search within a pdf document, pretty much every pdf reader software supports it (be it Adobe Reader or any third-party pdf viewer). However, it becomes tricky when there are more than one pdf document to search.

In Linux, there are command-line tools (e.g., pdftotext or pdfgrep) that can be used to do simple search on multiple pdf documents at once. Compare to these command-line utilities, a desktop application called recoll is a much more advanced and user-friendly text search tool. In this tutorial, I will describe how to search multiple pdf documents for text by using recoll.

What is Recoll?

recoll is an open-source desktop application specializing in text search. recoll maintains a database index for all document files in a target storage location (e.g., a specific folder, home directory, disk drive, etc). The document index contains texts extracted from document files with external helper programs. Using the document index, recoll can perform more advanced queries than simple regular expression based search.

The powerful features of recoll include:

  • Supports multiple document formats (e.g., pdf, doc, text, html, mailbox).
  • Automatically indexes document contents from files, emails, email attachments, compressed archives, etc.
  • Indexes web pages you visited (with the help of Firefox extension).
  • Supports multiple languages and Unicode-based multi-character sets.
  • Supports advanced search, such as proximity search and filtering based on file type, file system location, modification time, and file size.
  • Supports search with multiple entry fields such as document title, keyword, author, etc.

Install Recoll on Linux

To install recoll and external helper programs on Debian, Ubuntu, or Linux Mint:

$ sudo apt-get install recoll poppler-utils antiword

To install recoll and external helper programs on Fedora:

$ sudo yum install recoll poppler-utils antiword

To install recoll on CentOS or RHEL, first enable EPEL repository, and then run:

$ sudo yum install recoll poppler-utils antiword

Build a Document Index with Recoll

To launch recoll, simply run:

$ recoll

The first time you launch recoll, you will see the screen shown below. Here you are asked to choose one of two menu before starting indexing: (1) "Indexing configuration" which controls how to build a document database index, or (2) "Indexing schedule" which controls how often to update a database index. For now, click on "Indexing configuration" menu.

In the configuration window, you will see "Top directories" (directories which contain documents to search), and "Skipped paths" (file system paths to avoid when building a document index) under "General parameters" tab. In this example, I add "~/Documents" to "Top directories" field.

Under "Local parameters" tab, you can specify other indexing criteria, such as file names to skip, max file size, etc. Once you are done, go ahead and create a document database index. The document index building process uses external programs (e.g., pdftotext for pdf documents, antiword for MS Word documents) to extract texts from individual documents, and create an index out of the extracted texts.

Once an initial document index is built, you can check what kind of documents have been indexed, by going to "Help"-->"Show indexed types" menu. Make sure that "application/pdf" mime-type is included.

Search Multiple PDF Documents for Text

You are now ready to conduct document search. Enter any word or phrase (with quotes) to search for.

A search result shows a list of pdf documents along with document snippets and page number information that are matched with search query. The example output shows a list of pdf documents that contain a phrase "virtual machine". You can check document previews, or open the matched documents by using an external pdf viewer.

Using recoll, you can search pdf documents that contains specific word(s) in the document title. For example, by typing in "title:kernel" in search query, you can search for pdf documents which contain "kernel" in their titles.

Using advanced search option, you can define various other search criteria.

As documents are added, updated or removed, you will need to update an existing document index. You can do it manually by clicking on "Update Index" menu.

You can also update an existing document index automatically, either with a periodic cron job or with a background daemon process.

Subscribe to Xmodulo

Do you want to receive Linux FAQs, detailed tutorials and tips published at Xmodulo? Enter your email address below, and we will deliver our Linux posts straight to your email box, for free. Delivery powered by Google Feedburner.

The following two tabs change content below.
Dan Nanni is the founder and also a regular contributor of Xmodulo.com. He is a Linux/FOSS enthusiast who loves to get his hands dirty with his Linux box. He likes to procrastinate when he is supposed to be busy and productive. When he is otherwise free, he likes to watch movies and shop for the coolest gadgets.
Your name can also be listed here. Write for us as a freelancer.

8 thoughts on “How to search multiple pdf documents for words on Linux

  1. Tracker is also quite adept at this and many other file types. Still not as good as Drive's "OCR everything" method, though. Scans and pictures I've taken show up there.

  2. Particularly useful in recoll is the advanced search, where one can limit text search to just pdf, from other text docs.

    mlocate has become very slow after I use encrypted home. recoll is also fast in locating files.

  3. Call me old-fashioned, which I probably am, but I try to avoid things that need to index your files, like nepomuk and its elder brother beagle. The only exception is mlocate

  4. Thanks for bringing this to my attention. I have thousands of self-created
    pdf's from Scan&OCR jobs, and have sometimes felt the need for something
    similar like the comprehebnsive search through PDF's that you could do
    in Adobe Reader (which I came to know under Windows when I had to use
    that for work).

    Of course I scripted my way around that limitation, but this here is something
    I will look into.
    Thanks again.

  5. recoll is a fantastic piece of software - very fast, very usefull. I found similar software to be cumbersome, heavy, cpu intensive just at the worst moment when one is working on something else.
    I don't schedule recoll. It automatically updates its indexes every time it is run.

    In the past few months I began using claws-mail instead of gmail's web client. How refreshing to be using email software again that was not designed under the assumption the user is an idiot. With that short diversion, I can now go on to say that recoll does a fantastic job of indexing your email and it is lightning fast.

    That it indexes

  6. To search inside searchable PDF, you can always use: less *.pdf | grep
    For my own, I have stuffed it into a little tool "lgrep".
    BTW, you can modify less to read many more text containing files.

  7. I recently began the process of going "paperless" which involves scanning a lot of documents into pdf using scan2pdf. with the use of recoll finding exactly what I want is quick and easy. Scan2pdf also ocr's the document which means you can search with recoll for things you know the document said and still find it.

Leave a comment

Your email address will not be published. Required fields are marked *