We use GitHub issues to keep track of all issues. Please do not report bugs or issues in this blog’s comments. Instead, post them on GitHub as an issue. Before submitting a comment with an issue, please use GitHub search to look for existing issues (both open and closed) that may be similar.

The PDF (Portable Document Format) was born out of The Camelot Project to create “a universal way to communicate documents across a wide variety of machine configurations, operating systems and communication networks”. Basically, the goal was to make documents viewable on any display and printable on any modern printer. PDF was built on top of PostScript (a page description language), which had already solved this “view and print anywhere” problem. PDF encapsulates the components required to create a “view and print anywhere” document. These include characters, fonts, graphics and images.

A PDF file defines instructions to place characters (and other components) at precise x,y coordinates relative to the bottom-left corner of the page. Words are simulated by placing some characters closer than others. Similarly, spaces are simulated by placing words relatively far apart. How are tables simulated then? You guessed it correctly — by placing words as they would appear in a spreadsheet.

The PDF format has no internal representation of a table structure, which makes it difficult to extract tables for analysis. Sadly, a lot of open data is stored in PDFs, which was not designed for tabular data in the first place!

Camelot: PDF table extraction for humans

Today, we’re pleased to announce the release of Camelot, a Python library and command-line tool that makes it easy for anyone to extract data tables trapped inside PDF files! You can check out the documentation at Read the Docs and follow the development on GitHub.

How to install Camelot

Installation is easy! After installing the dependencies, you can install Camelot using pip (the recommended tool for installing Python packages):

$ pip install camelot-py

How to use Camelot

Extracting tables from a PDF using Camelot is very simple. Here’s how you do it. (Here’s the PDF used in the following example.)

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html
>>> tables[0].df # get a pandas DataFrame!

You can also check out the command-line interface.

Why use Camelot?

  • Camelot gives you complete control over table extraction by letting you tweak its settings.
  • Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
  • Each table is a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows.
  • You can export tables to multiple formats, including CSV, JSON, Excel and HTML.

Okay, but why another PDF table extraction library?

TL;DR: Total control for better table extraction

Many people use open (Tabula, pdf-table-extract) and closed-source (smallpdf, pdftables) tools to extract tables from PDFs. But they either give a nice output or fail miserably. There is no in between. This is not helpful since everything in the real world, including PDF table extraction, is fuzzy. This leads to the creation of ad-hoc table extraction scripts for each type of PDF table.

We created Camelot to offer users complete control over table extraction. If you can’t get your desired output with the default settings, you can tweak them and get the job done!

You can check out a comparison of Camelot’s output with other open-source PDF table extraction libraries.

The longer read

We’ve often needed to extract data trapped inside PDFs.

The first tool that we tried was Tabula, which has nice user and command-line interfaces, but it either worked perfectly or failed miserably. When it failed, it was difficult to tweak the settings — such as the image thresholding parameters, which influence table detection and can lead to a better output.

We also tried closed-source tools like smallpdf and pdftables, which worked slightly better than Tabula. But then again, they also didn’t allow tweaking and cost money.

When these full-blown PDF table extraction tools didn’t work, we tried pdftotext (an open-source command-line utility). pdftotext extracts text from a PDF while preserving the layout, using spaces. After getting the text, we had to write Python scripts with complicated regexes (regular expressions) to convert the text into tables. This wasn’t scalable, since we had to change the regexs for each new table layout.

We clearly needed a tweakable PDF table extraction tool, so we started developing one in December 2015. We started with the idea of giving the tool back to the community, which had given us so many open-source tools to work with.

We knew that Tabula classifies PDF tables into two classes. It has two methods to extract these different classes: Lattice (to extract tables with clearly defined lines between cells) and Stream (to extract tables with spaces between cells). We named Camelot’s table extraction flavors, Lattice and Stream, after Tabula’s methods.

Tabula uses a combination of scraping the vector elements and raster lines. Since we wanted to use Python, OpenCV was the obvious choice to do image processing. After more exploration, we settled on morphological transformations, which gave the exact line segments. From here, representing the table trapped inside a PDF was straightforward.

To get more information on how Lattice and Stream work in Camelot, check out the “How It Works” section of the documentation.

How we use Camelot

We’ve battle tested Camelot by using it in a variety of projects, both for one-off and automated table extraction.

For Atlan Grid, our curated data from 600+ sources and partners, we identified open data sources (primarily PDF reports) for each of the 17 Sustainable Development Goals. For India, we identified open data sources (primarily PDF reports) for each of the 17 Sustainable Development Goals. For example, one of our sources for Goal 3 (“Good Health and Well-Being for People”) is the National Family Health Survey (NFHS) report released by IIPS. To get data from these PDF sources, we created an internal web interface built on top of Camelot, where our data analysts could upload PDF reports and extract tables in their preferred format.

We also set up an ETL workflow using Apache Airflow to track disease outbreaks in India. The workflow scrapes the Integrated Disease Surveillance Programme (IDSP) website for weekly PDFs of disease outbreak data, and then it extracts tables from the PDFs using Camelot, sends alerts to our team, and loads the data into a data warehouse.

To infinity and beyond!

Camelot has some limitations. (We’re developing solutions!) Here are a couple of them:

  • When using Stream, tables aren’t autodetected. Stream treats the whole page as a single table, which gives bad output when there are multiple tables on the page.
  • Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, “If you can click-and-drag to select text in your table in a PDF viewer… then your PDF is text-based”.)

You can check out the GitHub repository for more information.

You can help too — every contribution counts! Check out the Contributor’s Guide for guidelines around contributing code, documentation or tests, reporting issues and proposing enhancements. You can also head to the issue tracker and look for issues labeled “help wanted” and “good first issue”.

We urge organizations to release open data in a “data friendly” format like the CSV. But while tables are trapped inside PDF files, there’s Camelot 🙂


Note: This blog was updated on 2nd November 2018 after we learnt that Tabula uses a combination of scraping the vector elements and raster lines, and not the Hough Transform as mentioned in this blog.


Photo by Jason Wong on Unsplash

New call-to-action
Author

Data Engineering at Atlan

34 Comments

  1. Pingback: Announcing Camelot, a Python Library to Extract Tabular Data from PDFs » @FinTechLog

  2. Pingback: New top story on Hacker News: A Python Library to extract tabular data from PDFs – Latest news

  3. Pingback: New top story on Hacker News: A Python Library to extract tabular data from PDFs - EYFnews

  4. Pingback: New top story on Hacker News: A Python Library to extract tabular data from PDFs – Golden News

  5. Pingback: New top story on Hacker News: A Python Library to extract tabular data from PDFs – News about world

  6. Pingback: New top story on Hacker News: A Python Library to extract tabular data from PDFs – Hckr News

  7. Pingback: A Python Library to extract tabular data from PDFs – Hacker News Robot

  8. Pingback: New best story on Hacker News: A Python Library to extract tabular data from PDFs – letest news

  9. Pingback: New best story on Hacker News: A Python Library to extract tabular data from PDFs – Fiverr Alternative

  10. Pingback: A Python Library to extract tabular data from PDFs – Infinity News

  11. Pingback: A Python Library to extract tabular data from PDFs

  12. Pingback: A Python Library to extract tabular data from PDFs | toppertrick

  13. Ravender Singh Dahiya Reply

    Hello vinayak
    camelot.read_pdf(‘foo.pdf’) is not working. Is there any change in the lastest version as i just downloaded it today only?

    It is giving error that it cannot find the file (whereas the file is present there)

  14. Why don’t you guys compare PDFPlumber extraction part with Camelot extract part. Also in your results you are not able to extract merged cells properly. CSV will not be able to handle it, so you might need to think of Excel output

  15. is there any way in which we can access table and cell coordinates? like in tabula we get the json with in the format specified below?
    {u’width’: 44.12999725341797, u’top’: 166.88, u’height’: 10.020000457763672, u’text’: u’suraj’, u’left’: 102.62}

  16. Hi Vinayak,
    Am not able import camelot after successful install. Can you please help me out ?
    —————————————————————–
    >>> import camelot
    Traceback (most recent call last):
    File “”, line 1, in
    File “/home/vineet/anaconda/lib/python2.7/site-packages/camelot/__init__.py”, line 8, in
    from .io import read_pdf
    File “/home/vineet/anaconda/lib/python2.7/site-packages/camelot/io.py”, line 4, in
    from .handlers import PDFHandler
    File “/home/vineet/anaconda/lib/python2.7/site-packages/camelot/handlers.py”, line 9, in
    from .parsers import Stream, Lattice
    File “/home/vineet/anaconda/lib/python2.7/site-packages/camelot/parsers/__init__.py”, line 4, in
    from .lattice import Lattice
    File “/home/vineet/anaconda/lib/python2.7/site-packages/camelot/parsers/lattice.py”, line 18, in
    from ..image_processing import (adaptive_threshold, find_lines,
    File “/home/vineet/anaconda/lib/python2.7/site-packages/camelot/image_processing.py”, line 5, in
    import cv2
    ImportError: No module named cv2
    >>>
    —————————————————————–

    • You need to install OpenCV to run Camelot, which you can do by installing Camelot with “pip install camelot-py[cv]”. If you face any issue, please file it on GitHub.

  17. Bhupender Kumar Reply

    Hi Vinayak,

    I am getting the following error:

    ” File “C:\Users\UserName\AppData\Roaming\Python\Python36\site-packages\camelot\image_processing.py”, line 38, in adaptive_threshold
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    error: OpenCV(3.4.4) C:\projects\opencv-python\opencv\modules\imgproc\src\color.cpp:181: error: (-215:Assertion failed) !_src.empty() in function ‘cv::cvtColor'”

    I am using it as follows:

    import camelot
    tables = camelot.read_pdf(“C:\\Users\\UserName\\Desktop\\foo.pdf”)

    Please tell me how to solve the issue.

  18. Hi Vinayak,

    I have some PDFs where a table starts in page 1 and ends in page. That is , a table in pdf spans 2 pages. Tabula doesn’t seem to give a good result in that case. Is something like this feasible with Camelot.

  19. Pingback: 2 – Announcing flyio, an R package to interact with data in the cloud | Traffic.Ventures Social

  20. Hi Vinayak,
    I am getting an error while running your exemple. Can you help me out?
    PS.: I installed python3-ghostscript

    Traceback (most recent call last):
    File “C:\Users\tomas\Desktop\Google Drive\bolsa\python\short term\pGSPack\lib\site-packages\camelot\parsers\lattice.py”, line 193, in get_executable
    raise ValueError
    ValueError

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File “C:/Users/tomas/Desktop/Google Drive/lang/z-pdfcamelot.py”, line 6, in
    tables = camelot.read_pdf(pdf)
    File “C:\Users\tomas\Desktop\Google Drive\bolsa\python\short term\pGSPack\lib\site-packages\camelot\io.py”, line 101, in read_pdf
    tables = p.parse(flavor=flavor, suppress_stdout=suppress_stdout, **kwargs)
    File “C:\Users\tomas\Desktop\Google Drive\bolsa\python\short term\pGSPack\lib\site-packages\camelot\handlers.py”, line 156, in parse
    t = parser.extract_tables(p, suppress_stdout=suppress_stdout)
    File “C:\Users\tomas\Desktop\Google Drive\bolsa\python\short term\pGSPack\lib\site-packages\camelot\parsers\lattice.py”, line 361, in extract_tables
    self._generate_image()
    File “C:\Users\tomas\Desktop\Google Drive\bolsa\python\short term\pGSPack\lib\site-packages\camelot\parsers\lattice.py”, line 220, in _generate_image
    gs = get_executable()
    File “C:\Users\tomas\Desktop\Google Drive\bolsa\python\short term\pGSPack\lib\site-packages\camelot\parsers\lattice.py”, line 206, in get_executable
    ‘Please make sure that Ghostscript is installed’
    camelot.parsers.lattice.GhostscriptNotFound: Please make sure that Ghostscript is installed and available on the PATH environment variable

  21. Hello vinayak

    i am trying to use your code but it throwing an import error “ImportError: cannot import name ‘TableList’ from ‘camelot.core’ (C:\Users\NITESH\PycharmProjects\pdf_to_excel\venv\lib\site-packages\camelot\core\__init__.py)” . Could you please suggest me how to fix it. i imported cv2 before importing camelot.

    Thanks in Advance.

  22. Hi Vinayak,
    Camelot is too good and working perfectly, only problem i faced during converting pdf to excel is every header of an excel is repeating twice.

  23. Kallol Samanta Reply

    Hi,

    Thaknks for good work, I have one doubt, I am uploading a pdf having multiple pages using read_pdf.
    I am getting output of only one page, How to get the output of every page.

  24. It does not work

    #!/usr/bin/env python
    import camelot

    tables = camelot.read_pdf(‘upower15.pdf’)
    print tables

Write A Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.