Document Intelligence Series — Part 2: Transformer for Table Detection & Extraction

5 min readOct 16, 2023

AI Generated Image from https://ideogram.ai/

Introduction

This is second part of Document Intelligence series where I continue to explore various techniques which can be applied on documents. This article introduces an approach towards table detection by utilizing transformer based models.

Detection Transformer or DETR in short which is an encoder decoder transformer released by Microsoft. I recommend you to watch this video to understanding its full functioning.

Let's see how to use it.

Code

The experiment was performed in Google colab and I'm sharing the code here from my notebook.

Install the libraries

!pip install -q git+https://github.com/huggingface/transformers.git

!sudo apt install tesseract-ocr

!pip install -q timm pytesseract

2. Import the libaries

import matplotlib.pyplot as plt

%matplotlib inline

import numpy as np
import pandas as pd
import pytesseract
from pytesseract import Output

from huggingface_hub import hf_hub_download
from PIL import Image

import torch

from transformers import DetrFeatureExtractor
from transformers import TableTransformerForObjectDetection

3. Initialize model

feature_extractor = DetrFeatureExtractor()
model = TableTransformerForObjectDetection.from_pretrained("microsoft/table-transformer-structure-recognition")

4. Loading an image

Here, I've downloaded an image from huggingface hub.

file_path = hf_hub_download(repo_id="nielsr/example-pdf", repo_type="dataset", filename="example_table.png")
image = Image.open(file_path).convert("RGB")
width, height = image.size
image.resize((int(width*0.5), int(height*0.5)))

5. Inference on image

encoding = feature_extractor(image, return_tensors="pt")

with torch.no_grad():
  outputs = model(**encoding)

results = feature_extractor.post_process_object_detection(outputs, threshold=0.6, target_sizes=target_sizes)[0]
results

Output:

{'scores': tensor([0.7818, 0.9094, 0.8190, 0.9996, 0.9995, 0.7614, 0.9992, 0.7655, 0.7816,
         0.8138, 0.9999, 0.7961, 0.8562, 0.9973, 0.9996, 0.9995, 0.9995, 0.6596,
         0.9082, 0.9802, 0.9996, 0.7440, 0.9283, 0.6625, 0.9690, 1.0000]),
 'labels': tensor([2, 2, 2, 1, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 2, 2, 3, 1, 2, 5, 2,
         5, 0]),
 'boxes': tensor([[  55.1625,  473.8491, 1992.9238,  536.2683],
         [  55.4169,  760.5422, 1993.0583,  808.9153],
         [  55.5170,  302.8054, 1993.0104,  364.1293],
         [1370.9203,  128.7774, 1607.9895,  856.0112],
         [ 961.6448,  128.3521, 1151.3483,  855.8168],
         [  55.3619,  423.8597, 1993.0262,  472.4117],
         [1153.1301,  127.8595, 1372.5353,  855.7993],
         [  55.5993,  253.6965, 1993.0969,  302.4271],
         [  55.4433,  194.0559, 1993.0710,  253.7692],
         [  55.7711,  361.3059, 1992.7142,  424.7832],
         [  55.7679,  126.7103,  331.8651,  853.5883],
         [  55.1420,  534.7481, 1992.9456,  594.9164],
         [  55.9185,  698.5151, 1993.0920,  761.0674],
         [  55.2845,  129.6848, 1992.8042,  194.8529],
         [1609.8851,  128.6986, 1832.0500,  856.1185],
         [ 625.6287,  126.0535,  953.5483,  852.9497],
         [1834.7311,  131.1194, 1993.4819,  855.7679],
         [  54.7878,  697.0790, 1993.8176,  851.9252],
         [  55.4953,  807.7474, 1991.9313,  854.0126],
         [  54.9738,  130.3264, 1992.7706,  194.1180],
         [ 335.9736,  125.3920,  624.2226,  853.0720],
         [  55.5878,  642.4491, 1992.8743,  699.4342],
         [  56.0965,  197.0991,  330.3431,  368.5917],
         [  55.5552,  593.7886, 1992.8492,  642.4099],
         [  55.4703,  371.5856,  326.4745,  850.0078],
         [  55.9349,  128.0338, 1992.4685,  850.6671]])}

6. Decoding Labels

label_dict = model.config.id2label
label_dict

{0: 'table',
 1: 'table column',
 2: 'table row',
 3: 'table column header',
 4: 'table projected row header',
 5: 'table spanning cell'}

7. Cropping the ROI

Now, let's crop each table rows and column header part from the image. It can be done from the coordinates from boxes and labels in the result.

labels, boxes = results['labels'], results['boxes']

column_header = None
table_rows = []
for label, (xmin, ymin, xmax, ymax) in zip(labels.tolist(), boxes.tolist()):
    label = label_dict[label]
    if label in ['table row', 'table column header']:
        cropped_image = image.crop((xmin, ymin, xmax, ymax))
        if label == "table column header":
            column_header = cropped_image
        else:
            table_rows.append(cropped_image)

It would look something like this.

column_header

Cropped column header

table_rows[0]

Cropped table row

8. Extract columns and rows from table

For extraction, pytesseract can be used. First, starting with extracting the columns.

ext_df = pytesseract.image_to_data(column_header, output_type=Output.DATAFRAME, config="--psm 1")
ext_df = ext_df.dropna(subset=['text'])
ext_df['text'] = ext_df['text'].str.strip()
ext_df = ext_df[ext_df['text'].apply(len) > 1]
ext_df = ext_df.reset_index(drop=True)
extracted_columns = ext_df['text'].values.tolist()
ext_result_df = pd.DataFrame(columns=list(map(lambda x: x, range(len(extracted_columns)))))
ext_result_df.columns = extracted_columns
ext_result_df

Next, extract each table rows and concatenate rows to this dataframe.

Also, apply post processing to clean the dataframe.

for table_row in table_rows:
    ext_df = pytesseract.image_to_data(table_row, output_type=Output.DATAFRAME, config="--psm 1")
    ext_df = ext_df.dropna(subset=['text'])
    # ext_df['text'] = ext_df['text'].str.strip()
    # ext_df = ext_df[ext_df['text'].apply(len) > 1]
    ext_df = ext_df.reset_index(drop=True)
    data = dict(zip(extracted_columns, ext_df['text'].values.tolist()))
    for col in extracted_columns:
        if col not in data:
            data[col] = np.nan
    row_df = pd.DataFrame(data=data, index=[0])
    ext_result_df = pd.concat([ext_result_df, row_df])

final_ext_df = pd.DataFrame()
for _, row in ext_result_df.iterrows():
    missing_count = row.isna().sum()
    if missing_count // len(ext_result_df.columns) == 1:
        continue
    row_df = pd.DataFrame(row).T
    final_ext_df = pd.concat([final_ext_df, row_df])

final_ext_df = final_ext_df.reset_index(drop=True)
final_ext_df.head()

Conclusion

So this was a quick walkthrough of how to use DETR, a transformer based model for table detection.

The notebook with the entire code is available here.

This article is the 2nd part of the Document Intelligence Series. If you are interested in knowing more, 1st part of the series available here.

Please drop a like and follow for more such articles.

Reference

The model implementation is referred from Neils's github repo which is available here. I encourage you to go through this repo, it has an entire library of transformer models implemented for a number of usecases.

This message is for you!!
🚀 Hello, fellow knowledge enthusiast! 🌟
Searching for hands-on wisdom in Data Science, AI, and Python? You’re in the right place! 👩‍💻💡
I’m on a mission to demystify complexity, unleashing real-time applications that fuel your success. Let’s embark on this thrilling voyage of discovery!
Come, be a part of this exciting journey. I’m striving to reach 150 followers by year-end. 📈 Your follow is the boost I crave.
Fuel your curiosity, surge ahead. 🚀📊
Follow now and unlock the world of practical tech!

Document Intelligence Series — Part 2: Transformer for Table Detection & Extraction

Introduction

Code

Conclusion

Reference

Written by Rajat Roy