Google Cloud Document AI Basics
This simple example shows how to use a custom extractor in Google's Doc AI to process W-2s and use a PDF as part of the context to Gemini.
Join the DZone community and get the full member experience.
Join For FreeGoogle Cloud’s Document AI (Doc AI) helps organizations automate the processing, extraction, and classification of massive amounts of documents.
Doc AI has a lot of capabilities and use cases, and here are a few ways it can help organizations. They’re tailored towards the public sector since that’s the customers I help; however, these use cases also apply to private companies.
Doc AI Example Use Cases
Processing Applications
- Automating the extraction of key data from applications such as services/benefits, driver’s licenses, and building permits.
Tax Document Processing
- Extracting information from tax forms (W-2s, 1040s, etc.) for faster processing and auditing. We’ll focus on this example.
Healthcare Administration
- Processing medical documents, such as medical records and insurance claims, for faster payment.
Unemployment
- Streamline the process of collecting various documents, quickly adjudicate, and reduce the time it takes to process benefits.
Let’s Get Started!
In this blog post, we’ll review how to create a custom document extractor for W-2 forms, use the Doc AI API to extract information from a document, and pass the W-2 PDF to Gemini to summarize the document.
Create a Custom Processor
Rather than going over the steps to create a custom extractor in this blog post, you can reference the Document AI Workbench — Custom Document Extractor Google codelab. The codelab does an excellent job of showing you, step by step, how to easily create, train, test, validate, and deploy a custom processor using the Doc AI Workbench without writing any code.
Here’s what one of the W-2s looks like after you’ve labeled it in Doc AI Workbench. You can choose three different training methods with a custom extractor. I chose one that uses Gemini 1.5 Flash. The Gen AI training method requires about 50 documents for the best results. You can learn more about the training methods here.
You can view evaluation metrics and upload a document to test as well.
Application Overview
Our application is very simple. You upload a W-2 PDF, Doc AI extracts the key items, Gemini 2.0 Flash summarizes the PDF, and the results are displayed as shown below. Rather than go through the entire application, I’ll just show the code on document extraction and summarization using Gemini Flash 2.0. I plan on sharing the entire code on GitHub soon.
Here’s the sample W-2 we’ll upload.
Doc AI Code
Here’s the code for Doc AI and an explanation of what it does.
from google.cloud import documentai
import os
def process_document(file):
try:
# Initialize Document AI client
client = documentai.DocumentProcessorServiceClient()
# Configure processor path
LOCATION = 'us' # Format is 'us' or 'eu'
PROJECT_ID = os.getenv('PROJECT_ID')
PROCESSOR_ID = os.getenv('PROCESSOR_ID')
if not PROJECT_ID or not PROCESSOR_ID:
raise ValueError("PROJECT_ID and PROCESSOR_ID must be set in .env file")
PROCESSOR_PATH = f"projects/{PROJECT_ID}/locations/{LOCATION}/processors/{PROCESSOR_ID}"
print(f"Using processor path: {PROCESSOR_PATH}")
# Read file content
file_content = file.read()
print(f"Read file content, size: {len(file_content)} bytes")
# Configure the process request
raw_document = documentai.RawDocument(
content=file_content,
mime_type="application/pdf"
)
# Process the document
request = documentai.ProcessRequest(
name=PROCESSOR_PATH,
raw_document=raw_document
)
print("Sending request to Doc AI...")
result = client.process_document(request=request)
print("Received response from Doc AI")
document = result.document
# Extract entities from the processed document
extracted_data = {}
for entity in document.entities:
extracted_data[entity.type_] = entity.mention_text
print(f"Extracted {len(extracted_data)} entities")
return extracted_data
except Exception as e:
print(f"Error in process_document: {str(e)}")
raise
- Import libraries: Import the Doc AI library.
- Doc AI processor: Get the Doc AI processor information from the workbench.
- Read and configure file: Read the file into the
file_content
variable. Load the PDF intoraw_document
variable so that Doc AI can scan it. - Process document: Send the document to Doc AI. Save the results to the
document
variable. - Extract key data: The
extracted_data
variable is a dictionary. It gets the entities in the document and returns them.
Here’s the final output.
Summarize PDF Using Gemini
I’m using the Gemini Flash 2.0 model to create a summary of the W-2.
import google.generativeai as genai
import os
def get_summary(file):
api_key = os.getenv('GEMINI_API_KEY')
genai.configure(api_key=api_key)
sample_pdf = genai.upload_file(path="PDF Path", display_name="file")
model = genai.GenerativeModel(model_name="gemini-2.0-flash")
response = model.generate_content(
contents=[sample_pdf, "Give me a summary of this pdf file." ]
)
print(response.text)
return response.text
The code is really simple. One of the things I love about Gemini 2.0 is that you can give it a PDF or a TXT directly in the prompt request or even provide multimodal prompts. There’s no need for me to build RAG or do other preprocessing. Simply put the PDF inside the model.generate_content
prompt request as shown in the code above.
Here are the results of Gemini Flash 2.0.
References
Here are some additional references:
Opinions expressed by DZone contributors are their own.
Comments