Google Cloud Document AI Basics

This simple example shows how to use a custom extractor in Google's Doc AI to process W-2s and use a PDF as part of the context to Gemini.

Imran Burki

Apr. 30, 25 · Tutorial

Likes (2)

Comment

Save

2.4K Views

Google Cloud’s Document AI (Doc AI) helps organizations automate the processing, extraction, and classification of massive amounts of documents.

Doc AI has a lot of capabilities and use cases, and here are a few ways it can help organizations. They’re tailored towards the public sector since that’s the customers I help; however, these use cases also apply to private companies.

Doc AI Example Use Cases

Processing Applications

Automating the extraction of key data from applications such as services/benefits, driver’s licenses, and building permits.

Tax Document Processing

Extracting information from tax forms (W-2s, 1040s, etc.) for faster processing and auditing. We’ll focus on this example.

Healthcare Administration

Processing medical documents, such as medical records and insurance claims, for faster payment.

Unemployment

Streamline the process of collecting various documents, quickly adjudicate, and reduce the time it takes to process benefits.

Let’s Get Started!

In this blog post, we’ll review how to create a custom document extractor for W-2 forms, use the Doc AI API to extract information from a document, and pass the W-2 PDF to Gemini to summarize the document.

Create a Custom Processor

Rather than going over the steps to create a custom extractor in this blog post, you can reference the Document AI Workbench — Custom Document Extractor Google codelab. The codelab does an excellent job of showing you, step by step, how to easily create, train, test, validate, and deploy a custom processor using the Doc AI Workbench without writing any code.

Here’s what one of the W-2s looks like after you’ve labeled it in Doc AI Workbench. You can choose three different training methods with a custom extractor. I chose one that uses Gemini 1.5 Flash. The Gen AI training method requires about 50 documents for the best results. You can learn more about the training methods here.

Evaluation metrics

Application Overview

Our application is very simple. You upload a W-2 PDF, Doc AI extracts the key items, Gemini 2.0 Flash summarizes the PDF, and the results are displayed as shown below. Rather than go through the entire application, I’ll just show the code on document extraction and summarization using Gemini Flash 2.0. I plan on sharing the entire code on GitHub soon.

Here’s the sample W-2 we’ll upload.

W-2 First Page

W-2 Second Page

Doc AI Code

Here’s the code for Doc AI and an explanation of what it does.

    Python
   
 

   from google.cloud import documentai
import os

def process_document(file):
    try:
        # Initialize Document AI client
        client = documentai.DocumentProcessorServiceClient()
        
        # Configure processor path
        LOCATION = 'us'  # Format is 'us' or 'eu'
        PROJECT_ID = os.getenv('PROJECT_ID')
        PROCESSOR_ID = os.getenv('PROCESSOR_ID')
        
        if not PROJECT_ID or not PROCESSOR_ID:
            raise ValueError("PROJECT_ID and PROCESSOR_ID must be set in .env file")
        
        PROCESSOR_PATH = f"projects/{PROJECT_ID}/locations/{LOCATION}/processors/{PROCESSOR_ID}"
        print(f"Using processor path: {PROCESSOR_PATH}")
        
        # Read file content
        file_content = file.read()
        print(f"Read file content, size: {len(file_content)} bytes")
        
        # Configure the process request
        raw_document = documentai.RawDocument(
            content=file_content,
            mime_type="application/pdf"
        )
        
        # Process the document
        request = documentai.ProcessRequest(
            name=PROCESSOR_PATH,
            raw_document=raw_document
        )
        
        print("Sending request to Doc AI...")
        result = client.process_document(request=request)
        print("Received response from Doc AI")
        
        document = result.document
        
        # Extract entities from the processed document
        extracted_data = {}
        for entity in document.entities:
            extracted_data[entity.type_] = entity.mention_text
            
        print(f"Extracted {len(extracted_data)} entities")
        return extracted_data
        
    except Exception as e:
        print(f"Error in process_document: {str(e)}")
        raise
  

Import libraries: Import the Doc AI library.
Doc AI processor: Get the Doc AI processor information from the workbench.
Read and configure file: Read the file into the file_content variable. Load the PDF into raw_document variable so that Doc AI can scan it.
Process document: Send the document to Doc AI. Save the results to the document variable.
Extract key data: The extracted_data variable is a dictionary. It gets the entities in the document and returns them.

Here’s the final output.

Doc AI Output

Summarize PDF Using Gemini

I’m using the Gemini Flash 2.0 model to create a summary of the W-2.

    Python
   
   import google.generativeai as genai
import os

def get_summary(file):

    api_key = os.getenv('GEMINI_API_KEY')
    genai.configure(api_key=api_key)

    sample_pdf = genai.upload_file(path="PDF Path", display_name="file")

    model = genai.GenerativeModel(model_name="gemini-2.0-flash")
    
    response = model.generate_content(
        contents=[sample_pdf, "Give me a summary of this pdf file." ]
    )
    print(response.text)

    return response.text

The code is really simple. One of the things I love about Gemini 2.0 is that you can give it a PDF or a TXT directly in the prompt request or even provide multimodal prompts. There’s no need for me to build RAG or do other preprocessing. Simply put the PDF inside the model.generate_content prompt request as shown in the code above.

Here are the results of Gemini Flash 2.0.

Gemini Summarization

References

Here are some additional references:

AI Document PDF Google (verb)

Opinions expressed by DZone contributors are their own.

Related

Trending