Parent Document Retrieval (PDR): Useful Technique in RAG

PDR allows RAG to respond accurately and nuanced to complex queries. Explore more in this step-by-step implementation using LangChain, OpenAI, and more.

Intiaz Shaik

Aug. 22, 24 · Tutorial

Likes (2)

Comment

Save

6.9K Views

What Is Parent Document Retrieval (PDR)?

Parent Document Retrieval is a method implemented in state-of-the-art RAG models meant to recover full parent documents from which relevant child passages or snippets can be extracted. It provides context enrichment and is passed on to the RAG model for more comprehensive, information-rich responses to complex or nuanced questions.

Major steps in parent document retrieval in RAG models include:

Data preprocessing: Breaking very long documents into manageable pieces
Create embeddings: Convert pieces into numerical vectors for efficient search
User query: User submits a question
Chunk retrieval: Model retrieves the piece’s most similar to the embedding for the query
Find parent document: Retrieve original documents or bigger pieces of them from which these pieces were taken
Parent Document Retrieval: Retrieve full parent documents to provide more context for the response

Step-By-Step Implementation

The steps for implementing parent document retrieval comprise four different stages:

1. Prepare the Data

We will first create the environment and preprocess data for our RAG system implementation for parent document retrieval.

A. Import Necessary Modules

We will import the required modules from the installed libraries to set up our PDR system:

    Python
   
 

   from langchain.schema import Document
from langchain.vectorstores import Chroma
from langchain.retrievers import ParentDocumentRetriever
from langchain.chains import RetrievalQA
from langchain_openai import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
  

It is these libraries and modules that will form a major part of the forthcoming steps in the process.

B. Set Up the OpenAI API Key

We are using an OpenAI LLM for response generation, so we will need an OpenAI API key. Set the OPENAI_API_KEY environment variable with your key:

    Python
   
   OPENAI_API_KEY = os.environ["OPENAI_API_KEY"] = ""  # Add your OpenAI API key
if OPENAI_API_KEY == "":
    raise ValueError("Please set the OPENAI_API_KEY environment variable")

C. Define the Text Embedding Function

We will leverage OpenAI's embeddings to represent our text data:

    Python
   
   embeddings = OpenAIEmbeddings()

D. Load Text Data

Now, read in the text documents you would like to retrieve. You can leverage the class TextLoader for reading text files:

    Python
   
 

   loaders = [
    TextLoader('/path/to/your/document1.txt'),
    TextLoader('/path/to/your/document2.txt'),
]
docs = []
for l in loaders:
    docs.extend(l.load())
  

2. Retrieve Full Documents

Here, we will set up the system to retrieve full parent documents for which child passages are relevant.

A. Full Document Splitting

We'll use RecursiveCharacterTextSplitter to split the loaded documents into smaller text chunks of a desired size. These child documents will allow us to search efficiently for relevant passages:

    Python
   
   child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

B. Vector Store and Storage Setup

In this section, we will use Chroma vector store for embeddings of the child documents and InMemoryStore to keep track of the full parent documents associated with the child documents:

    Python
   
 

   vectorstore = Chroma(
    collection_name="full_documents",
    embedding_function=OpenAIEmbeddings()
)
store = InMemoryStore()
  

C. Parent Document Retriever

Now, let us instantiate an object from the class ParentDocumentRetriever. This class shall be responsible for the core logic related to the retrieval of full parent documents based on child document similarity.

    Python
   
 

   full_doc_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter
)
  

D. Adding Documents

These loaded documents will then be fed into the ParentDocumentRetriever using the add_documents method as follows:

    Python
   
   full_doc_retriever.add_documents(docs)
print(list(store.yield_keys()))  # List document IDs in the store

E. Similarity Search and Retrieval

Now that the retriever is implemented, you can retrieve relevant child documents given a query and fetch the relevant full parent documents:

    Python
   
 

   sub_docs = vectorstore.similarity_search("What is LangSmith?", k=2)
print(len(sub_docs))
print(sub_docs[0].page_content)  
retrieved_docs = full_doc_retriever.invoke("What is LangSmith?")
print(len(retrieved_docs[0].page_content)) 
print(retrieved_docs[0].page_content)
  

3. Retrieve Larger Chunks

Sometimes it may not be desirable to fetch the full parent document; for instance, in cases where documents are extremely big. Here is how you would fetch bigger pieces from the parent documents:

Text splitting for chunks and parents:
- Use two instances of RecursiveCharacterTextSplitter:
  - One of them shall be used to create larger parent documents of a certain size.
  - Another with a smaller chunk size to create text snippets, child documents from the parent documents.
Vector store and storage setup (like full document retrieval):
- Create a Chroma vector store that indexes the embeddings of the child documents.
- Use InMemoryStore, which holds the chunks of the parent documents.

A. Parent Document Retriever

This retriever solves a fundamental problem in RAG: it retrieves the whole documents that are too large or may not contain sufficient context. It chops up documents into small chunks for retrieval, and these chunks are indexed. However, after a query, instead of these pieces of documents, it retrieves the whole parent documents from which they came — providing a richer context for generation.

    Python
   
 

   parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)  
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)   
vectorstore = Chroma(
    collection_name="split_parents",
    embedding_function=OpenAIEmbeddings()
)
store = InMemoryStore()
big_chunks_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter
)
# Adding documents
big_chunks_retriever.add_documents(docs)
print(len(list(store.yield_keys())))  # List document IDs in the store
  

B. Similarity Search and Retrieval

The process remains like full document retrieval. We look for relevant child documents and then take corresponding bigger chunks from the parent documents.

    Python
   
 

   sub_docs = vectorstore.similarity_search("What is LangSmith?", k=2)
print(len(sub_docs))
print(sub_docs[0].page_content)  
retrieved_docs = big_chunks_retriever.invoke("What is LangSmith?")
print(len(retrieved_docs))
print(len(retrieved_docs[0].page_content)) 
print(retrieved_docs[0].page_content)  
  

4. Integrate With RetrievalQA

Now that you have a parent document retriever, you can integrate it with a RetrievalQA chain to perform question-answering using the retrieved parent documents:

    Python
   
 

   qa = RetrievalQA.from_chain_type(llm=OpenAI(),
                                chain_type="stuff",
                                retriever=big_chunks_retriever)
query = "What is LangSmith?"
response = qa.invoke(query)
print(response)
  

Conclusion

PDR considerably improves the RAG models' output of accurate responses that are full of context. With the full-text retrieval of parent documents, complex questions are answered both in-depth and accurately, a basic requirement of sophisticated AI.

API Document Data (computing) Python (language) artificial intelligence large language model

Opinions expressed by DZone contributors are their own.

Related

Trending