Find answers from the community

Updated 8 months ago

he guys. I am using the following line(

he guys. I am using the following line(PDFReader) in order to load my pdf file as input data:
now my question is that how I can use just first page or a specific page of my file during querying it?
in the following you can see how i make index and other stuff:
Plain Text
PDFReader = download_loader("PDFReader")
`loader = PDFReader()
`documents = loader.load_data(file=Path('/content/4.pdf'))

service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model, chunk_size=256, chunk_overlap=50)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)
query_engine = index.as_query_engine(llm=llm)
Plain Text
invoice_type = query_engine.query("""
What is the month and year of the period of consumption of this invoice?
I want just dates as your response without words
                              """)

I mean how I can use just data presented in first page of pdf, because in some cases there are some similar text in differen page that make the llm confuse but I now in advance the related data is in which page
L
k
3 comments
the page number is in the metadata, You could use a metadata filter, or do better processing on your data before ingesting
Yes you are right I can filter by page number, but I am not sure when and how?
I tried it to send just first page of documents to VectorStoreIndex but I got an error. I tried the following line
Plain Text
index = VectorStoreIndex.from_documents(
        documents[0], service_context=service_context
    )

as you can see I used documents[0](first page) instead of documents(whole file).
thanks a lot in advance for your help
assuming the page number is in the metadata, you can put everything into an index and use a filter, something like

Plain Text
from llama_index.core.vector_stores import (
    MetadataFilter,
    MetadataFilters,
    FilterOperator,
)

filters = MetadataFilters(
    filters=[
        MetadataFilter(
            key="page_num", operator=FilterOperator.EQ, value="1"
        ),
    ]
)

query_engine = index.as_query_engine(filters=filters)
Add a reply
Sign up and join the conversation on Discord