Find answers from the community

Updated 3 months ago

I may be doing something wrong with

I may be doing something wrong with respect to getting line numbers.

I am not getting the most accurate citations when it comes down to the line number. The page numbers seem spot on. Is there an art to fetching the line number that is significantly different than getting the page number?
T
t
8 comments
Do you have some method for obtaining the line number? The page numbers are I think included by default with PDFs. You can check the metadata from your nodes with:

Plain Text
print(response.source_nodes)
Sure, let me get that
Plain Text
### STAMP



# Perform a query
query_str = "what is the installation date?"
response = query_engine.query(query_str)
print(f"Debug: Query Response: {response.response}")  # Debugging


# Extract the source nodes from the response
source_nodes = response.source_nodes


# New Debug Statements
for node in source_nodes:
    print(f"Queried Node Metadata: {node.metadata}")  # Debugging


# Extract metadata from the source nodes
metadata_list = [node.metadata for node in source_nodes]

# Print the answer
print("Answer:", response.response)

# Loop through the metadata list and format each metadata dictionary
for idx, metadata in enumerate(metadata_list):
    print(f"Debug: Metadata for Source Node {idx + 1}: {metadata}")  # Debugging
    # Assuming you can find the line number of the answer within the chunk
    # For demonstration, let's say the answer starts at the 3rd line of the chunk
    answer_line_number = metadata['starting_line_number'] + 2  # Replace 2 with the actual line offset

    print(f"\n--- Citation for Source Node {idx + 1} ---")
    print(f"Filename: {metadata['filename']}")
    print(f"Document Title: {metadata['document_title']}")
    print(f"Page Number: {metadata['page_number']}")
    print(f"Line Number: {answer_line_number}")
Does the other metadata work?
yes, I think so
page number is mostly accurate
I guess there might be an issue with your implementation for line numbers then, I'm not really sure on the best way to count those. Might be a bit difficult also since the LLMs don't really understand it
line numbers are defined here:

Plain Text
line_number = 1

# Create Document objects with extended metadata
documents = []
for doc_idx, (chunk, page_number) in enumerate(chunks):  # Note the unpacking of (chunk, page_number)
    print(f"Debug: Creating Document Object {doc_idx + 1} for Page {page_number}")  # Debugging
    line_count_in_chunk = chunk.count('\n') + 1

    # Debug print statements
    print(f"Debug: Chunk {doc_idx}, Page {page_number}, Line Count {line_count_in_chunk}")
    print("Debug: Chunk Content:", chunk[:50])  # Print first 50 characters of the chunk

    metadata = {
        "source_doc_idx": doc_idx,
        "filename": "1.2.2.2 Customer Contract - Stockwood Dr - Woodstock - GA.pdf",
        "page_number": page_number,  # Using the actual page number
        "document_title": "Customer Contract - Stockwood Dr - Woodstock - GA",
        "line_count": line_count_in_chunk,
        "starting_line_number": line_number
    }
    print(f"Debug: Adding Metadata for Document Object {doc_idx + 1} for Page {page_number}")  # Debugging
    print(f"Debug: Metadata: {metadata}")  # Debugging
    documents.append(Document(text=chunk, metadata=metadata))

    # Update the line_number for the next chunk
    line_number += line_count_in_chunk  # Moved inside the loop

# New Debug Statements
for doc in documents:
    print(f"Document Metadata: {doc.metadata}")
Add a reply
Sign up and join the conversation on Discord