I may be doing something wrong with

At a glance

The community member is having trouble getting accurate line numbers when citing sources, even though the page numbers seem correct. They ask if there is a specific technique for fetching line numbers that is different from getting page numbers.

In the comments, another community member suggests checking the metadata of the source nodes to see if the line number information is available. The community member then provides some code that attempts to extract the line number from the metadata, but it's unclear if this is a complete solution.

Other community members chime in, noting that the page number seems mostly accurate, but there may be an issue with the implementation for line numbers. They suggest it might be difficult to reliably get line numbers, as language models may not fully understand that concept.

ttarpus

I may be doing something wrong with respect to getting line numbers.

I am not getting the most accurate citations when it comes down to the line number. The page numbers seem spot on. Is there an art to fetching the line number that is significantly different than getting the page number?

8 comments

TTeemu

Do you have some method for obtaining the line number? The page numbers are I think included by default with PDFs. You can check the metadata from your nodes with:

Plain Text

print(response.source_nodes)

ttarpus

Sure, let me get that

ttarpus

Plain Text

### STAMP



# Perform a query
query_str = "what is the installation date?"
response = query_engine.query(query_str)
print(f"Debug: Query Response: {response.response}")  # Debugging


# Extract the source nodes from the response
source_nodes = response.source_nodes


# New Debug Statements
for node in source_nodes:
    print(f"Queried Node Metadata: {node.metadata}")  # Debugging


# Extract metadata from the source nodes
metadata_list = [node.metadata for node in source_nodes]

# Print the answer
print("Answer:", response.response)

# Loop through the metadata list and format each metadata dictionary
for idx, metadata in enumerate(metadata_list):
    print(f"Debug: Metadata for Source Node {idx + 1}: {metadata}")  # Debugging
    # Assuming you can find the line number of the answer within the chunk
    # For demonstration, let's say the answer starts at the 3rd line of the chunk
    answer_line_number = metadata['starting_line_number'] + 2  # Replace 2 with the actual line offset

    print(f"\n--- Citation for Source Node {idx + 1} ---")
    print(f"Filename: {metadata['filename']}")
    print(f"Document Title: {metadata['document_title']}")
    print(f"Page Number: {metadata['page_number']}")
    print(f"Line Number: {answer_line_number}")

TTeemu

Does the other metadata work?

ttarpus

yes, I think so

ttarpus

page number is mostly accurate

TTeemu

I guess there might be an issue with your implementation for line numbers then, I'm not really sure on the best way to count those. Might be a bit difficult also since the LLMs don't really understand it

ttarpus

line numbers are defined here:

Plain Text

line_number = 1

# Create Document objects with extended metadata
documents = []
for doc_idx, (chunk, page_number) in enumerate(chunks):  # Note the unpacking of (chunk, page_number)
    print(f"Debug: Creating Document Object {doc_idx + 1} for Page {page_number}")  # Debugging
    line_count_in_chunk = chunk.count('\n') + 1

    # Debug print statements
    print(f"Debug: Chunk {doc_idx}, Page {page_number}, Line Count {line_count_in_chunk}")
    print("Debug: Chunk Content:", chunk[:50])  # Print first 50 characters of the chunk

    metadata = {
        "source_doc_idx": doc_idx,
        "filename": "1.2.2.2 Customer Contract - Stockwood Dr - Woodstock - GA.pdf",
        "page_number": page_number,  # Using the actual page number
        "document_title": "Customer Contract - Stockwood Dr - Woodstock - GA",
        "line_count": line_count_in_chunk,
        "starting_line_number": line_number
    }
    print(f"Debug: Adding Metadata for Document Object {doc_idx + 1} for Page {page_number}")  # Debugging
    print(f"Debug: Metadata: {metadata}")  # Debugging
    documents.append(Document(text=chunk, metadata=metadata))

    # Update the line_number for the next chunk
    line_number += line_count_in_chunk  # Moved inside the loop

# New Debug Statements
for doc in documents:
    print(f"Document Metadata: {doc.metadata}")

Add a reply

Find answers from the community

I may be doing something wrong with