Find answers from the community

P
Provo
Offline, last seen 3 months ago
Joined September 25, 2024
Could this be a potential bug or am i doing something wrong?
im using DocumentSummary Index in combination with its LLM retriever:
Plain Text
    doc_summary_index = DocumentSummaryIndex.from_documents(
        documents=documents,
        transformations=[splitter],
        response_synthesizer=response_synthesizer,
        show_progress=True,
    )

    retriever = DocumentSummaryIndexLLMRetriever(
            index=doc_summary_index,
            llm=llm,
            # choice_select_prompt=None,
            # choice_batch_size=10,
            # choice_top_k=1,
            # format_node_batch_fn=None,
            # parse_choice_select_answer_fn=None,
     )


i somehow get following error:
Plain Text
Traceback (most recent call last):
  File "/home/_DEV/maas-ai-gmbh-new/main.py", line 424, in handleBasicQuestionWithDocumentSummaryIndex
    retrieved_nodes = retriever.retrieve(question)
  File "/home/_DEV/maas-ai-gmbh-new/maas-ai-gmbh-new/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py", line 274, in wrapper
    result = func(*args, **kwargs)
  File "/home/_DEV/maas-ai-gmbh-new/maas-ai-gmbh-new/lib/python3.10/site-packages/llama_index/core/base/base_retriever.py", line 244, in retrieve
    nodes = self._retrieve(query_bundle)
  File "/home/_DEV/maas-ai-gmbh-new/maas-ai-gmbh-new/lib/python3.10/site-packages/llama_index/core/indices/document_summary/retrievers.py", line 98, in _retrieve
    raw_choices, relevances = self._parse_choice_select_answer_fn(
  File "/home/_DEV/maas-ai-gmbh-new/maas-ai-gmbh-new/lib/python3.10/site-packages/llama_index/core/indices/utils.py", line 104, in default_parse_choice_select_answer_fn
    answer_num = int(line_tokens[0].split(":")[1].strip())
IndexError: list index out of range

following this tutorial: https://docs.llamaindex.ai/en/stable/examples/index_structs/doc_summary/DocSummary/
3 comments
L
P
I am currently building a Llama index RAG that processes user requests with unstructured data... most of the time its PDFs.
It all works fine. The only problem i have is that there can be pdfs that arent scanned properly and because i process them
with OCR i get gibberish output which confuses the LLM.

Is there any way in LLama to clean those things? i mean the text extracted from them doesnt make sense at all. Is there any
type of IngestionSanitizer? Or how should i do that?
5 comments
P
L
@Logan M when can we use it with llama Index? 😂
3 comments
L