Find answers from the community

Home
Members
skittythecat
s
skittythecat
Offline, last seen 3 months ago
Joined September 25, 2024
s
skittythecat
·

Parsing

I have been looking for ways to parse PDFs into titles sections paragraphs and tables as well as caption figures. My goal is to have a very robust indexer that will maximize the llms understanding of PDF snippets by using things like section titles as metadata and trying to do the chunking at section boundaries and keeping tables together recognizing column headers and so on.

It seems that none of the PDF integrations for llama index do all this as they all just rip the text from the PDF without these structural elements. Unfortunately I can't use third-party apis for this as the PDFs are sensitive company data.

I tried looking into the library's behind the integrations including pymu and PDF miner six and so on and found that these can do things like make an HTML file with the right layout but they still didn't find the actual structure of the PDF in terms of titles sections paragraphs and so on. Llama index does have the unstructured reader integration but its usage of it is extremely simple using the low res auto partition and then just concatenating all of the bits of text like the other readers so again it's not what I need.

However when I looked into the unstructured library it is the most powerful one I have found so far. They seem to throw thing they can find at the problem including OCR and other computer vision machine learning models. What was lacking however was converting the list of elements they find into a structure that is easily consumed by a file reader or parser. For example if I could convert to markdown there are parsers that will turn the heading into meta data.

It seems like this last step is fairly straightforward and I'm wondering if someone has already written some code that does it.

Finally I am wondering if there is another library or integration that I missed and that would do everything that I want.
6 comments
s
d
L
Is there a good way to finish creating embeddings before inserting nodes to a remote index? I'm doing this by building an in-memory index and then accessing the SimpleVectorStoreData property to create nodes with embeddings, and using that to build my VectorStoreIndex over an Azure Cognitive Search Storage Context.

Plain Text
        logger.info("Indexing may take a while.  Please wait...")
        # First create an in memory index to get embeddings for all nodes.
        local_vector_store_index = VectorStoreIndex.from_documents(documents)
        nodes = list(local_vector_store_index.docstore.docs.values())
        local_vector_store_data: SimpleVectorStoreData = local_vector_store_index.vector_store._data
        for node in nodes:
            node.embedding = local_vector_store_data.embedding_dict[node.node_id]

        ################## This is an Azure Vector Store 
        VectorStoreIndex(nodes, storage_context=azure_storage_context)
        logger.info(f"Your index {self.index_name} should now be populated at {self.service_endpoint} (go check)")
9 comments
L
s
Below is what llama-index provides for. So if I wrapped Langchain's equivalent in a class that looks like that one...
15 comments
s
L
Is there a way to get the embeddings out of an index loaded from a persisted VectoreStoreIndex on disk?
1 comment
L
I really liked this stuff you guys did on putting RAG in production. The ideas seem promising, but I still find that they fail on various edge cases, particularly in technical documents. For example the windowing idea, where you index one sentence at a time, but then the LLM sees a larger portion of the document. I end up with a lot of garbage in my index :D. Also, PDFs with tables 🙄. Are you doing some more work / discussion around this? https://docs.llamaindex.ai/en/stable/end_to_end_tutorials/dev_practices/production_rag.html
9 comments
s
L
I'm getting an error about embed rate limits from Azure OpenAI that I wasn't getting before, and openai seems to say to tell me I have to wait 24hrs. This is with a embed batch size of 500. Doesn't happen with size 100.

However, I didn't have to wait 24 hours. I just reduced batch size and it worked.

Unanswered questions:
  1. Did Azure OpenAI recently reduce max batch size?
  2. Why does it say 86400s when it's not true?
Plain Text
2024-06-06 14:54:59,932 - llama_index.embeddings.openai.utils - WARNING - Retrying llama_index.embeddings.openai.base.get_embeddings in 0.6736039165786247 seconds as it raised RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the Embeddings_Create Operation under Azure OpenAI API version 2024-02-15-preview have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 86400 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.'}}.
1 comment
L
When I try to run llamaindex-cli I get

Plain Text
$ llamaindex-cli-tool.exe upgrade file.py 
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "...\venv\Scripts\cli-tool.exe\__main__.py", line 4, in <module>
  File "...\venv\Lib\site-packages\package\cli\command_line.py", line 4, in <module>
    from package.cli.module import ModuleCLI, default_modulecli_persist_dir
  File "...\venv\Lib\site-packages\package\cli\module\__init__.py", line 1, in <module>
    from package.cli.module.base import ModuleCLI, default_modulecli_persist_dir
  File "...\venv\Lib\site-packages\package\cli\module\base.py", line 9, in <module>
    from package.core import (
ImportError: cannot import name 'SimpleDirectoryReader' from 'package.core' (unknown location)


Here's some pip freeze output, for reference:

Plain Text
llama-hub==0.0.79.post1
llama-index==0.10.14
llama-index-agent-openai==0.1.5
llama-index-cli==0.1.7
llama-index-core==0.10.14.post1
llama-index-embeddings-openai==0.1.6
llama-index-indices-managed-llama-cloud==0.1.3
llama-index-legacy==0.9.48
llama-index-llms-openai==0.1.7
llama-index-multi-modal-llms-openai==0.1.4
llama-index-program-openai==0.1.4
llama-index-question-gen-openai==0.1.3
llama-index-readers-file==0.1.6
llama-index-readers-llama-parse==0.1.3
llama-index-vector-stores-chroma==0.1.5
llama-parse==0.3.5
llamaindex-py-client==0.1.13
19 comments
L
s
I'm looking for tools relating to Evaluation, as Jerry Liu explains here https://youtu.be/TRjq7t2Ms5I?si=7GtUTie9nD6_OdKa&t=339
1 comment
s
I've been building RAG systems with LlamaIndex for a few months now, both for internal productivity tools and enterprise app software features, and I'm missing a piece of puzzle. I would like my users to be able to maintain their own vector db indices with a no-code browser interface, de-duplication, and CRUD functionality. Does anyone know of a project with features like this?

Edit: I should add that I must use Azure's vector db, or generally be able to use the db provider of my choosing.
2 comments
T
a
Separate issue, with LlamaIndex 0.8.59 I got an error about instantiating a retriever with an abstract method when doing index.as_query_engine on a SummaryIndex.
3 comments
L
s
Arize Phoenix is not tracing queries to a summary index in a jupyter notebook today.

Plain Text
import phoenix as px
from llama_index import set_global_handler
set_global_handler("arize_phoenix")
session = px.launch_app()


Plain Text
index = SummaryIndex.from_documents(documents, storage_context=storage_context, service_context=service_context)
query_engine = index.as_query_engine(text_qa_template=text_qa_template, refine_template=refine_template)
response = query_engine.query("")


No trace in phoenix.

LLama index 0.8.57 and Arize 0.0.51
1 comment
s
Hello, I'm looking at the context chat engine for a RAG chat application and noticing that only the last user message is embedded to query the db. Can I use a query engine (of which there are many more choices) to get the same effect as a chat engine by manually handling the chat history and such? Is there a good tutorial for this?
21 comments
s
a
M
L
Are there diagrams / flow charts for the Data Agents? I find it hard to reason about these things and use them for solving complex problems without a mental map of the flow of information and prompt stages. The "how the indices work" diagrams were super helpful and it would be good to have something like that for the newer features too.
2 comments
s
I'm looking for an in-memory or on premise vector store db where llama_index supports metadata filters
1 comment
L
If I build an index over some documents, and then re-scrape the document source I can use refresh() to update documents with the same doc_id but different text, as well as add new documents... but I'd also like to delete documents that are not present in the new scrape. Does llama-index have a built in management utility for this too?
3 comments
L
s
d
I have been experimenting with searching (fake) enterprise data and generating answers with the search results and I find that the REFINE method is very slow. SIMPLE_SUMMARIZE takes <7s while REFINE takes ~40s. Where can I find more information on query performance and ways to improve on SIMPLE_SUMMARIZE that are not doing ~6 sequential calls to the LLM? My index is a vector store, but it doesn't have to be...

Plain Text
qa_prompt = QuestionAnswerPrompt(prompt_prefix_template.format(
    context_str="{context_str}",
    query_str="{query_str}"))

refine_template_string = CHAT_REFINE_PROMPT_TMPL.format(
    context_msg="{context_msg}",
    query_str="{query_str}",
    existing_answer="{existing_answer}")
my_refine_prompt = RefinePrompt(refine_template_string)

query_engine_refine = index.as_query_engine(text_qa_template=qa_prompt, refine_template=my_refine_prompt, response_mode=ResponseMode.REFINE, similarity_top_k=6)
query_engine_simple = index.as_query_engine(text_qa_template=qa_prompt, response_mode=ResponseMode.SIMPLE_SUMMARIZE, similarity_top_k=4)
4 comments
L
s