Find answers from the community

Updated 6 months ago

llama_index/llama-index-core/llama_index...

Hi there, have been working with ingestion pipelines, docstores and I am finding that for a large document with a large number of nodes there can be a significant performance hit when doing any document management like delete/add. This is because it does a put on every node action, in delete e.g. https://github.com/run-llama/llama_index/blob/a24292c79424affeeb47920b327c20eca5ba85ff/llama-index-core/llama_index/core/storage/docstore/keyval_docstore.py#L485), and depending on the number of remaining nodes, it can take a while. Would it make more sense to wait til all the nodes are removed before doing the put for ref_doc_info?
L
r
24 comments
I think in general, the ingestion pipeline makes too many calls to the vector store and docstore -- it should be batching everything at the end πŸ˜…

Very open to any improvements or PRs here!
ok, I'll create one for delete for review
Is this a bug or feature?
(for the git issue)
mmm... lets call this an enhancement
dont see that as an option when I create an issue
or do I just create the PR
oh yea, I guess its a feature then lol

But you can just create a PR too
The other issue is that the ingestion pipeline doesnt support pinecone serverless (metadata filtering for delete not allowed e.g.)
ugh I know πŸ˜… There is a PR somewhere for this, but I don't think its fully baked yet
Ok. If you point me to the PR, I can see if I can have someone on my team help out.
What's interesting about the high number of nodes is that the document chunking results in about a thousand nodes. But once the documentsummary index gets created, the ref_doc_info balloons to 1M+ refs. So there's something strange going on there. Have to look into that. If that number of refs is unavoidable, we may have to namespace by document and delete the namespace in the kv store directly.
@Logan M I tracked down the cause of the 1M refs. It looks like if the doc store is used to store multiple indexes each index will cause an exponential increase in refs. Here is a basic notebook to see the issue (I didnt test it this late on a friday, but it should work)

The code that's causing it is this https://github.com/run-llama/llama_index/blob/dd6910757fa846370d3e04183838dee7f0ddec28/llama-index-legacy/llama_index/legacy/storage/docstore/keyval_docstore.py#L103
once the refs are that high any kv put on the refs will take an inordinate amount of time.
I'm not sure how refs are used in general to know how the merge routine could change and not break things downstream
let me know if I should create a bug in github
fixed the example so it runs (needs openai key or fails with connection error). if you repeatedly run the last step, the refs will continue to grow.
@Logan M question, I am trying to use this updated version in my application, and I refer to it in the requirements.txt as the git repo, but it always tries to get the wheel for 10.51 instead of using the updated core (because it's defined in poetry as such). How can I make it use the latest?
I did something hacky looking by using my local copy and this in requirements.txt which seems to have worked

Plain Text
../llama_index/llama-index-core
../llama_index
Yea the llama-index package is just a starter wrapper on several packages (including core). You could just skip installing that
Add a reply
Sign up and join the conversation on Discord