🤔 doc... store...

At a glance

The post asks if a "doc... store..." is a file format to store LlamaIndex Documents. The comments explain that a "doc store" is an object/class that holds nodes and information about their parent documents, essentially a key-value store. Community members discuss the possibility of integrating Apache Solr for search, whether document stores should support efficient retrieval, and the potential to use file metadata to inform LLM models. The community members also discuss the responsibilities of different classes, such as BaseRetriever which has methods to implement retrieval.

MMr Pebble

🤔 doc... store...

Is that like a file format to store LlamaIndex Document?

27 comments

LLogan M

Its just an object/class that holds nodes and information about their parent documents

LLogan M

Essentially its a giant key-value store

MMr Pebble

Oh. Okay. Thanks.

VVoyager2

Hey @Logan M

VVoyager2

I think many orgs currently use Apache Solr for their search platforms. If we could provide an integration with Solr just like we did for ES, that would be great! And I’m currently working on this

VVoyager2

Plus the notion behind document stores being purely key-value stores, do you think that might change? The code seems to suggest it might.

VVoyager2

For example, what if instead of doing some splitting of documents into nodes and then storing them, we want to do a traditional keyword search on the query and let the llm summarize the retrieved results?

MMr Pebble

I'd actually like to see LLMs use normal file metadata to describe file contents and then also be able to use that to identify which files are relevant to it/us

MMr Pebble

Being able to use that to inform things like SimpleDirectoryReader or CSVReader so it knows what its looking at via extra_info seems prudent.

VVoyager2

Interesting idea

VVoyager2

I think this is already possible by using the QueryPipeline though.

VVoyager2

Or maybe you should only need to make small extensions to extract file metadata

MMr Pebble

I'm actually suspecting we're going to want to change file headers now.

MMr Pebble

Files are no longer something only accessible to ourselves, and having some more metadata like lyrics, tempo, etc for music files will all be meaningful metadata that would provide value to us by allowing our own AI models to better use our data.

VVoyager2

Hey @Logan M
Sorry to tag you again. I was just wondering if you have any thoughts on this. Namely integrating Solr into LLaMaIndex just like we have for ES. And also whether the use-case of supporting keyword retrieval on an inverted index is still relevant in the RAG era (which current doc-stores do not support)

LLogan M

Its definitely possible to add this -- I don't think retrieval techniques belong on the docstore though, thats either a retriever or a vector store

LLogan M

Open to contributions in any case

VVoyager2

But what if the docstore supports efficient retrieval of documents of interest?

VVoyager2

As is the case with ES and Solr

VVoyager2

Otherwise, I don’t see the point of distinguishing docstores and kvstores in the codebase. Most of the docstores inherit from KVDocumentStore instead of BaseDocumentStore anyway

LLogan M

yea a docstore is just a key-value lookup interface

I don't see why you can't define a retriever on top of the same collection a docstore uses.

LLogan M

imo docstores are not for retrieval, just key/val lookup and metadata tracking

LLogan M

Just trying to have clear responsibilities for classes 🙂

VVoyager2

I see. What is the base class for a retriever?

LLogan M

BaseRetriever

LLogan M

It has two methods to implement: _retrieve() and optionally _aretrieve()

VVoyager2

Perfect! I'll have a look at those. Thank you Logan!

Add a reply

Find answers from the community

🤔 doc... store...