Hello

At a glance

The post is about creating a search-based chatbot that can handle various file types (PDF, TXT, RTF, etc.) in a multi-tenant environment like AWS. The community members discuss the following key points:

1. Whether to use a vector database like Qdrant or create an index for each user in the multi-tenant system.

2. The ability to load the index from AWS services like S3 or EFS.

3. The possibility of specifying metadata like user ID in Qdrant to avoid creating too many collections and instead use filters when querying.

4. How to retrieve source information like page numbers, line numbers, and references when querying PDF or TXT files.

The community members provide suggestions and share their experiences, such as using the save_to_string and load_from_string functions to make the index data easily uploadable, and setting the extra_info dictionary for each document or node to store additional metadata. They also discuss the trade-offs between using a single file + index versus a vector database for supporting a large number of files and users.

There is no explicitly marked answer in the comments, but the community members are open

hhammad

Hello,
I want to create a search based chatbot based on users files it can be pdf, txt, rft, etc some question around that are

In an environment like AWS should a vector db be used like(qdrant) or we just create index for each user as we have a multi tenant system?
Can we load index in or load from some thing like AWS S3, EFS etc
If we use Qdrant can we specify metadata like userid so we dont create too many index(collections) in vector db but rather define filter when querying so that only document specific to user are extracted
If we are querying pdf, txt how can we get source page no, line no, references etc

7 comments

LLogan M

Depends on how much data will be in each index. At a certain point, GPTSimpleVectorIndex will slow down because everything (embedding vectors AND douments) is loaded into memory

You might be interested in the save_to_string and load_from_string functions, to make the index data easily uploadable on S3 or others

You are actually the second person to ask about this today! @jerryjliu0 I think this isn't supported for qdrant yet right? Very open to a PR if you want to take a stab at it @hammad

You can set the "extra_info" dict of each document and/or node object before inserting into the index.

document.extra_info = {"file_name": "my_file.txt"}, or with nodes, node.node_info = {"file_name": "my_file.txt"}
Then you can check response.source_nodes after getting a query response to see that info dict

hhammad

can we specify while reading directory? And as for PDF should we parse manually and specify page no etc

LLogan M

Yea you can do this in the directory reader!

Plain Text

filename_fn = lambda filename: {'file_name': filename}
documents = SimpleDirectoryReader('data/', file_metadata=filename_fn)

RRunonthespot

we got a long way just using single file + index. Works really nicely. VectorDB seems only necessary if you need to support lots of files and lots of users

hhammad

@Logan M can we some set extra_info via SimpleDirectoryReader treat each file differently?

jjerryjliu0

for pinecone we do, don't think we have one for qdrant yet 😮

jjerryjliu0

super open to contributions here

Add a reply

Find answers from the community

Hello