Find answers from the community

Updated 4 months ago

Hello

At a glance

The post is about creating a search-based chatbot that can handle various file types (PDF, TXT, RTF, etc.) in a multi-tenant environment like AWS. The community members discuss the following key points:

1. Whether to use a vector database like Qdrant or create an index for each user in the multi-tenant system.

2. The ability to load the index from AWS services like S3 or EFS.

3. The possibility of specifying metadata like user ID in Qdrant to avoid creating too many collections and instead use filters when querying.

4. How to retrieve source information like page numbers, line numbers, and references when querying PDF or TXT files.

The community members provide suggestions and share their experiences, such as using the save_to_string and load_from_string functions to make the index data easily uploadable, and setting the extra_info dictionary for each document or node to store additional metadata. They also discuss the trade-offs between using a single file + index versus a vector database for supporting a large number of files and users.

There is no explicitly marked answer in the comments, but the community members are open

Hello,
I want to create a search based chatbot based on users files it can be pdf, txt, rft, etc some question around that are
  1. In an environment like AWS should a vector db be used like(qdrant) or we just create index for each user as we have a multi tenant system?
  2. Can we load index in or load from some thing like AWS S3, EFS etc
  3. If we use Qdrant can we specify metadata like userid so we dont create too many index(collections) in vector db but rather define filter when querying so that only document specific to user are extracted
  4. If we are querying pdf, txt how can we get source page no, line no, references etc
1
L
h
R
7 comments
  1. Depends on how much data will be in each index. At a certain point, GPTSimpleVectorIndex will slow down because everything (embedding vectors AND douments) is loaded into memory
  1. You might be interested in the save_to_string and load_from_string functions, to make the index data easily uploadable on S3 or others
  1. You are actually the second person to ask about this today! @jerryjliu0 I think this isn't supported for qdrant yet right? Very open to a PR if you want to take a stab at it @hammad
  1. You can set the "extra_info" dict of each document and/or node object before inserting into the index.
document.extra_info = {"file_name": "my_file.txt"}, or with nodes, node.node_info = {"file_name": "my_file.txt"}
Then you can check response.source_nodes after getting a query response to see that info dict
can we specify while reading directory? And as for PDF should we parse manually and specify page no etc
Yea you can do this in the directory reader!

Plain Text
filename_fn = lambda filename: {'file_name': filename}
documents = SimpleDirectoryReader('data/', file_metadata=filename_fn)
we got a long way just using single file + index. Works really nicely. VectorDB seems only necessary if you need to support lots of files and lots of users
@Logan M can we some set extra_info via SimpleDirectoryReader treat each file differently?
for pinecone we do, don't think we have one for qdrant yet ๐Ÿ˜ฎ
super open to contributions here
Add a reply
Sign up and join the conversation on Discord