I'm a startup founder, and we're preparing to setup our vectorstore to train an LLM. I'm curious how we will setup our server architecture to separate some of our data, and make sure the LLM can't access it in case of prompt hacking. At the same time, we will want our LLM to be able to retreive answers from and generate responses from the vectorstore will will host our customers data. Does anyone have advice on structuring the Azure Database, or video or documents I should review?
You shouldn't connect a LLM to an index that hosts data you don't want accessed unless you go with an approach like Pinecone namespaces. If a prompt is the only thing stopping from someone accessing the underlying information, that will not work
Can you control the output though? If you ask me for a fruit, and I have a database of all sorts of food, can I tell the AI "You may only output objects that have the label fruit"? Or is their still a security risk
Okay what if their is no prompting done by the end user. What if we as a company control all the prompting. Are their still security risks? The only risk I'm aware of is prompt-injections/prompt hacking. Are their other risks to building a vectorstore? We want to host on AzureAI so that they don't share our data with OpenAI (they have different policies)
I still think it's risky and would go against stuff like EU GDPR most likely. Although I'm not exactly sure why you'd store those types of information together in 1 index
You mentioned you're worried about someone breaking the prompt and having the model give information from your users. But you shouldn't be including data about customers into an index that other customers can also query