Find answers from the community

Updated 2 months ago

Data

I'm a startup founder, and we're preparing to setup our vectorstore to train an LLM. I'm curious how we will setup our server architecture to separate some of our data, and make sure the LLM can't access it in case of prompt hacking. At the same time, we will want our LLM to be able to retreive answers from and generate responses from the vectorstore will will host our customers data. Does anyone have advice on structuring the Azure Database, or video or documents I should review?
T
j
13 comments
You shouldn't connect a LLM to an index that hosts data you don't want accessed unless you go with an approach like Pinecone namespaces. If a prompt is the only thing stopping from someone accessing the underlying information, that will not work
Can you control the output though? If you ask me for a fruit, and I have a database of all sorts of food, can I tell the AI "You may only output objects that have the label fruit"? Or is their still a security risk
With some strict filtering using namespaces/metadata it's possible but any prompt based ones are a massive security risk
Okay what if their is no prompting done by the end user. What if we as a company control all the prompting. Are their still security risks? The only risk I'm aware of is prompt-injections/prompt hacking. Are their other risks to building a vectorstore? We want to host on AzureAI so that they don't share our data with OpenAI (they have different policies)
I still think it's risky and would go against stuff like EU GDPR most likely. Although I'm not exactly sure why you'd store those types of information together in 1 index
Also OAI by default wont train on API inputs
Their policies are relatively similar imo
Azure is just a bit more advanced on security
We won't be in the EU fortunately, but if users opt in/sign a contract I think it will be fine
What do you mean by 'not sure why you would store in 1 index?'
I've read the docs and watched some tutorials, but I'm not an expert. Just want to say I appreciate your help so far
You mentioned you're worried about someone breaking the prompt and having the model give information from your users. But you shouldn't be including data about customers into an index that other customers can also query
You should either have multiple indices or use one of the other solutions I outlined
Add a reply
Sign up and join the conversation on Discord