Data

jjeremy@castme

I'm a startup founder, and we're preparing to setup our vectorstore to train an LLM. I'm curious how we will setup our server architecture to separate some of our data, and make sure the LLM can't access it in case of prompt hacking. At the same time, we will want our LLM to be able to retreive answers from and generate responses from the vectorstore will will host our customers data. Does anyone have advice on structuring the Azure Database, or video or documents I should review?

13 comments

TTeemu

You shouldn't connect a LLM to an index that hosts data you don't want accessed unless you go with an approach like Pinecone namespaces. If a prompt is the only thing stopping from someone accessing the underlying information, that will not work

jjeremy@castme

Can you control the output though? If you ask me for a fruit, and I have a database of all sorts of food, can I tell the AI "You may only output objects that have the label fruit"? Or is their still a security risk

TTeemu

With some strict filtering using namespaces/metadata it's possible but any prompt based ones are a massive security risk

jjeremy@castme

Okay what if their is no prompting done by the end user. What if we as a company control all the prompting. Are their still security risks? The only risk I'm aware of is prompt-injections/prompt hacking. Are their other risks to building a vectorstore? We want to host on AzureAI so that they don't share our data with OpenAI (they have different policies)

TTeemu

I still think it's risky and would go against stuff like EU GDPR most likely. Although I'm not exactly sure why you'd store those types of information together in 1 index

TTeemu

Also OAI by default wont train on API inputs

TTeemu

Their policies are relatively similar imo

TTeemu

Azure is just a bit more advanced on security

jjeremy@castme

We won't be in the EU fortunately, but if users opt in/sign a contract I think it will be fine

jjeremy@castme

What do you mean by 'not sure why you would store in 1 index?'

jjeremy@castme

I've read the docs and watched some tutorials, but I'm not an expert. Just want to say I appreciate your help so far

TTeemu

You mentioned you're worried about someone breaking the prompt and having the model give information from your users. But you shouldn't be including data about customers into an index that other customers can also query

TTeemu

You should either have multiple indices or use one of the other solutions I outlined

Add a reply

Find answers from the community

Data