Find answers from the community

Updated 5 months ago

Privacy

At a glance

hey y'all, I just have kind of a dumb newbie question regarding data privacy and llama index - what, if any, private info is exposed when you send over an index + query to LLMs? say I wanted to analyze my private health care data, would I be at any risk of exposing personal info? is this dependent on the language model used? thanks in advance!

11 comments

LLogan M

Yea with default settings, all data is sent to openai over their api

So at that point you are subject to their privacy policies

You can definitely run a local LLM and embedding model, but you'll need some powerful resources to run that

LLogan M

(And also, open-source models aren't that great yet 🥲)

kkrieger_

dang okay i sort of figured that would be the case

kkrieger_

im just confused, if im using like a simplevectorindex doesn't that construct the index locally without sending anything to the LLM? so the only data the LLM gets is the index itself right?

kkrieger_

also thanks for answering me! love how active this community is

LLogan M

When you construct the index, you still need to generate embeddings for the data, so the data gets sent to the embed model (which is openai by default)

Embeddings models are easier to run locally though, if that helps

LLogan M

Custom (local) embeddings
https://gpt-index.readthedocs.io/en/latest/how_to/customization/embeddings.html#custom-embeddings

LLogan M

Local LLM
https://gpt-index.readthedocs.io/en/latest/how_to/customization/custom_llms.html#example-using-a-huggingface-llm

LLogan M

The embed model and the LLM both eventually read the plain text of the data you indexed

LLogan M

If you only need the embeddings, you can set response_mode="no_text" to only retrieve the nodes, without sending to the LLM.

This still requires an embed model, but you could run that locally as linked above (it might still complain about an openai key, but just set that to a random string)

Plain Text

query_engine = index.as_query_engine(response_mode="no_text")

response = query_engine.query("query")

print(response.source_nodes)

kkrieger_

got it, this is super helpful! for some reason in my head i totally ignored the embeddings 🤦

Add a reply