Chatgpt remembers??

At a glance

I just discovered something weird, maybe someone can share some insights...

I indexed 3 different knowledge bases with GPTSimpleVectorIndex, lets call them index1, index2 & index3.

My goal is to build a graph on top but for now I checked them separately. I use gpt-3.5-turbo as my model of choice.

llm_predictor_gpt3 = LLMPredictor(llm=ChatOpenAI(temperature=0.2, model_name='gpt-3.5-turbo', max_tokens=2000))

So I asked the same questions querying index1, then index2 followed by index3. The answer I got at the end led me to understand that the bot remembered the last 2 queries!

Even though I did not use any memory object (as in langchain), the bot knew that I already asked this question and that it received new information since I queried another index.

Question: Does that mean that if my bot uses one openAI API token for everything, that question and answers from different users may bleed into another? User A asks question 1, then user B asks question 2 but the answer takes question 1 (and answer 1) into consideration....
I'm not even sure if that is a bad thing, but wow can I avoid this?

19 comments

LLogan M

That's... super weird?

What makes you think it was remembering?

LLau Fla

The answer. I queried the same question multiple times. Im playing around with 9 indexes but they represent basically 3 docs indexed once with vector, once tree and once keyword. I queried them all with the same query and one of the results was something like "I apologize for my last response, with new information i just received...."

LLau Fla

I'll try to reproduce it

LLogan M

that might have something to do with the refine process 🤔

LLau Fla

I dont think i used it, here the "head" code, dunno how to call this

LLau Fla

chunk_len = 1024
chunk_overlap = 50

embed_model = LangchainEmbedding(OpenAIEmbeddings(query_model_name="text-embedding-ada-002"))
splitter = TokenTextSplitter(chunk_size=chunk_len, chunk_overlap=chunk_overlap)
node_parser = SimpleNodeParser(text_splitter=splitter, include_extra_info=True, include_prev_next_rel=False)
llm_predictor_gpt3 = LLMPredictor(llm=ChatOpenAI(temperature=0.2, model_name='gpt-3.5-turbo', max_tokens=2000))
prompt_helper_gpt3 = PromptHelper.from_llm_predictor(llm_predictor=llm_predictor_gpt3)
service_context_gpt3 = ServiceContext.from_defaults(llm_predictor=llm_predictor_gpt3, prompt_helper=prompt_helper_gpt3, embed_model=embed_model, node_parser=node_parser, chunk_size_limit=chunk_len)

LLau Fla

(went up in chunk len)

LLau Fla

do you have some insights to the chunk_len and max_tokens i chose with gpt-3.5-turbo? does that even make sense 😄

LLogan M

I think this will likely activate the refine process, but the parameters in general look fine (max_tokens seems a little big, but that's up to you lol)

Anytime a query retrieves nodes where all the text does not fit into a single LLM call, the text is split and the answer is refined across a few LLM calls

gpt-3.5 has been pretty bad at this lately tbh.

I've been working on a new refine prompt that might help you out. Here's an example on how to use it.

Plain Text

from langchain.prompts.chat import (
    AIMessagePromptTemplate,
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)

from llama_index.prompts.prompts import RefinePrompt

# Refine Prompt
CHAT_REFINE_PROMPT_TMPL_MSGS = [
    HumanMessagePromptTemplate.from_template("{query_str}"),
    AIMessagePromptTemplate.from_template("{existing_answer}"),
    HumanMessagePromptTemplate.from_template(
        "I have more context below which can be used "
        "(only if needed) to update your previous answer.\n"
        "------------\n"
        "{context_msg}\n"
        "------------\n"
        "Given the new context, update the previous answer to better "
        "answer my previous query."
        "If the previous answer remains the same, repeat it verbatim. "
        "Never reference the new context or my previous query directly.",
    ),
]


CHAT_REFINE_PROMPT_LC = ChatPromptTemplate.from_messages(CHAT_REFINE_PROMPT_TMPL_MSGS)
CHAT_REFINE_PROMPT = RefinePrompt.from_langchain_prompt(CHAT_REFINE_PROMPT_LC)
...
index.query("my query", similarity_top_k=3, refine_template=CHAT_REFINE_PROMPT)

LLau Fla

nice, will give this a shot...

LLau Fla

btw, just tried treeindex with same docs and got this error:
"This model's maximum context length is 4097 tokens. However, you requested 4838 tokens (2838 in the messages, 2000 in the completion). Please reduce the length of the messages or completion."

with max_tokens=2000, chunk_len = 1000 & chunk_overlap = 50. I dont get it, I thought indexing chunks automatically under the hood

LLogan M

Yea the math starts to get complicated when max_tokens is so big. Tbh I would either lower the max tokens or lower the chunk length.

Llama index has to ensure that there is always 2000 tokens of room in the prompt. When the max input size is already 4097 and the chunk length is 1000, this gets too restrictive

LLau Fla

so chunksize 1000 and max length 1000 makes sense ? is there some kind of formula i can use as a guideline as chunksize + max length <= max input size?

LLogan M

I think a safer formula is (chunksize * 2) + max_tokens + 100 <= max_input_size

LLogan M

The tree index has to summarize two nodes. Plus some buffer for the prompt templates

LLogan M

I think you can get away with a max_tokens of ~500 in most cases tbh.

LLogan M

Unless your use case requires super long outputs

LLau Fla

one last Q 🙂

Im trying stuff out like this, but the difference between my service_context gpt3 VS gpt4 is only relevant for querying index and prompting LLM, but for indexing its the same, right?:

chunk_len = 500
chunk_overlap = 50

embed_model = LangchainEmbedding(OpenAIEmbeddings(query_model_name="text-embedding-ada-002"))
splitter = TokenTextSplitter(chunk_size=chunk_len, chunk_overlap=chunk_overlap)
node_parser = SimpleNodeParser(text_splitter=splitter, include_extra_info=True, include_prev_next_rel=False)

llm_predictor_gpt3 = LLMPredictor(llm=ChatOpenAI(temperature=0.2, model_name='gpt-3.5-turbo', max_tokens=1000))
prompt_helper_gpt3 = PromptHelper.from_llm_predictor(llm_predictor=llm_predictor_gpt3)
service_context_gpt3 = ServiceContext.from_defaults(llm_predictor=llm_predictor_gpt3, prompt_helper=prompt_helper_gpt3, embed_model=embed_model, node_parser=node_parser, chunk_size_limit=chunk_len)

llm_predictor_gpt4 = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name='gpt-4',max_tokens=1000))
prompt_helper_gpt4 = PromptHelper.from_llm_predictor(llm_predictor=llm_predictor_gpt4)
service_context_gpt4 = ServiceContext.from_defaults(llm_predictor=llm_predictor_gpt4, prompt_helper=prompt_helper_gpt4, embed_model=embed_model, node_parser=node_parser, chunk_size_limit=chunk_len)

LLogan M

The LLM (gpt3 or gpt4) is used during index construction for Tree and KG indexes. List and Vector indexes do no use it.

(All of them use the LLM during queries)

Add a reply

Find answers from the community

Chatgpt remembers??