Find answers from the community

Updated 2 years ago

Chatgpt remembers??

At a glance
I just discovered something weird, maybe someone can share some insights...

I indexed 3 different knowledge bases with GPTSimpleVectorIndex, lets call them index1, index2 & index3.

My goal is to build a graph on top but for now I checked them separately. I use gpt-3.5-turbo as my model of choice.

llm_predictor_gpt3 = LLMPredictor(llm=ChatOpenAI(temperature=0.2, model_name='gpt-3.5-turbo', max_tokens=2000))

So I asked the same questions querying index1, then index2 followed by index3. The answer I got at the end led me to understand that the bot remembered the last 2 queries!

Even though I did not use any memory object (as in langchain), the bot knew that I already asked this question and that it received new information since I queried another index.

Question: Does that mean that if my bot uses one openAI API token for everything, that question and answers from different users may bleed into another? User A asks question 1, then user B asks question 2 but the answer takes question 1 (and answer 1) into consideration....
I'm not even sure if that is a bad thing, but wow can I avoid this?
L
L
19 comments
That's... super weird?

What makes you think it was remembering?
The answer. I queried the same question multiple times. Im playing around with 9 indexes but they represent basically 3 docs indexed once with vector, once tree and once keyword. I queried them all with the same query and one of the results was something like "I apologize for my last response, with new information i just received...."
I'll try to reproduce it
that might have something to do with the refine process πŸ€”
I dont think i used it, here the "head" code, dunno how to call this
chunk_len = 1024
chunk_overlap = 50

embed_model = LangchainEmbedding(OpenAIEmbeddings(query_model_name="text-embedding-ada-002"))
splitter = TokenTextSplitter(chunk_size=chunk_len, chunk_overlap=chunk_overlap)
node_parser = SimpleNodeParser(text_splitter=splitter, include_extra_info=True, include_prev_next_rel=False)
llm_predictor_gpt3 = LLMPredictor(llm=ChatOpenAI(temperature=0.2, model_name='gpt-3.5-turbo', max_tokens=2000))
prompt_helper_gpt3 = PromptHelper.from_llm_predictor(llm_predictor=llm_predictor_gpt3)
service_context_gpt3 = ServiceContext.from_defaults(llm_predictor=llm_predictor_gpt3, prompt_helper=prompt_helper_gpt3, embed_model=embed_model, node_parser=node_parser, chunk_size_limit=chunk_len)
(went up in chunk len)
do you have some insights to the chunk_len and max_tokens i chose with gpt-3.5-turbo? does that even make sense πŸ˜„
I think this will likely activate the refine process, but the parameters in general look fine (max_tokens seems a little big, but that's up to you lol)

Anytime a query retrieves nodes where all the text does not fit into a single LLM call, the text is split and the answer is refined across a few LLM calls

gpt-3.5 has been pretty bad at this lately tbh.

I've been working on a new refine prompt that might help you out. Here's an example on how to use it.

Plain Text
from langchain.prompts.chat import (
    AIMessagePromptTemplate,
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)

from llama_index.prompts.prompts import RefinePrompt

# Refine Prompt
CHAT_REFINE_PROMPT_TMPL_MSGS = [
    HumanMessagePromptTemplate.from_template("{query_str}"),
    AIMessagePromptTemplate.from_template("{existing_answer}"),
    HumanMessagePromptTemplate.from_template(
        "I have more context below which can be used "
        "(only if needed) to update your previous answer.\n"
        "------------\n"
        "{context_msg}\n"
        "------------\n"
        "Given the new context, update the previous answer to better "
        "answer my previous query."
        "If the previous answer remains the same, repeat it verbatim. "
        "Never reference the new context or my previous query directly.",
    ),
]


CHAT_REFINE_PROMPT_LC = ChatPromptTemplate.from_messages(CHAT_REFINE_PROMPT_TMPL_MSGS)
CHAT_REFINE_PROMPT = RefinePrompt.from_langchain_prompt(CHAT_REFINE_PROMPT_LC)
...
index.query("my query", similarity_top_k=3, refine_template=CHAT_REFINE_PROMPT)
nice, will give this a shot...
btw, just tried treeindex with same docs and got this error:
"This model's maximum context length is 4097 tokens. However, you requested 4838 tokens (2838 in the messages, 2000 in the completion). Please reduce the length of the messages or completion."

with max_tokens=2000, chunk_len = 1000 & chunk_overlap = 50. I dont get it, I thought indexing chunks automatically under the hood
Yea the math starts to get complicated when max_tokens is so big. Tbh I would either lower the max tokens or lower the chunk length.

Llama index has to ensure that there is always 2000 tokens of room in the prompt. When the max input size is already 4097 and the chunk length is 1000, this gets too restrictive
so chunksize 1000 and max length 1000 makes sense ? is there some kind of formula i can use as a guideline as chunksize + max length <= max input size?
I think a safer formula is (chunksize * 2) + max_tokens + 100 <= max_input_size
The tree index has to summarize two nodes. Plus some buffer for the prompt templates
I think you can get away with a max_tokens of ~500 in most cases tbh.
Unless your use case requires super long outputs
one last Q πŸ™‚

Im trying stuff out like this, but the difference between my service_context gpt3 VS gpt4 is only relevant for querying index and prompting LLM, but for indexing its the same, right?:


chunk_len = 500
chunk_overlap = 50

embed_model = LangchainEmbedding(OpenAIEmbeddings(query_model_name="text-embedding-ada-002"))
splitter = TokenTextSplitter(chunk_size=chunk_len, chunk_overlap=chunk_overlap)
node_parser = SimpleNodeParser(text_splitter=splitter, include_extra_info=True, include_prev_next_rel=False)

llm_predictor_gpt3 = LLMPredictor(llm=ChatOpenAI(temperature=0.2, model_name='gpt-3.5-turbo', max_tokens=1000))
prompt_helper_gpt3 = PromptHelper.from_llm_predictor(llm_predictor=llm_predictor_gpt3)
service_context_gpt3 = ServiceContext.from_defaults(llm_predictor=llm_predictor_gpt3, prompt_helper=prompt_helper_gpt3, embed_model=embed_model, node_parser=node_parser, chunk_size_limit=chunk_len)

llm_predictor_gpt4 = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name='gpt-4',max_tokens=1000))
prompt_helper_gpt4 = PromptHelper.from_llm_predictor(llm_predictor=llm_predictor_gpt4)
service_context_gpt4 = ServiceContext.from_defaults(llm_predictor=llm_predictor_gpt4, prompt_helper=prompt_helper_gpt4, embed_model=embed_model, node_parser=node_parser, chunk_size_limit=chunk_len)
The LLM (gpt3 or gpt4) is used during index construction for Tree and KG indexes. List and Vector indexes do no use it.

(All of them use the LLM during queries)
Add a reply
Sign up and join the conversation on Discord