Find answers from the community

Updated 2 months ago

Hi everyone, I am playing around with

Hi everyone, I am playing around with the Claude model using Bedrock in Llamaindex like this command: Bedrock(model="anthropic.claude-v2"). When I check the generated prompt using token_counter.llm_token_counts , I notice that there are two duplicate prompt events. Any one has any idea why we send two back to back same prompt calls to LLM?
L
a
26 comments
Are you sure they are duplicates?
sounds like a bug in the token counter maybe?
yes the same exact
is it a bug in the token counter or the way we generate prompt? because I also track prompts, it seems like we send two calls
What does your setup/query look like?
If it is actually duplicating calls, I should be able to replicate using your setup pretty easily
Plain Text
class QA():
    def __init__(self):
        self.CHUNK_SIZE = 256
        self.MODEL_NAME = "gpt-3.5-turbo"
        self.NUM_OF_PRODUCT_REQUIRED_PAGES = 5
        self.SIMILARITY_TOP_K = 2

    def get_model(self):
        token_counter = TokenCountingHandler(
            tokenizer=tiktoken.encoding_for_model(self.MODEL_NAME).encode
        )
        callback_manager = CallbackManager([token_counter])
        llm_predictor =  Bedrock(model="anthropic.claude-v2", profile_name='ABC'
)

        # llm_predictor = OpenAI(temperature=0, model=self.MODEL_NAME)
        # service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size=self.CHUNK_SIZE)
        service_context = ServiceContext.from_defaults(llm=llm_predictor, chunk_size=self.CHUNK_SIZE, callback_manager=callback_manager)
        set_global_service_context(service_context)
        return service_context, token_counter
and then you just used that in a vector index query engine I suppose?
@Logan M yes that is correct
is it a bug you think?
I track calls to LLMs, there are two exact same prompts
How did you setup the query engine? Just index.as_query_engine()?

I'd love to reproduce this but so far, I cannot

Plain Text
from llama_index.callbacks import TokenCountingHandler, CallbackManager
from llama_index import ServiceContext, SimpleDirectoryReader, VectorStoreIndex

documents = SimpleDirectoryReader("./docs/examples/data/paul_graham").load_data()

service_context = ServiceContext.from_defaults(
    callback_manager=CallbackManager([TokenCountingHandler()]), chunk_size=256
)

index = VectorStoreIndex.from_documents(documents, service_context=service_context)

query_engine = index.as_query_engine(similarity_top_k=2)

response = query_engine.query("What is the best way to raise money for a startup?")
This only calls the LLM once
@Logan M tried again this code
Plain Text
token_counter = TokenCountingHandler(
    tokenizer=tiktoken.encoding_for_model( "gpt-3.5-turbo").encode
)
callback_manager = CallbackManager([token_counter])
llm_predictor = Bedrock(model="anthropic.claude-v2")

service_context = ServiceContext.from_defaults(llm=llm_predictor, chunk_size=256, callback_manager=callback_manager)
set_global_service_context(service_context)

pages_text = ['Investors write checks when the idea they hear is compelling, when they are persuaded that the team of founders can realize its vision, and that the opportunity described is real and sufficiently large. When founders are ready to tell this story, they can raise money. And usually when you can raise money, you should.']
documents = [Document(text=t) for t in pages_text]

custom_llm_index = VectorStoreIndex.from_documents(documents, service_context=service_context)
custom_llm_query_engine = custom_llm_index.as_query_engine(similarity_top_k=2)
question = "how to raise money for a startup"
response = custom_llm_query_engine.query(question)
token_counter.llm_token_counts
i see two calls to the LLM
this is my output:
a list of two exact same events
wow this one is a doozy

Stepped through with a debugger, add a ton of print statements

I can confirm that the API is only called once, but the actual event is getting logged twice (for reasons that are very complex)
I put a print statement just before an API calls in the Bedrock LLM class, it only gets hit once
I am using Arize AI and it shows two calls
yes because the callback is getting called twice
(hence the two token counting events)
If you debug further, into the actual code, or enable some debug logs for the bedrock client (if it has any), you'd see only one network request is being made
I wrote our entire callback system, I can state this with 100% confidence
the callback is just being called twice, for a pretty complex reason that I guess I need to fix at some point
Add a reply
Sign up and join the conversation on Discord