Find answers from the community

Updated 8 months ago

Hi, I'm not sure to understand how QP

At a glance

The community member is trying to understand why the runtime of an LLM (Large Language Model) called "LLM1" varies between two different programs, with the first program taking 1 minute and the second taking 3 minutes. They provide code snippets showing the setup of a "Query Pipeline" (QP) with various components, including the LLM1.

In the comments, another community member suggests that the runtime difference could be due to factors like cache hits, model reloading, or other processes using compute resources and slowing down the LLM calls. The original poster then explores these possibilities, asking about the impact of adding more modules on cache hit rate, model reloading, and potential pre-loading of modules.

Ultimately, the community members conclude that the runtime difference is likely due to cache hits, where the addition of more modules clears the temporary cache used for the first LLM, leading to the longer runtime in the second program.

Hi, I'm not sure to understand how QP perf work.

Indeed,
First Program: LLM1 takes 1min
Second Program: LLM1 takes 3min.
Plain Text
    def get_query_pipeline(self):
        """Create & Return the Query Pipeline of database generation"""

        qp = QP(
            modules={
                "input": InputComponent(),
                "process_retriever": self.process_retriever_component,
                "table_creation_prompt": self.table_creation_prompt,
                "llm1": self.llm1,
                "python_output_parser": self.python_parser_component,
            },
            verbose=True,
        )

        qp.add_link("input", "process_retriever")
        qp.add_link("input", "table_creation_prompt", dest_key="query_str")
        qp.add_link(
            "process_retriever", "table_creation_prompt", dest_key="retrieved_nodes"
        )

        qp.add_chain(["table_creation_prompt", "llm1", "python_output_parser"])

        return qp

VS
Plain Text
    def get_query_pipeline(self):
        """Create & Return the Query Pipeline of database generation"""

        qp = QP(
            modules={
                "input": InputComponent(),
                "process_retriever": self.process_retriever_component,
                "table_creation_prompt": self.table_creation_prompt,
                "llm1": self.llm1,
                "python_output_parser": self.python_parser_component,
                "table_insert_prompt": self.table_insert_prompt,
                "llm2": self.llm1,
                "python_output_parser1": self.python_parser_component,
            },
            verbose=True,
        )

        qp.add_link("input", "process_retriever")
        qp.add_link("input", "table_creation_prompt", dest_key="query_str")
        qp.add_link(
            "process_retriever", "table_creation_prompt", dest_key="retrieved_nodes"
        )

        qp.add_chain(["table_creation_prompt", "llm1", "python_output_parser"])
        
        ...
        return qp
L
A
8 comments
How are you measuring runtime of llm1 ?
what llm are you using?
I'm using Ollama to use the LLM.

And in the ollama serve, you have access to logs where the runtime of each api call to the llm is display.

-> I'm using Llama3.
Ah, with ollama, it could be a few things
  • cache hits
  • model reloading due to inactivity
  • other processes using compute and slowing down llm calls
Thanks for your answer :D, I will try once again: both with and without adding more modules (several times).

I'm kind of interested how does it work behind the scene...

  • Cache Hits: Does the hit rate frequency (cache usage retrieval) is lower with more modules ?
  • Model reloading: But having more modules shouldn't affect much the performance of the first llm in this case, isn't it ?
  • Other processes: Is there like a pre-loading of each modules ?
Alright, I think, I figured out what is happening. It's probably just cache hits as you said where with more modules, the temp cache used for the first llm is cleared.

Thanks you so much x)
Great! πŸ’ͺ
Yea ollama does a lot of caching and optimizations
Add a reply
Sign up and join the conversation on Discord