chantlong

LlamaIndex

Log inLog into community

Find answers from the community

chantlong

c

chantlong

Offline, last seen last month

Joined September 25, 2024

·

Separate Callback Managers for Parallel API Requests

I have this situation where I need to create a callback_manager per api request.
If I do Settings.callback_manager = callbackmanager when requests A and B are running in parallel kinda, where it starts from A then to B sequential. B would override A's callback_manager right?
If I want to keep each requests token_counter separate, that means I shouldn't use Settings.callback_manager, and directly pass callback_maanger into it's respective engines right?

When I try to pass callback_manager manually into every thing that can take a callback_manager, I usually get incomplete callback traces and token_counters that end up showing as 0. Where as if I just Settings.callback_manager=callback_manager, everything seems to just work.

If I don't have to worry about B overriding A, I would like to keep using settings.callback_manager=callback_manager 😅

If I do the following manually, I get

TypeError: llama_index.core.indices.vector_store.retrievers.retriever.VectorIndexRetriever() got multiple values for keyword argument 'callback_manager'

I'm using arize_phoenix

Plain Text

vector_query_engine = base_index.as_query_engine(
    vector_store_kwargs={"qdrant_filters": vector_filters},
    callback_manager=callback_manager,
    node_postprocessors=[rerank],
    similarity_top_k=10,
    use_async=True
)

9 comments

L

c

·

Query Engine's Handling of Maximum Context Window Limits

Hello there. was wondering does query_engine handle going over the maximum context window?

like if you're using gpt-4 with a 8092 context window, and your nodes are over that limit, how does query engine handle that.

13 comments

c

L

W

·

Checking package compatibility with llama-index-core version

what's the best way to see what packages are compatible with what version of llama-index-core?

like if I installed llama-index-core 0.10.44, I want to make sure the packages like llama-index-llms-gemini version is compatible. a lot of times i'll install the latest version of the packages and will have resolve it manually.

1 comment

L

·

Callback

Question on the callback manager. Currently in my app, when I start it, I initialize it with Settings.callback_manager = callback_manager where callback_manager just has, llama_debug, tokencountinghandler in Django.

How can I import it correctly because when I try to use import Settings, and try to access Settings.callback_manager in a different file, it ignores what I setup globally and starts its own callback_manager.

Am I suppose to be starting an instance of token_counter per API request? If 2 API requests happen at the same time, I assume it'll conflict so... maybe I can't really use Settings.callback_manager globally? 🤔

4 comments

c

L

·

Hello, I was wondering is there an API

Hello, I was wondering is there an API version of SentenceTransformerRerank.
When I use SentenceTransformerRerank directly in my code it takes up a lot of CPU / Ram.
I'm thinking of moving the Rerank model into a separate API and so in the postprocessor, instead of running the rerank model on my my current code/server, it calls an external API to do so.

2 comments

c

L

·

Reading the source code of synthezing

Reading the source code of synthezing and please correct me if I'm wrong.
It seems like query_engine calls predict when synthezing,
Is it that query_engine calls response = self._llm.complete(formatted_prompt)
where as chat_engine calls chat_response = self._llm.chat(messages) instead?

`
def predict(
        self,
        prompt: BasePromptTemplate,
        output_cls: Optional[BaseModel] = None,
        **prompt_args: Any,
    ) -> str:
        """Predict."""
        self._log_template_data(prompt, **prompt_args)

        if output_cls is not None:
            output = self._run_program(output_cls, prompt, **prompt_args)
        elif self._llm.metadata.is_chat_model:
            messages = prompt.format_messages(llm=self._llm, **prompt_args)
            messages = self._extend_messages(messages)
            chat_response = self._llm.chat(messages)
            output = chat_response.message.content or ""
        else:
            formatted_prompt = prompt.format(llm=self._llm, **prompt_args)
            formatted_prompt = self._extend_prompt(formatted_prompt)
            response = self._llm.complete(formatted_prompt)
            output = response.text

6 comments

c

L

·

When printing the trace when using query

When printing the trace when using query engine I always see,
SYNTHESIZE
CHUNKING
CHUNKING
LLM

Chunking has this info

Plain Text

{
  "__computed__": {
    "latency_ms": 1.436,
    "error_count": 0,
    "cumulative_token_count": {
      "total": 0,
      "prompt": 0,
      "completion": 0
    },
    "cumulative_error_count": 0
  }
}

What is this chunking actually doing? Does it use prompt tokens ?

2 comments

c

L

·

Really noob question, but if I am using

Really noob question, but if I am using an embedding model that save vectors in a dimension of 768. Will a chunk size of greater than 768, like 1024 fit into it?

2 comments

c

L

·

When I call `get_prompts ` on query

When I call get_prompts on query engine, it provides back 2 prompts
response_synthesizer:text_qa_template
and
'response_synthesizer:refine_template'

how do I make it use the refine_template instead of the qa_template programmatically?

3 comments

c

L

·

Search

I'm just following the Notebook on Raptor and instead of Chroma using Qdrant and just dumping in my own docs.

1 comment

L

·

One other issue I'm noticing after

One other issue I'm noticing after migration to 0.10.X is that I get this error asyncio.run() cannot be called from a running event loop in my API call. This wasna't the case during 0.9.X. I wasn't using nestio previously too.

8 comments

L

c

·

One other thing is that I am already

One other thing is that I am already using ObjectIndex to fetch the correct SQL Table mappings.
But the problem is I still have all my few shot prompts for all tables in 1 prompt and it's getting maxed out.
Any suggestions on how can I separate those few shot prompts out so that depending on table I use a smaller prompt?

2 comments

c

L

·

Does Text-to-sql support streaming?

Does Text-to-sql support streaming?
https://docs.llamaindex.ai/en/stable/examples/index_structs/struct_indices/SQLIndexDemo.html

1 comment

L

·

For the SubqueryEngine, it makes an LLM

For the SubqueryEngine, it makes an LLM call per subquestion and then a final synthesizer LLM call. How do I prevent the LLM call per subquestion and just take all the retrieved nodes and questions and dump it in the final synthesizer.

I'm reading the SubQueryEngine source code, but I'm having trouble seeing how the retrieved nodes gets passed into an LLM from right after for the subquestions. Help appreciated.

11 comments

c

L

·

Does query engine come with retry

Does query engine come with retry mechanism? I am getting Request Timeout when calling Azure OpenAI.

8 comments

L

c

·

Does Object Index retriever make any LLM

Does Object Index retriever make any LLM calls? It looks like it just fetches back the corresponding nodes?

7 comments

L

c

·

Questions

For SubQuestionQueryEngine, it's great when you need to generate questions, but what if I already have my questions beforehand?
In that case is it better to loop over the questions and just use index._as_query_engine and ask each question? I want it to run in parallel if possible though like SubQuestionQueryEngine. If there are any best practices for that I'd love to know! Thanks.

16 comments

c

L

·

Qdrant

I'm not sure if this is the right place to ask but when running this code, assuming my NOTES (documents) have a length of 2500, after adding it to Qdrant, and looking at the vectors_count, it is around 1300. I would assume if I add 2500 docs, based on the chunking, I would have at least 2500+ vectors_count?

Plain Text

client = qdrant_client.QdrantClient(url="http://localhost:6333")
vector_store = QdrantVectorStore(client=client, collection_name="NOTES")

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=0),
        HuggingFaceEmbedding(model_name='XXXXX'),
    ],
    vector_store=vector_store,
)

pipeline.run(documents=NOTES)

3 comments

L

c

W

·

Score

More of a LLM question, but when SubqueryQuestionEngine does retrieval, it then passes the 3 nodes it finds along with the score to the LLM to get an answer. Do the scores have an effect on what the LLM decides to do? I'm wondering why the score is passed into the context as well.

5 comments

L

c

·

Api

I've been stuck on llama index 0.8.64 since the package for openai 1.0 got released because it didn't support azure open ai with SubQueryEngine -> API calls would fail. Was wondering if that issue is gone now.

6 comments

L

c

·

Does SubQuestionQueryEngine have memory

Does SubQuestionQueryEngine have memory in that it remembers past answers (i.e. the query sent before this query) ?

1 comment

T

·

llama_index/docs/examples/metadata_extra...

I'm reading this doc on MetadataExtraction.
and with using the QuestionsAnsweredExtractor(questions=3, llm=llm), I see that it generates

Plain Text

 'questions_this_excerpt_can_answer': '1. How many countries does Uber operate in?\n2. What is the total gross bookings of Uber in 2019?\n3. How many trips did Uber facilitate in 2019?'}

My understanding is when doing a Vector Search a Document/Node's Text is searched. So how does it know how to search questions_this_excerpt_can_answer without specifying it as a metafilter using qdrant for example.

https://github.com/run-llama/llama_index/blob/main/docs/examples/metadata_extraction/MetadataExtractionSEC.ipynb

5 comments

c

L

·

Nodes

How can I view the nodes or I guess the result of the chunks after creating a vector query index?
I'm running into an issue where in the service context if I pass in a chunk_size of 512 I am able to get proper results from similarity search. But when I don't pass anything into the service context, which I assume the chunk_size is 1024 by default from llama index, I am getting no results back so I'd like to see what the chunks end up looking like to see whats wrong.

Plain Text

vector_query_engine_index = VectorStoreIndex.from_documents(documents, use_async=True,
service_context=service_context
)

3 comments

W

c

·

Subquestion

Does SubQuestionQueryEngine ask questions in parallel or is it sequential? i'm using the async version.

1 comment

W

·

AzureOpenAI

Looking at the docs posted by

Plain Text

llm = AzureOpenAI(
    engine="my-custom-llm",
    model="gpt-35-turbo-16k",
    temperature=0.0,
    azure_endpoint="https://<your-resource-name>.openai.azure.com/",
    api_key="<your-api-key>",
    api_version="2023-07-01-preview",
)

if you use the latest llama index & openai, did api_base get deprecated and became azure_endpoint ?
https://github.com/run-llama/llama_index/blob/main/docs/examples/llm/azure_openai.ipynb

2 comments

L

W