Settings.callback_manager = callback_manager
where callback_manager just has, llama_debug, tokencountinghandler in Django.SentenceTransformerRerank
.SentenceTransformerRerank
directly in my code it takes up a lot of CPU / Ram.predict
when synthezing, response = self._llm.complete(formatted_prompt)
chat_response = self._llm.chat(messages)
instead?`
def predict(
self,
prompt: BasePromptTemplate,
output_cls: Optional[BaseModel] = None,
**prompt_args: Any,
) -> str:
"""Predict."""
self._log_template_data(prompt, **prompt_args)
if output_cls is not None:
output = self._run_program(output_cls, prompt, **prompt_args)
elif self._llm.metadata.is_chat_model:
messages = prompt.format_messages(llm=self._llm, **prompt_args)
messages = self._extend_messages(messages)
chat_response = self._llm.chat(messages)
output = chat_response.message.content or ""
else:
formatted_prompt = prompt.format(llm=self._llm, **prompt_args)
formatted_prompt = self._extend_prompt(formatted_prompt)
response = self._llm.complete(formatted_prompt)
output = response.text
{ "__computed__": { "latency_ms": 1.436, "error_count": 0, "cumulative_token_count": { "total": 0, "prompt": 0, "completion": 0 }, "cumulative_error_count": 0 } }
get_prompts
on query engine, it provides back 2 promptsresponse_synthesizer:text_qa_template
'response_synthesizer:refine_template'
refine_template
instead of the qa_template
programmatically?asyncio.run() cannot be called from a running event loop
in my API call. This wasna't the case during 0.9.X. I wasn't using nestio previously too.ObjectIndex
to fetch the correct SQL Table mappings.index._as_query_engine
and ask each question? I want it to run in parallel if possible though like SubQuestionQueryEngine. If there are any best practices for that I'd love to know! Thanks.client = qdrant_client.QdrantClient(url="http://localhost:6333") vector_store = QdrantVectorStore(client=client, collection_name="NOTES") pipeline = IngestionPipeline( transformations=[ SentenceSplitter(chunk_size=512, chunk_overlap=0), HuggingFaceEmbedding(model_name='XXXXX'), ], vector_store=vector_store, ) pipeline.run(documents=NOTES)
QuestionsAnsweredExtractor(questions=3, llm=llm),
I see that it generates 'questions_this_excerpt_can_answer': '1. How many countries does Uber operate in?\n2. What is the total gross bookings of Uber in 2019?\n3. How many trips did Uber facilitate in 2019?'}
Text
is searched. So how does it know how to search questions_this_excerpt_can_answer
without specifying it as a metafilter using qdrant for example.chunk_size of 512
I am able to get proper results from similarity search. But when I don't pass anything into the service context, which I assume the chunk_size is 1024 by default
from llama index, I am getting no results back so I'd like to see what the chunks end up looking like to see whats wrong. vector_query_engine_index = VectorStoreIndex.from_documents(documents, use_async=True, service_context=service_context )
llm = AzureOpenAI( engine="my-custom-llm", model="gpt-35-turbo-16k", temperature=0.0, azure_endpoint="https://<your-resource-name>.openai.azure.com/", api_key="<your-api-key>", api_version="2023-07-01-preview", )
api_base
get deprecated and became azure_endpoint
?id
and I want to be able to dynamically filter metadataid
I want to be able to pas that id
as filter to Qdrant. Is that possible?