LlamaIndex

Log inLog into community

Find answers from the community

Updated 3 months ago

Speed

Speed

·

max_input = 4096
tokens = 2048

prompt_helper = PromptHelper(max_input, tokens)

llmPredictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-003", max_tokens=tokens))

service_context = ServiceContext.from_defaults(llm_predictor=llmPredictor, prompt_helper=prompt_helper)
storage_context = StorageContext.from_defaults(persist_dir=self.persist_directory)
index = load_index_from_storage(storage_context=storage_context, index_id=self.vector_store_name, service_context=service_context)
query_engine = index.as_query_engine(response_mode="tree_summarize")
response = query_engine.query(query)
answer = response.response

L

s

p

101 comments

It can sometimes fail to find an answer depending on the query right?

same query

replies sometime

No replies sometimes

with the same documents in the index?

Yes same document

same index

queried twice

replies once

and sometimes not

hmm, no idea lol

😦

let me do some hit and try

any way that we force it to search entire context for a keyword

like temperature does

temperature expands the possibility of responses as per my understanding

you are right 👍

so temperature is 0.25 means

?

and gettings this -

Plain Text

VectorStoreIndex' object has no attribute 'insert_ref_doc

I think it is insert not insert_ref_doc

right?

ha yea you are right, it's just insert(document) , forgot we never added that re-named method

Plain Text

# define custom retriever
vector_retriever = vector_index.as_retriever(similarity_top_k=10)
keyword_retriever = keyword_index.as_retriever()

custom_retriever = CustomRetriever(vector_retriever, keyword_retriever)

rerank = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-2-v2", top_n=5
)

I think we should add similarity_top_k to keyword_retriever also?

WYO?

service_context is only used at the time of querying? Right?

theres no similarity for keywords -- it just fetches all nodes that contain the same keywords as the query

service context contains multiple things. The embedding model for example is used during vector index construction and querying

ok ! so we pass service_context in both cases then?

yea! 👍

okie!

let me do it

and then I'll disturb you again 😄

Is the service_context passed correctly -
vector_index = VectorStoreIndex([], storage_context=storage_context, service_context=service_context)
keyword_index = SimpleKeywordTableIndex([], storage_context=storage_context, service_context=service_context)

??

you got it!

one last question

does it supports streaming?

custom_query_engine = RetrieverQueryEngine(retriever=custom_retriever,response_synthesizer=response_synthesizer,streaming=True,node_postprocessors=[rerank])
I tried to pass it this way

but that doesn't turn out to work

before I was doing this way

Plain Text

vectorIndex.as_query_engine(response_mode="tree_summarize", similarity_top_k=10, streaming=True, node_postprocessors=[rerank])

you can set streaming in the response synthesiszer

response_synthesizer = get_response_synthesizer(..., streaming=True)

done that 🙂

already 😄

did it work? lol

yup!

I debugged the library

and found it should be passed in streaming

Fingers crossed!

deployed the updates

time to test

That seems to work but response time is increased

Any suggestion for decreasing that

maybe changing the re-ranker to top 2 instead of top 3? Response times do just vary depending on how busy openai is too

Yeah but we were using openai before also

But may be we doing deep search

That's why it's increased

yea thats true 🤔

hi @Logan M

How're you?

Hey! Doing good! 👋 hbu?

I have a question

About passing chat history

To the query

Do you know how we can do that

In llama as well as in openai chat completion api

Both

Using the query engine, you'd have to use a chat engine/agent

Otherwise, you'd have to manually create the prompt template on each query to contain some chat history

Using the raw llm object, you can just do llm.chat() and give it a list of ChatMessage objects
https://gpt-index.readthedocs.io/en/stable/examples/llm/openai.html#call-chat-with-a-list-of-messages

Ok so when using chat engine is it necessary to send chat history'

Or it remember by itself?

It will remember it by itself -- every chat engine starts with a fresh memory

hi Logan

does llama supports gpt-4?

It sure does!

How Do we specify

Plain Text

from llama_index.llms import OpenAI

llm = OpenAI(model="gpt-4", temperature=0.1)
service_context = ServiceContext.from_defaults(llm=llm)

Cool

One more question

Which is the best vector store you suggest

That have great accuracy speed and results

I am using qdrant but results are not so convincing

I don't think the results will change much between vector stores tbh -- they all do the same thing, which is vector similarity 🤔

Maybe weaviate is worth trying?

Ok I'll give a try

And is there any limitations in terms of document count

Per collection

So that it generates good results

Not really size limits, but more like things will work better if your collections are organized (i.e. per topic, etc.)

The bigger an index gets, you may also need to play with the top k

I see

Ok that good let me try the one u suggested

hi @Logan M

can we host weaviate locally?

For sure, they provide a docker container
https://weaviate.io/developers/weaviate/installation/docker-compose

Or directly through code
https://weaviate.io/developers/weaviate/installation/embedded

what is embedded installation?

okay understood

Yea it's a weird way to name it haha

yeah! they should have said app-integrated

or something

embedded takes mind to circuits

I don't have that much data to Index, around 5Gb of Markdown files for a RAG chatbot but I will need the Vector database to support 300k concurrent users. In that case which one would you recommend? Qdrant vs Milvus vs Weaviate vs Chroma vs Pinecone?

Add a reply

Sign up and join the conversation on Discord