BanaanBakje

i am really confused about how to get

i am really confused about how to get more control with retrieval from a vector store (milvus 2.4).
i have tried the hybrid search with RRFRanker (https://docs.llamaindex.ai/en/stable/examples/vector_stores/MilvusHybridIndexDemo/), but i get error:

Plain Text

pymilvus.exceptions.MilvusException: <MilvusException: (code=2, message=Fail connecting to server on localhost:19530, illegal connection params or server unavailable)>

also since i am using index.as_chat_engine i am very confused since 99% of the docs is about query engine which works slightly different and of those 90% use nodes instead of vector stores, so im kinda stuck.

i have found that i can use node postprocessors in chat engine, which i still have to test:

Plain Text

chat_engine = index.as_chat_engine(
    node_postprocessors=[
        SimilarityPostprocessor(similarity_cutoff=0.7)
    ],
    chat_mode="context",
    memory=memory,
    system_prompt=(
        "You are an expert Q&A system that is trusted around the world. Answer questions based on the given context."
    ),
)

what i would like to do is either get the hybrid method working on milvus, but i would actually prefer to have more control about the retrieval function, im not sure if node postprocessor is enough for me yet and if i really need that retrieval control, but i would like to play around with it since it seems to be available.

i could be wrong, because im kinda confused since there are so many different ways to do things, so i wouldnt be surprised if postprocessors would be the same

TLDR: i need a little guidance on retrieval for milvus vector store index.as_chat_engine

2 comments

BBanaanBakje

how to make this llama pack work with a

how to make this llama pack work with a local llm?
https://docs.llamaindex.ai/en/stable/examples/llama_hub/llama_pack_resume.html
https://github.com/run-llama/llama-hub/tree/2c42ff046d99c1ed667ef067735e77364f9b6b7a/llama_hub/llama_packs/resume_screener
i am getting all kinds of errors trying to make this work, from not using openai, to not using an embed model, to Tree Summarize issues, to json issues

I tried using the pack directly in my code using this line to use my own llm:

Plain Text

resume_screener = ResumeScreenerPack(
    job_description=job_description,
    criteria=[
        "2+ years of experience in one or more of the following areas: machine learning, recommendation systems, pattern recognition, data mining, artificial intelligence, or related technical field",
        "Experience demonstrating technical leadership working with teams, owning projects, defining and setting technical direction for projects",
        "Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience.",
    ],
    llm=llm,
)

I also changed this line to add an embed model:

Plain Text

service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)

Then i used my own pdf ofcourse

Plain Text

response = resume_screener.run(resume_path="pathtofile")

Still cant get it to work
Idk but normally i am used to using the query engine or something, could it be there is missing something before we can use local llm?

4 comments

BBanaanBakje

RAG

Is chunking the best way to create a knowledgebase? Is it better to make specific nodes like Q&A or to improve the text of the data if its low in quality? Is it better to control the chunks or are there other strategies? Or is chunking reliable enough with lots of data? Are there some guides that you would recommend for business grade RAG that goes in depth about it and that gives reliable output even if the data quality is low?

2 comments

BBanaanBakje

Model not offloading on GPU, I tried

Model not offloading on GPU, I tried many things all week, only oobabooga seems to be able to do it with n_gpu_layers, all the other scripts I tried seem to ignore it or something?

46 comments

BBanaanBakje

i have seen ollama around, but im not

i have seen ollama around, but im not sure if thats an option, will check that out tho

1 comment

BBanaanBakje

i use OpenAILike as llm, so im thinking

i use OpenAILike as llm, so im thinking thats why, but im not sure what other llm to use since i use vLLM as interference server

6 comments

BBanaanBakje

Chat engine

Hello,
I noticed the chat engine repeats a single .chat(question) a few times before it answers the next question
I have 3 questions, but it keeps looping the first like 7 times before moving to the next, which it doesnt loop
so it pretty weird

Plain Text

response_text = chat_engine.chat(question)

11 comments

BBanaanBakje

vLLM

I am trying to get a chat engine running with my vLLM OpenAI docker.
I can't find anything in the docs about using the vLLM OpenAI docker, but it should be like using OpenAI.
But nothing I do works like just using OpenAI according to the LlamaIndex docs.
I can't use the OpenAI import from LlamaIndex, because I have to use the credentials from the vLLM docker, which has an empty api key, which is not allowed.
Also I can't use the OpenAI import from openai, which I do use for the vLLM docker, which does work.
The reason why I can't use the import is because I get the error:

Plain Text

service_context = ServiceContext.from_defaults(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/service_context.py", line 184, in from_defaults
    llm_metadata=llm_predictor.metadata,
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llm_predictor/base.py", line 148, in metadata
    return self._llm.metadata
AttributeError: 'OpenAI' object has no attribute 'metadata'

My script is:

28 comments

BBanaanBakje

LlamaIndex_RAG_Memory/example_UI_app.py ...

I am trying to connect multiple clients to the same LLM model, so the model gets loaded only once and can handle multiple clients.
I am using this script and it loads the LLM twice and it bugs with more than 1 connection https://github.com/Josh-ee/LlamaIndex_RAG_Memory/blob/main/example_UI_app.py
I had it working where it only loads once, but it gave me segmentation fault.
I believe I have to put the LLM into a new async function, but I dont know how to handle it if that is actually the solution.
Question: how to make the script load 1 model and make it possible for multiple clients to have their own chat that doesnt get influenced by other users?

44 comments

Find answers from the community

i am really confused about how to get

how to make this llama pack work with a

RAG

Model not offloading on GPU, I tried

i have seen ollama around, but im not

i use OpenAILike as llm, so im thinking

Chat engine

vLLM

LlamaIndex_RAG_Memory/example_UI_app.py ...