Find answers from the community

Updated 10 months ago

hmmm does with auto-merge query engine

hmmm does with auto-merge query engine load something into vRAM that was not loaded before?
Im hosting both LLM and embeddings model in a different service. However since the new llama-index upgrade the rag pipeline started loading things into VRAM πŸ€”
1
L
L
A
45 comments
can you share some code?

If im remembering right, auto-merge query engine relies on a docstore, which unless you are using redis or mongodb, lives in memory
oh VRAM? Even weirder haha
VRAM is purely used for LLM or embedding models πŸ€”
I posted the error code which narrows down the function which puts unwanted things on VRAM. Its connected to vllm.
Did you change anything with your vllm usage? Maybe the vllm package updated?
We will refactor this internally, but you might want to consider refactoring as well, as others might not want the additional vram usage either. In our case it blows up quickly as we batch calls on the llama-index side.
nope still on 0.3.0
is this a llama-index issue though or vllm? I think llama-index really just calls the llm πŸ˜…
its this function: /opt/conda/lib/python3.10/site-packages/llama_index/llms/vllm/base.py
thats the LLM class yes
So the funamental problem we were having is that vllm does not like to import that function in del without an active CUDA GPU on the system
We are just trying to use the http server
In a GPU-less environment
Also the vllm generate endpoint is all but depreciated so yall may want to consider refactoring it into an OpenAI derivative?
but thats another low priority issue lol
Have you considered running vllm with openai api and using OpenAILike llm class?
(Probably much more reliable/tested tbh)
Yeah we probably should...
ITs complicated though
we where there before πŸ˜‰
Yeah I cant remember why we even switched now?
switched to VllmServer πŸ˜„
Prompt syntax I think
doesn't vllm auto-format prompts if you use the chat endpoints? That was my understanding
its been a while. πŸ˜…
Like we couldnt figure out how to feed a proper raw prompt or something, and there may have been other issues as well
wait really? What does it do with message dicts? lol
I'm trying to remember myself, but it wasn't satisfactory lol
Well last time we used it over OpenAI endpoint we sure had to prompt it with the right template. However that might have changed, as everything moves so fast πŸ˜„
With OpenAILike, you may have to explicitly set is_chat_model=True in the constructor to use the chat endpoints πŸ‘€
But back to vllm-server -- not 100% how to solve this issue πŸ˜…
we will find a way. Right now we will simply restrict gpu limits on other services so the whole pipeline does not OOM...
@Logan M On another topic:
Would the new Llama-ingest service be able to infer scanned in tables in Documents?
Is it able to do OCR reliably?
Does it scan & ingest content of embedded images in a Document?
Do you have capabilities to find subtables in Excel spreadsheets as unstructured does?
Our other concern was using a non chatml model, at the time
But I think we have a better grasp on forming the custom prompt syntax with llama index now... maybe πŸ˜›
I think it does quite well tbh -- I encourage you to give the API a shot and see what you think. The 1000 page/day limit is fairly roomy to try a few tough examples πŸ’ͺ
I will see. The tough examples are quiet sensitive so cant push them into an external service just like that. But I understand correctly llama-ingest is only for PDFs right?
Do you use a LVLM for it?
Yea only for PDFs right now. More document types are planned. And eventually (hopefully) something like enterprise deployments.

The pipeline right now is a complicated mix of OCR and processing πŸ™‚
@kapa.ai Can I use Neo4j as vector storage backend to store documents used by auto-merge query engine?
Let me check the knowledge base for you. One moment! :books:
----------------------
Yes, you can use Neo4j as a vector storage backend to store documents used by the auto-merge query engine in LlamaIndex.

Here is an example of how you can do this:

First, you need to install the necessary packages:
Plain Text
%pip install llama-index-vector-stores-neo4jvector
!pip install llama-index


Then, you can initiate the Neo4j vector wrapper:
Plain Text
from llama_index.vector_stores.neo4jvector import Neo4jVectorStore

username = "neo4j"
password = "pleaseletmein"
url = "bolt://localhost:7687"
embed_dim = 1536

neo4j_vector = Neo4jVectorStore(username, password, url, embed_dim)


After that, you can load your documents and build the VectorStoreIndex:
Plain Text
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import StorageContext

# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()

storage_context = StorageContext.from_defaults(vector_store=neo4j_vector)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)


Finally, you can use the index as a query engine:
Plain Text
query_engine = index.as_query_engine()
response = query_engine.query("What happened at interleaf?")


For more details, you can refer to the Neo4jVectorDemo notebook () on GitHub.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
Add a reply
Sign up and join the conversation on Discord