hmmm does with auto-merge query engine

can you share some code?

If im remembering right, auto-merge query engine relies on a docstore, which unless you are using redis or mongodb, lives in memory

oh VRAM? Even weirder haha

VRAM is purely used for LLM or embedding models 🤔

I posted the error code which narrows down the function which puts unwanted things on VRAM. Its connected to vllm.

Did you change anything with your vllm usage? Maybe the vllm package updated?

We will refactor this internally, but you might want to consider refactoring as well, as others might not want the additional vram usage either. In our case it blows up quickly as we batch calls on the llama-index side.

nope still on 0.3.0

is this a llama-index issue though or vllm? I think llama-index really just calls the llm 😅

its this function: /opt/conda/lib/python3.10/site-packages/llama_index/llms/vllm/base.py

thats the LLM class yes

feel free to take a look at the source code: https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-vllm/llama_index/llms/vllm/base.py

So the funamental problem we were having is that vllm does not like to import that function in del without an active CUDA GPU on the system

We are just trying to use the http server

In a GPU-less environment

Also the vllm generate endpoint is all but depreciated so yall may want to consider refactoring it into an OpenAI derivative?

but thats another low priority issue lol

Have you considered running vllm with openai api and using OpenAILike llm class?

(Probably much more reliable/tested tbh)

Yeah we probably should...

ITs complicated though

we where there before 😉

Yeah I cant remember why we even switched now?

switched to VllmServer 😄

Some issue?

Prompt syntax I think

doesn't vllm auto-format prompts if you use the chat endpoints? That was my understanding

its been a while. 😅

Like we couldnt figure out how to feed a proper raw prompt or something, and there may have been other issues as well

nope

wait really? What does it do with message dicts? lol

I'm trying to remember myself, but it wasn't satisfactory lol

Well last time we used it over OpenAI endpoint we sure had to prompt it with the right template. However that might have changed, as everything moves so fast 😄

With OpenAILike, you may have to explicitly set is_chat_model=True in the constructor to use the chat endpoints 👀

But back to vllm-server -- not 100% how to solve this issue 😅

we will find a way. Right now we will simply restrict gpu limits on other services so the whole pipeline does not OOM...

@Logan M On another topic:
Would the new Llama-ingest service be able to infer scanned in tables in Documents?
Is it able to do OCR reliably?
Does it scan & ingest content of embedded images in a Document?
Do you have capabilities to find subtables in Excel spreadsheets as unstructured does?

Our other concern was using a non chatml model, at the time

But I think we have a better grasp on forming the custom prompt syntax with llama index now... maybe 😛

I think it does quite well tbh -- I encourage you to give the API a shot and see what you think. The 1000 page/day limit is fairly roomy to try a few tough examples 💪

I will see. The tough examples are quiet sensitive so cant push them into an external service just like that. But I understand correctly llama-ingest is only for PDFs right?

Do you use a LVLM for it?

Yea only for PDFs right now. More document types are planned. And eventually (hopefully) something like enterprise deployments.

The pipeline right now is a complicated mix of OCR and processing 🙂