Find answers from the community

Updated last year

Open source llms

If you don't mind me asking, but what makes you think it would be challenging? I ask only because if this is more effort than it's worth, I might try another approach. I don't want to spin my wheels here, fighting against the grain.
L
c
25 comments
Just a general trend I see with open source llms tbh

They usually struggle to follow instructions, or struggle to give consistent structured outputs

Although I think I remember you mentioning you tried a few 70B models, which was surprising that they didn't work well πŸ€” most of my observations have been with much smaller models (13B or less)

I guess my question is, were you ensuring your prompts were formated properly for the model you were using? Usually if you deviate from the format the model was fine tuned on, results will be... not great lol
Hm - unsure about that.
I've been asking general questions. Here's what my proof of concept is:
I'm creating an sqlalchemy database that describes computers with columns (hostname, mac_address, ip_address, group_name) and an open_port table with columns (hostname, ip_address, port_number). All primed with dummy data.

What I've found so far is that if I tell Llamaindex of only the computers table, i can ask things like "how many computers are on the network" and get the correct response (only test so far). When I add in the second table, it starts to get confused with the same question.
How would I go about confirming the format?
Do you mind sharing how you setup the LLM on your latest attempt?
(One second, joining a meeting)
Just to double check
no worries haha
Plain Text
selected_model = '/data/models/Llama-2-70b-hf/'
llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=2048,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    # query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name=selected_model,
    model_name=selected_model,
    device_map="cpu",
    # change these settings below depending on your GPU
    #model_kwargs={"torch_dtype": torch.float16, "load_in_8bit": True},
)
# service_context = ServiceContext.from_defaults(llm=llm, chunk_size=512)
service_context = ServiceContext.from_defaults(llm=llm)
query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    tables=list_tables,
    service_context=service_context,
)

If that's what you mean.
thats it! And I already see the issue πŸ˜…
one sec, lemme type this out
In the huggingface LLM, just set the query wrapper prompt. It will look something like this

Plain Text
BOS, EOS = "<s>", "</s>"
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

query_wrapper_prompt=(
    f"{BOS}{B_INST} {B_SYS}{system_prompt_str.strip()}{E_SYS}"
    f"{completion.strip()} {E_INST}"
)

If you aren't using a system prompt, then it would look like this

Plain Text
query_wrapper_prompt=(
    f"{BOS}{B_INST} "
    f"{completion.strip()} {E_INST}"
)
I think this should help quite a bit? Although this may or may not be specific to the llama2-chat model, unsure
llama2 has probably the strangest prompting requirements lol
Awesome, thank you very much! I'll probably have some time to play around with this later today/tomorrow.
I REALLY appreciate your input here.
Is there documentation (that I probably missed) that can help further my knowledge here. Not sure if I would have figured that out on my own.
Meta really buried this. We only found out about it after reading their github repo for information.

NORMALLY this information is available on the model card for the LLM

For example, the quantized version of the model from the community has a proper example
https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML

(note that they exclude the BOS/EOS tokens -- we found you get slightly better outputs with them πŸ€·β€β™‚οΈ )
So yea, it kind of relies on the community providing proper input examples lol
Thank you very much for this info! I didn't get a chance to apply it today, but will do it tomorrow.
(let me know if I should take this discussion elsewhere)
I am where does completion come from in here:
Plain Text
    f"{completion.strip()} {E_INST}"

So trying to work through what you gave me, and applying it to what I have, I have:

Plain Text
OS, EOS = "<s>", "</s>"
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"


query_wrapper_prompt=(
    f"{BOS}{B_INST} "
    "{query_str} "
    f"{E_INST}"
)
wrapper = SimpleInputPrompt(query_wrapper_prompt, prompt_type=PromptType.SIMPLE_INPUT)
llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=2048,
    # system_prompt=system_prompt,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    query_wrapper_prompt=wrapper,
    tokenizer_name=selected_model,
    model_name=selected_model,
    device_map="cpu",
    # change these settings below depending on your GPU
    #model_kwargs={"torch_dtype": torch.float16, "load_in_8bit": True},
)


I got "{query_str} " and effectively the whole wrapper = SimpleInputPrompt(query_wrapper_prompt, prompt_type=PromptType.SIMPLE_INPUT) after reviewing docs.

I'm currently executing the query against the SQL database (cpu so slow) but I wanted to ask if you anything wrong in what I'm doing right now.

I wasn't sure where {completion.strip()} came from in your example so I really just swapped it out for "{query_str} " (per my understanding of the docs)
Yea my bad, I think you got it right. The example I gave was copy-pasted from someone who was implementing the LLM call method, so they had access to more variables lol
That looks good so far!
tbh though, you might get faster results using llama-cpp-python and our LlamaCPP integration.

If you install it, you can compile it to run on any GPU (I use it on my M2 mac). It's not "fast" per say, but it will be way faster than huggingface on CPU lol

The llamaCPP integration provides a slightly different (easier?) way to format model inputs. We have internal utils to format llama2 specifically, and you can see them used here (messages_to_prompt, completion_to_prompt)
https://github.com/jerryjliu/llama_index/blob/main/docs/examples/llm/llama_2_llama_cpp.ipynb

Tbh, we should probably add a similar functionality to the huggingfaceLLM lol
No problem at all - I just appreciate the direction. I'm still very new into this library so trying to learn it, so any input you have is great.
Your help is looking good! The predicted SQL statement for my query is looking correct! Waiting for the second part and seeing how that goes. I definitely did not get good results before your help.

I'll take a look at your other things. I am surprised at how slow this is.
This might also be helpful: https://github.com/jerryjliu/llama_index/pull/7283

Noticed that some of our parsing for getting the actual SQL query out of the response could be much better
Add a reply
Sign up and join the conversation on Discord