Open source llms

ccmsimike

If you don't mind me asking, but what makes you think it would be challenging? I ask only because if this is more effort than it's worth, I might try another approach. I don't want to spin my wheels here, fighting against the grain.

25 comments

LLogan M

Just a general trend I see with open source llms tbh

They usually struggle to follow instructions, or struggle to give consistent structured outputs

Although I think I remember you mentioning you tried a few 70B models, which was surprising that they didn't work well 🤔 most of my observations have been with much smaller models (13B or less)

I guess my question is, were you ensuring your prompts were formated properly for the model you were using? Usually if you deviate from the format the model was fine tuned on, results will be... not great lol

ccmsimike

Hm - unsure about that.
I've been asking general questions. Here's what my proof of concept is:
I'm creating an sqlalchemy database that describes computers with columns (hostname, mac_address, ip_address, group_name) and an open_port table with columns (hostname, ip_address, port_number). All primed with dummy data.

What I've found so far is that if I tell Llamaindex of only the computers table, i can ask things like "how many computers are on the network" and get the correct response (only test so far). When I add in the second table, it starts to get confused with the same question.

ccmsimike

How would I go about confirming the format?

LLogan M

Do you mind sharing how you setup the LLM on your latest attempt?

ccmsimike

(One second, joining a meeting)

LLogan M

Just to double check

LLogan M

no worries haha

ccmsimike

Plain Text

selected_model = '/data/models/Llama-2-70b-hf/'
llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=2048,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    # query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name=selected_model,
    model_name=selected_model,
    device_map="cpu",
    # change these settings below depending on your GPU
    #model_kwargs={"torch_dtype": torch.float16, "load_in_8bit": True},
)
# service_context = ServiceContext.from_defaults(llm=llm, chunk_size=512)
service_context = ServiceContext.from_defaults(llm=llm)
query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    tables=list_tables,
    service_context=service_context,
)

If that's what you mean.

LLogan M

thats it! And I already see the issue 😅

LLogan M

one sec, lemme type this out

LLogan M

In the huggingface LLM, just set the query wrapper prompt. It will look something like this

Plain Text

BOS, EOS = "<s>", "</s>"
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

query_wrapper_prompt=(
    f"{BOS}{B_INST} {B_SYS}{system_prompt_str.strip()}{E_SYS}"
    f"{completion.strip()} {E_INST}"
)

If you aren't using a system prompt, then it would look like this

Plain Text

query_wrapper_prompt=(
    f"{BOS}{B_INST} "
    f"{completion.strip()} {E_INST}"
)

LLogan M

I think this should help quite a bit? Although this may or may not be specific to the llama2-chat model, unsure

LLogan M

llama2 has probably the strangest prompting requirements lol

ccmsimike

Awesome, thank you very much! I'll probably have some time to play around with this later today/tomorrow.

ccmsimike

I REALLY appreciate your input here.

ccmsimike

Is there documentation (that I probably missed) that can help further my knowledge here. Not sure if I would have figured that out on my own.

LLogan M

Meta really buried this. We only found out about it after reading their github repo for information.

NORMALLY this information is available on the model card for the LLM

For example, the quantized version of the model from the community has a proper example
https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML

(note that they exclude the BOS/EOS tokens -- we found you get slightly better outputs with them 🤷‍♂️ )

LLogan M

So yea, it kind of relies on the community providing proper input examples lol

ccmsimike

Thank you very much for this info! I didn't get a chance to apply it today, but will do it tomorrow.

ccmsimike

(let me know if I should take this discussion elsewhere)
I am where does completion come from in here:

Plain Text

    f"{completion.strip()} {E_INST}"

So trying to work through what you gave me, and applying it to what I have, I have:

Plain Text

OS, EOS = "<s>", "</s>"
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"


query_wrapper_prompt=(
    f"{BOS}{B_INST} "
    "{query_str} "
    f"{E_INST}"
)
wrapper = SimpleInputPrompt(query_wrapper_prompt, prompt_type=PromptType.SIMPLE_INPUT)
llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=2048,
    # system_prompt=system_prompt,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    query_wrapper_prompt=wrapper,
    tokenizer_name=selected_model,
    model_name=selected_model,
    device_map="cpu",
    # change these settings below depending on your GPU
    #model_kwargs={"torch_dtype": torch.float16, "load_in_8bit": True},
)

I got "{query_str} " and effectively the whole wrapper = SimpleInputPrompt(query_wrapper_prompt, prompt_type=PromptType.SIMPLE_INPUT) after reviewing docs.

I'm currently executing the query against the SQL database (cpu so slow) but I wanted to ask if you anything wrong in what I'm doing right now.

I wasn't sure where {completion.strip()} came from in your example so I really just swapped it out for "{query_str} " (per my understanding of the docs)

LLogan M

Yea my bad, I think you got it right. The example I gave was copy-pasted from someone who was implementing the LLM call method, so they had access to more variables lol

LLogan M

That looks good so far!

LLogan M

tbh though, you might get faster results using llama-cpp-python and our LlamaCPP integration.

If you install it, you can compile it to run on any GPU (I use it on my M2 mac). It's not "fast" per say, but it will be way faster than huggingface on CPU lol

The llamaCPP integration provides a slightly different (easier?) way to format model inputs. We have internal utils to format llama2 specifically, and you can see them used here (messages_to_prompt, completion_to_prompt)
https://github.com/jerryjliu/llama_index/blob/main/docs/examples/llm/llama_2_llama_cpp.ipynb

Tbh, we should probably add a similar functionality to the huggingfaceLLM lol

ccmsimike

No problem at all - I just appreciate the direction. I'm still very new into this library so trying to learn it, so any input you have is great.
Your help is looking good! The predicted SQL statement for my query is looking correct! Waiting for the second part and seeing how that goes. I definitely did not get good results before your help.

I'll take a look at your other things. I am surprised at how slow this is.

LLogan M

This might also be helpful: https://github.com/jerryjliu/llama_index/pull/7283

Noticed that some of our parsing for getting the actual SQL query out of the response could be much better

Add a reply

Find answers from the community

Open source llms