MosaicML as an llm model with llamaindex

ddev_blockchain

Hey guys, has anyone here tried to implement the MosaicML as an llm model with llamaindex?

64 comments

ddev_blockchain

@Logan M

ddev_blockchain

One more question I have after openAI llm model, which one is the best free and open source model, we can use?

IIchigø

il let you know on this. im trying ti myself

IIchigø

loading the model is pain.

IIchigø

so far the best luck i had is with wizzard vicuna 13b hf if u have the hardware

IIchigø

u need around 20gb vram

IIchigø

works well. is slow in responding can take like 20 seconds ish but is decent in quality. Im also using it with llamaindex to use against my own documents and works decently.

IIchigø

@dev_blockchain

ddev_blockchain

Hey @Ichigø Thanks for the response appreciate it, can you please share the notebook or reference for the wizzard vicuna 13b i wanted to try that as well.

IIchigø

@dev_blockchain i host my stuff on aws sagemaker so cant really share it straightforward but if you just go on huggingface and look for wizard vicuna by theblocke smth like that u should be ablw to see it

IIchigø

Plain Text

mname = "TheBloke/wizard-vicuna-13B-HF"
tokenizer = LlamaTokenizer.from_pretrained(mname)
model = LlamaForCausalLM.from_pretrained(mname, load_in_8bit=True, device_map="auto", torch_dtype=torch.float16)

def format_prompt(prompt: str) -> str:
    prompt_template=f"### Human: {prompt} \n### Assistant:"
    return prompt_template


class customLLM(LLM):

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        pr = format_prompt(prompt)
        generation_config = GenerationConfig(
            max_new_tokens=5000,
            temperature=0.1,
            repetition_penalty=1.0,
        )
        inputs = tokenizer(pr, padding=False, add_special_tokens=False, return_tensors="pt").to(model.device)
 
        with torch.inference_mode():
            tokens = model.generate(**inputs, generation_config=generation_config)
            
        return tokenizer.decode(tokens[0], skip_special_tokens=True)
        
    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        return {"name_of_model": model}

    @property
    def _llm_type(self) -> str:
        return "custom"

IIchigø

this should get you started

IIchigø

below that is mainly llamaindex stuff and langhchain stuff

ddev_blockchain

Thanks for the help @Ichigø , will keep you updated about my progress on this.
really appreciate it.

IIchigø

No worries been dealing with this for last month

IIchigø

Thanks to @Logan M lol he been dealing with my questions

ddev_blockchain

Yeah he is the best

ddev_blockchain

Hey @Ichigø , I have implemented the above code, can you please check my notebook and can you look in to the errors i am having.

ddev_blockchain

https://colab.research.google.com/drive/1orACY44Lvr0GrzFptnZBOzwhnUVg4TKn?usp=sharing

ddev_blockchain

Also what about this notebook is, i am trying to use it as query llm on my custom data.

ddev_blockchain

@Logan M please look if you can also have any thought or reference for same.

LLogan M

add this line somewhere near the top from typing import Optional, List, Mapping

ddev_blockchain

Plain Text

documents = SimpleDirectoryReader('./data').load_data()
index = GPTSimpleVectorIndex.from_documents(documents,service_context=service_context)
index.save_to_disk('index.json')
     

query_text = "in Backup and Recovery, what Customer Data is ?"
response = index.query(query_text,response_mode="compact",service_context=service_context, similarity_top_k=1)
print(response)
     
Answer: Customer Data 

query_text = "from where we can access files stored on Smallpdf ?"
response = index.query(query_text,response_mode="compact",service_context=service_context, similarity_top_k=1)
print(response)

@Logan M after that can I use the model like this, because for now, I believe it needs more than 15 GB GPU.So, i just wanted to know if this is the next steps or something different

LLogan M

Yup, that shouuuld work!

ddev_blockchain

Great, let me work on this, and will update you so thers can also take help.

IIchigø

Perfect! You could also you use langchain to further make it a chatbot with memory

IIchigø

It would be sick if llama index can just have memory feature

LLogan M

Llama index is less focused on chat, and more on data retrieval/answering questions using your data 👀 I think memory is a pretty low priority at the moment, but maybe someday

IIchigø

Ah makes sense

IIchigø

Well langchain it is

IIchigø

@Logan M oh yea wanted to ask does langchain memory like save the chat somewhere or is it just stored in ram and flushed when session stop

IIchigø

Because its odd that whenever i use the the open ai version, it always jist tells me “the new context did not provide blah blah so the original anserr stays the same”

IIchigø

Literally all the time

IIchigø

It never answers me anything for the first try

IIchigø

Its like it stores the whole conversation

IIchigø

And then i say give the original answer and it gives me

IIchigø

So it must be storing on server side of openai

LLogan M

Are you using GPT-3.5? This is a super common problem with GPT-3.5 and llama index 😦

This response comes from the answer refinement part of llama index. It used to work, but then openai "updated" gpt-3.5

I have a custom refine prompt I can share, if you want. It seemed to help

LLogan M

Depending on your llama index versions, there's two ways at the bottom

Plain Text

from langchain.prompts.chat import (
    AIMessagePromptTemplate,
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)

from llama_index.prompts.prompts import RefinePrompt

# Refine Prompt
CHAT_REFINE_PROMPT_TMPL_MSGS = [
    HumanMessagePromptTemplate.from_template("{query_str}"),
    AIMessagePromptTemplate.from_template("{existing_answer}"),
    HumanMessagePromptTemplate.from_template(
        "I have more context below which can be used "
        "(only if needed) to update your previous answer.\n"
        "------------\n"
        "{context_msg}\n"
        "------------\n"
        "Given the new context, update the previous answer to better "
        "answer my previous query."
        "If the previous answer remains the same, repeat it verbatim. "
        "Never reference the new context or my previous query directly.",
    ),
]


CHAT_REFINE_PROMPT_LC = ChatPromptTemplate.from_messages(CHAT_REFINE_PROMPT_TMPL_MSGS)
CHAT_REFINE_PROMPT = RefinePrompt.from_langchain_prompt(CHAT_REFINE_PROMPT_LC)
...
# v0.6.x
query_engine = index.as_query_engine(..., refine_template=CHAT_REFINE_PROMPT)

# v0.5.x
respone = index.query(..., refine_template=CHAT_REFINE_PROMPT)

IIchigø

AHHH

IIchigø

thank you so much! ill check this out

IIchigø

its giving me an error now @Logan M

IIchigø

raise ValueError(f"One input key expected got {prompt_input_keys}")
ValueError: One input key expected got ['refine_template', 'input']

IIchigø

ah wait

IIchigø

i was adding this to langchain

IIchigø

Perfect. thanks!

ddev_blockchain

Hey @Ichigø @Logan M , getting this error

Plain Text

RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

ddev_blockchain

i was running this command @Ichigø
print(customLLM()._call("Tell me somthing about New York City."))

ddev_blockchain

resolved it guys

ddev_blockchain

Hey @Ichigø @Logan M do we also have conversations like anything, so not only query. But if we can have a liitle chat.

LLogan M

You'll want to integrate with langchain for that 👍

ddev_blockchain

any reference @Logan M

LLogan M

https://github.com/jerryjliu/llama_index/blob/main/examples/langchain_demo/LangchainDemo.ipynb

LLogan M

https://gpt-index.readthedocs.io/en/latest/how_to/integrations/using_with_langchain.html

LLogan M

Basically the idea is you use llama index as a custom tool for a langchain agent

ddev_blockchain

oh great thanks, one more important question, how i can print the context while i am querying

LLogan M

If you are using the approach in the notebook, you can use a wrapper function instead of a lambda. In the wrapper, you can print anything you want

ddev_blockchain

Hey @Ichigø, the vicuna is running now, but i think it's handling on request at a time, or i am doing anything wrong ??

LLogan M

Nah that's how it works 😅 need multiple instances of the model to do parallel... which requires a lot of hardware

ddev_blockchain

oh thanks, i thought the issue is with me 😅

ddev_blockchain

Hey @Logan M , i tried to get the context but still i am not able to get it do you have any reference for the same, like when i am passing a query i want to see that from which context i am getting the response.

LLogan M

Check response.source_nodes I think

ddev_blockchain

it worked

ddev_blockchain

you are the best

Add a reply

Find answers from the community

MosaicML as an llm model with llamaindex