Flan is a little tricky to use.
Can I see how you setup the index? Did you use a prompt helper?
class CustomLLM(LLM):
model_name = "google/flan-t5-large"
pipeline = pipeline("text-generation", model=model_name, device="cuda:0", model_kwargs={"torch_dtype":torch.bfloat16})
def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
prompt_length = len(prompt)
print(prompt)
response = self.pipeline(prompt, max_new_tokens=num_output)[0]["generated_text"]
# only return newly generated tokens
return response[prompt_length:]
@property
def _identifying_params(self) -> Mapping[str, Any]:
return {"name_of_model": self.model_name}
@property
def _llm_type(self) -> str:
return "custom"
llm_predictor = LLMPredictor(llm=CustomLLM())
embed_model = LangchainEmbedding(HuggingFaceEmbeddings())
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper,embed_model=embed_model)
Load the your data
documents = SimpleDirectoryReader('data').load_data()
index = GPTListIndex.from_documents(documents, service_context=service_context)
index.save_to_disk('index.json')
new_index = GPTListIndex.load_from_disk('index.json', service_context=service_context)
Query and print response
query with embed_model specified
response = new_index.query(
"how much was spent at COMCAST?",
mode="embedding",
verbose=True,
service_context=service_context
)
print(response)I see you passing in the prompt helper, what are the settings for that? I think that will be the main thing to tweak
max_input_size = 2048
set number of output tokens
num_output = 256
set maximum chunk overlap
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)Really appreciate the help!
Flans max input size is very small (512)
Try with something like these settings
max_input_size=512
num_output=256
max_chunk_overlap=20
However, by default flan outputs 512 tokens, you we need to find the setting to change that to 256 in the pipeline
UPDATE: ah I see you are setting this
Running now looks better already!
snap so it looked promising was i started seeing :
We have provided an existing answer:
We have the opportunity to refine the existing answer (only if needed) with some more context below.
------------
FOOD MARKET INC,Groceries,Sale,-63.75,
01/01/2022,01/02/2022,LYFT *2 RIDES 12-30,Travel,Sale,-54.23,
12/30/2021,01/02/2022,TAQUERIA DOWNTOWN CATE,Food & Drink,Sale,-23.35,
01/01/2022,01/02/2022,DD DOORDASH CHIPOTLE,Food & Drink,Sale,-26.60,
but in the end Empty Response
Yeaaa sounds familiar
I see you are actually setting the num output in the pipeline. You might need to reduce it further (maybe 150?)
The way flan works is a little different than GPT models (encoder/decoder model vs. decoder models), it makes it a little tricky π€
is there a better model you'd recommend
goal was to avoid the GPT costs while doing dev : (
and then upgrade once i have a built solution
How much VRAM do you have access to?
32 , but i can spin a VM if needed
Oh cool!
This model might be interesting to try
https://huggingface.co/facebook/opt-iml-max-1.3bAnd of course I'm sure you've seen all the github repos with things like alpaca, llama, gpt4all. All those would be good options too, but just a bit more setup since they aren't in huggingface
Just be aware that depending on the model you are using, you need to adjust the prompt helper to the input size of that model
Also note #2, depending on. Which LLM you use, you might get pretty varied performance/quality of answers π
but hopefully they are mostly the same
haha yeah, hard to bet chatgpt : (
but appreciate the help! one question when using huggingface
is the data being passed outside of my local machine