I used the text-generation class because the input is a raw prompt, and the model needs to read that and "continue" the prompt by answering the question.
You could use a question answer pipeline too though, or really you can use anything as long as it returns an answer lol but the question answering pipeline takes a specific format (question and context need to be separated), which would require more complex string parsing to organize the prompt
So when you input some documents into an index, they get broken into chunks so that they hopefully fit into the models context size at query time.
Then when you query (this is for a vector index) the top_k nodes that match the query are retrieved. Then, an answer to the query is refined across all the nodes. This means that once the model gives an answer, llama index presents the LLM with new context and ask if it needs to change the previous answer.
Got it thank you very much! For context I'm trying to create a bot to read an entire documentation like the example you provided. I got weird results though when I tried various models. For example flan-t5 gave empty responses and opt-iml-1.3b gave decent results which I thought its curious
Do you think that this is due to the "customisation" of the LLM or due to the LLM? Could "customizing" the CustomLLM make the flan-t5 give reasonable results like the opt-iml?
Flan has a very limited context size. Which means if it needs to refine an answer, it can quickly run out room in the input (since the refine needs the prompt template + question + context + previous answer)
Even without the refine, the 512 token limitation makes things a little tricky