I am new to the whole LLM ecosystem

At a glance

I am new to the whole LLM ecosystem, trying to wrap my head around llama_index and langchain. Could someone help me understand the difference between gptvectorstoreindex/gptlistindex/gptsimplevectorindex?

11 comments

LLogan M

Those are pretty outdated terms, it's just VectorStoreIndex and ListIndex now

A list index sends all nodes in the index to the LLM

The vector store index only sends the top k most relevant nodes to the LLM

CCool

Thanks @Logan M !
How is fine-tuning a model is different from using VectorStoreIndex/ListIndex from llama_index? Should we always fine tune a model for our custom use case or shall we use VectorStoreIndex(from llama_index) for querying our data? Just having hard time wrapping my head around it 🙁

LLogan M

Using llama index, there is no training. You rely on an LLM that is hopefully smart enough to understand general instructions. And rather than relying on the actual facts the LLM knows, you give it relevant text to use when answering a query

This is powerful because it's a) fast iterations and b) debugable

Fine tuning is where you actually train a model and update the model weights. This requires high quality training data, and usually a lot of it. Since this process is pretty labour heavy, it's only recommended for specific use cases (teaching an LLM to write in a certain style, very specific tasks, etc.)

If something doesn't work well from finetuning, it's very slow to figure out what went wrong and then retraining it

LLogan M

Hope that makes some sense 😅

CCool

Thanks @Logan M !
So if we are using ListIndex, then all nodes data get sent over to LLM as input text? What happens if we have a very large custom dataset in the ListIndex (for example our custom domain knowledge set)?

LLogan M

So llama index does something called answer refinement

Basically chunk the text according to the models max input size

Get an initial answer

Pass that answer and the next chunk of text to the LLM. Then the llm has to answer the query again by either repeating the existing answer or updating it with new context.

The alternative is using response_mode="tree_summarize", which asks the query to every node, and then building a bottom up tree using all those responses, until there's one node left that it returns

CCool

Thanks @Logan M !
I am new to LLM ecosystem (huggingface, llama_index, langchain), trying to learn llama_index. Thank you so much for helping out 🙏
I am trying to build few applications with open datasets, index those using llama_index and hoping a decent enough llm would be able to help analyse the dataset and provide relevant information:

1: This is the first use case and first dataset: https://drive.google.com/file/d/1WFvu8dnVwZV5WuluHFS_eCMJv3qOaXr1/view
This has lending loan information and whether the loan was bad loan or not. Based on this dataset (which could be actually huge in case of production dataset), I want the llm to predict if an upcoming loan request could result into a bad/good loan in future?

2: This is the second use case and second dataset: https://huggingface.co/datasets/consumer-finance-complaints/viewer/default/train?row=0
This has consumer finance related complaints with finance institutions. Based on this dataset, I want llm to summarise things like:

Most pain-points of the consumers
For a finance institution, what is the most common occurring issue

3: This is the third use case and third dataset: https://huggingface.co/datasets/PolyAI/banking77
Given a labled dataset of customer queries, I want the llm to understand upcoming live customer query over the phone and understand it (from previous similar labeling) and then correctly respond.

Are the use-cases I listed above even solvable by LLMs today? Are LLMs matured enough to understand numeric relation(like in first use case) and then make predictions? Does passing the datasets by llama_index is good enough or do we need to fine tune existing llm for these use cases?

I am having hard time wrapping my head around understanding all this, hoping you could help?🙏

CCool

@Logan M Requesting to clarify on the above query please 🙏 appreciate your help on this Logan

LLogan M

Yea ngl this is a lot to unpack 😅

For the first case, non-LLM models will be much better (I.e. a transformers based classifier, like Roberta or similar)

For the second use case, llama-index and llms in general are a good fit. You could put thing in a list index and ask it to summarize these facts. Or, use a more structured approach using a pydantic program, which can extract user-defined fields from text

3 could also maybe work with llama index, if you had a vector index to retrieve similar cases to base its reasoning off of 🤔

LLogan M

I mentioned a lot of stuff here. I think it's probably a good idea to read our docs, and understand some more concepts.

I would start here
https://gpt-index.readthedocs.io/en/stable/getting_started/concepts.html

CCool

Thanks a ton @Logan M ! 🙏

Add a reply

Find answers from the community

I am new to the whole LLM ecosystem