Hey everyone, I have a general question:

At a glance

Hey everyone, I have a general question:
I'd like to finetune a model on several data sources of data. I'm using the llama-index loaders to load data from multiple data sources. What is the best way to save this data and fine tune an LLM on it? Should I build a vector DB without embeddings and once training is started - fetch the data from the DB? Or should I just save the data to GCS/S3 and load it from there? If the second option is the "correct" one, is there a built-in way to do that with llama-index?

5 comments

WWhiteFang_Jr

I think if you are finetuning it locally then putting it with you would be the fastest.
If you are going to finetune on a diff server then you can put the training data in a DB and then fetch it ovber there for finetuning

ddavid1542

I intend on using a diff server. I was thinking of maybe creating an index without embeddings and simply retrieve the data from it. Is there a way to simply save documents to disk without creating an index?

WWhiteFang_Jr

You can use docstore I guess, and add data to it and then persist as well without creating the index

WWhiteFang_Jr

But this require you to convert your data into nodes

ddavid1542

Docstore Sounds like a good idea. I'll try to check it out. Thanks!

Add a reply

Find answers from the community

Hey everyone, I have a general question: