Hey everyone, I have a general question: I'd like to finetune a model on several data sources of data. I'm using the llama-index loaders to load data from multiple data sources. What is the best way to save this data and fine tune an LLM on it? Should I build a vector DB without embeddings and once training is started - fetch the data from the DB? Or should I just save the data to GCS/S3 and load it from there? If the second option is the "correct" one, is there a built-in way to do that with llama-index?
I think if you are finetuning it locally then putting it with you would be the fastest. If you are going to finetune on a diff server then you can put the training data in a DB and then fetch it ovber there for finetuning
I intend on using a diff server. I was thinking of maybe creating an index without embeddings and simply retrieve the data from it. Is there a way to simply save documents to disk without creating an index?