I just cant reliably get the right context for the llm, this is my current code:
embed_model = HuggingFaceBgeEmbeddings(model_name="sentence-transformers/gtr-t5-xxl")
service_context = ServiceContext.from_defaults(llm=llm,embed_model=embed_model)
set_global_service_context(service_context)
UnstructuredReader = download_loader('UnstructuredReader')
dir_reader = SimpleDirectoryReader('./data', file_extractor={
".html": UnstructuredReader(),
})
documents = dir_reader.load_data()
index = VectorStoreIndex.from_documents(documents, show_progress=True);
sometimes the context is good, sometimes its completely offtopic
my entire data is exported HTML from a company confluence
I went trough
https://huggingface.co/spaces/mteb/leaderboard and tried most of the top multilanguage models, some are better, some are worse, but they all fail to find relevant context in 70% of the cases
what else can I try to improve accuracy?
edit:
just found a benchmark specifically for german, will test the top model there
https://github.com/ClimSocAna/tecb-debut question still stands, any ways to improve?