i see in the tgis server logs what looks like a request to summarize the summaries
It builds a bottom up tree of summaries
well i thought the purpose of this summary exercise was to generate summaries of each document to help improve index retrieval or whatever
Correct -- and tree summarize is the best way to do this
A bottoms-up tree will do the best job of capturing details from across the entire document, especially if the document/folder is large and doesn't fit into the LLM itself
So if we create summaries of each folder, we can do recursive retrieval based on those summaries
As a way of narrowing the retrieval scope when answering a query
ah, yeah, that's a problem, because the summaries are all going to be "this is troubelshooting information"
and/or each single document would need to be in its own folder
does the summarization happen at the directory level and not the individual file level? even so, these summaries are pretty weaksauce
basically it takes all the documents you gave it and attempts to summarize it. So I think that should be a whole folder?
My bet is indeed falcon sucks haha
I would expect the summaries to say something like "Information on troubleshooting X, Y, and Z" with some specifics, in order to be helpful
is 7B the biggest model you can run?
i only have access to a ~20G GPU at present
The only other Falcon option is 40b which didn't fit
unless there's some FP option I can pass to the TGIS server
but like I said, this whole folder is troubleshooting stuff
so if you summarize a folder of 50 troubleshooting documents, the summary is just "troubleshooting documents"
The text is a general troubleshooting guide for troubleshooting issues with Red Hat OpenShift containers in a cluster environment. It is most useful for understanding the resolution process and the steps taken to address the issue.
^^ yes, I already knew that, because this entire repo is "a troubleshooting guide for OpenShift"
so this type of folder-level summarization isn't helpful (with the current organization)
if I have to go read all the docs and reorganize them into different folders, at that point i've made these documents easy to browse, so what value does llm/chat serve then?
i'm really just trying to understand here
oh, I though this already had some folder organization
Thus far I have failed at:
- whole documents
- sentence window
- summarization
π
there are 32 documents in the troubleshooting folder
hey, great, i already know the troubleshooting folder contains troubleshooting documents π
well if there's only 32, then we could do it per-document then π€
the entire repo looks like it has ~400 docs. 32 are troubleshooting. ~60 are general knowledgebase, etc etc
but even when i tried full docs on ONLY the troubleshooting folder, the answers were bad
i'm about to try mpt-7b-instruct for giggles
But I guess it's good to narrow down the cause of the bad answers
- falcon is π© ?
- Should we be using a better embedding model?
these documents are also "weird"
code/yaml/cli samples, markdown, just all over the place
i tried to get some sample "questions" from the SRE team
this looks like it should be easy to answer
How do I get the SSH key for a cluster from XX?
i'll just try it against different models
maybe we need better parsers for these file type-too, to help with ingestion π
On the embedding side, I think I saw you were using mpnet-base-v2, which tbh is a bad model
Could try setting something like embed_model='local:BAAI/bge-base-en
in the service context
embed_model
is s4et to the HF default:
embed_model = LangchainEmbedding(HuggingFaceEmbeddings())
are you saying the default is that mpnet and it's π© ?
its like... bottom of the leaderboard π
bge-base is pretty good. The jump to large is not worth the increased model size imo
so this is a weird question that i don't know how to phrase correctly -- is there a way to make the embedding process run "over there" on the GPU (via TGIS?) or does that always run "locally"
ah none of these are that huge thop
yea embeddings are pretty tiny -- TGIS miiight have embedding support, but tbh I haven't looked into it
python will gladly completely destroy my computer when it tries to cpu+llm
like legit totally locks the machine
There is also a smaller version of bge too if it is also locking your machine π
nah this baai/bge is working
INFO:llama_index.vector_stores.redis:Creating index pg_essays
Creating index pg_essays
INFO:llama_index.vector_stores.redis:Added 287 documents to index pg_essays
Added 287 documents to index pg_essays
Done indexing!
Nice! The embeddings are helping then π
mpt/embeddings seems to be working better
answers are curt but that's ok
just need to figure out how to store the filename in the index so that i can display the "sourced" files back to the user
that's probably the metadata stuff
Yup exactly. If you are using SimpleDirectoryReader, there's a neat trick for this
from llama_index import SimpleDirectoryReader
filename_fn = lambda filename: {'file_name': filename}
# automatically sets the metadata of each document according to filename_fn
documents = SimpleDirectoryReader('./data', file_metadata=filename_fn).load_data()
Then response.source_nodes[0].node.metadata
will have it for example
basically inserts a metadata hook based on the filename
reindexing and trying again