when you setup the metadata extractors, you should just need to pass in the llm π€
i.e KeywordExtractor(llm=llm)
Thank you Logan! What is the best approach to build an llm helper for a wiki in my case?
I have up to 200.000 documents in several wiki spaces. One of the biggest is ~76.000.
Documents are business texts explaining technical details of business processes.
Sometimes, we have several versions of the process used for different clients.
I would like to build a system that answers questions most accurately.
I have a couple of lead business analysts who help me to evaluate the performance.
We want to build a system that can answer info retrieval, analysis, and interpretation kind of questions.
Today, I found that another group in the company had already created a similar system using Microsoft vector and full-text search tools. Their solution works ok.
I think a technical solution based on your framework could be better. Especially with metadata extractors. However, this approach seems to be very slow and a bit expensive.
What development approach would you recommend to do it?
Cost and time is also a factor. I need to present today a plan that can show the results in less than a week.
If my approach will win, I will make sure to share the result and write about it.
I agree, the current extractors are slow -- I don't think there's any way to speed it up right now though.
If you have a machine with CUDA available, I would try the entity extractor. But all other extractors will be way too slow for 200,000 documents in a week π
I also want to add a keyword extractor that runs locally on cuda this week
This week I need only 76000 but it this is still doesn't look doable.
My questions was actually about the approach in general. What should be done to beat simple vector plus full text search?
Microsoft approach is simple costs less and produces results that BA's are almost ok with.
The source is confluence wiki with text , tables and many attachments that I am not reading at the moment.
I think valid approaches beyond that depend on your data actually
It sounds like you want hyrbid search, so I would definitely use a vector db that support hybrid search (i.e. weaviate, pinecone, elastic search, postgres)
If your documents are organized into a few high-level topics, you can use multiple indexes with a sub-question query engine or router retriever
If you have a ton of different topics, you can use an object retriever on top of your indexes, and then use those in a sub question engine or router retriever. The idea here would be to reduce the search scope to only the most relevant sub-indexes using similarity across descriptions/summaries of those sub indexes
yes I have a ton of different topics in the wiki. In this case I am not sure what kind of different indexes should be built in the same wiki space. Most relevant search would include both exact words of the question and synonymous or variations.
Also I think that in this case it may be beneficial to load wiki tree structure: links between the documents. Currently to load confluence I have to download it to my computer and load html from my computer.
What multiple indexes you would recommend in this case? What would be the optimal storage for all retrieved data (docs, metadata, vectors)? Which one from what you mentioned (i.e. weaviate, pinecone, elastic search, postgres) makes more sense and is inexpensive?
I think they all make sense tbh. Pinecone is cloud hosted only though, weaviate can be local or cloud, postgres is a local db, and I haven't used elastic search yet but I know its popular in enterprise
Personally weaviate probably has the best search ability, but the local setup might be annoying lol
Thank you! What about the indexes in my case? What would be the distinction between these indexes?
That's up to you I suppose, and how your data is organized and what is easiest
For example, I might have one index for troubleshooting documentation, and another index for onboarding documentation. Although you can see how this might not scale well if you have a lot of categories like this, which is why I suggested a higher-level filter to retrieve relavant indexes for a query π