Find answers from the community

Updated 3 months ago

When moving data from sources like Slack

When moving data from sources like Slack or Notion to a Vector database, how should I transform the data before embedding it to make it the most useful for my LLM? Any guidance on chunking, adding metadata, or other transforms? Are you using any tools or frameworks for that, or are you writing most of the code yourself?

Some Vector DBs like Weavieate offer embedding as part of the product, but doesn’t that limit me in terms of the transformations I can do beforehand? Is it a bad idea to lock-in to a vendor’s pre-packaged embedding?
L
J
9 comments
If you are using llama-index, it doesn't use the embeddings offered by the vector db, it will use the embeddings that you've setup

Embeddings themselves are a little lock-in by nature. The queries and nodes all need to be embeded with the same model for the similarity search to work. If you switch embedding models, you need to re-embed everything
RE: processing data, it kind of depends. For slack, I would personally organize it so that each thread is a document.

For notion, maybe each page is it's own document makes sense.

There's a lot you could add to metadata (author, timestamp), and you can control if the metadata is used for embeddings, LLMs, both, or neither.

We also offer some metadata extraction here
https://gpt-index.readthedocs.io/en/latest/core_modules/data_modules/index/metadata_extraction.html
Thanks for the answer! How do I handle things like running a batch job of extracting, chunking etc. and loading on a schedule and making sure I won't re-process documents that are already in the vector store? Do I have to take care of that myself or is there some tooling provided?
Mostly looking for best practices here 🙂
We have a refresh_ref_docs function, that works if you set the doc_id of each document to be something consistent. This helps avoid duplicate inserts

Buuuut it doesn't work for vector store inegrations yet (it has to be unique implemenation for each vector store)

It's actually on my todo list to add some support this week for the popular vector dbs (weaviate, pinecone, qdrant, chroma)
Interesting! Where can I get a good consistent id from? I was looking into building something like that but the best I could come up with is something like "<id>_<chunk number>" where id is a consistent id of the source document and chunk number is the index of the list of chunks I had to split up my source document into because it was too big to fit. However, if I do this and a source document gets shorter, I could miss some trailing chunks
Oh and also then I would just have that running via a cron job or something similar?
For consistent doc_ids, you can also levarage the data source. You could use the page name from notion. For slack, I'm assuming there's a message/thread ID?

A cron job makes sense. I should note that by default, refresh_ref_docs will only insert doc_ids that aren't already inserted.

But, I'm thinking there probably needs to be an option to let it delete documents that are inserted, but not present in the new list of documnets? 🤔

Also a cron job makes sense yes
But, I'm thinking there probably needs to be an option to let it delete documents that are inserted, but not present in the new list of documnets? 🤔
Yeah, there is some nuance to it, that's why I asked. Thanks anyway for your detailed answers.
Add a reply
Sign up and join the conversation on Discord