When moving data from sources like Slack

At a glance

The community members are discussing best practices for transforming data from sources like Slack or Notion before embedding it into a Vector database for use with a Language Model (LLM). Key points:

- The community members discuss the trade-offs of using a Vector DB's built-in embedding capabilities versus doing custom transformations beforehand. Embeddings can be a source of lock-in, as the same embedding model must be used for queries and documents.

- Suggestions for data processing include organizing Slack data by thread and Notion data by page, and adding metadata like author and timestamp. The community members mention a tool called llama-index that provides metadata extraction capabilities.

- The community members discuss the challenge of avoiding re-processing documents that are already in the Vector store, and mention a "refresh_ref_docs" function that can help with this, though it currently has limitations for Vector store integrations. They suggest using consistent document IDs and running the process on a cron job.

Useful resources

JJoe Reuter @ Airbyte

When moving data from sources like Slack or Notion to a Vector database, how should I transform the data before embedding it to make it the most useful for my LLM? Any guidance on chunking, adding metadata, or other transforms? Are you using any tools or frameworks for that, or are you writing most of the code yourself?

Some Vector DBs like Weavieate offer embedding as part of the product, but doesn’t that limit me in terms of the transformations I can do beforehand? Is it a bad idea to lock-in to a vendor’s pre-packaged embedding?

9 comments

LLogan M

If you are using llama-index, it doesn't use the embeddings offered by the vector db, it will use the embeddings that you've setup

Embeddings themselves are a little lock-in by nature. The queries and nodes all need to be embeded with the same model for the similarity search to work. If you switch embedding models, you need to re-embed everything

LLogan M

RE: processing data, it kind of depends. For slack, I would personally organize it so that each thread is a document.

For notion, maybe each page is it's own document makes sense.

There's a lot you could add to metadata (author, timestamp), and you can control if the metadata is used for embeddings, LLMs, both, or neither.

We also offer some metadata extraction here
https://gpt-index.readthedocs.io/en/latest/core_modules/data_modules/index/metadata_extraction.html

JJoe Reuter @ Airbyte

Thanks for the answer! How do I handle things like running a batch job of extracting, chunking etc. and loading on a schedule and making sure I won't re-process documents that are already in the vector store? Do I have to take care of that myself or is there some tooling provided?

JJoe Reuter @ Airbyte

Mostly looking for best practices here 🙂

LLogan M

We have a refresh_ref_docs function, that works if you set the doc_id of each document to be something consistent. This helps avoid duplicate inserts

Buuuut it doesn't work for vector store inegrations yet (it has to be unique implemenation for each vector store)

It's actually on my todo list to add some support this week for the popular vector dbs (weaviate, pinecone, qdrant, chroma)

JJoe Reuter @ Airbyte

Interesting! Where can I get a good consistent id from? I was looking into building something like that but the best I could come up with is something like "<id>_<chunk number>" where id is a consistent id of the source document and chunk number is the index of the list of chunks I had to split up my source document into because it was too big to fit. However, if I do this and a source document gets shorter, I could miss some trailing chunks

JJoe Reuter @ Airbyte

Oh and also then I would just have that running via a cron job or something similar?

LLogan M

For consistent doc_ids, you can also levarage the data source. You could use the page name from notion. For slack, I'm assuming there's a message/thread ID?

A cron job makes sense. I should note that by default, refresh_ref_docs will only insert doc_ids that aren't already inserted.

But, I'm thinking there probably needs to be an option to let it delete documents that are inserted, but not present in the new list of documnets? 🤔

Also a cron job makes sense yes

JJoe Reuter @ Airbyte

But, I'm thinking there probably needs to be an option to let it delete documents that are inserted, but not present in the new list of documnets? 🤔

Yeah, there is some nuance to it, that's why I asked. Thanks anyway for your detailed answers.

Add a reply

Find answers from the community

When moving data from sources like Slack