Some preliminary thoughts about how to do it, but nothing concrete.
Its extremely complex (and uses a lot more resources compared to dense embeddings)
Things that need refactoring to support it
- the node class assumes a single dense vector
- all our embedding model classes assume a single dense vector
- all our vector stores assume a single dense vector for retrieval
These are not easy things to fix
RE performance, I can't really comment. It largely depends on what your data looks like. IMO cohere multimodal should be fine in most cases