Find answers from the community

Updated 3 months ago

This code loads and embeds approximately

This code loads and embeds approximately 3000 markdown files. but . Embedding takes too long and takes a long time to load. Is there a way to improve this?

Plain Text
documents = SimpleDirectoryReader("./markdown").load_data()

doc_text = "\n\n".join([d.get_content() for d in documents])

docs = [Document(text=doc_text)]

llm = OpenAI(model="gpt-3.5-turbo")

chunk_sizes = [128, 256, 512, 1024]
nodes_list = []
vector_indices = []
for chunk_size in chunk_sizes:
    print(f"Chunk Size: {chunk_size}")
    splitter = SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_size // 2)
    nodes = splitter.get_nodes_from_documents(docs)
    for node in nodes:
        node.metadata["chunk_size"] = chunk_size
        node.excluded_embed_metadata_keys = ["chunk_size"]
        node.excluded_llm_metadata_keys = [ "chunk_size"]
    nodes_list.append(nodes)
    vector_index = VectorStoreIndex(nodes)
    vector_indices.append(vector_index)
    print(vector_indices)
๊ถŒ
v
W
9 comments
There should be an embed_batch_size kwarg on your Embedder setup, HuggingFaceEmbedding has it at least
What does it mean?
I didn't quite understand.
And I am using openai's embedding model
If the default embedding model is not specified, the openai embedding default model is used. Is there a way to speed up embedding while using the openai embedding model?
Not sure, I've only used custom embedders. I would assume it's in the docs somewhere if it's possible
You can set OpenAI as the embedder manually, and then set the params when you do
Okay, is there any way to improve the above code other than embedding? Chunks and stuff like that.
Not sure, do you have timing details?
Since the number of files are huge, I would recommend using a vector store: https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores.html#vector-store-options-feature-support

For batch size:
embed_model = EmbedModel(...,embed_batch_size=50)
Add a reply
Sign up and join the conversation on Discord