This code loads and embeds approximately

At a glance

This code loads and embeds approximately 3000 markdown files. but . Embedding takes too long and takes a long time to load. Is there a way to improve this?

Plain Text

documents = SimpleDirectoryReader("./markdown").load_data()

doc_text = "\n\n".join([d.get_content() for d in documents])

docs = [Document(text=doc_text)]

llm = OpenAI(model="gpt-3.5-turbo")

chunk_sizes = [128, 256, 512, 1024]
nodes_list = []
vector_indices = []
for chunk_size in chunk_sizes:
    print(f"Chunk Size: {chunk_size}")
    splitter = SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_size // 2)
    nodes = splitter.get_nodes_from_documents(docs)
    for node in nodes:
        node.metadata["chunk_size"] = chunk_size
        node.excluded_embed_metadata_keys = ["chunk_size"]
        node.excluded_llm_metadata_keys = [ "chunk_size"]
    nodes_list.append(nodes)
    vector_index = VectorStoreIndex(nodes)
    vector_indices.append(vector_index)
    print(vector_indices)

권

9 comments

권권씨 😮💨

vverdverm

There should be an embed_batch_size kwarg on your Embedder setup, HuggingFaceEmbedding has it at least

권권씨 😮💨

What does it mean?
I didn't quite understand.
And I am using openai's embedding model

권권씨 😮💨

If the default embedding model is not specified, the openai embedding default model is used. Is there a way to speed up embedding while using the openai embedding model?

vverdverm

Not sure, I've only used custom embedders. I would assume it's in the docs somewhere if it's possible

vverdverm

You can set OpenAI as the embedder manually, and then set the params when you do

권권씨 😮💨

Okay, is there any way to improve the above code other than embedding? Chunks and stuff like that.

vverdverm

Not sure, do you have timing details?

WWhiteFang_Jr

Since the number of files are huge, I would recommend using a vector store: https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores.html#vector-store-options-feature-support

For batch size:
embed_model = EmbedModel(...,embed_batch_size=50)

Add a reply

Find answers from the community

This code loads and embeds approximately