Find answers from the community

Updated 3 months ago

Bros, is it possible to embed md files

Bros, is it possible to embed md files in llama index? And if there are about 3000 files... is it possible to embed all of these files?
L
๊ถŒ
4 comments
yes sure, why not ๐Ÿ™‚
But how? Just what for loop? Even so, it seems like it would cost a huge amount of API fees.
embeddings are very cheap. like, extremely cheap.

You just load the data and toss it into a vector index. With 3000 documents, I would something like qdrant or chroma though
@Logan M This code loads and embeds approximately 3000 markdown files. but . Embedding takes too long and takes a long time to load. Is there a way to improve this?

Plain Text
documents = SimpleDirectoryReader("./markdown").load_data()

doc_text = "\n\n".join([d.get_content() for d in documents])

docs = [Document(text=doc_text)]

llm = OpenAI(model="gpt-3.5-turbo")

chunk_sizes = [128, 256, 512, 1024]
nodes_list = []
vector_indices = []
for chunk_size in chunk_sizes:
    print(f"Chunk Size: {chunk_size}")
    splitter = SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_size // 2)
    nodes = splitter.get_nodes_from_documents(docs)
    for node in nodes:
        node.metadata["chunk_size"] = chunk_size
        node.excluded_embed_metadata_keys = ["chunk_size"]
        node.excluded_llm_metadata_keys = [ "chunk_size"]
    nodes_list.append(nodes)
    vector_index = VectorStoreIndex(nodes)
    vector_indices.append(vector_index)
    print(vector_indices)
Add a reply
Sign up and join the conversation on Discord