Find answers from the community

Updated 4 months ago

Bros, is it possible to embed md files

At a glance
Bros, is it possible to embed md files in llama index? And if there are about 3000 files... is it possible to embed all of these files?
L
๊ถŒ
4 comments
yes sure, why not ๐Ÿ™‚
But how? Just what for loop? Even so, it seems like it would cost a huge amount of API fees.
embeddings are very cheap. like, extremely cheap.

You just load the data and toss it into a vector index. With 3000 documents, I would something like qdrant or chroma though
@Logan M This code loads and embeds approximately 3000 markdown files. but . Embedding takes too long and takes a long time to load. Is there a way to improve this?

Plain Text
documents = SimpleDirectoryReader("./markdown").load_data()

doc_text = "\n\n".join([d.get_content() for d in documents])

docs = [Document(text=doc_text)]

llm = OpenAI(model="gpt-3.5-turbo")

chunk_sizes = [128, 256, 512, 1024]
nodes_list = []
vector_indices = []
for chunk_size in chunk_sizes:
    print(f"Chunk Size: {chunk_size}")
    splitter = SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_size // 2)
    nodes = splitter.get_nodes_from_documents(docs)
    for node in nodes:
        node.metadata["chunk_size"] = chunk_size
        node.excluded_embed_metadata_keys = ["chunk_size"]
        node.excluded_llm_metadata_keys = [ "chunk_size"]
    nodes_list.append(nodes)
    vector_index = VectorStoreIndex(nodes)
    vector_indices.append(vector_index)
    print(vector_indices)
Add a reply
Sign up and join the conversation on Discord