Find answers from the community

s
sadai
Offline, last seen 3 months ago
Joined September 25, 2024
Hello guys! I have a question about managing multi-level node structure within a vector database.

Imagine the following situation:

I have a markdown file. It has headings ranging from h1 to theoretically h4-h5. I want to create a vector database that stores different chunks of these .md files. My embedding model bge-large has passage length limit of 512 tokens. Thus, it is possible, that in order to effectively embed my documents i would probably need to split my documents.

I'm planning on splitting my documents recursively by header sections. That is, h1 section, then if it is longer then embedding model max length, then h2 sections within the h1 section, and so on and so forth until it fits, or if there are no deeper heading, then just using TokenTextSplitter, for example.

Now, since i want to be able to reconstruct my larger sections using parent node information, i would want to store larger unembeddable nodes in my collection (nodes without embeddings).

Is it possible to store nodes with and without embeddings simultaneously within one vector DB? Does it make sense at all? Or there are better approaches which I'm not aware of?

Or do i need to store them at all, maybe it is just possible to reconstuct them from child nodes using parent node info?

Appreciate any help!
3 comments
L
s
s
sadai
·

Chunk size

Hello guys! Has anyone ever tried to pass tokenizer of a hugginface model to a TokenTextSplitter?

I tried to use the tokenizer of BAAI/bge-large-en-v1.5 model. I set chunk size = 512. So I assumed that te average chunk size in bge tokens would be somewhere near this value.

But seems like that's not the case -> Node statistics (tokens): min: 132, max: 230, mean: 215.989898989899, median: 216.0

My code:
Plain Text
embeddings = HuggingFaceEmbedding(model_name=embedding_model, device="cuda")
embeddings_tokenizer = AutoTokenizer.from_pretrained(embedding_model).encode

pipeline = IngestionPipeline(
        transformations=[
            TokenTextSplitter(
                chunk_size=chunk_size,
                chunk_overlap=chunk_size // 20,
                tokenizer=embeddings_tokenizer,
            ),
            # QuestionsAnsweredExtractor(questions=3, llm=llm),
            embeddings,
        ],
        vector_store=vector_store,
        # docstore=SimpleDocumentStore(),
    )

nodes = pipeline.run(documents=documents, show_progress=True, num_workers=1)


What is my mistake? This seems to work as expected when i do not pass any tokenizer (basically when it uses the standard one, tiktoken, I assume)
2 comments
s
L
s
sadai
·

Chat

Hello guys! I have a problem with ChatPromptTemplate when combined with QueryPipeline.

For some reason, it raises an error: Input is not stringable when processing system message.

The code snippet is here:

Plain Text
system_prompt = "You are a helpful asistant."

user_prompt = "Give me five reasons to try {query}."

prompt = ChatPromptTemplate(
  ChatMessage(role="system", content=system_prompt),
  ChatMessage(role="user", content=user_prompt),

pipeline = QueryPipeline(chain=[prompt, llm])

output = pipeline.run(query="diving")


Seems like QueryPipeine does not expect ChatMessage to be passed? It is not included in a set of stringable types.

I checked the docs and only found mentions of PromptTemplate but no examples for ChatPromptTemplate. Does this mean that it cannot be used with ChatPromptTemplates? Or i have a mistake somewhere?\

I tried regular prompt and it seems to work. But i need to define custom system prompt in order to use my model properly (Hermes 2 Pro uses custom system prompt for JSON mode)
32 comments
L
s
Seems like its fixed: https://github.com/run-llama/llama_index/pull/12273

But this fix is not included in the last version 0.10.23... So seems like we need to update it manually for now

UPD: works after manual update of files mentioned in commit
1 comment
L