sadai

Hello guys! I have a question about

Hello guys! I have a question about managing multi-level node structure within a vector database.

Imagine the following situation:

I have a markdown file. It has headings ranging from h1 to theoretically h4-h5. I want to create a vector database that stores different chunks of these .md files. My embedding model bge-large has passage length limit of 512 tokens. Thus, it is possible, that in order to effectively embed my documents i would probably need to split my documents.

I'm planning on splitting my documents recursively by header sections. That is, h1 section, then if it is longer then embedding model max length, then h2 sections within the h1 section, and so on and so forth until it fits, or if there are no deeper heading, then just using TokenTextSplitter, for example.

Now, since i want to be able to reconstruct my larger sections using parent node information, i would want to store larger unembeddable nodes in my collection (nodes without embeddings).

Is it possible to store nodes with and without embeddings simultaneously within one vector DB? Does it make sense at all? Or there are better approaches which I'm not aware of?

Or do i need to store them at all, maybe it is just possible to reconstuct them from child nodes using parent node info?

Appreciate any help!

3 comments

ssadai

Chunk size

Hello guys! Has anyone ever tried to pass tokenizer of a hugginface model to a TokenTextSplitter?

I tried to use the tokenizer of BAAI/bge-large-en-v1.5 model. I set chunk size = 512. So I assumed that te average chunk size in bge tokens would be somewhere near this value.

But seems like that's not the case -> Node statistics (tokens): min: 132, max: 230, mean: 215.989898989899, median: 216.0

My code:

Plain Text

embeddings = HuggingFaceEmbedding(model_name=embedding_model, device="cuda")
embeddings_tokenizer = AutoTokenizer.from_pretrained(embedding_model).encode

pipeline = IngestionPipeline(
        transformations=[
            TokenTextSplitter(
                chunk_size=chunk_size,
                chunk_overlap=chunk_size // 20,
                tokenizer=embeddings_tokenizer,
            ),
            # QuestionsAnsweredExtractor(questions=3, llm=llm),
            embeddings,
        ],
        vector_store=vector_store,
        # docstore=SimpleDocumentStore(),
    )

nodes = pipeline.run(documents=documents, show_progress=True, num_workers=1)

What is my mistake? This seems to work as expected when i do not pass any tokenizer (basically when it uses the standard one, tiktoken, I assume)

2 comments

ssadai

Chat

Hello guys! I have a problem with ChatPromptTemplate when combined with QueryPipeline.

For some reason, it raises an error: Input is not stringable when processing system message.

The code snippet is here:

Plain Text

system_prompt = "You are a helpful asistant."

user_prompt = "Give me five reasons to try {query}."

prompt = ChatPromptTemplate(
  ChatMessage(role="system", content=system_prompt),
  ChatMessage(role="user", content=user_prompt),

pipeline = QueryPipeline(chain=[prompt, llm])

output = pipeline.run(query="diving")

Seems like QueryPipeine does not expect ChatMessage to be passed? It is not included in a set of stringable types.

I checked the docs and only found mentions of PromptTemplate but no examples for ChatPromptTemplate. Does this mean that it cannot be used with ChatPromptTemplates? Or i have a mistake somewhere?\

I tried regular prompt and it seems to work. But i need to define custom system prompt in order to use my model properly (Hermes 2 Pro uses custom system prompt for JSON mode)

32 comments

ssadai

Fix download_llama_dataset and download_...

Seems like its fixed: https://github.com/run-llama/llama_index/pull/12273

But this fix is not included in the last version 0.10.23... So seems like we need to update it manually for now

UPD: works after manual update of files mentioned in commit

1 comment

Find answers from the community

Hello guys! I have a question about

Chunk size

Chat

Fix download_llama_dataset and download_...