I currently use a system that processes

At a glance

The community member currently uses a system that processes various file types (JSON, Markdown, PDF, HTML, DOCX) and stores them in a Qdrant vector database. They are using a SentenceWindowNodeParser for all file types, but have discovered that LlamaIndex offers specialized node parsers for JSON, Markdown, and HTML. The community member plans to switch to using the specialized node parsers for those file types, while continuing to use SentenceWindowNodeParser for PDF and DOCX files.

The community member has two questions:

Do they foresee any issues with this approach?
Is there a way to make the query code Node Parser agnostic?

In the comments, another community member suggests that the node parsers can be chained together, and provides an example of how to do this. They also mention that the node parsers are intended to read the raw file text (e.g., raw HTML).

Another community member comments that the query code is fine, and that the node_postprocessor will leave the content alone if the target key isn't found.

The

SSayan

I currently use a system that processes JSON, Markdown, PDF, HTML, and DOCX files, storing them in a Qdrant vector database. The database is then queried in a separate session.

At the moment, I employ the following Node Parser for all file types:

Plain Text

node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

However, I've discovered that LlamaIndex offers specialized node parsers for JSON, Markdown, and HTML. Consequently, I plan to switch to MarkdownNodeParser, JSONNodeParser, and HTMLNodeParser for those respective formats, while continuing to use SentenceWindowNodeParser for PDF and DOCX files.

I have two questions:

Do you foresee any issues with this approach?
My query code is as follows:

Plain Text

service_context = ServiceContext.from_defaults(llm=llm,
                                               node_parser=node_parser,
                                               embed_model=embed_model)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store,
                                           service_context=service_context)
chat_engine = index.as_chat_engine(
    similarity_top_k=2,
    # the target key defaults to `window` to match the node_parser's default
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
    vector_store_kwargs={"qdrant_filters": filters})

This setup is specifically tailored to the SentenceWindowNodeParser due to the node_parser parameter in ServiceContext.from_defaults and the node_postprocessors configuration in index.as_chat_engine():

Plain Text

node_postprocessors=[
    MetadataReplacementPostProcessor(target_metadata_key="window")
],

Is there a way to make the query code Node Parser agnostic?

7 comments

LLogan M

actually, those node parsers are intended to be chained with other node parsers.

So you chould chain an html node parser into sentence window 👀 But iup to you

Also keep in mind those node parsers are intended to read the raw file text (i.e. raw HTML)

LLogan M

I think your query code is fine, the node_postprocessor will leave the content alone if the taget key isn't found

SSayan

Thanks @Logan M ! Do you have any code example of creating a node parser by chaining two or more node parsers?

Regarding the query, the ServiceContext is needed at query time, and that requires a node parser. Could I skip passing the node parser there?

LLogan M

Plain Text

nodes = node_parser1(documents)
nodes = node_parser2(nodes)

LLogan M

ez pz 🙂

LLogan M

I think you could skip it there

SSayan

Awesome, thank you so much!

Add a reply

Find answers from the community

I currently use a system that processes