Find answers from the community

Updated 3 months ago

I currently use a system that processes

I currently use a system that processes JSON, Markdown, PDF, HTML, and DOCX files, storing them in a Qdrant vector database. The database is then queried in a separate session.

At the moment, I employ the following Node Parser for all file types:

Plain Text
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)


However, I've discovered that LlamaIndex offers specialized node parsers for JSON, Markdown, and HTML. Consequently, I plan to switch to MarkdownNodeParser, JSONNodeParser, and HTMLNodeParser for those respective formats, while continuing to use SentenceWindowNodeParser for PDF and DOCX files.

I have two questions:

  1. Do you foresee any issues with this approach?
  2. My query code is as follows:
Plain Text
service_context = ServiceContext.from_defaults(llm=llm,
                                               node_parser=node_parser,
                                               embed_model=embed_model)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store,
                                           service_context=service_context)
chat_engine = index.as_chat_engine(
    similarity_top_k=2,
    # the target key defaults to `window` to match the node_parser's default
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
    vector_store_kwargs={"qdrant_filters": filters})


This setup is specifically tailored to the SentenceWindowNodeParser due to the node_parser parameter in ServiceContext.from_defaults and the node_postprocessors configuration in index.as_chat_engine():

Plain Text
node_postprocessors=[
    MetadataReplacementPostProcessor(target_metadata_key="window")
],


Is there a way to make the query code Node Parser agnostic?
L
S
7 comments
actually, those node parsers are intended to be chained with other node parsers.

So you chould chain an html node parser into sentence window πŸ‘€ But iup to you

Also keep in mind those node parsers are intended to read the raw file text (i.e. raw HTML)
I think your query code is fine, the node_postprocessor will leave the content alone if the taget key isn't found
Thanks @Logan M ! Do you have any code example of creating a node parser by chaining two or more node parsers?

Regarding the query, the ServiceContext is needed at query time, and that requires a node parser. Could I skip passing the node parser there?
Plain Text
nodes = node_parser1(documents)
nodes = node_parser2(nodes)
ez pz πŸ™‚
I think you could skip it there
Awesome, thank you so much!
Add a reply
Sign up and join the conversation on Discord