I currently use a system that processes JSON, Markdown, PDF, HTML, and DOCX files, storing them in a Qdrant vector database. The database is then queried in a separate session.
At the moment, I employ the following Node Parser for all file types:
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=3,
window_metadata_key="window",
original_text_metadata_key="original_text",
)
However, I've discovered that LlamaIndex offers specialized node parsers for JSON, Markdown, and HTML. Consequently, I plan to switch to MarkdownNodeParser, JSONNodeParser, and HTMLNodeParser for those respective formats, while continuing to use SentenceWindowNodeParser for PDF and DOCX files.
I have two questions:
- Do you foresee any issues with this approach?
- My query code is as follows:
service_context = ServiceContext.from_defaults(llm=llm,
node_parser=node_parser,
embed_model=embed_model)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store,
service_context=service_context)
chat_engine = index.as_chat_engine(
similarity_top_k=2,
# the target key defaults to `window` to match the node_parser's default
node_postprocessors=[
MetadataReplacementPostProcessor(target_metadata_key="window")
],
vector_store_kwargs={"qdrant_filters": filters})
This setup is specifically tailored to the SentenceWindowNodeParser due to the
node_parser
parameter in
ServiceContext.from_defaults
and the
node_postprocessors
configuration in
index.as_chat_engine()
:
node_postprocessors=[
MetadataReplacementPostProcessor(target_metadata_key="window")
],
Is there a way to make the query code Node Parser agnostic?