Find answers from the community

Updated 8 months ago

Split by HTML header | πŸ¦œπŸ”— LangChain

At a glance

The community member is looking for a llamaindex variant of the Langchains HTMLHeaderTextSplitter, as they have tried using HTMLNodeParser but the output is not satisfactory. They have also tried wrapping it with LangchainNodeParser, but it fails because Llamaindex expects a list of strings while Langchain is returning a list of document objects.

The comments suggest that the community member could implement a custom solution by either using the Langchain splitter and converting the output to Llamaindex nodes/documents, or by creating a custom component for the IngestionPipeline. There is a link provided to the documentation on implementing custom transformations in Llamaindex.

Useful resources
Hello, is there a llamaindex variant of Langchains HTMLHeaderTextSplitter?

I have a tried HTMLNodeParser but the output im getting for the html i have is not great.

https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/HTML_header_metadata/

I have tried wrapping with LangchainNodeParser but it fails because Llamaindex expects a list of strings while langchain is returning a list of document objects.
L
g
5 comments
if its just splitting by tag, that sounds pretty easy to implement yourself (or just use the langchain splitter and convert the output to llama-index nodes/documents)
Agreed, i think I will go the custom route. I was initially looking for an option that would integrate nicely with the IngestionPipeline.

That being said, is there documentation regarding implementing custom text splitters in llamaindex?
not a custom text splitter, but creating a custom component for a pipeline is super easy
Add a reply
Sign up and join the conversation on Discord