Split by HTML header | 🦜🔗 LangChain

At a glance

The community member is looking for a llamaindex variant of the Langchains HTMLHeaderTextSplitter, as they have tried using HTMLNodeParser but the output is not satisfactory. They have also tried wrapping it with LangchainNodeParser, but it fails because Llamaindex expects a list of strings while Langchain is returning a list of document objects.

The comments suggest that the community member could implement a custom solution by either using the Langchain splitter and converting the output to Llamaindex nodes/documents, or by creating a custom component for the IngestionPipeline. There is a link provided to the documentation on implementing custom transformations in Llamaindex.

Useful resources

ggamecode8

Hello, is there a llamaindex variant of Langchains HTMLHeaderTextSplitter?

I have a tried HTMLNodeParser but the output im getting for the html i have is not great.

https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/HTML_header_metadata/

I have tried wrapping with LangchainNodeParser but it fails because Llamaindex expects a list of strings while langchain is returning a list of document objects.

5 comments

LLogan M

if its just splitting by tag, that sounds pretty easy to implement yourself (or just use the langchain splitter and convert the output to llama-index nodes/documents)

ggamecode8

Agreed, i think I will go the custom route. I was initially looking for an option that would integrate nicely with the IngestionPipeline.

That being said, is there documentation regarding implementing custom text splitters in llamaindex?

LLogan M

not a custom text splitter, but creating a custom component for a pipeline is super easy

LLogan M

https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/transformations/#custom-transformations

ggamecode8

Thanks Logan

Add a reply

Find answers from the community

Split by HTML header | 🦜🔗 LangChain