Find answers from the community

Updated 6 months ago

Split by HTML header | πŸ¦œπŸ”— LangChain

Hello, is there a llamaindex variant of Langchains HTMLHeaderTextSplitter?

I have a tried HTMLNodeParser but the output im getting for the html i have is not great.

https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/HTML_header_metadata/

I have tried wrapping with LangchainNodeParser but it fails because Llamaindex expects a list of strings while langchain is returning a list of document objects.
L
g
5 comments
if its just splitting by tag, that sounds pretty easy to implement yourself (or just use the langchain splitter and convert the output to llama-index nodes/documents)
Agreed, i think I will go the custom route. I was initially looking for an option that would integrate nicely with the IngestionPipeline.

That being said, is there documentation regarding implementing custom text splitters in llamaindex?
not a custom text splitter, but creating a custom component for a pipeline is super easy
Add a reply
Sign up and join the conversation on Discord