Find answers from the community

Updated 12 months ago

I like to embed HTML data from files how

At a glance

oosiworx

I like to embed HTML data from files how would I do that?

8 comments

WWhiteFang_Jr

You want to read from html files? Could you explain a little on this!

oosiworx

ok, I will export data from our confluence to then run this into a vector store, main issue here is the AI box can not get access to the confluence, so I can not run a simple crawler as I found in the docu. so the idea is to export that data into files and then read those files but keep them handled as html files

oosiworx

for my training prject I did have just textfiles with no html and I did it like this: documents = SimpleDirectoryReader(subdir).load_data()

oosiworx

so now the content will be html and I wonder how to get the files treated as html and not just like text

WWhiteFang_Jr

So basically you want to extract the main text from html files right?

oosiworx

yes, I like to get somehow the same treatment as if I would use something like this: documents = SimpleWebPageReader(html_to_text=True).load_data(
["http://paulgraham.com/worked.html"]
)

WWhiteFang_Jr

https://github.com/run-llama/llama_index/blob/4394c7f11e907c4a7c9926ae98eb53e6d60a1619/llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/main_content_extractor/base.py#L40C13-L40C46

https://github.com/run-llama/llama_index/blob/4394c7f11e907c4a7c9926ae98eb53e6d60a1619/llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/simple_web/base.py#L61

found these two, You can make your own reader, taking help from the other reader, get the text from the file and pass to any of the above two mentioned library.

oosiworx

Oh cool, thank you very much 🙂

Add a reply