LlamaIndex

Log inLog into community

Find answers from the community

Updated 4 months ago

Html parsing

Html parsing

At a glance

The post describes an issue the community member is facing when trying to load HTML files using the llama_index library. They are encountering an error related to parsing the HTML content. The comments provide helpful explanations and suggestions to address the issue, including:

- Explanations on how llama_index handles chunking and indexing of documents, and the process of refining answers across multiple chunks.

- Discussions on the performance and limitations of different vector store options like GPTSimpleVectorIndex, Chroma, and Weaviate.

- Sharing a custom refine prompt template that can improve the quality of responses from GPT-3.5.

- Comparisons between GPT-3.5, Davinci-003, and GPT-4 models.

- Insights into the community member's work as a machine learning engineer and their involvement with the llama_index project.

There is no explicitly marked answer in the post, but the community members provide helpful guidance and suggestions to address the original issue.

Useful resources

·

Hello, I've a question on loading html files. I'm following the tutorial here (https://github.com/jerryjliu/llama_index/blob/main/examples/chatbot/Chatbot_SEC.ipynb), but with my own html file. However, I'm getting this error for some html files:

Plain Text

INFO:unstructured:Reading document from string ...
INFO:unstructured:Reading document ...
Traceback (most recent call last):
  File "/Users/user/crawl/index.py", line 14, in <module>
    html = loader.load_data(file=Path(f'./output1.html'))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/crawl/venv/lib/python3.11/site-packages/llama_index/readers/llamahub_modules/file/unstructured/base.py", line 36, in load_data
    elements = partition(str(file))
               ^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/crawl/venv/lib/python3.11/site-packages/unstructured/partition/auto.py", line 86, in partition
    elements = partition_html(
               ^^^^^^^^^^^^^^^
  File "/Users/user/crawl/venv/lib/python3.11/site-packages/unstructured/partition/html.py", line 85, in partition_html
    layout_elements = document_to_element_list(document, include_page_breaks=include_page_breaks)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/crawl/venv/lib/python3.11/site-packages/unstructured/partition/common.py", line 71, in document_to_element_list
    num_pages = len(document.pages)
                    ^^^^^^^^^^^^^^
  File "/Users/user/crawl/venv/lib/python3.11/site-packages/unstructured/documents/xml.py", line 52, in pages
    self._pages = self._read()
                  ^^^^^^^^^^^^
  File "/Users/user/crawl/venv/lib/python3.11/site-packages/unstructured/documents/html.py", line 101, in _read
    etree.strip_elements(self.document_tree, ["script"])
  File "src/lxml/cleanup.pxi", line 100, in lxml.etree.strip_elements
  File "src/lxml/apihelpers.pxi", line 41, in lxml.etree._documentOrRaise
TypeError: Invalid input object: NoneType

L

j

100 comments

No worries! I will try to clarify 😅

So when you index a bunch of documents, they get broken into chunks and embeddings are generated for each chunk. The are chunked using the node parser, which has a default chunk_size_limit of 3900 and an default overlap of 200. If you set chunk_size_limit directly in the service context, then that will become the chunk size limit for this step

Then, during queries, the index retrieves the top k chunks of text (assuming you have a vector index here). If the text from the chunk + the prompt template + the query is bigger than max_input_size minus num_output, it breaks up the text into multiple chunks. This process is controlled by the prompt helper settings.

If top k is bigger than one, it refines an answer over the chunks. After getting a response from the first chunk, it sends the next chunk + prompt template + query + previous answer to the LLM to get an updated answer

So... pretty complicated haha llama index is always trying to make sure the text sent to the LLM isn't too big

Oh I see, i understand now 🙏🏻 Thank you so much!

I feel like giving chatgpt 3.5 context makes it much less smart, is this really the case...?

Are you getting answers like "The previous answer remains the same" stuff?

There's been a ton of problems with gpt-3.5 lately. Especially with the refine prompt.

It really feels like they downgraded the model lol

I have a refine prompt that I've been working on. I can share it if you want to try it out?

not really, it is just refusing to use the tools and context I provided, it insists on "fetching stuff on the internet" (which i didnt know it was capable of, maybe it's just lying to me)

maybe they downgraded to get more paying users for gpt4 XD

please do! would love to try it

this is my exact conspiracy theory too LOL 10x cheaper than davinci-003 seemed too good to be true

one sec, I'll get the code!

Plain Text

from langchain.prompts.chat import (
    AIMessagePromptTemplate,
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)

from llama_index.prompts.prompts import RefinePrompt

# Refine Prompt
CHAT_REFINE_PROMPT_TMPL_MSGS = [
    HumanMessagePromptTemplate.from_template("{query_str}"),
    AIMessagePromptTemplate.from_template("{existing_answer}"),
    HumanMessagePromptTemplate.from_template(
        "I have more context below which can be used "
        "(only if needed) to update your previous answer.\n"
        "------------\n"
        "{context_msg}\n"
        "------------\n"
        "Given the new context, update the previous answer to better "
        "answer my previous query."
        "If the previous answer remains the same, repeat it verbatim. "
        "Never reference the new context or my previous query directly.",
    ),
]


CHAT_REFINE_PROMPT_LC = ChatPromptTemplate.from_messages(CHAT_REFINE_PROMPT_TMPL_MSGS)
CHAT_REFINE_PROMPT = RefinePrompt.from_langchain_prompt(CHAT_REFINE_PROMPT_LC)
...
index.query("my query", similarity_top_k=3, refine_template=CHAT_REFINE_PROMPT)

In my (limited) testing, this seemed to improve the quality of the refine process with gpt3.5. Lately lots of people have been complaining about it giving answers like "The new context is not relevant, so the previous answer remains the same" which is very unhelpful lol

thank you! ill try it tmr and get back with my experience with it 🙂

yup XD

is davinci-003 or gpt3.5 better...?

Davinci-003 is much better, at least in my experience (hallucinates less, better at following instructions)

And of course gpt4 is king. But it's also still on waitlists, and is also expensive

Oh I see! The refine prompt is working good btw 🙂

I'm just curious, are you working on this full time? Is llama index backed by a company or just a community

Amazing! Maybe I should make a PR for it. Always scary to change something used to frequently though lol

Nah this is my spare time thing. I work full time elsewhere as a machine learning engineer.

Llama Index might be a company/full time thing someday though 🙏

Oh i see, that's very cool 🙂 So do you have to design algorithms as a ML engineer?

It's more of training/designing models and datasets. Been doing lots of work with document analysis mostly (extracting key information/line items from invoices), some product categorization, and a few other random projects lol

sounds pretty cool ngl

It's not bad haha but tbh working on the llama index stuff has been much more interesting 😄

i know that feeling 🙂 working at a big company can be boring and your work can be very isolated

btw, you mentioned that we could save the json in memory, i was thinking caching it in redis, would that make any sense at all?

Yea that definitely makes sense! You could use save_to_string and cache the entire string. I'm not sure how well that will scale though as the index gets bigger

There are some other options, like using qdrant or chroma, etc. For the docstore, there is also a recently added mongodb support, and I think official redis support is coming soon (but that's just for the documents)

Just to clarify things, GPTSimpleVectorIndex is specific to llamaindex and is llama's specific way of vectorizing text...? I'm looking at this page (https://gpt-index.readthedocs.io/en/latest/how_to/integrations/vector_stores.html) and it seems like chroma has its own GPTChromaIndex

Yea pretty much! I mean, it will vectorize text the same way for all vector indexes, it's just differences in how they vectors are stored (simple is all local and in memory + save to disk, others are more like dedicated databases)

oh i see! The chroma example looks a bit too simplified, I read it is purely in memory, so does it mean I don't have to start an "instance" of chroma (unlike other vector DBs)?

also, from some blog posts i see that they do

Plain Text

from llama_index import GPTSimpleVectorIndex

Whereas in the docs its

Plain Text

from gpt_index import GPTSimpleVectorIndex

I'm guessing they are the same? 😅

Yeaaa they are the same. There was a renaming at some point, and now it's complicated haha always use llama_index tho

And yea, normally with these vector store integrations you'll have an "instance" of that vector store running somewhere already

oh that's very convenient then! so much easier to serve it as an api this way, thanks!

Yea no worries! It also scales a little better as you index more data 💪💪

I was wondering is there a catch to storing everything in memory? How is it persisted and what if theres too much data..?

It's persisted by calling save/load from disk.

I've used it with index.json files that were up to 2GB, and tbh I didn't really notice any performance impacts 🤔 at a certain point yea its going to hit a scaling wall. But then that's where the vector store and doc store integrations come in

oh i see, thanks great

hmm

i was wondering, can i just vectorized and index the content of a few html pages in a json file? Would that impact performance a lot? I know vector search is fast

Yea that should be fine!

that's great to know 🙂

btw, i saw this part in the chroma documentation:
https://docs.trychroma.com/api-reference#run-chroma-just-as-a-client-to-talk-to-a-backend-service

I think my default chroma is in memory but somehow it is still recommended to use an instance of it in production? (and so won't be in memory) Any thoughts?

Yea they aren't wrong. For production use-cases you would want a dedicated server running. (but again, this is only when you are dealing with large amounts of documents/vectors)

Other vector clients like pinecone and weaviate provide dedicated servers for you, so you don't have to worry about deploying your own. Depends on what works best for you

aw i see, i'm trying to ship a product to production, so i guess i will need weaviate then, i'm just wondering if theres a production vector database that's in memory, like how redis caches stuff in memory even in production

I think it can't really be in memory, or it has to be a combination of in memory/disk. It's similar to setting up a SQL database, but it just holds vector data instead 🙂 But with enough data, the vectors will use a lot of RAM if all held in memory

If you aren't planning on inserting much data though and only supporting queries, anything running in memory would be fine (GPTSimpleVectorIndex, that chroma example that was in memory)

yup that's what i thought too, so was a bit skeptical of chroma when i read its completely in memory. I'm planning to build something and sell it as a service, so it's up to the user to decide how much data to insert ... I guess ill go with something like weaviate then 🙂

Sounds like a good choice then! 👍

Btw, the chatbot tutorial uses a graph index as a top level index, I’m wondering why isn’t all the 10k files just vectorized into a single json file for faster lookup…?

Because each index contains financial information for a specific year. Don't want to mix that data up in one pile yanno?

Plus with separate indexes, you can use the query decomposition transform to compare different years

Ah I see, was concerned that it was to optimize performance 🙂 thanks!

Btw, is there a place where people building these sort of things hangout? (Like conferences or something) we are building something and would like to find someone who’s fully focused on tech 😬

hmmm not really sure haha. You can ask in #🦄founders maybe? I know most in-person activity is focused in silicon valley (assuming you are in north america)

I'm in the middle of nowhere lol so I'm little removed from that scene

Ok! Thanks haha

How did you started contributing to llama if I may ask

Not too familiar with open source world

Honestly, I think my github feed recommended the repo LOL and I just really liked what the repo was doing

made a few small contributions to help learn the codebase (minor features/bug fixes), and here I am lol

That’s very impressive! Extremely gifted engineer 😬

haha nah, years of experience before that 🙂

hey logan, i'm wondering do you know anything about deploying models on the cloud?

do i need to use any special machines, or just how i would normally deploy a docker container? Because I'm looking at weaviate and it looks like they also offer pretrained models, not should if additional setup is needed (https://weaviate.io/developers/weaviate/installation/docker-compose)

Deploying models on the cloud isn't tooo bad. Basically yea you setup a docker container, and basically you can deploy that

If you need a GPU, your options are basically google or aws. But the IT team at my job handles everything after we create the docker image 😅 I've used sagemaker a bit in the past, and it wasn't the best experience (their docs are 💩 )

Oh I see, thanks 😇

But yup trynna do that, what embedded does llamaindex use…? When using weaviate there are tons of options, such as choosing which text2vec, NER, QnA transformers, not sure will these override what llamaindex is using? Slightly confused 🥲😅

If you want to override what llama index is doing, you'll have to change embed_model in the service context

By default it's using text-ada-002 from openai, you can read more about it in this file:
https://github.com/jerryjliu/llama_index/blob/2e25909c7a40e564b3de4ffab1ef9146a96f4ca8/gpt_index/embeddings/openai.py#L175

Not sure how you would drop in weaviate embeddings. When you use the weaviate integration, it computes the embeddings with openai and pushes that to weaviate

Eventually in llama index, it calls this to insert data into weaviate

https://github.com/jerryjliu/llama_index/blob/2e25909c7a40e564b3de4ffab1ef9146a96f4ca8/gpt_index/readers/weaviate/client.py#L247

okok ill take a moment to read up on this, thanks!!!

do you know how good is ada2? If i use the pre trainer weaciate transformers i get to use all: text2vec-transformers, QnA-transformers, and NER-transformers

Tbh I think Ada is pretty good, at least in my experience.

If you use a weaviate model, it sounds like text2vec is what you would want

It would be nice if they shared benchmark results of all these models on traditional benchmark datasets so people could compare them 😅

oh i see, i also looked into weaviate a bit more and it seems like i have to define a schema? I"m wondering how does weaviate stores the embeddings llamaindex sends to weaviate, since we never have to define a schema?

You'll have to look at the source code in llama index for that one lol I'm not sure how it works exactly

This example just kinda creates a client and off it goes (also it saves/loads something, not sure what gets saved or if its nesscary)
https://github.com/jerryjliu/llama_index/blob/main/examples/vector_indices/WeaviateIndexDemo.ipynb

Looks like the schema is in here
https://github.com/jerryjliu/llama_index/blob/main/gpt_index/readers/weaviate/client.py

Ah i see, it's like a catch all schema for documents

for these production vector DBs, which one would you recommend TBH

Tbh they all seem really similar. Although I've heard some sketchy stuff about pinecone so maybe stay away from that one lol

Seems like weaviate, qdrant, and chroma are the most popular

lol thanks for telling me this, i wasn't gonna choose pinecone anyway cuz i can't self host it but i'm surprised theres sketchy stuff about it..??

can u tell me more 😂

Just a few weird comments people were making in the enterprise channel LOL seemed like they had some insider info on how their data is stored and managed. Some weird stuff with payment issues too

Started with this message

https://discord.com/channels/1059199217496772688/1091025027350138940/1095842607617290331

Then a few more people piped in after that haha

😂😂😂

Well I definitely thought pinecone looked solid at first 😬

hahaha it seems popular too! Who knows xD

Maybe I should try it so no one else has to 😬

Hey Logan, I'm wondering do you know how well does gpt or davinci handle structured data, and is this something relevant to llamaindex? For example, if i have a CSV file and i want to ask some questions on it, would vectorizing the content even make sense?

Like, it can do some text2sql for structured data. So given the schemas of a few tables and maybe some extra context description of the tables, it can convert user queries to sql commands

However, sometimes the models hallucinate things like column names, especially gpt3.5 lol

You could also convert each row in the index to a document, but that doesn't make sense to do for every document type

oh i see, in what form should the schemas be provided in? Is there a strict format?

For the struct indices, it either needs to be a database or a pandas file

Then from there, the code derives the schema automatically

https://github.com/jerryjliu/llama_index/tree/main/examples/struct_indices

Hey Logan! I’m wondering how does semantic search work under the hood for llamaindex?

Hey! Yea sure thing

The process goes something like this, with a vector index

ingest documents. These documents are broken up into smaller overlapping nodes, so that they can be used for embeddings and LLM calls

each node is then embedded using the embed_model (default is text-ada-002 from openai, which embeds uses 1536 dimenstions)

then at query time, the query text is also embedded. Cosine similarity is calculated comparing the query embedding to all node embeddings

the top 2 (by default) nodes are retrieved.

if the text from the nodes is too big to fit into a single llm call, it gets broken into overlapping chunks again

the first call to the llm sends your query and the node text, inside a prompt template

the llm returns an answer to the query

if there is more text for the model to read, the next chunk of text is sent. This time, the text, query, template, and existing answer is sent. The llm has to either update the existing answer using the new context, or repeat it

finally, the answer is returned to the user, along with the source nodes + similarities used to create that answer

I hope that's what you were looking for haha

It is! So to my understanding llamaindex really shines when it comes to unstructured data like documents with lots of texts. But for structured data, can I just use a vector database like weaviate instead for semantic search? Interested in hearing about your thoughts

And thank you for being so helpful as always 🙂

For structured data, embeddings don't make as much sense, especially for highly numeric spreadsheets and stuff like that

For that, what you can do is use an LLM to convert queries into something like sql commands (which llama index also does, but it will only return the result of that sql)

If you mean structured data as in JSON or something more textual, there are ways to make embeddings work for it

for structured data i mean something more like a vector database with a schema, like weaviate. Weaviate offers semantic search out of the box, so i'm guessing for these kind of structured data i don't need llamaindex?

Ahhh I see. I think llama index still provides some value there (since it integrates with weaviate, handles a lot of document ingestion, chunking, prompting), but definitely up to you 👍

By "document", do you mean an actual document? As long as llamaindex will improve semantic search in weaviate i'm happy, just wanted to see what you think 🙂

Yea, I meant full document file 👍

Add a reply

Sign up and join the conversation on Discord