Find answers from the community

Updated 4 months ago

Node parser

At a glance
Hello, when using the get_nodes_from_documents method from UnstructuredElementNodeParser together with OpenAI I get
Plain Text
Embeddings have been explicitly disabled. Using MockEmbedding. 0it [00:00, ?it/s]

And then whenever I try to get the node_mappings dictionary, it is always empty, no matter which html file I use.
Below is the full code and the output:
Plain Text
from llama_index.readers.file.flat_reader import FlatReader
from llama_index.node_parser import UnstructuredElementNodeParser
from llama_index.llms import OpenAI
from pathlib import Path

llm = OpenAI(model="gpt-3.5-turbo", api_key="sk-")

# !wget "https://www.dropbox.com/scl/fi/mlaymdy1ni1ovyeykhhuk/tesla_2021_10k.htm?rlkey=qf9k4zn0ejrbm716j0gg7r802&dl=1" -O tesla_2021_10k.htm

reader = FlatReader()
docs_2021 = reader.load_data(Path("tesla_2021_10k.htm"))

node_parser = UnstructuredElementNodeParser(llm=llm)
raw_nodes_2021 = node_parser.get_nodes_from_documents(docs_2021)
base_nodes_2021, node_mappings_2021 = node_parser.get_base_nodes_and_mappings(raw_nodes_2021)
print(len(node_mappings_2021))

Plain Text
Embeddings have been explicitly disabled. Using MockEmbedding. 0it [00:00, ?it/s]
0
L
C
S
12 comments
I don't think that block of code is creating that warning?

That block of code specifically happens when you set embed_model=None somewhere, like in the service context
Thats the only block of code I run, and I get the warning from the get_nodes_from_documents method. I didnt set embed_model=None anywhere.
Also even if I explicitly set the embedding model I still get the warning:
Plain Text
embed_model = OpenAIEmbedding()
os.environ['OPENAI_API_KEY'] = 'sk-'
llm = OpenAI(model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)
set_global_service_context(service_context)

reader = FlatReader()
docs_2021 = reader.load_data(Path("tesla_2021_10k.htm"))

node_parser = UnstructuredElementNodeParser(llm=llm)
raw_nodes_2021 = node_parser.get_nodes_from_documents(docs_2021)

Seems like get_nodes_from_documents doesnt recognize any embeddings model and maybe that causes the problem with node_mappings being always an empty dict.
@Chris
@Logan M
I also tried to reproduce the error, using this notebook.
https://docs.llamaindex.ai/en/stable/examples/query_engine/sec_tables/tesla_10q_table.html
I also face the same issue...
somehow, unstructured is not finding any tables... not sure if they updated their package or what. Trying to debug :PSadge:
unstructured is just failing hard I think. All the tables it finds are irregular, and we can't convert into dataframes. I might try downgrading my unsctrucred version a bit and see if something changed..
aha figured it out
pip install unstructured==0.10.30 seems to work. Something after that changed their table parsing πŸ€”
Nice, this version seems to be working fine but It stills doesn't recognize my embed models, did you also have this warning Embeddings have been explicitly disabled. Using MockEmbedding. when testing?
Yea that's expected actually
It's fine. It's using a summary index under the hood to create summaries of tables
And it's disabling the embed model specifically at that step
So it's all good
Add a reply
Sign up and join the conversation on Discord