Hey all I m running into a very strange

Hey all, I'm running into a very strange error relating to encoding during a call to get_nodes_from_documents(). This issue seems to stem from a UnicodeEncodeError in tiktoken (which is used by LlamaIndex for encoding.

I would greatly appreciate any thoughts~

I have a set of documents that I am parsing into nodes as follows

documents = SimpleDirectoryReader('blog_posts').load_data()
service_context = ServiceContext.from_defaults(chunk_size=chunk_size, llm=llm)
service_contexts.append(service_context)
nodes = service_context.node_parser.get_nodes_from_documents(documents)

The last line of code will never return, until I interupt the execution, and then I can see a hint as to what's going on. get_nodes_from_documents calls the following method

split_text_metadata_aware(self, text, metadata_str)
metadata_len = len(self.tokenizer(metadata_str))
effective_chunk_size = self._chunk_size - metadata_len
return self._split_text(text, chunk_size=effective_chunk_size)

This ultimately calls an encode function in tiktoken - this is where the nature of the problem is.

I am getting the following exception:
except UnicodeEncodeError:
# BPE operates on bytes, but the regex operates on unicode. If we pass a str that is
# invalid UTF-8 to Rust, it will rightfully complain. Here we do a quick and dirty
# fixup for any surrogate pairs that may have sneaked their way into the text.
# Technically, this introduces a place where encode + decode doesn't roundtrip a Python
# string, but given that this is input we want to support, maybe that's okay.
# Also we use errors="replace" to handle weird things like lone surrogates.
text = text.encode("utf-16", "surrogatepass").decode("utf-16", "replace")
return self._core_bpe.encode(text, allowed_special)

But the structure of their code is leading to an infinite loop.

24 comments

What I'd really like to do is figure out what text is being passed in here that is causing the exception.

However I'm not sure if there is a way I can inspect the text passed in, or even print it, as this node parsing is being done.

There must be some special characters in my blog posts that are offending the encoder but I've been using LlamaIndex for a while and never had this issue

hmm yea that's super weird, not sure how to fix that 😅

I'm sure you could narrow down to at least the Document object cuasing this issue though

Interesting. Is the use of tiktoken for encoding at all new @Logan M ? Used this same code no problem a few months ago. Could be possible that they changed their code too.

Any suggestions for how I might narrow down the document it's getting stuck on? I have like 1000 docs, so I can try to run the same code on batches worst case. But wondering if there's a way I might be able to print out the doc while the underlying loop goes on or something.

If you use python3.8, it might be new? For some reason if you had python3.8, we used to use some tokenizer from huggingace instead of tiktoken (but there was not justification for this, so we removed it a few months back)

I would create an empty index and loop over documents and insert

Then you can try/catch and print out the problematic document (as well as not halt index construction)

Plain Text

index = VectorStoreIndex([], ....)

for doc in documents:
  try:
    index.insert(doc)
  except:
    ...

Great idea. Gonna do some debugging and will post back when I find a solution.

Ok @Logan M figured it out.

I'm following the Ensemble Query Engine guide
https://gpt-index.readthedocs.io/en/latest/examples/retrievers/ensemble_retrieval.html


llm = OpenAI(model="gpt-4")
chunk_sizes = [128, 256, 512, 1024]
service_contexts = []
nodes_list = []
vector_indices = []
query_engines = []
for chunk_size in chunk_sizes:
    print(f"Chunk Size: {chunk_size}")
    service_context = ServiceContext.from_defaults(chunk_size=chunk_size, llm=llm)
    service_contexts.append(service_context)
    nodes = service_context.node_parser.get_nodes_from_documents(documents)
....

This code is provided but led to the issue above.

I changed it to define a node parser
node_parser = SimpleNodeParser()

And then add that into the service context
service_context = ServiceContext.from_defaults(chunk_size=chunk_size, llm=llm, node_parser=node_parser)

And that solved it. Looking at the source code it looks like ServiceContext.from_defaults will have default None node_parser.

Lol but now I'm running into another issue later in the guide 😢

When I go to actually try the retriever


nodes = await retriever.aretrieve(
    "Who founded MyCase?"
)

I get
There are 0 selections, please use .inds.

The above exception was the direct cause of the following exception:

ValueError: Failed to select retriever

Searching this I see some others have run into simliar errors with selectors

The node parser is None in the service context, but then it gets built in the from_defaults function. I'm surprised this fixes your issue haha
https://github.com/jerryjliu/llama_index/blob/3506143d5aedafa91437a4f4097bceb3a4c9ab6f/llama_index/indices/service_context.py#L147

Looks like the retriever didn't select any indexes 😆 hmm

You can try changing the selector

Plain Text

retriever = RouterRetriever(
    selector=PydanticMultiSelector.from_defaults(llm=llm, max_outputs=4),
    retriever_tools=retriever_tools,
)

The guide uses a pydantic multi selector, which relies on OpenAI's function calling API

Alternatively, you could use the LLMMultiSelector, which relies on the LLM outputing structured json to parse

Plain Text

from llama_index.selectors import LLMMultiSelector

retriever = RouterRetriever(
    selector=LLMMultiSelector.from_defaults(service_context=service_context, max_outputs=4),
    retriever_tools=retriever_tools,
)

If that doesn't help, you might need to tweak the selctor prompt template? Which tbh is a little annoying lol

Bumping an old thread, but How would one do this, Logan?

Changing the prompt selector template?

Take a look at the source code. You can pass it in here:
https://github.com/jerryjliu/llama_index/blob/9dc1c3a03928acb181cdb762ef4bd3f6d651435c/llama_index/selectors/llm_selectors.py#L68

And this is where the default is
https://github.com/jerryjliu/llama_index/blob/9dc1c3a03928acb181cdb762ef4bd3f6d651435c/llama_index/selectors/prompts.py#L26

To make LLMSingleSelectors return JSONs better, would it be a good idea to modify these templates with something like langchain pydantic output parser parse instructions?

Pydantic selectors don't seem to be async, anyway around that?

So those prompts are also connected to an output parser, which further extends instructions for outputting json. That's where I would actually make the modification, and pass in the output parser and leave the prompt as the default
https://github.com/jerryjliu/llama_index/blob/main/llama_index/output_parsers/selection.py#L50

They are async
https://github.com/jerryjliu/llama_index/blob/9dc1c3a03928acb181cdb762ef4bd3f6d651435c/llama_index/selectors/pydantic_selectors.py#L127

huh, weird - was this with a recent update or was this always the case

That makes a lot of sense, thank you!

Also how do you even do this full time, you are just so helpful

Appreciate you so much ;_;

Well, I do work for llama index, so I get paid 😅