Find answers from the community

m
matt_a
Offline, last seen 3 months ago
Joined September 25, 2024
Hey all, I'm running into a very strange error relating to encoding during a call to get_nodes_from_documents(). This issue seems to stem from a UnicodeEncodeError in tiktoken (which is used by LlamaIndex for encoding.

I would greatly appreciate any thoughts~

I have a set of documents that I am parsing into nodes as follows

documents = SimpleDirectoryReader('blog_posts').load_data()
service_context = ServiceContext.from_defaults(chunk_size=chunk_size, llm=llm)
service_contexts.append(service_context)
nodes = service_context.node_parser.get_nodes_from_documents(documents)


The last line of code will never return, until I interupt the execution, and then I can see a hint as to what's going on. get_nodes_from_documents calls the following method

split_text_metadata_aware(self, text, metadata_str)
metadata_len = len(self.tokenizer(metadata_str))
effective_chunk_size = self._chunk_size - metadata_len
return self._split_text(text, chunk_size=effective_chunk_size)

This ultimately calls an encode function in tiktoken - this is where the nature of the problem is.

I am getting the following exception:
except UnicodeEncodeError:
# BPE operates on bytes, but the regex operates on unicode. If we pass a str that is
# invalid UTF-8 to Rust, it will rightfully complain. Here we do a quick and dirty
# fixup for any surrogate pairs that may have sneaked their way into the text.
# Technically, this introduces a place where encode + decode doesn't roundtrip a Python
# string, but given that this is input we want to support, maybe that's okay.
# Also we use errors="replace" to handle weird things like lone surrogates.
text = text.encode("utf-16", "surrogatepass").decode("utf-16", "replace")
return self._core_bpe.encode(text, allowed_special)

But the structure of their code is leading to an infinite loop.
24 comments
L
V
m
Hey all, I published an article covering my learnings on how to improve RAG performance. Inspired by a lot of learning from the LlamaIndex community. Might be helpful if you're looking for some ideas / strategies to improve your application. Let me know if you have any thoughts or if there's anything you think I missed.

https://towardsdatascience.com/10-ways-to-improve-the-performance-of-retrieval-augmented-generation-systems-5fa2cee7cd5c
2 comments
m
j
Hey all, recently spent a few weeks experimenting with the Recency Filtering feature to see if it would help improve the performance of my bot - which uses thousands of blog posts for context. Wrote up my results here, which might be useful for others.

https://www.mattambrogi.com/posts/recency-for-chatbots/
2 comments
m
j
Has anyone used the DatasetGenerator class or generate_questions() method from this notebook?

https://github.com/jerryjliu/llama_index/blob/main/examples/evaluation/QuestionGeneration.ipynb

Tried to use today on a collection of ~1000 blog posts. Ran for an hour without returning anything. Never errored but I eventually stopped out of worry I was using a crazy amount of tokens. Don't see any docs on it.
5 comments
m
L
Looking for thoughts on four common issues I see
Hey all! I'm working on a few chatbots built with LlamaIndex, a collection of (1000s of) blog posts as a data-source, and GPT. Really impressed right out of the box, but as I continue to work I've found a few commons ways in which responses are bad. I'm working through mitigating each issue - all of which I think are very solvable.

Issues
  1. Failing to account for recency. Can I somehow get my bot to prioritize more recent context if the same thing is mentioned many times. Maybe I can store date in some metadata?
  2. Requiring very specifically worded questions. I.e. Ask two questions that mean the same thing to a human. Bot won't be able to find answer for one, will for the other.
  3. Aggregating vs. Non-aggregating Index. I'm using a simple vector index. Some questions would benefit from an index that could use aggregation of info from across my blog posts.. Others wouldn't. How can I balance this?
  4. How to handle subjective questions for which there is nothing in the context. I think this comes down to prompt engineering.
If you have any thoughts on the above, please let me know, I'd love to hear them. I'm sure I'm missing some easy improvements.

More info
I wrote about this in depth on my website https://www.mattambrogi.com/posts/chat-bots/
12 comments
m
L
j
Question: I'm finding that I need to ask questions very precisely to get the proper answer. Is there anything I can do to change this?

Example: I've created an index using ~1000 blog posts about legal tech from the past 3ish years. MyCase is a legal tech company that was acquired last year by a company called Affinipay. This is covered in the document corpus.

However you can see that the right answer isn't returned until I ask a pointed question. Is there anything I might be able to try to improve this?
11 comments
y
m
m
m