The community member is experiencing an IndexError: list index out of range error when parsing large PDF datasheets with small chunk sizes using the GPTSimpleVectorIndex from the gpt_index library. The community members discuss the issue, with one noting that a pull request (https://github.com/jerryjliu/gpt_index/pull/306) should fix the problem, and that it will be patched into the next release. The community members also suggest that the user can pull the main branch to get the fix sooner.
I'm getting this error"[ERROR] IndexError: list index out of range Traceback (most recent call last): File "/var/task/app.py", line 85, in handler index = GPTSimpleVectorIndex(documents, chunk_size_limit=256) File "/var/lang/lib/python3.8/site-packages/gpt_index/indices/vector_store/simple.py", line 48, in init super().init( File "/var/lang/lib/python3.8/site-packages/gpt_index/indices/vector_store/base.py", line 43, in init super().init( File "/var/lang/lib/python3.8/site-packages/gpt_index/indices/base.py", line 96, in init self._index_struct = self.build_index_from_documents( File "/var/lang/lib/python3.8/site-packages/gpt_index/token_counter/token_counter.py", line 54, in wrapped_llm_predict f_return_val = f(_self, *args, **kwargs) File "/var/lang/lib/python3.8/site-packages/gpt_index/indices/base.py", line 231, in build_index_from_documents return self._build_index_from_documents(documents, verbose=verbose) File "/var/lang/lib/python3.8/site-packages/gpt_index/indices/vector_store/base.py", line 74, in _build_index_from_documents self._add_document_to_index(index_struct, d, text_splitter) File "/var/lang/lib/python3.8/site-packages/gpt_index/indices/vector_store/simple.py", line 64, in _add_document_to_index nodes = self._get_nodes_from_document(document, text_splitter) File "/var/lang/lib/python3.8/site-packages/gpt_index/indices/base.py", line 197, in _get_nodes_from_document text_chunks = text_splitter.split_text(document.get_text()) File "/var/lang/lib/python3.8/site-packages/gpt_index/langchain_helpers/text_splitter.py", line 128, in split_text cur_num_tokens = max(len(self.tokenizer(splits[start_idx])), 1) " when parsing large PDF datasheets with small chunk sizes