NLTK

At a glance

The post describes an error encountered while indexing nodes, where the NLTK resource "punkt_tab" was not found. Community members suggested installing the required NLTK resource. However, after solving that issue, the community member encountered a new error related to empty vectors in the semantic double merging splitter. The community members discussed the possibility of the embeddings not working properly and asked for a solution to mitigate the issue. One community member noted that the code has many try/except blocks, which may be hiding the actual traceback for the error.

Useful resources

BBhavya Giri

2025-01-01 09:24:14.745 | ERROR | indexing.indexing_hugging:indexing:140 - Error injecting nodes:
**
Resource [93mpunkt_tab[0m not found.
Please use the NLTK Downloader to obtain the resource:

[31m>>> import nltk

nltk.download('punkt_tab')
[0m
For more information see: https://www.nltk.org/data.html

Attempted to load [93mtokenizers/punkt_tab/english/[0m

Searched in:
'/root/nltk_data'
'/usr/nltk_data'
'/usr/share/nltk_data'
'/usr/lib/nltk_data'
'/usr/share/nltk_data'
'/usr/local/share/nltk_data'
'/usr/lib/nltk_data'
'/usr/local/lib/nltk_data'
'/usr/local/lib/python3.10/dist-packages/llama_index/core/_static/nltk_cache'
**

lang_config = LanguageConfig(language="english", spacy_model="en_core_web_md")

splitter = SemanticDoubleMergingSplitterNodeParser(
language_config=lang_config,
initial_threshold=0.4,
appending_threshold=0.5,
merging_threshold=0.5,
max_chunk_size=5000,
)
nodes = splitter.get_nodes_from_documents(documents)

Error while using SemanticDoubleMerging
@WhiteFang_Jr @Logan M

7 comments

WWhiteFang_Jr

You need to install the nltk resource as mentioned in the error

BBhavya Giri

solved it, but getting this error now:
20.73s/it]/usr/local/lib/python3.10/dist-packages/llama_index/core/node_parser/text/semantic_double_merging_splitter.py:228: UserWarning: [W008] Evaluating Doc.similarity based on empty vectors. ).similarity(/usr/local/lib/python3.10/dist-packages/llama_index/core/node_parser/text/semantic_double_merging_splitter.py:297: UserWarning: [W008] Evaluating Doc.similarity based on empty vectors. current_nlp.similarity(/usr/local/lib/python3.10/dist-packages/llama_index/core/node_parser/text/semantic_double_merging_splitter.py:310: UserWarning: [W008] Evaluating Doc.similarity based on empty vectors. and current_nlp.similarity(/usr/local/lib/python3.10/dist-packages/llama_index/core/node_parser/text/semantic_double_merging_splitter.py:328: UserWarning: [W008] Evaluating Doc.similarity based on empty vectors. and current_nlp.similarity(/usr/local/lib/python3.10/dist-packages/llama_index/core/node_parser/text/semantic_double_merging_splitter.py:255: UserWarning: [W008] Evaluating Doc.similarity based on empty vectors. ).similarity(2025-01-01 09:35:42.895 | ERROR | indexing.indexing_hugging:indexing:143 - Error injecting nodes: list index out of range
2025-01-01 09:35:42.896 | DEBUG | indexing.utils:move_files:46 - Moved PF Testing Locations_Condos.pdf to /home/raaadmin/data/321/28ab04a7-09bd-4d58-af5e-4b8878e0209a/error-files/PF Testing Locations/attachments/PF Testing Locations_Condos.pdf

LLogan M

Seems like your embeddings aren't working properly?

BBhavya Giri

any solution to mitigate this?

BBhavya Giri

i am using openai embeddings, this is the file that's giving the error

BBhavya Giri

@Logan M @WhiteFang_Jr

LLogan M

Hard to say without an actual traceback. Your code has so many try/excepts its hiding the traceback for the error

Add a reply

Find answers from the community

NLTK