LlamaIndex

Log inLog into community

Find answers from the community

Updated 4 months ago

Bug

Bug

At a glance

The community members are encountering an error with the replace() function in the llama_index library, which seems to be a bug. They discuss potential workarounds, such as using a dataclass instead of a dictionary for llm_metadata. The community members also discuss using the OpenAI class from llama_index versus the one from langchain, and one community member mentions they will patch the bug for the langchain LLMs.

Later, the community members encounter another issue with the SentenceEmbeddingOptimizer class from llama_index, where it is unable to find the punkt tokenizer from the nltk library. They discuss potential solutions, such as pre-downloading the required resources or overriding the download location on AWS Lambda.

The community members also encounter an issue with a MongoDB error, which they determine is an environment variable casting issue.

Overall, the community members are working through various issues and bugs in the llama_index library, and are discussing potential solutions and workarounds.

Useful resources

·

Also, as another issue question: I am getting this error and I'm not sure what is causing it, and wasn't getting it before afaik:

Plain Text

replace() should be called on dataclass instances
#--my stuff here, and then my call into llama_index:
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, num_output=700)
File "/opt/python/llama_index/indices/service_context.py", line 140, in from_defaults
prompt_helper = prompt_helper or _get_default_prompt_helper(
File "/opt/python/llama_index/indices/service_context.py", line 44, in _get_default_prompt_helper
llm_metadata = dataclasses.replace(llm_metadata, num_output=num_output)
File "/var/lang/lib/python3.10/dataclasses.py", line 1424, in replace
raise TypeError("replace() should be called on dataclass instances")
TypeError: replace() should be called on dataclass instances

seems like a bug though but, correct me if I'm wrong

L

R

50 comments

Seems like a bug with custom llms? But also might be a work around if you make llm_metadata return a dataclass instead of a dict

All we're doing is:

Plain Text

store = MongoDBAtlasVectorSearch(mongodb_client, db_name=db_name,collection_name=collection_name, index_name=index_name)
index = VectorStoreIndex.from_vector_store(vector_store=store)
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0.2, model_name="gpt-3.5-turbo"))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, num_output=700)

might be a work around if you make llm_metadata return a dataclass instead of a dict

I wouldn't know how to do that

Are you using the langchain OpenAI class or our OpenAI class? I suspect langchain (and then I guess it's a bug with langchain lol)

If you are using langchain llms, you should really be using the ChatOpenAI class to get the best performance

If you use our openai LLM class, there's only one, so less confusion there 👌

I'll patch this bug for langchain llms in the next few minutes here

I am out picking up food rq but can check in a few when I get home

from langchain import OpenAI

so you're saying I should instead be using from llama_index.llms.openai import OpenAI ?

And... fwiw... I switched to that, and, same error

Plain Text

service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, num_output=700)
File "/opt/python/llama_index/indices/service_context.py", line 140, in from_defaults
prompt_helper = prompt_helper or _get_default_prompt_helper(
File "/opt/python/llama_index/indices/service_context.py", line 44, in _get_default_prompt_helper
llm_metadata = dataclasses.replace(llm_metadata, num_output=num_output)
File "/var/lang/lib/python3.10/dataclasses.py", line 1424, in replace
raise TypeError("replace() should be called on dataclass instances")
TypeError: replace() should be called on dataclass instances

Plain Text

from llama_index.llms import OpenAI
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-3.5-turbo", max_tokens=700, temperature=0))

In any case, I'll patch that error soon when I get home! 🙏

Ok, found the bug (The error is caused by setting num_outputs in the service context, just some outdated code.). But also more usage tips

In general, don't worry about LLMPredictor anymore, just pass the llm directly into the service context and it should just work

Merged!

I think I tried this and it didn't work, but I will try again for sanity

plz confirm you have 0.7.3 or newer too

Uh, so I did a pip upgrade on it and got 0.7.4. I ran it and got past that error but have a new one

Plain Text

optimizer=SentenceEmbeddingOptimizer(threshold_cutoff=threshold_cutoff,percentile_cutoff=percentile_cutoff),
File "/opt/python/llama_index/indices/postprocessor/optimizer.py", line 51, in __init__
import nltk.data
ModuleNotFoundError: No module named 'nltk'

Am I using the optimizer wrong now?

Plain Text

from llama_index.indices.postprocessor.optimizer import SentenceEmbeddingOptimizer

Or are my dependencies busted? or...?

dependencies look busted 😅 Try pip install nltk ? Or a fresh venv tbh

shouldn't upgrade have caught and fixed that though? D:

like, is it missing from a requirements file or something? @_@;

It's in the reqs

python dependencies are hell sometimes

lol

Fresh venv is usually a good start

wat:

Plain Text

File "/opt/python/llama_index/indices/postprocessor/optimizer.py", line 54, in __init__
nltk.data.find("tokenizers/punkt")
File "/opt/python/nltk/data.py", line 583, in find
raise LookupError(resource_not_found)

>_>

literally just installed it

ohhhh... maybe not...

no... it's definitely there... what is going on

@Logan M
Am I crazy, or shouldn't this
nltk.data.find("tokenizers/punkt")
be:
nltk.data.find("tokenize/punkt") instead? 🤔 https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py

I guess... I should be able to predownload this... so.. guess I'll do this

It should be downloading it automatically... your env is cooked 🙃

Well no, the issue I believe is that I'm packaging up the environment for use in aws lambda so, it probably literally can't download it, and even if it could, it really shouldn't do that on every request anyway.
Anywho... I ran it on my local pc and, just hit an error so.. im seeing what's up with that. ty

ohhh on lambda, it will download to /tmp (which I think lambda doesn't like)

You can override the download location with an env var NLTK_DATA

ah okay interesting... any thoughts on my next error? XD

pymongo.errors.OperationFailure: "knnBeta.k" must be a integer, full error: {'ok': 0.0, 'errmsg': '"knnBeta.k" must be a integer', 'code': 8, 'codeName': 'UnknownError', '$clusterTime': {'clusterTime': Timestamp(1689026539, 14), 'signature': {'hash': b'\x8eil\xff$\x87W\x9c\xc3rZ5\xec\xb3\xa0\xbd\xebw\xaeE', 'keyId': 7204928928916439059}}, 'operationTime': Timestamp(1689026539, 14)}

coming down from query_engine.query(prompt) on my end :x

via

  File "/root/pytest/venv/lib/python3.10/site-packages/llama_index/indices/vector_store/retrievers/retriever.py", line 85, in _retrieve
    query_result = self._vector_store.query(query, **self._kwargs)

nvm, looks like an ENV casting issue

thanks so much for your time!

Actually... I've got a followup question... I got it working locally, but Amazon is still "mad" with a:
OSError: [Errno 30] Read-only file system: '/home/sbx_user####' -- But, I'm not understanding what it is having to download, and I'd really like for it to not have to download the same thing every time it runs from scratch 🤔
So what exactly is this download for? And can/should-I put it elsewhere?

Plain Text

File "/opt/python/llama_index/indices/postprocessor/optimizer.py", line 56, in __init__
nltk.download("punkt")
File "/opt/python/nltk/downloader.py", line 777, in download
for msg in self.incr_download(info_or_id, download_dir, force):
File "/opt/python/nltk/downloader.py", line 642, in incr_download
yield from self._download_package(info, download_dir, force)
File "/opt/python/nltk/downloader.py", line 699, in _download_package
os.makedirs(download_dir)
File "/var/lang/lib/python3.10/os.py", line 215, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/var/lang/lib/python3.10/os.py", line 225, in makedirs
mkdir(name, mode)

@Logan M

Also... it looks like Lambda already uses /tmp for stuff.. but this error very clearly states it's trying to create a directory in home... 🤔

ah, yeah, I see on my own machine it stuck it in ~/nltk_data/tokenizers

Is it necssary for it to download every time for some reason?

Attachment

like... does that data update really regularly or something?

lambda might be clearing it every time? Or is that locally?

It should be caching it, at least locally

sure but, what is it for?

tokenization (i.e. splitting text into single words), so that we can chunk properly

ah

Add a reply

Sign up and join the conversation on Discord