HELP I updated packages and everything

At a glance

The community member is experiencing issues after updating packages, including errors with the prompt_helper and the gpt-3.5-turbo-16k model. They also encounter an nltk package not found error, even though it is listed in their requirements.txt file.

The comments suggest that the issue may be related to the nltk package trying to download data into a non-writable directory on the AWS Lambda environment. Community members provide suggestions, such as setting the NLTK_DATA environment variable, downloading the required data manually, and modifying the utils.py file to change the download directory. However, these solutions do not seem to fully resolve the issue.

There is no explicitly marked answer, but the community members continue to discuss and investigate the problem, including references to a related GitHub issue (#6579).

Useful resources

LLau Fla

HELP, I updated packages and everything broke - did anything change since version 0.6.14 that changed how prompt_helper is used?

I get an error from

Plain Text

prompt_helper = PromptHelper.from_llm_predictor(llm_predictor=llm_predictor, chunk_size_limit=1024)

I returned to version 0.6.14 and kept langchain version 0.0.205, but then choosing the newer gpt-3.5-turbo-16k model gives error:

Plain Text

[ERROR] ValueError: Unknown model: gpt-3.5-turbo-16k. Please provide a valid OpenAI model name.Known models are: gpt-4, gpt-4-0314, gpt-4-32k, gpt-4-32k-0314, gpt-3.5-turbo, gpt-3.5-turbo-0301, text-ada-001, ada, text-babbage-001, babbage, text-curie-001, curie, davinci, text-davinci-003, text-davinci-002, code-davinci-002, code-davinci-001, code-cushman-002, code-cushman-001

then returning to the regular gpt-3.5-turbo model suddenly gives me error nltk package not found, please run pip install nltk - even though nltk is the first item on the requirements.txt file for my project - Im confused

14 comments

LLau Fla

I checked the logs as in picture...

OpenAI says:
If the error comes from nltk_data, you may need to download the required data into a writable directory (e.g., /tmp) within the Lambda environment. First, you should make sure to import nltk in your code and then update the data path to use the /tmp directory.
Add the following lines after all your imports:

Plain Text

import nltk

nltk.data.path.append('/tmp/nltk_data')

Then, you'll need to download the required data or corpora before using them. For example, if you're using the 'punkt' tokenizer, add this line after updating the data path:

Plain Text

nltk.download('punkt', download_dir='/tmp/nltk_data')

This way, you'll ensure that the required data is downloaded and stored in a writable directory within the Lambda environment.
---
But for now I have no idea where to put these its just confusing

Attachment

LLau Fla

btw the error with the prompt helper was :

Plain Text

type object 'PromptHelper' has no attribute 'from_llm_predictor'

LLogan M

I would start with a fresh venv and the latest llama index and langchain versions

The nltk stuff should be handled automatically, something weird happend to your cache I think? (The fresh venv should help)

For the prompt helper, it looks like that function got removed (tbh I didn't think anyone was using that lol)

The default chunk size is 1024 now, so now need to define the prompt helper anyways though. If you want to change the chunk size, you can do it in the service context ServiceContext.from_defaults(..., chunk_size=1024)

LLau Fla

@Logan M
tbh i wasnt even 100% sure what prompthelper does/did, i thought it splits too long queries into multiple ones but maybe im wrong and its not relevant....

Regarding chunk size, there are chunk sizes in service context and also in the textsplitter, i get the meaning for splitting but what does the chunk size in the service context do?

Im really confused, what are the recommended (or default) sizes for:

max token in LLMPredictor when doing the nex 3.5-16k model vs gpt4?
chunk size in service context, is that only for the final query to LLM?

if doing vector k=3 and the docs were split with TokenTextSplitter(chunk_size=400, chunk_overlap=50)

LLau Fla

Hi @Logan M , I double checked with our SRE who better explained me the issue. We are using AWS lambda that only has one writable folder: /tmp
It looks like prompthelper (i think) needs nltk to download stopwords into /home but in AWS Lmabda its default to only have /tmp as a writable folder.
Is there some high level configuration that would change the folder NLTK is trying to write to and then read from?
Locally this isn't an issue, but on Lambda it is. I don't think that changing all /home entries into /tmp entries in all kinds of different files is a good solution to this

LLau Fla

it would be very helpful if we find which dependency does the problem.
Something in the chain tries to download the stopwords into the default /home folder

LLogan M

You can set an ENV variable to control NLTK downloads

LLogan M

export NLTK_DATA=$PWD worked for me locally, hopefully it works on lambda lol

LLau Fla

Thanks Ill try!

LLau Fla

Unfortunately it didn't work, the variable I set with os.environ["NLTK_DATA"] = '/tmp/nltk_data' but the logs still show :

Plain Text

[nltk_data] Downloading package stopwords to
[nltk_data] /home/sbx_user1051/nltk_data...
Error generating response: [Errno 30] Read-only file system: "/home/sbx_user1051'

I tried to dig deeper under the hood but as a non-dev I can only guess the meaning of what I see:

LLau Fla

Service_context calls for llama_index.indices.prompt_helper import PromptHelper which calls for from llama_index.utils import globals_helper
In utils.py I see _stopwords: Optional[List[str]] = None and then later below:

Plain Text

    def stopwords(self) -> List[str]:
        """Get stopwords."""
        if self._stopwords is None:
            try:
                import nltk
                from nltk.corpus import stopwords
            except ImportError:
                raise ImportError(
                    "`nltk` package not found, please run `pip install nltk`"
                )
            try:
                nltk.data.find("corpora/stopwords")
            except LookupError:
                nltk.download("stopwords")
            self._stopwords = stopwords.words("english")
        return self._stopwords

then the one-liner of globals_helper = GlobalsHelper() followed by another def: get_new_id(d: Set) -> str: ....

not sure if the stopwords thing here is used or what the chain exactly is, but it looks like it's always falling back to the except lookuperror option to nltk.download("stopwords")

I found a stackoverflow article (https://stackoverflow.com/questions/44857382/change-nltk-download-path-directory-from-default-ntlk-data) saying that "...nltk seems to totally ignore its own environment variable NLTK_DATA and default its download directories to a standard set of five paths..."
It continues by saying I could change that line to nltk.download('stopwords',download_dir='/tmp') but I dont have a way to edit utils.py with lambda, only my bots py file, all the rest are dependencies

LLau Fla

I downloaded the stopwords and have the list, but evenlso pre-defining
_stopwords = [list of stopwords]
or
self._stopwords = [list of stopwords]
wont work because utils.py starts with _stopwords: Optional[List[str]] = None so it will overwrite my list

LLau Fla

i think (noob opinion) that changing line 84 in utils.py from nltk.download("stopwords") to nltk.download("stopwords", download_dir='{NLTK_DATA}') could do the trick

LLau Fla

As discussed in #6579 (https://github.com/jerryjliu/llama_index/pull/6579/files/e5e3aae9848ccaf2343ac903421c51c514b15895) 🙂

Add a reply

Find answers from the community

HELP I updated packages and everything