Find answers from the community

Updated 2 years ago

HELP I updated packages and everything

At a glance
HELP, I updated packages and everything broke - did anything change since version 0.6.14 that changed how prompt_helper is used?

I get an error from
Plain Text
prompt_helper = PromptHelper.from_llm_predictor(llm_predictor=llm_predictor, chunk_size_limit=1024)


I returned to version 0.6.14 and kept langchain version 0.0.205, but then choosing the newer gpt-3.5-turbo-16k model gives error:

Plain Text
[ERROR] ValueError: Unknown model: gpt-3.5-turbo-16k. Please provide a valid OpenAI model name.Known models are: gpt-4, gpt-4-0314, gpt-4-32k, gpt-4-32k-0314, gpt-3.5-turbo, gpt-3.5-turbo-0301, text-ada-001, ada, text-babbage-001, babbage, text-curie-001, curie, davinci, text-davinci-003, text-davinci-002, code-davinci-002, code-davinci-001, code-cushman-002, code-cushman-001


then returning to the regular gpt-3.5-turbo model suddenly gives me error nltk package not found, please run pip install nltk - even though nltk is the first item on the requirements.txt file for my project - Im confused
L
L
14 comments
I checked the logs as in picture...

OpenAI says:
If the error comes from nltk_data, you may need to download the required data into a writable directory (e.g., /tmp) within the Lambda environment. First, you should make sure to import nltk in your code and then update the data path to use the /tmp directory.
Add the following lines after all your imports:
Plain Text
import nltk

nltk.data.path.append('/tmp/nltk_data')

Then, you'll need to download the required data or corpora before using them. For example, if you're using the 'punkt' tokenizer, add this line after updating the data path:
Plain Text
nltk.download('punkt', download_dir='/tmp/nltk_data')

This way, you'll ensure that the required data is downloaded and stored in a writable directory within the Lambda environment.
---
But for now I have no idea where to put these its just confusing
Attachment
image.png
btw the error with the prompt helper was :
Plain Text
type object 'PromptHelper' has no attribute 'from_llm_predictor'
I would start with a fresh venv and the latest llama index and langchain versions

The nltk stuff should be handled automatically, something weird happend to your cache I think? (The fresh venv should help)

For the prompt helper, it looks like that function got removed (tbh I didn't think anyone was using that lol)

The default chunk size is 1024 now, so now need to define the prompt helper anyways though. If you want to change the chunk size, you can do it in the service context ServiceContext.from_defaults(..., chunk_size=1024)
@Logan M
tbh i wasnt even 100% sure what prompthelper does/did, i thought it splits too long queries into multiple ones but maybe im wrong and its not relevant....

Regarding chunk size, there are chunk sizes in service context and also in the textsplitter, i get the meaning for splitting but what does the chunk size in the service context do?

Im really confused, what are the recommended (or default) sizes for:
  • max token in LLMPredictor when doing the nex 3.5-16k model vs gpt4?
  • chunk size in service context, is that only for the final query to LLM?
if doing vector k=3 and the docs were split with TokenTextSplitter(chunk_size=400, chunk_overlap=50)
Hi @Logan M , I double checked with our SRE who better explained me the issue. We are using AWS lambda that only has one writable folder: /tmp
It looks like prompthelper (i think) needs nltk to download stopwords into /home but in AWS Lmabda its default to only have /tmp as a writable folder.
Is there some high level configuration that would change the folder NLTK is trying to write to and then read from?
Locally this isn't an issue, but on Lambda it is. I don't think that changing all /home entries into /tmp entries in all kinds of different files is a good solution to this
it would be very helpful if we find which dependency does the problem.
Something in the chain tries to download the stopwords into the default /home folder
You can set an ENV variable to control NLTK downloads
export NLTK_DATA=$PWD worked for me locally, hopefully it works on lambda lol
Thanks Ill try!
Unfortunately it didn't work, the variable I set with os.environ["NLTK_DATA"] = '/tmp/nltk_data' but the logs still show :
Plain Text
[nltk_data] Downloading package stopwords to
[nltk_data] /home/sbx_user1051/nltk_data...
Error generating response: [Errno 30] Read-only file system: "/home/sbx_user1051'

I tried to dig deeper under the hood but as a non-dev I can only guess the meaning of what I see:
Service_context calls for llama_index.indices.prompt_helper import PromptHelper which calls for from llama_index.utils import globals_helper
In utils.py I see _stopwords: Optional[List[str]] = None and then later below:
Plain Text
    def stopwords(self) -> List[str]:
        """Get stopwords."""
        if self._stopwords is None:
            try:
                import nltk
                from nltk.corpus import stopwords
            except ImportError:
                raise ImportError(
                    "`nltk` package not found, please run `pip install nltk`"
                )
            try:
                nltk.data.find("corpora/stopwords")
            except LookupError:
                nltk.download("stopwords")
            self._stopwords = stopwords.words("english")
        return self._stopwords

then the one-liner of globals_helper = GlobalsHelper() followed by another def: get_new_id(d: Set) -> str: ....

not sure if the stopwords thing here is used or what the chain exactly is, but it looks like it's always falling back to the except lookuperror option to nltk.download("stopwords")

I found a stackoverflow article (https://stackoverflow.com/questions/44857382/change-nltk-download-path-directory-from-default-ntlk-data) saying that "...nltk seems to totally ignore its own environment variable NLTK_DATA and default its download directories to a standard set of five paths..."
It continues by saying I could change that line to nltk.download('stopwords',download_dir='/tmp') but I dont have a way to edit utils.py with lambda, only my bots py file, all the rest are dependencies
I downloaded the stopwords and have the list, but evenlso pre-defining
_stopwords = [list of stopwords]
or
self._stopwords = [list of stopwords]
wont work because utils.py starts with _stopwords: Optional[List[str]] = None so it will overwrite my list
i think (noob opinion) that changing line 84 in utils.py from nltk.download("stopwords") to nltk.download("stopwords", download_dir='{NLTK_DATA}') could do the trick
Add a reply
Sign up and join the conversation on Discord