Find answers from the community

Updated 5 months ago

Nltk

At a glance

16 comments

You can set the cache dir it downloads to with the LLAMA_INDEX_CACHE_DIR env var

Then you couuuuld preload the dockerfile with the nltk files

I think

Seems like the did something similar in that stackoverflow thread

Are you sure that’s the one? Source looks like it’s NLTK_DATA

Oh I see

The fallback is Llama’s cache dir

I think that means it might work without modifying the dockerfile…seems llama already handles this support even though nltk doesn’t :)

haha yea, NLTK_DATA isn't even used by their downloader, which is super annoying, so we made our own version xD

ahahah

@Logan M btw the type hint here is wrong

Plain Text

def split_by_sentence_tokenizer() -> Callable[[str], List[str]]:

Callable[[str, str], List[str]]

Plain Text

def sent_tokenize(text, language="english"):

ah yea thats fair

i shall cast

For others referencing:

Plain Text

TOKENIZER: Callable[[str, str], List[str]] = typing.cast(
   Callable[[str, str], List[str]], split_by_sentence_tokenizer()
)

Add a reply