Find answers from the community

Updated 2 months ago

Nltk

L
i
16 comments
You can set the cache dir it downloads to with the LLAMA_INDEX_CACHE_DIR env var
Then you couuuuld preload the dockerfile with the nltk files
Seems like the did something similar in that stackoverflow thread
Are you sure that’s the one? Source looks like it’s NLTK_DATA
The fallback is Llama’s cache dir
I think that means it might work without modifying the dockerfile…seems llama already handles this support even though nltk doesn’t :)
haha yea, NLTK_DATA isn't even used by their downloader, which is super annoying, so we made our own version xD
@Logan M btw the type hint here is wrong
Plain Text
def split_by_sentence_tokenizer() -> Callable[[str], List[str]]:
Callable[[str, str], List[str]]
Plain Text
def sent_tokenize(text, language="english"):
ah yea thats fair
For others referencing:

Plain Text
TOKENIZER: Callable[[str, str], List[str]] = typing.cast(
   Callable[[str, str], List[str]], split_by_sentence_tokenizer()
)
Add a reply
Sign up and join the conversation on Discord