Does anybody know how to use streamlit

At a glance

The community members are discussing issues with using the "load_index_from_storage" function in Streamlit and caching the returned index. They are also looking for recommendations on how to split and store a large JSON file, such as a vector_store.json file, on GitHub. The community members have tried various approaches, including using st.cache_data and st.cache_resource, but have encountered issues with the index being "unpickleable". They have also discussed the pros and cons of using GitHub's large file storage feature versus storing the file on services like S3 or Google Bucket. There is no explicitly marked answer, but the community members are collaborating to find a solution.

Useful resources

ttilleul

Does anybody know how to use streamlit with the latest "load_index_from_storage" function ? The index returned is not "pickleable" and thus using st.cache_data does not work ...

Other question: any recommandation for a lib or a module that would split a large json file like a vector_store.json file ? If the file is >100Mb, it cannot be uploaded to github (unless we use github large files storage -- whatever that is) ... I'd prefer to split the file, store it in parts on github, then when it's time to read it back, simply "join" the json again before feeding it to a storage_context ... unless github large files is really cool and the way to go ... 😉

18 comments

LLogan M

What's the streamlit function look like that you have currently? There are ways around the cache issue

LLogan M

(Also not sure on the second part, you might have to craft your own script to do that lol)

LLogan M

I would probably put the index on S3 or a google bucket though, rather than github

ttilleul

This is the function

Plain Text

# TODO: cache index (only the index struct from storage_context)
def load_index_folder (idx_folder, service_context):
    print ("Loading indices from [" + idx_folder  + "]");
    # rebuild storage context
    storage_context = StorageContext.from_defaults(persist_dir=idx_folder)

    # load index
    index = load_index_from_storage(storage_context, service_context=service_context)
    
    return index

The comment I wrote is after several tests ... I noticed I can st.cache_data the index_struct before it's converted into a class in load_*indices*_from_storage() ... but it would be simpler if I could just cache the returned index ... The error I got is something like this: https://github.com/jerryjliu/llama_index/issues/886

LLogan M

does st.cache_resource decorator work?

ttilleul

no but I must admit I don't remember the error then 🙂 I'll investigate again tomorrow ...

LLogan M

Yea I've definitely cached the index in a streamlit before, I'm pretty sure st.cache_resource worked 🤔

LLogan M

https://github.com/logan-markewich/llama_index_starter_pack/blob/b490fc29f5c15f03c1814af2c8cee8dc50c1ef29/streamlit_term_definition/streamlit_demo.py#L86

ttilleul

yes that's what you did in your git but I think the structure of the whole thing must have changed or something ... not sure ... I'm a nodeJS guy, not a Python guy (and weirdly I'm beginning to like it) and I just discovered today what a "pickle" was 😉

LLogan M

I'll test it later today... maybe it's a streamlit version thing?

lol man i love python. JS/node is ok too, but I really only use it for complex frontends 🙂

ttilleul

I've always had very strong biases against python .. I was wrong ... there are some very impressive stuff syntactically ... the only thing I dislike is the indentation ... not that I'm a die-hard fan of curly brackets but it's imho less prone to errors ...

LLogan M

definitely fair, the indenting takes some getting used to lol

ttilleul

Part of the problem is in the definition of the function and its parameters as they are cached too ...

Thus I replaced the above function with this one, called by the main program just after calling the above function:

Plain Text

@st.cache_resource
def cache_index(index):
    return index

ttilleul

The error I have is now

UnhashableParamError: Cannot hash argument 'index' (of type llama_index.indices.vector_store.base.VectorStoreIndex) in 'cache_index'.

ttilleul

If I try with @st.cache_data, the error is the same.

ttilleul

@Logan M I don't know what exactly is unhashable and why ...

LLogan M

You can give the function parameter an underscore to avoid hashing

@st.cache_resource
def cache_index(_index):

LLogan M

I think then it only caches the output? Tbh not 100% sure. Every time I've used the underscore trick I had more parameters, so it might be basing the cache off of the other parameters then?

Add a reply

Find answers from the community

Does anybody know how to use streamlit