llm = HuggingFaceLLM( # https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b model_name="stabilityai/stablelm-2-zephyr-1_6b", tokenizer_name="stabilityai/stablelm-2-zephyr-1_6b", query_wrapper_prompt=PromptTemplate("<|system|>\n\n<|user|>\n{query_str}\n<|assistant|>\n"), context_window=3900, max_new_tokens=256, model_kwargs={"trust_remote_code": True}, #tokenizer_kwargs={"max_length": 2048}, generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95, "do_sample":True}, messages_to_prompt=messages_to_prompt, device_map="auto", # uncomment this if using CUDA to reduce memory usage #model_kwargs={"torch_dtype": torch.float16} )
trust_remote_code
to True
, I still have the question: Do you wish to run the custom code? [y/N]
..... model.safetensors: 100% 3.29G/3.29G [00:39<00:00, 111MB/s] generation_config.json: 100% 121/121 [00:00<00:00, 7.17kB/s] tokenizer_config.json: 100% 825/825 [00:00<00:00, 37.4kB/s] The repository for stabilityai/stablelm-2-zephyr-1_6b contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/stabilityai/stablelm-2-zephyr-1_6b. You can avoid this prompt in future by passing the argument `trust_remote_code=True`. ** Do you wish to run the custom code? [y/N] y** tokenization_arcade100k.py: 100% 9.89k/9.89k [00:00<00:00, 463kB/s] ....
LangChainDeprecationWarning
/home/dev/.local/lib/python3.10/site-packages/langchain/chat_models/__init__.py:31: LangChainDeprecationWarning: Importing chat models from langchain is deprecated. Importing from langchain will no longer be supported as of langchain==0.2.0. Please import from langchain-community instead: `from langchain_community.chat_models import ChatAnyscale`. To install langchain-community run `pip install -U langchain-community`. warnings.warn( /home/dev/.local/lib/python3.10/site-packages/langchain/chat_models/__init__.py:31: LangChainDeprecationWarning: Importing chat models from langchain is deprecated. Importing from langchain will no longer be supported as of langchain==0.2.0. Please import from langchain-community instead: `from langchain_community.chat_models import ChatOpenAI`.
langchain-community
but I mainly get this warning messagellm = LlamaCPP(...))
?llm = OpenAI(model="gpt-3.5-turbo-0613") agent = OpenAIAgent.from_tools([weather_tool], llm=llm, verbose=True) response = agent.chat( "What's the weather like in San Francisco, Tokyo, and Paris?" )
query_engine.query(QUERY)
is truncated.# define prompt viewing function def display_prompt_dict(prompts_dict): for k, p in prompts_dict.items(): text_md = f"**Prompt Key**: {k}<br>" f"**Text:** <br>" display(Markdown(text_md)) print(p.get_template()) display(Markdown("<br><br>")) prompts_dict = query_engine.get_prompts() display_prompt_dict(prompts_dict)
**Prompt Key:** response_synthesizer:text_qa_template **Text:** Context information is below. --------------------- {context_str} --------------------- Given the context information and not prior knowledge, answer the query. Query: {query_str} Answer: **Prompt Key:** response_synthesizer:refine_template **Text:** The original query is as follows: {query_str} We have provided an existing answer: {existing_answer} We have the opportunity to refine the existing answer (only if needed) with some more context below. ------------ {context_msg} ------------ Given the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer. Refined Answer:
context_msg
& query_str
completed/informed?/usr/local/lib/python3.10/dist-packages/llama_index/readers/download.py in download_loader(loader_class, loader_hub_url, refresh_cache, use_gpt_index_import, custom_path) 138 library = json.loads(library_raw_content) 139 if loader_class not in library: --> 140 raise ValueError("Loader class name not found in library") 141 142 loader_id = library[loader_class]["id"] ValueError: Loader class name not found in library
from llama_index import download_loader TrafilaturaWebReader = download_loader("TrafilaturaWebReader") loader = TrafilaturaWebReader() documents = loader.load_data(urls=['https://google.com'])
response_synthesizer = get_response_synthesizer(response_mode="tree_summarize", use_async=True) doc_summary_index = DocumentSummaryIndex.from_documents( [data_document], service_context=service_context, response_synthesizer=response_synthesizer, show_progress=True, ) doc_summary_index.storage_context.persist("index_summary")
doc_summary_index.get_document_summary(DOC_ID)
is in English not in French.[data_document]
contain text in French.Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=64) Settings.chunk_size = 512 Settings.chunk_overlap = 64 # https://huggingface.co/OrdalieTech/Solon-embeddings-large-0.1 embeded_model_name = "OrdalieTech/Solon-embeddings-large-0.1" embed_model = HuggingFaceEmbedding(model_name=embeded_model_name) Settings.embed_model = embed_model ..... vector_store_index = VectorStoreIndex.from_documents(documents=documents, show_progress=True)
CUDA Out of memory
error since the previous model is still present in the GPU VRAM.langchain
It's possible to set device CPU for embeddingfrom langchain.embeddings.huggingface import HuggingFaceEmbeddings embedding_model = HuggingFaceEmbeddings( model_name="sentence-transformers/all-mpnet-base-v2", model_kwargs={"device": "cpu"}, # Use CPU for embedding )
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
)from llama_index import set_global_handler # general usage set_global_handler("<handler_name>", **kwargs)
remove global handler
?data_generator.generate_questions_from_nodes()
Context information is below. --------------------- {context_str} --------------------- Given the context information and not prior knowledge. generate only questions based on the below query. {query_str}
# https://github.com/abetlen/llama-cpp-python # GPU llama-cpp-python !CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.78 --force-reinstall --upgrade --no-cache-dir --verbose # https://github.com/run-llama/llama_index !pip install llama-index
import logging import sys from llama_index.callbacks import CallbackManager, LlamaDebugHandler from llama_index.llms.llama_utils import messages_to_prompt, completion_to_prompt from llama_index.llms import LlamaCPP logging.basicConfig(stream=sys.stdout, level=logging.DEBUG) # Change INFO to DEBUG if you want more extensive logging logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout)) llama_debug = LlamaDebugHandler(print_trace_on_end=True) callback_manager = CallbackManager([llama_debug]) # https://gpt-index.readthedocs.io/en/stable/examples/llm/llama_2_llama_cpp.html llm = LlamaCPP( model_url="https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf", # optionally, you can set the path to a pre-downloaded model instead of model_url #model_path="mistral-7b-v0.1.Q4_K_M.gguf", temperature=0.0, max_new_tokens=1024, # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room context_window=3900, # note, this sets n_ctx in the model_kwargs below, so you don't need to pass it there. # kwargs to pass to __call__() generate_kwargs={}, # kwargs to pass to __init__() # set to at least 1 to use GPU model_kwargs={"n_gpu_layers": 1}, # transform inputs into Llama2 format messages_to_prompt=messages_to_prompt, completion_to_prompt=completion_to_prompt, verbose=True, )