Find answers from the community

Home
Members
Homosexual Toaster
H
Homosexual Toaster
Offline, last seen 3 months ago
Joined September 25, 2024
Plain Text
llm = LlamaCpp(
        model_path=r'C:\Users\UserAdmin\Desktop\vicuna\Wizard-Vicuna-30B-Uncensored.ggmlv3.q2_K.bin', 
        verbose=False,
        n_ctx=2048,
        n_gpu_layers=55,
        n_batch=512,
        n_threads=11,
        temperature=0.65)
embed_model = LangchainEmbedding(HuggingFaceEmbeddings(
        model_name=r".\all-mpnet-base-v2",
        model_kwargs={'device': 'cuda'})
                )
llm_predictor = LLMPredictor(llm=llm)

    
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size = 200, embed_model=embed_model)

documents = SimpleDirectoryReader(r'.\data\pdfs').load_data()
 
index =  VectorStoreIndex.from_documents(documents, service_context=service_context)

query_engine = index.as_query_engine(text_qa_template=QA_TEMPLATE)
9 comments
L
H
Does PromptHelper change the amount of LLM calls?
I currently have
Plain Text
prompt_helper = PromptHelper(
    context_window=8192,
    num_output=1,
    chunk_overlap_ratio=0.1,
    chunk_size_limit=300,
)

connected to Mistral7B, but this is my outputs:
Plain Text
llama_print_timings:        load time =     521.48 ms
...
llama_print_timings:       total time =    1137.07 ms
Llama.generate: prefix-match hit

llama_print_timings:        load time =     521.48 ms
...
llama_print_timings:       total time =    3449.42 ms
Llama.generate: prefix-match hit

llama_print_timings:        load time =     521.48 ms
...
llama_print_timings:       total time =    3452.64 ms
Llama.generate: prefix-match hit

llama_print_timings:        load time =     521.48 ms
...
llama_print_timings:       total time =    5615.29 ms
Llama.generate: prefix-match hit

llama_print_timings:        load time =     521.48 ms
llama_print_timings:      sample time =      32.55 ms /   251 runs   (    0.13 ms per token,  7712.40 tokens per second)
llama_print_timings: prompt eval time =     242.42 ms /   248 tokens (    0.98 ms per token,  1023.00 tokens per second)
llama_print_timings:        eval time =    7252.34 ms /   250 runs   (   29.01 ms per token,    34.47 tokens per second)
llama_print_timings:       total time =    7947.15 ms


does'nt this mean the LLM is being called 5 times?
4 comments
H
L
Hi, can I just ask if anyone has found ways to improve the accuracy of the basic RAG system.

I know the three methods on the documentation are summarising, window and metadata search. However, summarising and creating metadata seems to be way too expensive w chatGPT calls for me. I've tried adding a window but it seems pretty weak (I'm doing top_k=4 and window of +-3 sentences). Has anyone used any methods that aren't expensive?
2 comments
W
so im currently using llamacpp, but my gpu is a lot stronger than my cpu and i understand that llamacpp predominantly uses cpu (even with CUDA acceleration, its still using both right?)

does llamaindex/langchain support other quantized llamas for gpu, like exllama or gptq? also, if my intended use case is to do a more complex semantic search on documents, would llama be better, or would alpaca be better? llama seems like a text generation model, and alpaca might be better in this sense?
1 comment
L
Plain Text
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llm = LlamaCpp(model_path=r'7B\ggml-model-q4_0.bin', callback_manager=callback_manager, verbose=True, n_ctx=2048)
llm_predictor = LLMPredictor(llm=llm)
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

documents = SimpleDirectoryReader(r'------').load_data()

index = KeywordTableIndex.from_documents(documents, service_context=service_context)


so when creating the index using llamacpp-7B, over a folder containing 1 pdf, i get stuff like this
1 comment
L