Homosexual Toaster

Does PromptHelper change the amount of

Does PromptHelper change the amount of LLM calls?
I currently have

Plain Text

prompt_helper = PromptHelper(
    context_window=8192,
    num_output=1,
    chunk_overlap_ratio=0.1,
    chunk_size_limit=300,
)

connected to Mistral7B, but this is my outputs:

Plain Text

llama_print_timings:        load time =     521.48 ms
...
llama_print_timings:       total time =    1137.07 ms
Llama.generate: prefix-match hit

llama_print_timings:        load time =     521.48 ms
...
llama_print_timings:       total time =    3449.42 ms
Llama.generate: prefix-match hit

llama_print_timings:        load time =     521.48 ms
...
llama_print_timings:       total time =    3452.64 ms
Llama.generate: prefix-match hit

llama_print_timings:        load time =     521.48 ms
...
llama_print_timings:       total time =    5615.29 ms
Llama.generate: prefix-match hit

llama_print_timings:        load time =     521.48 ms
llama_print_timings:      sample time =      32.55 ms /   251 runs   (    0.13 ms per token,  7712.40 tokens per second)
llama_print_timings: prompt eval time =     242.42 ms /   248 tokens (    0.98 ms per token,  1023.00 tokens per second)
llama_print_timings:        eval time =    7252.34 ms /   250 runs   (   29.01 ms per token,    34.47 tokens per second)
llama_print_timings:       total time =    7947.15 ms

does'nt this mean the LLM is being called 5 times?

4 comments

HHomosexual Toaster

Hi, can I just ask if anyone has found ways to improve the accuracy of the basic RAG system.

I know the three methods on the documentation are summarising, window and metadata search. However, summarising and creating metadata seems to be way too expensive w chatGPT calls for me. I've tried adding a window but it seems pretty weak (I'm doing top_k=4 and window of +-3 sentences). Has anyone used any methods that aren't expensive?

2 comments

HHomosexual Toaster

so im currently using llamacpp but my

so im currently using llamacpp, but my gpu is a lot stronger than my cpu and i understand that llamacpp predominantly uses cpu (even with CUDA acceleration, its still using both right?)

does llamaindex/langchain support other quantized llamas for gpu, like exllama or gptq? also, if my intended use case is to do a more complex semantic search on documents, would llama be better, or would alpaca be better? llama seems like a text generation model, and alpaca might be better in this sense?

1 comment

HHomosexual Toaster

Keyword index

Plain Text

callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llm = LlamaCpp(model_path=r'7B\ggml-model-q4_0.bin', callback_manager=callback_manager, verbose=True, n_ctx=2048)
llm_predictor = LLMPredictor(llm=llm)
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

documents = SimpleDirectoryReader(r'------').load_data()

index = KeywordTableIndex.from_documents(documents, service_context=service_context)

so when creating the index using llamacpp-7B, over a folder containing 1 pdf, i get stuff like this

1 comment

Find answers from the community

```py

Does PromptHelper change the amount of

Improving RAGs

so im currently using llamacpp but my

Keyword index