Max tokens

I have chunk_size=512 and max_tokens=512, context_window=2048 on model and PromptHelper

10 comments

Max tokens means that you've set the max input size on the model to 512 👀

Can you share the code? I can help correct it

ССерёга Леший

I'm sorry, I wasn't clear. By max_tokens I meant the parameter, which in PromptHelper is called num_output.

Plain Text

from llama_index import (
    StorageContext,
    load_index_from_storage,
    Prompt,
    LLMPredictor,
    ServiceContext,
    SimpleDirectoryReader,
    VectorStoreIndex,
    LangchainEmbedding,
    Document,
    ListIndex,
    PromptHelper
)
from os import listdir
from os.path import isfile, join
import json
from llama_index.optimization.optimizer import SentenceEmbeddingOptimizer
from langchain.llms import LlamaCpp
from langchain.embeddings import HuggingFaceEmbeddings
import os
import sys
import logging

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))


MODEL = LlamaCpp(
    model_path="models/wizardlm-30B-uncensored.ggmlv3.q4_0.bin",
    verbose=False,
    max_tokens=512,
    n_ctx=2048
)

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
EMBEDDINGS_MODEL = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
    cache_folder="models/transformers/"
)

template = (
    "Context: \n"
    "---------------------\n"
    "{context_str}"
    "\n---------------------\n"
    " ### Human: {query_str}\n"
    "### Assistant: "
)

QA_TEMPLATE = Prompt(template)

service_context = ServiceContext.from_defaults(
    llm_predictor=LLMPredictor(llm=MODEL),
    embed_model=LangchainEmbedding(EMBEDDINGS_MODEL),
    # In exceptional cases with these values, the program sometimes went beyond the 2048 context. The value context_window=1536 worked flawlessly.
    prompt_helper=PromptHelper(context_window=2048, num_output=512),
    chunk_size=512
)

storage_context = StorageContext.from_defaults()

topic_indexes = []
topic_index_summaries = []

mypath = "data"

files = [...]

ССерёга Леший

Plain Text

for i, filename in enumerate(files):
    with open(join(mypath, filename), "r") as file:
        print(f"### Processing file {filename} ({i + 1}/{len(files)}) ###")
        topic = json.load(file)
        docs = transform_dataset_topic(topic)
        index = ListIndex.from_documents(docs, service_context=service_context, storage_context=storage_context)
        topic_indexes.append(index)
        summary = index.as_query_engine(
#            optimizer=SentenceEmbeddingOptimizer(percentile_cutoff=0.5, embed_model=LangchainEmbedding(EMBEDDINGS_MODEL)),
            text_qa_template=QA_TEMPLATE,
            response_mode="refine"
        ).query(
            "Provide a detailed summary of the topic."
        )
        topic_index_summaries.append(str(summary))
        print(f"### Summary for {filename}: {str(summary)} ###")


...

ССерёга Леший

I've shortened the code.

ССерёга Леший

@Logan M

LLogan M

Right. Since you are using a list index, it will use the template no matter what, since the list index sends every node in the index to the LLM

You can avoid this though, if you use index.as_query_engine(response_mode="tree_summarize"), which is the ideal mode for creating summaries

ССерёга Леший

It produces false summaries in such case.

ССерёга Леший

I'm more interested in the reason why the response is lost during the refining process.

LLogan M

Yea, the refine prompt is complex, like I think you mentioned earlier. OpenSource models are not good at following it

You could try customizing index.as_query_engine(refine_template=my_refine_template)

The default refine template is here
https://github.com/jerryjliu/llama_index/blob/0cf7f9983b6ec0528a327e8bc0e64bf0321b73fc/llama_index/prompts/default_prompts.py#L81

ССерёга Леший

@Logan M Thank you very much for your help. Seems like the following template works:

Plain Text

     HumanMessagePromptTemplate.from_template(
        "-----------\n"
        "Complete the answer to the question \"{query_str}\" based on the following context.\n"
        "Original answer:\n"
        "------------\n"
        "{existing_answer}\n"
        "------------\n"
        "Context provided:\n"
        "------------\n"
        "{context_msg}\n"
        "------------\n",
    ),

Add a reply

Find answers from the community

Max tokens