Hi a few questions

sajjadazami · 2023-04-02T02:18:22.019Z

Hi, a few questions:I'm indexing a document using the standard usage pattern (https://gpt-index.readthedocs.io/en/latest/guides/primer/usage_pattern.html#customizing-llm-s), but the resulting index is only one node. Therefore, each query, uses about 6K tokens. How do I go about debugging this?Is there a good source (example notebook, article etc.) about the effect of parameters in using GPTSimpleVectorIndex? (chunk size, max_input_size etc.)?

At a glance

ssajjadazami

Hi, a few questions:

I'm indexing a document using the standard usage pattern (https://gpt-index.readthedocs.io/en/latest/guides/primer/usage_pattern.html#customizing-llm-s), but the resulting index is only one node. Therefore, each query, uses about 6K tokens. How do I go about debugging this?
Is there a good source (example notebook, article etc.) about the effect of parameters in using GPTSimpleVectorIndex? (chunk size, max_input_size etc.)?

9 comments

LLogan M

You can set something like chunk_size_limit=512 in the service context so that your document is broken into more nodes.

There isn't really a dedicated explanation of the parameters, but I'm happy to help if you have further questions. Basically:

max_input_size is the max input size to the LLM (usually 4096 for openAI models).

num_output is the number of expected output tokens (256 by default for openAI https://gpt-index.readthedocs.io/en/latest/how_to/customization/custom_llms.html#example-changing-the-number-of-output-tokens-for-openai-cohere-ai21). Every prompt sent to the LLM is setup so that there is room for these tokens, since GPT decoder-like models generate tokens one at a time, while appending them to the original input

chunk_size_limit is used to set the size of chunks llama index breaks documents into. By default, it is 4000. However at query time, they may be broken even smaller to make sure there is room for num_output in the prompt

ssajjadazami

Thanks for your help Logan. I'm using the same parameters you mentioned and still have the same issues.

response is always 1 node and response.source_nodes[0].node.text is exactly the same as the document I fed GPTSimpleVectorIndex when indexing. I feel like there might be something wrong with the way I index the document. In my use case, I just want to index one document and build a QA bot on top of it. I'd be happy to share the code with which I build my index if you want.
two side question,

A. why is total LLM token usage about 7-8K (and more sometimes) on my queries? how can a query exceed 4096 Token limit when that's the max GPT accepts.
B. How index.query() remembers the past questions? I'd like to make each question independent from previous one (unlike what happens in a Chat model)

LLogan M

The default behavior of the vector index is to fetch the single top node. You can change this by doing something like this: index.query(...., similarity_top_k=3) -- this will fetch the top 3 nodes that best match the query. You might also be interested in using compact mode to decrease how long the query takes index.query(..., response_mode="compact") -- this will stuff as much text as possible in one LLM call

2A. If the node retrieved + the size of the prompt do not fit into a single query, it gets broken into multiple chunks and the answer to the query is refined over a few LLM calls (this is a very powerful concept!)

2B. index.query() will not remember past questions. Llama index is meant to be more of a search tool rather than a chatbot. Although there are some super cool ways to integrate llama index as a tool inside langchain:
https://github.com/jerryjliu/llama_index/blob/main/examples/langchain_demo/LangchainDemo.ipynb
https://gpt-index.readthedocs.io/en/latest/guides/tutorials/building_a_chatbot.html

ssajjadazami

I tried index.query(...., similarity_top_k=3) but same issue. I think the problem is in the indexing. I'm not sure but I feel like the indexed document is always 1 node (the whole doc). len(response.source_nodes) always returns 1 as well. Do you know what might cause this?

ssajjadazami

This is the code I use to index:

def parse_pdf(file: BytesIO):
    pdf = PdfReader(file)
    text_list = []

    # Get the number of pages in the PDF document
    num_pages = len(pdf.pages)

    # Iterate over every page
    for page in range(num_pages):
        # Extract the text from the page
        page_text = pdf.pages[page].extract_text()
        text_list.append(page_text)

    text = "\n".join(text_list)

    return Document(text)


docs = []

with open(pdf_file, 'rb') as f:
    fs = f.read()
    docs.append(parse_pdf(BytesIO(fs)))

docs = [] # I only have one doc in docs

with open(pdf_file, 'rb') as f:
    fs = f.read()
    docs.append(parse_pdf(BytesIO(fs)))

# set maximum input size
max_input_size = 4096
# set number of output tokens
num_outputs = 256
# set maximum chunk overlap
max_chunk_overlap = 20
# set chunk size limit
chunk_size_limit = 512

prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo"))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)
index = GPTSimpleVectorIndex.from_documents(docs, service_context=service_context)

LLogan M

Aha! Try this instead: ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper, chunk_size_limit=512)

LLogan M

It should be a service context parameter, not the prompt helper. I know it's a little confusing!

ssajjadazami

oh wow! This fixed it. Thank you so much!

LLogan M

Nice! 💪

Add a reply

Find answers from the community

Hi a few questions