Defining models

Wow, I didn't know that. So what is the use of the model I choose in the embedding as well as max_input, chunk_size...

18 comments

I'm not sure what you mean there 🤔

For embeddings, the default model is text-ada-002 (which is quite cheap thankfully)

Max_input_size and chunk_size_limit are related to when we call the LLM (gpt turbo in your case)

MManu Lorenzo

When I create the index I choose gpt-3.5-turbo, really dont know why, cause ada-002 is the best, as you say

MManu Lorenzo

And gpt doesn't do embeds

MManu Lorenzo

I don't understand why the index is generated with the engine with which the queries are made

LLogan M

In your code, you've only set the llm_predictor.

But for GPTSimpleVectorIndex, it is not using that to create embeddings. There is a separate embed_model that defaults to text-ada-002

MManu Lorenzo

oh, okay

MManu Lorenzo

Really helpful

MManu Lorenzo

And would you know how to answer what I asked before? The source...

MManu Lorenzo

Langchain do it well. But llama-index works faster than langchain with chromadb

LLogan M

response = index.query("<my_query>")

In the response, you can check response.source_nodes to see where the answer came from. But it will only show the similarity, start/end positions, and doc_id

By default, the doc_id is a random string. But you can set the doc_id to the filename before constructing the index (assuming all filenames are unique). Something like this:

Plain Text

for (doc, fname) in zip(documents, filenames):
  doc.doc_id = fname
index = GPTSimpleVectorIndex(documents, ...)

MManu Lorenzo

I have only one doc, maybe I can index the page?

MManu Lorenzo

So I have page and sorce_node

MManu Lorenzo

is that possible?

LLogan M

Definitely! You can do something like:

Plain Text

documents = []
document_text = [] # create a list strings, one string per page
for i, page in enumerate(document_text):
  documents.append(Document(page))
  documents[-1].doc_id = "my_doc_page_" + str(i)
index = GPTSimpleVectorIndex(documents, ...)

Just need to figure out how to get the text per page 🤔

MManu Lorenzo

Sure, the problem is that the whole pdf is transformed to text as "one big string", which is then cut off. Perhaps I could get the source by doing a lookup for the source_nodes after the query via PyPDF, only if the user wants to see the page. Also, I could underline the text

MManu Lorenzo

Looks interesting

MManu Lorenzo

You have been a great help. I wish I could help you with another topic, but I find it difficult 🥲

LLogan M

Haha no worries!

The text won't get cut off quite like you think. The library will keep track of which documents the chunks came from

Try it out and see what it looks like 💪

Add a reply

Find answers from the community

Defining models