Find answers from the community

Updated 3 months ago

Defining models

Wow, I didn't know that. So what is the use of the model I choose in the embedding as well as max_input, chunk_size...
L
M
18 comments
I'm not sure what you mean there πŸ€”

For embeddings, the default model is text-ada-002 (which is quite cheap thankfully)

Max_input_size and chunk_size_limit are related to when we call the LLM (gpt turbo in your case)
When I create the index I choose gpt-3.5-turbo, really dont know why, cause ada-002 is the best, as you say
And gpt doesn't do embeds
I don't understand why the index is generated with the engine with which the queries are made
In your code, you've only set the llm_predictor.

But for GPTSimpleVectorIndex, it is not using that to create embeddings. There is a separate embed_model that defaults to text-ada-002
And would you know how to answer what I asked before? The source...
Langchain do it well. But llama-index works faster than langchain with chromadb
response = index.query("<my_query>")

In the response, you can check response.source_nodes to see where the answer came from. But it will only show the similarity, start/end positions, and doc_id

By default, the doc_id is a random string. But you can set the doc_id to the filename before constructing the index (assuming all filenames are unique). Something like this:

Plain Text
for (doc, fname) in zip(documents, filenames):
  doc.doc_id = fname
index = GPTSimpleVectorIndex(documents, ...)
I have only one doc, maybe I can index the page?
So I have page and sorce_node
is that possible?
Definitely! You can do something like:

Plain Text
documents = []
document_text = [] # create a list strings, one string per page
for i, page in enumerate(document_text):
  documents.append(Document(page))
  documents[-1].doc_id = "my_doc_page_" + str(i)
index = GPTSimpleVectorIndex(documents, ...)


Just need to figure out how to get the text per page πŸ€”
Sure, the problem is that the whole pdf is transformed to text as "one big string", which is then cut off. Perhaps I could get the source by doing a lookup for the source_nodes after the query via PyPDF, only if the user wants to see the page. Also, I could underline the text
Looks interesting
You have been a great help. I wish I could help you with another topic, but I find it difficult πŸ₯²
Haha no worries!

The text won't get cut off quite like you think. The library will keep track of which documents the chunks came from

Try it out and see what it looks like πŸ’ͺ
Add a reply
Sign up and join the conversation on Discord