Find answers from the community

Updated 6 months ago

Defining models

At a glance

A discussion about embeddings and model selection in vector indexing reveals that while gpt-3.5-turbo might be selected for index creation, the default embedding model is actually text-ada-002. Community members clarify that GPT models don't handle embeddings, and max_input_size and chunk_size_limit parameters are related to LLM calls rather than embeddings.

The conversation then shifts to source tracking in document queries. Community members explain that response.source_nodes can show where answers come from, including similarity scores and positions. For PDF documents, while they're initially converted to one large string, it's possible to index by page by creating separate documents with unique page IDs. The library maintains chunk tracking even when text appears to be cut off.

A brief comparison notes that while both Langchain and llama-index can handle these tasks, llama-index performs faster with ChromaDB.

Wow, I didn't know that. So what is the use of the model I choose in the embedding as well as max_input, chunk_size...
L
M
18 comments
I'm not sure what you mean there πŸ€”

For embeddings, the default model is text-ada-002 (which is quite cheap thankfully)

Max_input_size and chunk_size_limit are related to when we call the LLM (gpt turbo in your case)
When I create the index I choose gpt-3.5-turbo, really dont know why, cause ada-002 is the best, as you say
And gpt doesn't do embeds
I don't understand why the index is generated with the engine with which the queries are made
In your code, you've only set the llm_predictor.

But for GPTSimpleVectorIndex, it is not using that to create embeddings. There is a separate embed_model that defaults to text-ada-002
And would you know how to answer what I asked before? The source...
Langchain do it well. But llama-index works faster than langchain with chromadb
response = index.query("<my_query>")

In the response, you can check response.source_nodes to see where the answer came from. But it will only show the similarity, start/end positions, and doc_id

By default, the doc_id is a random string. But you can set the doc_id to the filename before constructing the index (assuming all filenames are unique). Something like this:

Plain Text
for (doc, fname) in zip(documents, filenames):
  doc.doc_id = fname
index = GPTSimpleVectorIndex(documents, ...)
I have only one doc, maybe I can index the page?
So I have page and sorce_node
is that possible?
Definitely! You can do something like:

Plain Text
documents = []
document_text = [] # create a list strings, one string per page
for i, page in enumerate(document_text):
  documents.append(Document(page))
  documents[-1].doc_id = "my_doc_page_" + str(i)
index = GPTSimpleVectorIndex(documents, ...)


Just need to figure out how to get the text per page πŸ€”
Sure, the problem is that the whole pdf is transformed to text as "one big string", which is then cut off. Perhaps I could get the source by doing a lookup for the source_nodes after the query via PyPDF, only if the user wants to see the page. Also, I could underline the text
Looks interesting
You have been a great help. I wish I could help you with another topic, but I find it difficult πŸ₯²
Haha no worries!

The text won't get cut off quite like you think. The library will keep track of which documents the chunks came from

Try it out and see what it looks like πŸ’ͺ
Add a reply
Sign up and join the conversation on Discord