Find answers from the community

Updated 3 months ago

Hi, I am new to llama index and trying

Hi, I am new to llama index and trying to build a basic document retrieval system. I am using Azure OpenAI embeddings. I have two problems and I am not sure how to resolve:
  1. I have a long documents with multiple paragraphs. I want to to treat each paragraph as a seperate document. How to do this?
  1. I have many text files. I can create documents object for each. But how do I generate embeddings? Everytime with more one 1 document, I get error as “Too many inputs…..” Seems like some limitation of Azure embeddings. How to resolve this?
Thanks in advance!!
W
h
L
23 comments
Hi,

  1. If you want to create document for each paragraph separately you can either use Textnode class or a wrapper over the Textnode class in the form of Document
Plain Text
from llama_index.schema import TextNode

para_1 = TextNode(text="Your text here")

# To insert this into a index
index = VectorStoreIndex([para_1])


  1. How are you generating embeddings here?
I am using AzureOpenAIEmbedding. Now whenever I am calling VectorStoreIndex with multiple documents, it is giving me error.
Can you share your code for the same, if possible?
Plain Text
from llama_index.embeddings import AzureOpenAIEmbedding

embed_model = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name="textembeddingada002",
    api_key=api_key,
    azure_endpoint=azure_endpoint,
    api_version=api_version,
    
)

from llama_index import set_global_service_context

service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
)

set_global_service_context(service_context)

documents = SimpleDirectoryReader(
    input_files=["./data/paul_graham_essay.txt"]
).load_data()

index = VectorStoreIndex.from_documents(documents)
On the last line, I get

" Too many inputs. The max number of inputs is 1. We hope to increase the number of inputs per request soon...."
Code looks fine, Can you check what do you get in the documents variable
Plain Text
[Document(id_='84d29648-c597-407c-968a-443924ebf956', embedding=None, metadata={'file_path': 'data/paul_graham_essay.txt', 'file_name': 'paul_graham_essay.txt', 'file_type': 'text/plain', 'file_size': 75042, 'creation_date': '2023-12-11', 'last_modified_date': '2023-12-11', 'last_accessed_date': '2023-12-11'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, hash='5dfe27179663d7ae4c02bbb134d50a62143a55545a90f194a20454deb5df5901', text='\n\nWhat I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, ...................Jessica Livingston, Robert Morris, and Harj Taggar for reading drafts of this.\n', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')]
I omited the text here. It is text sample from llama website only. Multiple paragraphs.
It is single Document element here. ^
When I just keep a single paragraph here, it works fine.
Do I have to define chunk_size or similar value somewhere?
No default is fine, Can you share your full error ? as the code looks alright to me

Also what is the version that you are trying with
This is classic azure, set embed_batch_size=1 in the embedding model
Oh, is that so. I will try out this, Thanks!
Anyway, is there any documentation available on AzureOpenAIEmbedding class?
Not really documentation, but an example!
Yes I saw this. But looking for documentation with all the parameters and explanations.
But tbh api docs suck, reading source code is more informative lol
Is it the case that when a document/text is large enough, the AzureOpenAIEmbedding object will break the text into multiple batches and create embeddings then?
Context: While using embed_batch_size=1 with a small document length, it works fine. When I put in a large text document, thenembedding generation fails again with "Too many requests" error.
Creating nodes from the documents is right approch in this case?
Right, from_documents() will chunk documents into nodes, or you can do it manually
Add a reply
Sign up and join the conversation on Discord