Find answers from the community

Updated 4 months ago

Embeddings

At a glance

The post is about a community member encountering a ValueError when using the new OpenAI embedding models with the LlamaIndex query engine. The comments suggest that the new large model has 3072 dimensions, and the community member needs to re-embed all their data with the new model to switch embedding models. The community member has tried creating new embeddings and using the text-embedding-3-small model, but the issue persists. They have shared their code, and another community member has suggested that the service context should be used even when loading the index from storage.

Are the new openai embedding models supported in llamaindex query engine?
Plain Text
ValueError: shapes (1536,) and (3072,) not aligned: 1536 (dim 0) != 3072 (dim 0)
L
k
15 comments
They are.

The new large model has 3072 dimensions.

You cannot switch embedding models though without first re-embedding all your data with the new model
I have created new embeddings. I still face the issue.
I have used
Plain Text
text-embedding-3-small
for my documents. It still throws this error.
I even passed the same embedding model in
Plain Text
service_context
to embed the query.
Can you share the code?
I'm not able to reproduce this
Sure. I'll get back to you
Here's the code.
index=indexgenerator(indexPath,documentsPath)
I'm not sure what this function does. But it should be also using a service context
Plain Text
def indexgenerator(indexPath, documentsPath):

    # check if storage already exists
    if not os.path.exists(indexPath):
        print("Not existing")
        # load the documents and create the index
        
        entity_extractor = EntityExtractor(prediction_threshold=0.2,label_entities=False, device="cpu")

        node_parser = SentenceSplitter(chunk_overlap=200,chunk_size=2000)

        transformations = [node_parser, entity_extractor]

        documents = SimpleDirectoryReader(input_dir=r"Text_Files").load_data()

        pipeline = IngestionPipeline(transformations=transformations)

        nodes = pipeline.run(documents=documents)

        service_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-3.5-turbo", temperature=0),embed_model=embed_model)

        index = VectorStoreIndex(nodes, service_context=service_context)

        # store it for later
        index.storage_context.persist(indexPath)
    else:
        #load existing index
        print("Existing")
        storage_context = StorageContext.from_defaults(persist_dir=indexPath)
        index = load_index_from_storage(storage_context)
        
    return index
Did you mean it should use service context even while loading from storage?
Yes, you need it even when loading

load_index_from_storage(storage_context, service_context=service_context)
Thanks @Logan M
Add a reply
Sign up and join the conversation on Discord