Embeddings

At a glance

The post is about a community member encountering a ValueError when using the new OpenAI embedding models with the LlamaIndex query engine. The comments suggest that the new large model has 3072 dimensions, and the community member needs to re-embed all their data with the new model to switch embedding models. The community member has tried creating new embeddings and using the text-embedding-3-small model, but the issue persists. They have shared their code, and another community member has suggested that the service context should be used even when loading the index from storage.

kkush2861

Are the new openai embedding models supported in llamaindex query engine?

Plain Text

ValueError: shapes (1536,) and (3072,) not aligned: 1536 (dim 0) != 3072 (dim 0)

15 comments

LLogan M

They are.

The new large model has 3072 dimensions.

You cannot switch embedding models though without first re-embedding all your data with the new model

kkush2861

I have created new embeddings. I still face the issue.

kkush2861

I have used

Plain Text

text-embedding-3-small

for my documents. It still throws this error.

kkush2861

I even passed the same embedding model in

Plain Text

service_context

to embed the query.

LLogan M

Can you share the code?

LLogan M

I'm not able to reproduce this

kkush2861

Sure. I'll get back to you

kkush2861

Here's the code.

LLogan M

index=indexgenerator(indexPath,documentsPath)

LLogan M

I'm not sure what this function does. But it should be also using a service context

kkush2861

It does

kkush2861

Plain Text

def indexgenerator(indexPath, documentsPath):

    # check if storage already exists
    if not os.path.exists(indexPath):
        print("Not existing")
        # load the documents and create the index
        
        entity_extractor = EntityExtractor(prediction_threshold=0.2,label_entities=False, device="cpu")

        node_parser = SentenceSplitter(chunk_overlap=200,chunk_size=2000)

        transformations = [node_parser, entity_extractor]

        documents = SimpleDirectoryReader(input_dir=r"Text_Files").load_data()

        pipeline = IngestionPipeline(transformations=transformations)

        nodes = pipeline.run(documents=documents)

        service_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-3.5-turbo", temperature=0),embed_model=embed_model)

        index = VectorStoreIndex(nodes, service_context=service_context)

        # store it for later
        index.storage_context.persist(indexPath)
    else:
        #load existing index
        print("Existing")
        storage_context = StorageContext.from_defaults(persist_dir=indexPath)
        index = load_index_from_storage(storage_context)
        
    return index

kkush2861

Did you mean it should use service context even while loading from storage?

LLogan M

Yes, you need it even when loading

load_index_from_storage(storage_context, service_context=service_context)

kkush2861

Thanks @Logan M

Add a reply

Find answers from the community

Embeddings