Find answers from the community

Updated 4 months ago

Hi

At a glance

Hi!
I have a problem. When indexing documents with Cyrillic text, it takes ages and creates and incredible amount of documents. It would be fine (because of the nature of unicode-encoded text) but probably there is some timeout error or so. When digesting just 130 not-too-big documents, it never ends. Is there any solution to this problem? Thanks!

5 comments

LLogan M

yea, non-english languages really get shafted on the whole NLP thing. Like one word in cryillic is probably getting converted into a large number of tokens, whereas english has a compact token representation

You can try increasing the batch size on embeddings maybe?

Plain Text

from llama_index.embeddings import OpenAIEmbedding

embed_model = OpenAIEmbedding(embed_batch_size=2000)
service_context = ServiceContext.from_defaults(embed_model=embed_model)

SSeaCat

Okay, I will try doing it. Thank you!

SSeaCat

Hi again, one more thing. I found this class has the following defaults:

Plain Text

        mode (str): Mode for embedding.
            Defaults to OpenAIEmbeddingMode.TEXT_SEARCH_MODE.
            Options are:

            - OpenAIEmbeddingMode.SIMILARITY_MODE
            - OpenAIEmbeddingMode.TEXT_SEARCH_MODE

Does it mean that if I don't specify the mode in the constructor, it will create embeddings for search mode instead of similarity search. If so, how will it affect the results? Thank you.

LLogan M

The modes are deprecated I think, they don't seem to do anything. Both modes return the same ada endpoint. I think older embedding models had different modes

SSeaCat

Ah, okay, gotcha. Thanks!

Add a reply