Find answers from the community

Updated 3 months ago

Hi

Hi!
I have a problem. When indexing documents with Cyrillic text, it takes ages and creates and incredible amount of documents. It would be fine (because of the nature of unicode-encoded text) but probably there is some timeout error or so. When digesting just 130 not-too-big documents, it never ends. Is there any solution to this problem? Thanks!
L
S
5 comments
yea, non-english languages really get shafted on the whole NLP thing. Like one word in cryillic is probably getting converted into a large number of tokens, whereas english has a compact token representation

You can try increasing the batch size on embeddings maybe?

Plain Text
from llama_index.embeddings import OpenAIEmbedding

embed_model = OpenAIEmbedding(embed_batch_size=2000)
service_context = ServiceContext.from_defaults(embed_model=embed_model)
Okay, I will try doing it. Thank you!
Hi again, one more thing. I found this class has the following defaults:

Plain Text
        mode (str): Mode for embedding.
            Defaults to OpenAIEmbeddingMode.TEXT_SEARCH_MODE.
            Options are:

            - OpenAIEmbeddingMode.SIMILARITY_MODE
            - OpenAIEmbeddingMode.TEXT_SEARCH_MODE

Does it mean that if I don't specify the mode in the constructor, it will create embeddings for search mode instead of similarity search. If so, how will it affect the results? Thank you.
The modes are deprecated I think, they don't seem to do anything. Both modes return the same ada endpoint. I think older embedding models had different modes
Ah, okay, gotcha. Thanks!
Add a reply
Sign up and join the conversation on Discord