Find answers from the community

Updated 6 months ago

πŸ†˜ HELP!! Does anyone has experience

At a glance

The community member is seeking help with using llama-index for non-Latin documents, specifically Arabic. They are unsure if the default tokenizer and embeddings in llama-index are suitable for Arabic data. Another community member suggests that the default OpenAI embeddings should work for multilingual/non-English data, but the original poster is still experiencing issues when querying the indexed data, unable to retrieve some expected context. The community members discuss how to get the embeddings' value of a prompt in llama-index.

πŸ†˜ HELP!! Does anyone has experience with none Latin documents and data in llama-index. Specially Arabic alphabets. Is llama-index default tokenizer and embeddings fit for Arabic documents. Any idea or experience in this field!!!!
L
H
4 comments
The default OpenAI embeddings should be fine for multilingual/non-english data
Thanks Logan, I am asking this because I have indexed a corpus of non English data, but when querying the indexed data I don't get the desired results. It is unable to pick some context which I am certain it exists in the indexed data. Maybe I have to tweak with the retriever or some other settings of the prompt and the query.
@Logan M Any idea how could i get the embeddings' value of my prompt in llama-index?
Plain Text
from llama_index.embeddings import OpenAIEmbedding
embed_model = OpenAIEmbedding()

doc_embed = embed_model.get_text_embedding("my doc")
query_embeds = embed_model.get_query_embedding("My query")
Add a reply
Sign up and join the conversation on Discord