The community member is seeking help with using llama-index for non-Latin documents, specifically Arabic. They are unsure if the default tokenizer and embeddings in llama-index are suitable for Arabic data. Another community member suggests that the default OpenAI embeddings should work for multilingual/non-English data, but the original poster is still experiencing issues when querying the indexed data, unable to retrieve some expected context. The community members discuss how to get the embeddings' value of a prompt in llama-index.
π HELP!! Does anyone has experience with none Latin documents and data in llama-index. Specially Arabic alphabets. Is llama-index default tokenizer and embeddings fit for Arabic documents. Any idea or experience in this field!!!!
Thanks Logan, I am asking this because I have indexed a corpus of non English data, but when querying the indexed data I don't get the desired results. It is unable to pick some context which I am certain it exists in the indexed data. Maybe I have to tweak with the retriever or some other settings of the prompt and the query.