how do I load a qa_dataset.json for evaluation?

@kapa.ai how do I load a qa_dataset.json for evaluation?

9 comments

Please stand by @Ar1s as I search the knowledge base. I normally take about 20 seconds or so. :wave:

----------------------
To load a qa_dataset.json for evaluation in LlamaIndex, you can use the from_json method of the EmbeddingQAFinetuneDataset class. Here is an example of how you can do this:

Plain Text

qa_dataset = EmbeddingQAFinetuneDataset.from_json("qa_dataset.json")

This line of code will load the dataset from the qa_dataset.json file into the qa_dataset variable.

Please replace "qa_dataset.json" with the path to your actual JSON file if it's different.

You can find more details in the LlamaIndex documentation ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

AAr1s

@kapa.ai my code as follow: qa_dataset = EmbeddingQAFinetuneDataset.from_json("C:/Users/Ar1s/Desktop/eval_questions.json")

this code raised error: Traceback (most recent call last):
File "C:\Users\Ar1s\Desktop\finaldesign\eval_chunksize_baseline.py", line 64, in <module>
qa_dataset = EmbeddingQAFinetuneDataset.from_json("C:/Users/Ar1s/Desktop/eval_questions.json") # 从json文件中加载问题
File "C:\Users\Ar1s\anaconda3\envs\finaldesign\lib\site-packages\llama_index\core\llama_dataset\legacy\embedding.py", line 46, in from_json
return cls(**data)
File "C:\Users\Ar1s\anaconda3\envs\finaldesign\lib\site-packages\pydantic\v1\main.py", line 341, in init
raise validation_error
pydantic.v1.error_wrappers.ValidationError: 3 validation errors for EmbeddingQAFinetuneDataset
queries
field required (type=value_error.missing)
corpus
field required (type=value_error.missing)
relevant_docs
field required (type=value_error.missing)

kkapa.ai

Let me check the knowledge base for you. One moment! :books:

kkapa.ai

----------------------
The error message you're seeing indicates that the EmbeddingQAFinetuneDataset.from_json method is expecting the JSON file to contain certain fields: queries, corpus, and relevant_docs. These fields are missing in your eval_questions.json file.

The EmbeddingQAFinetuneDataset class in LlamaIndex expects a specific structure in the JSON file. It should contain:

queries: A list of questions or queries.
corpus: A list of documents or contexts.
relevant_docs: A list of relevant documents for each query.

Please ensure that your eval_questions.json file contains these fields. If you're unsure about the structure of your JSON file, you might want to open it and check its contents.

Unfortunately, without the actual structure of your JSON file, it's hard to provide a more specific solution. But generally, the JSON file should look something like this:

Plain Text

{
  "queries": ["query1", "query2", ...],
  "corpus": ["document1", "document2", ...],
  "relevant_docs": [["doc1_for_query1", "doc2_for_query1"], ["doc1_for_query2"], ...]
}

Please adjust your JSON file accordingly and try loading it again. If you continue to experience issues, you may want to refer to the LlamaIndex documentation () for more information.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

AAr1s

@kapa.ai how do I generate questions from some files and save&load them?

kkapa.ai

Let me check the knowledge base for you. One moment! :books:

kkapa.ai

----------------------
To generate questions from some files using LlamaIndex, you can use the DatasetGenerator class. Here is a step-by-step guide on how to do this:

Load the documents: You can use the SimpleDirectoryReader class to load the documents from the files.

Plain Text

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["your_file.pdf"]
).load_data()

Generate the questions: You can use the DatasetGenerator class to generate the questions from the documents.

Plain Text

from llama_index.core.evaluation import DatasetGenerator
from llama_index.llms.openai import OpenAI

gpt_35_llm = OpenAI(model="gpt-3.5-turbo", temperature=0.3)

dataset_generator = DatasetGenerator.from_documents(
    documents,
    question_gen_query="Your question generation prompt",
    llm=gpt_35_llm,
    num_questions_per_chunk=25,
)

questions = dataset_generator.generate_questions_from_nodes(num=40)

Save the questions: You can save the generated questions to a file using the built-in open function in Python.

Plain Text

with open("questions.txt", "w") as f:
    for question in questions:
        f.write(question + "\n")

Load the questions: You can load the questions from the file using the built-in open function in Python.

Plain Text

def load_questions(path):
    questions = []
    with open(path, "r") as f:
        for line in f:
            questions.append(line.strip())
    return questions

questions = load_questions("questions.txt")

Please replace "your_file.pdf", "Your question generation prompt", and "questions.txt" with your actual file names, question generation prompt, and desired output file name, respectively.

kkapa.ai

You can find more details in the LlamaIndex documentation ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

Add a reply

Find answers from the community

how do I load a qa_dataset.json for evaluation?