Hi I have a problem with refreshing an

At a glance

Hi, I have a problem with refreshing an index. I load index from a disk and then I read documents and call index.refresh_ref_docs. I pass in the service_context since I use QuestionsAnsweredExtractor for the documents and nodes metadata. The problem is the questions are being generated for all the documents, not only the added ones. How can I update the index and generate questions for only new and updated documents?

4 comments

WWhiteFang_Jr

Hi,
Just to be more clear on your query, When you refresh some docs and query the index, Those new ones are not being reflected in the index right?

JJana

No, I'm not at the querying stage yet. I'm at the index creation part.

I load the index Like this:
storage_context = StorageContext.from_defaults(persist_dir=f"./storage/cache/{articles.key}_vector")
article_hubs_index = load_index_from_storage(storage_context)

and here is where I want to refresh only the newly added documents:
documents = SimpleDirectoryReader("./assets/{articles.key}/", file_metadata=filename_fn).load_data()
refreshed_docs = article_hubs_index.refresh_ref_docs(documents, update_kwargs={"delete_kwargs": {'delete_from_docstore': True}}, service_context=service_context)

I added meta_extractor to service_context:
metadata_extractor = MetadataExtractor(
extractors=[
QuestionsAnsweredExtractor(questions=3),
],
)

but the questions are generated for ALL the documents not just for the newly added documents.

The problem is in the indexing stage.... I don't want to generate questions for nodes that already have questions in their metadata.

WWhiteFang_Jr

I just checked the code for QuestionsAnsweredExtractor I think directly it will creates questions for all the nodes.
You'll have to directly interact with the QuestionsAnsweredExtractor class if you want specific nodes QnA.

Plain Text

# get the nodes
nodes = index.docstore.docs
# Fetch all the updated nodes to be sent to qna extractor using unique ID
updated_nodes_list = ...

# Intialize the QnA Exctractor class
question_answer_extractor = QuestionsAnsweredExtractor(llm=llm)
extracted_qna = question_answer_extractor.extract(updated_nodes_list)

More can be found here: https://github.com/run-llama/llama_index/blob/main/llama_index/node_parser/extractors/metadata_extractors.py

JJana

I found the issue, I just set the filename to meta data but did not ad the id 🙂

Add a reply

Find answers from the community

Hi I have a problem with refreshing an