Find answers from the community

Updated 3 months ago

Metadata

Im trying to get summaries+ generated questions for a html read using unstructured, and parsed using node parser. It has been extract summaries since 30 minutes just for 1 html, is this normal? I am using a T4 with llama 2 in colab.
Attachment
image.png
L
F
9 comments
Maybe a little slow (I'm guessing your gpu memory is maxed out on a T4 though, and likely offloading to cpu even?)

I'm not sure what extractor you are using, but it's being applied sequentially to each node 17 in this case)
I'm using both the summary and questions extractor. Memory still has 5 gb's free (out of 15) so not offloading yet. Llama produces like 1 long line per minute, so I guess It took predicted time here. But the hallucination here... (it gives questions the passage is unlikely to answer).

Another question on extractors, if the llm hallucinates, can it give a key error? (this is based on implementation of the function i suppose)
Attachment
image.png
I've noticed, when i print metadata_dicts, I have 17 section summaries, followed by 17 questions. Does it mean the code on the docs page is not updated?
Plain Text
# all nodes consists of source nodes, along with metadata
import copy

all_nodes = copy.deepcopy(base_nodes)
for idx, d in enumerate(metadata_dicts):
    inode_q = IndexNode(
        text=d["questions_this_excerpt_can_answer"],
        index_id=base_nodes[idx].node_id,
    )
    inode_s = IndexNode(
        text=d["section_summary"], index_id=base_nodes[idx].node_id
    )
    all_nodes.extend([inode_q, inode_s])

https://docs.llamaindex.ai/en/stable/examples/retrievers/recursive_retriever_nodes.html#metadata-references-summaries-generated-questions-referring-to-a-bigger-chunk
Plain Text
[{'section_summary': "Based on ..."},
 {'section_summary': 'The section ..."},
 {'questions_this_excerpt_can_answer': "Certainly! Based ..."}]
Iterated through dict twice was able to run the retriever, if anyone finds this thread in future! (code on docs didn't work)
Wow, that is a pretty bad page on the docs

To me, it seems like the docs should probably just have have something like this instead

Plain Text
# cache metadata dicts
def save_metadata_dicts(path, data):
    with open(path, "w") as fp:
        json.dump(data, fp)


def load_metadata_dicts(path):
    with open(path, "r") as fp:
        data = json.load(fp)
    return data

node_to_metadata = {}
for extractor in extractors:
    metadata_dicts = extractor.extract(base_nodes)
    for node, metadata in zip(base_nodes, metadata_dicts):
        if node.node_id not in node_to_metadata:
            node_to_metadata[node.node_id] = metadata
        else:
            node_to_metadata[node.node_id].update(metadata)

save_metadata_dicts("data/llama2_metadata_dicts.json", node_to_metadata)
node_to_metadata = load_metadata_dicts("data/llama2_metadata_dicts.json")

all_nodes = copy.deepcopy(base_nodes)
for node_id, metadata in node_to_metadata.items():
    for val in metadata.values():
        inode = IndexNode(text=val, index_id=node_id)
        all_nodes.append(inode)
makes a little more sense, at least in my brain
I'm going to update that doc lol
I see. I modifed my loop like this:
Plain Text
all_nodes = copy.deepcopy(base_nodes)

# Process summaries
for idx, d in enumerate(metadata_dicts[:len(metadata_dicts)//2]):
    inode_s = IndexNode(
        text=d["section_summary"], index_id=base_nodes[idx].node_id
    )
    all_nodes.append(inode_s)

# Process questions
for idx, d in enumerate(metadata_dicts[len(metadata_dicts)//2:]):
    inode_q = IndexNode(
        text=d["questions_this_excerpt_can_answer"],
        index_id=base_nodes[idx].node_id,
    )
    all_nodes.append(inode_q)
Add a reply
Sign up and join the conversation on Discord