Metadata

At a glance

Im trying to get summaries+ generated questions for a html read using unstructured, and parsed using node parser. It has been extract summaries since 30 minutes just for 1 html, is this normal? I am using a T4 with llama 2 in colab.

Attachment

9 comments

LLogan M

Maybe a little slow (I'm guessing your gpu memory is maxed out on a T4 though, and likely offloading to cpu even?)

I'm not sure what extractor you are using, but it's being applied sequentially to each node 17 in this case)

FFried cheese

I'm using both the summary and questions extractor. Memory still has 5 gb's free (out of 15) so not offloading yet. Llama produces like 1 long line per minute, so I guess It took predicted time here. But the hallucination here... (it gives questions the passage is unlikely to answer).

Another question on extractors, if the llm hallucinates, can it give a key error? (this is based on implementation of the function i suppose)

Attachment

FFried cheese

I've noticed, when i print metadata_dicts, I have 17 section summaries, followed by 17 questions. Does it mean the code on the docs page is not updated?

Plain Text

# all nodes consists of source nodes, along with metadata
import copy

all_nodes = copy.deepcopy(base_nodes)
for idx, d in enumerate(metadata_dicts):
    inode_q = IndexNode(
        text=d["questions_this_excerpt_can_answer"],
        index_id=base_nodes[idx].node_id,
    )
    inode_s = IndexNode(
        text=d["section_summary"], index_id=base_nodes[idx].node_id
    )
    all_nodes.extend([inode_q, inode_s])

https://docs.llamaindex.ai/en/stable/examples/retrievers/recursive_retriever_nodes.html#metadata-references-summaries-generated-questions-referring-to-a-bigger-chunk

FFried cheese

Plain Text

[{'section_summary': "Based on ..."},
 {'section_summary': 'The section ..."},
 {'questions_this_excerpt_can_answer': "Certainly! Based ..."}]

FFried cheese

Iterated through dict twice was able to run the retriever, if anyone finds this thread in future! (code on docs didn't work)

LLogan M

Wow, that is a pretty bad page on the docs

To me, it seems like the docs should probably just have have something like this instead

Plain Text

# cache metadata dicts
def save_metadata_dicts(path, data):
    with open(path, "w") as fp:
        json.dump(data, fp)


def load_metadata_dicts(path):
    with open(path, "r") as fp:
        data = json.load(fp)
    return data

node_to_metadata = {}
for extractor in extractors:
    metadata_dicts = extractor.extract(base_nodes)
    for node, metadata in zip(base_nodes, metadata_dicts):
        if node.node_id not in node_to_metadata:
            node_to_metadata[node.node_id] = metadata
        else:
            node_to_metadata[node.node_id].update(metadata)

save_metadata_dicts("data/llama2_metadata_dicts.json", node_to_metadata)
node_to_metadata = load_metadata_dicts("data/llama2_metadata_dicts.json")

all_nodes = copy.deepcopy(base_nodes)
for node_id, metadata in node_to_metadata.items():
    for val in metadata.values():
        inode = IndexNode(text=val, index_id=node_id)
        all_nodes.append(inode)

LLogan M

makes a little more sense, at least in my brain

LLogan M

I'm going to update that doc lol

FFried cheese

I see. I modifed my loop like this:

Plain Text

all_nodes = copy.deepcopy(base_nodes)

# Process summaries
for idx, d in enumerate(metadata_dicts[:len(metadata_dicts)//2]):
    inode_s = IndexNode(
        text=d["section_summary"], index_id=base_nodes[idx].node_id
    )
    all_nodes.append(inode_s)

# Process questions
for idx, d in enumerate(metadata_dicts[len(metadata_dicts)//2:]):
    inode_q = IndexNode(
        text=d["questions_this_excerpt_can_answer"],
        index_id=base_nodes[idx].node_id,
    )
    all_nodes.append(inode_q)

Add a reply

Find answers from the community

Metadata