Hi When to use which approach

Hi! When to use which approach?

GPTSimpleVectorIndex.from_documents(documents)
```

parser = SimpleNodeParser()

nodes = parser.get_nodes_from_documents(documents)

GPTSimpleVectorIndex(nodes)

``

In short, what is the fundamental advantage/disadvantage of using nodes over documents?

Also, I'm abit lost in terminology. So If I have 3 pdf files, it means I have 3 Documents, right?
And each Document have N nodes, where N is defined by

get_nodes_from_documents` method?

5 comments

LLogan M

Your terminology is correct!

Basically, the two methods are equal. BUT if you have your own method of splitting documents into nodes, then you can do something similar to option 2 and create your own Node objects

ppikachu8887867

Thanks Logan! Right now I'm doing like this:

Plain Text

    documents = []
    with pdfplumber.open(io.BytesIO(pdf)) as pdf_file:
        pages = pdf_file.pages
        for i, page in enumerate(pages):
            text = page.extract_text()
            text = clean_text(text)
            documents.append(
                Document(
                    text=text,
                    doc_id=f"doc_{i}",
                    extra_info={
                        "document": file_info["company_name"],
                        "company": file_info["company_name"],
                        "type": file_info["document_type"],
                        "year": file_info["document_year"],
                        "page": i
                    }
                )
            )

index = GPTSimpleVectorIndex.from_documents(documents=documents, service_context=service_context)

So I'm splitting a pdf into a list of Documents. As I understand this is same as If I split the pdf into a list of Nodes and create an index like GPTSimpleVectorIndex(nodes), right?

ppikachu8887867

or is there a difference?

LLogan M

It's almost the same! When you put documents into an index, they can be split further before being embedded, according to chunk_size_limit (default is 4000 tokens)

Whereas if you create the nodes yourself, they won't be broken up when creating embeddings

I think the way you are doing it now will work just fine, unless you have a specific way to split text on each page

ppikachu8887867

I see. Thanks, Logan!

Add a reply

Find answers from the community

Hi When to use which approach