Find answers from the community

Updated 2 months ago

Hi When to use which approach

Hi! When to use which approach?

  1. GPTSimpleVectorIndex.from_documents(documents)
  2. ```
parser = SimpleNodeParser()

nodes = parser.get_nodes_from_documents(documents)

GPTSimpleVectorIndex(nodes)
`` In short, what is the fundamental advantage/disadvantage of using nodes over documents? Also, I'm abit lost in terminology. So If I have 3 pdf files, it means I have 3 Documents, right? And each Document have N nodes, where N is defined by get_nodes_from_documents` method?
L
p
5 comments
Your terminology is correct!

Basically, the two methods are equal. BUT if you have your own method of splitting documents into nodes, then you can do something similar to option 2 and create your own Node objects
Thanks Logan! Right now I'm doing like this:

Plain Text
    documents = []
    with pdfplumber.open(io.BytesIO(pdf)) as pdf_file:
        pages = pdf_file.pages
        for i, page in enumerate(pages):
            text = page.extract_text()
            text = clean_text(text)
            documents.append(
                Document(
                    text=text,
                    doc_id=f"doc_{i}",
                    extra_info={
                        "document": file_info["company_name"],
                        "company": file_info["company_name"],
                        "type": file_info["document_type"],
                        "year": file_info["document_year"],
                        "page": i
                    }
                )
            )

index = GPTSimpleVectorIndex.from_documents(documents=documents, service_context=service_context)


So I'm splitting a pdf into a list of Documents. As I understand this is same as If I split the pdf into a list of Nodes and create an index like GPTSimpleVectorIndex(nodes), right?
or is there a difference?
It's almost the same! When you put documents into an index, they can be split further before being embedded, according to chunk_size_limit (default is 4000 tokens)

Whereas if you create the nodes yourself, they won't be broken up when creating embeddings

I think the way you are doing it now will work just fine, unless you have a specific way to split text on each page
I see. Thanks, Logan!
Add a reply
Sign up and join the conversation on Discord