Find answers from the community

Updated 6 months ago

how to convert Document to base nodes?

At a glance

The community members discuss how to convert a Document into base nodes and how to parse a Document into NodeType using LlamaIndex. The main points are:

To convert a Document into base nodes, the community members suggest using the get_nodes_from_documents() method of a NodeParser, such as SimpleNodeParser or SentenceSplitter. They provide example code and mention that the chunk_size parameter determines the size of the chunks the document will be split into.

To parse a Document into NodeType, the community members recommend using the get_nodes_from_documents() method from the SentenceSplitter class in the llama_index.core.node_parser module. They provide an example and note that the user should replace documents with their actual documents.

The community members also discuss how to define metadata for an IndexNode in LlamaIndex, providing examples of including a metadata dictionary when creating the IndexNode.

@kapa.ai how to convert Document to base nodes?
k
i
10 comments
Please stand by @ispan3tz as I search the knowledge base. I normally take about 20 seconds or so. :wave:
----------------------
To convert a Document into base nodes in LlamaIndex, you can use the get_nodes_from_documents() method of a NodeParser. Here's a general example:

Plain Text
node_parser = SimpleNodeParser.from_defaults(chunk_size=chunk_size)
base_nodes = node_parser.get_nodes_from_documents(docs)


In this example, SimpleNodeParser is a type of NodeParser and docs is a list of Document objects. The chunk_size parameter determines the size of the chunks that the document will be split into.

You can also use other types of NodeParsers, such as SentenceSplitter:

Plain Text
node_parser = SentenceSplitter(chunk_size=1024)
base_nodes = node_parser.get_nodes_from_documents(docs)


In this case, the document is split into sentences instead of fixed-size chunks.

Remember to replace docs with your actual documents. If your documents are stored in a directory, you can load them using SimpleDirectoryReader:

Plain Text
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data/paul_graham").load_data()


Then you can pass documents to get_nodes_from_documents().

Sources:
  • Param_optimizer ()
  • Recurisve_retriever_nodes_braintrust ()
  • BM25 Retriever ()
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai how to parse Document to NodeType?
Let me check the knowledge base for you. One moment! :books:
----------------------
To parse a Document into NodeType in LlamaIndex, you can use the get_nodes_from_documents() method from the SentenceSplitter class in the llama_index.core.node_parser module. Here's an example:

Plain Text
from llama_index.core.node_parser import SentenceSplitter

nodes = SentenceSplitter().get_nodes_from_documents(documents)


In this example, documents is your input data. The get_nodes_from_documents() method will parse these documents into nodes.

Please note that you need to replace documents with your actual documents. If your documents are stored in a directory, you can load them using SimpleDirectoryReader:

Plain Text
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data/paul_graham").load_data()


Then you can pass documents to get_nodes_from_documents().
Sources:
  • Docstore Demo ()
  • Firestore Demo ()
  • Redis Docstore+Index Store Demo ()
  • Dynamo DB Docstore Demo ()
  • Firestore Demo on GitHub ()
  • MongoDB Demo ()
  • Azure Table Storage as a Docstore Demo ()
  • UpTrain Callback Handler on GitHub ()
  • Dynamo DB Docstore Demo on GitHub ()
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai how to define metadata for IndexNode?
Let me check the knowledge base for you. One moment! :books:
----------------------
To define metadata for an IndexNode in LlamaIndex, you can include a metadata dictionary when creating the IndexNode. Here's an example:

Plain Text
from llama_index.core.schema import IndexNode

index_node = IndexNode(
    text="The text of the node",
    index_id="node_id",
    metadata={
        "key1": "value1",
        "key2": "value2",
        # Add more key-value pairs as needed
    }
)


In this example, text is the text of the node, index_id is the unique identifier of the node, and metadata is a dictionary containing the metadata. You can include any key-value pairs you need in the metadata dictionary.

Here's another example from the LlamaIndex documentation, where metadata is defined for each node in a list of nodes:

Plain Text
from llama_index.core.schema import TextNode
from llama_index.core.vector_stores import DuckDBVectorStore, StorageContext
from llama_index.core import VectorStoreIndex

nodes = [
    TextNode(
        text="The Shawshank Redemption",
        metadata={
            "author": "Stephen King",
            "theme": "Friendship",
            "year": 1994,
            "ref_doc_id": "doc_1",
        },
    ),
    # Add more nodes as needed
]

vector_store = DuckDBVectorStore()
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes, storage_context=storage_context)


In this example, each TextNode in the nodes list has a metadata dictionary.
Sources:
  • Multi_doc_auto_retrieval ()
  • Metadata filter example ()
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
Add a reply
Sign up and join the conversation on Discord