How do I see what chunks of text were

At a glance

How do I see what chunks of text were created and the associate vectors?

5 comments

To see the nodes, you can explicitly parse Documents into Node objects before building an index: https://gpt-index.readthedocs.io/en/latest/guides/primer/usage_pattern.html#parse-the-documents-into-nodes. Gives you control over each node.

The embedding storage is a bit different since it depends whether you're using a vector store, but I'd take a look at the get method in the VectorStore (see gpt_index/vector_stores/types.py)

GGreg Tanaka

@jerryjliu0 thanks, but how do I do this with the markdownreader llama hub?

LLogan M

@Greg Tanaka As jerry mentioned, seeing the embeddings is a bit more tricky. But for the text in each node, you can do something like this:

Plain Text

documents = loader.load_data(file=Path('./README.md'))

from llama_index.node_parser import SimpleNodeParser
parser = SimpleNodeParser()
nodes = parser.get_nodes_from_documents(documents)

GGreg Tanaka

Thanks. @Logan M how do we control the chunk size?

LLogan M

When using the node parser?

Plain Text

from llama_index import ServiceContext, GPTListIndex
from llama_index.node_parser import SimpleNodeParser
from llama_index.langchain_helpers.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(chunk_size=512)
parser = SimpleNodeParser(text_splitter=splitter)
nodes = parser.get_nodes_from_documents(documents)

index = GPTListIndex(nodes, service_context=ServiceContext.from_defaults(chunk_size_limit=512))

Without the node parser, just define it in the service context alone 👍

You can also use any text splitter you want (from langchain, or llama_index also has a sentence-based splitter)

Add a reply

Find answers from the community

How do I see what chunks of text were