Find answers from the community

Updated last year

What is the best method to create, store, and subsequently filter metadata when generating

@kapa.ai What is the best method to create, store, and subsequently filter metadata when generating and storing embeddings?
k
s
21 comments
Please stand by @sriani as I search the knowledge base. We are experiencing especially high traffic right now and might take longer than usual. :wave:
----------------------
To create, store, and filter metadata when generating and storing embeddings with LlamaIndex, you can follow these steps:

  1. Create Nodes with Metadata: When creating your nodes, you can include metadata in the extra_info parameter of the Node object. This metadata can be any key-value pairs that you want to associate with the node.
Plain Text
from llama_index.data_structs.node import Node

nodes = [
    Node('The Shawshank Redemption', extra_info={
        "author": "Stephen King",
        "theme": "Friendship",
    }),
    Node('The Godfather', extra_info={
        "director": "Francis Ford Coppola",
        "theme": "Mafia",
    }),
    Node("Inception", extra_info={
        "director": "Christopher Nolan",
    })
]

Source ()

  1. Store Nodes in a Vector Store: You can store these nodes in a vector store. The type of vector store you use can vary. For example, you can use a SupabaseVectorStore, PineconeVectorStore, DocArrayHnswVectorStore, or DocArrayInMemoryVectorStore.
Plain Text
from llama_index.storage.storage_context import StorageContext
from llama_index.vector_stores import PineconeVectorStore

vector_store = PineconeVectorStore(index_name="your_index_name")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes, storage_context=storage_context)

Source ()

  1. Filter Nodes Using Metadata: When querying the index, you can filter nodes based on their metadata using MetadataFilters and ExactMatchFilter. You can include one or more of your metadata keys in the filters.
```python
from llama_index.vector_stores.types import MetadataFilters, ExactMatchFilter
filters = MetadataFilters(filters=[
ExactMatchFilter(key='theme', value='Mafia')
])

retriever = index.as_retriever(filters=filters)
retriever.retrieve('What is inception about?')
`` **Source** (<https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/SupabaseVectorIndexDemo.html>) Remember to replace "your_index_name" with the name of your index, and "theme" and "Mafia" with your desired key and value for filtering. ---------------------- Note: you can ask me a follow up question by @ mentioning me again` :speech_balloon:
----------------------
@kapa.ai Where do i specify the data while creating the nodes?
Let me check the knowledge base for you. One moment! :books:
----------------------
When creating nodes in LlamaIndex, you specify the data as the first argument when creating a Node object. This data typically represents a chunk of text from a document. Here's an example:

Plain Text
from llama_index.data_structs.node import Node, DocumentRelationship

node1 = Node(text="<text_chunk>", doc_id="<node_id>")
node2 = Node(text="<text_chunk>", doc_id="<node_id>")

In this example, <text_chunk> is where you specify your data. Replace <text_chunk> with the actual text data you want to store in the node. Similarly, replace <node_id> with a unique identifier for each node.

You can also attach metadata to a node during its creation. Any metadata added to a node will be copied to the nodes that get created from their respective source document. Here's an example:

Plain Text
document = Document(
    'text', 
    extra_info={
        'filename': '<doc_file_name>', 
        'category': '<category>'
    }
)

In this example, replace 'text' with your actual text data, <doc_file_name> with the filename of the document, and <category> with the category of the document.

Source ()
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai Getting this error ModuleNotFoundError: No module named 'llama_index.data_structs.node'
Let me check the knowledge base for you. One moment! :books:
----------------------
The error ModuleNotFoundError: No module named 'llama_index.data_structs.node' typically occurs when the llama_index package is not installed in your Python environment, or your Python environment is not set up correctly.

To resolve this issue, you can try the following steps:

  1. Install LlamaIndex: You can install the llama_index package using pip. Open your terminal and run the following command:
Plain Text
pip install llama-index

Source ()

  1. Check Your Python Environment: Make sure you're using the correct Python environment where llama_index is installed. If you're using a virtual environment, ensure that it's activated when you run your Python script.
  1. Check Your Import Statement: Make sure your import statement is correct. It should be:
Plain Text
from llama_index.data_structs.node import Node


If you've done all these and you're still getting the error, it's possible that there might be an issue with your Python installation or your environment setup. In that case, you might want to consider reinstalling Python or setting up a new virtual environment.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
Where can i specify the LLM i want to use in this?
can i pass in pdf when creating nodes?
@kapa.ai can i pass in pdf when creating nodes?
Let me check the knowledge base for you. One moment! :books:
----------------------
Based on the provided extracts, LlamaIndex does not directly support the creation of nodes from PDF files. The typical workflow involves creating Document objects from text data, which are then parsed into Node objects.

However, you can manually convert your PDF files into text using a library like PyPDF2 or PDFMiner, and then create Document objects from the extracted text. Here's a basic example of how you might do this:

Plain Text
import PyPDF2
from llama_index import Document

# Open the PDF file in binary mode
with open('your_file.pdf', 'rb') as file:
    # Create a PDF file reader object
    reader = PyPDF2.PdfFileReader(file)
    
    # Get the text content from the PDF
    text = ''
    for page_num in range(reader.numPages):
        page = reader.getPage(page_num)
        text += page.extractText()
    
    # Create a Document object from the text
    document = Document(text)


Replace 'your_file.pdf' with the path to your PDF file. After creating the Document object, you can then parse it into Node objects as per the usual LlamaIndex workflow.

Please note that this is a simplified example. The quality of the text extraction can vary depending on the PDF file, and you might need to do additional processing on the extracted text to get it in a suitable format for your needs.

Source ()
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai How to use metadata with Documents ?
Let me check the knowledge base for you. One moment! :books:
----------------------
In LlamaIndex, you can include metadata when creating Document objects. This metadata can be any key-value pairs that you want to associate with the document. The metadata is included in the extra_info parameter of the Document object.

Here are a few ways to set up this dictionary:

  1. In the document constructor: You can include the metadata when creating the Document object.
Plain Text
from llama_index import Document

document = Document(
    'text', 
    extra_info={
        'filename': '<doc_file_name>', 
        'category': '<category>'
    }
)

In this example, replace 'text' with your actual text data, <doc_file_name> with the filename of the document, and <category> with the category of the document.

  1. After the document is created: You can also add or modify the metadata after the Document object has been created.
Plain Text
document.extra_info = {'filename': '<doc_file_name>'}

In this example, replace <doc_file_name> with the filename of the document.

  1. Set the filename automatically using the SimpleDirectoryReader and file_metadata hook: This will automatically run the hook on each document to set the extra_info field.
Plain Text
from llama_index import SimpleDirectoryReader

filename_fn = lambda filename: {'file_name': filename}

# automatically sets the extra_info of each document according to filename_fn
documents = SimpleDirectoryReader('./data', file_metadata=filename_fn)

In this example, replace './data' with the path to your directory of documents.

Any information set in the extra_info dictionary of each document will show up in the extra_info of each source node created from the document. This information is included in the nodes, enabling the index to utilize it on queries and responses.
Source ()
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai can i add additional params alongwith filename ? Like how we do in nodes?
Let me check the knowledge base for you. One moment! :books:
@sriani That's a great question, but unfortunately I am only capable of responding to a maximum of three questions per thread. Don't hesitate to create a new thread for any further questions you have though. I'm here to help!
Add a reply
Sign up and join the conversation on Discord