----------------------
To implement an indexing pipeline that adds the name of JSON files as their metadata, you can use the
SimpleDirectoryReader
and
file_metadata
hook provided by LlamaIndex. This will automatically run the hook on each document to set the
metadata
field. Here's an example of how you can do it:
from llama_index.core import SimpleDirectoryReader
filename_fn = lambda filename: {"file_name": filename}
# automatically sets the metadata of each document according to filename_fn
documents = SimpleDirectoryReader(
"./data", file_metadata=filename_fn
).load_data()
In this example,
filename_fn
is a function that takes a filename and returns a dictionary with the filename. The
SimpleDirectoryReader
is initialized with the directory containing your JSON files and the
file_metadata
hook set to
filename_fn
. This will load the data from the files in the directory and automatically set the metadata of each document according to
filename_fn
.
After loading the documents, you can create an indexing pipeline using the
VectorStoreServer
and
from_llamaindex_components
method. Here's an example:
from pathway.xpacks.llm.vector_store import VectorStoreServer
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import TokenTextSplitter
embed_model = OpenAIEmbedding(embed_batch_size=10)
transformations_example = [
TokenTextSplitter(
chunk_size=150,
chunk_overlap=10,
separator=" ",
),
embed_model,
]
processing_pipeline = VectorStoreServer.from_llamaindex_components(
*data_sources,
transformations=transformations_example,
)
In this example, the text is first split using
TokenTextSplitter
, then embedded with
OpenAIEmbedding
. The
VectorStoreServer
is initialized with the data sources and transformations.