Hi guys.

At a glance

The community member has a set of JSON files and is trying to index them using LlamaIndex. They want to retrieve data based on the similarity between the query and the metadata, but the current approach is not working. The community members discuss various approaches, including:

- Linking metadata to nodes using IndexNode and TextNode classes.

- Using auto-retrieval or reranking to filter metadata.

- Converting JSON files to text format, as LLMs and embedding models can only work with text.

- Structuring the JSON data in a way that is meaningful for embeddings and retrieval, as JSON has a hierarchical structure that needs to be captured.

The community members also discuss the use case of LlamaIndex's support for various file formats, including JSON, and the extra work required to properly index and retrieve data from these formats compared to plain text.

In the end, the community member is able to index the PDF files using the provided code, and the other community member confirms that no additional arguments are needed for the VectorStoreIndex.

Useful resources

AAlwin

Hi guys.
I have a set of JSON files in my directory and trying to index them via:


filename_fn = lambda filename: {"file_name": os.path.splitext(os.path.basename(filename))[0]}

documents = SimpleDirectoryReader("./myfile", file_metadata=filename_fn, filename_as_id=True).load_data(show_progress=True)

Settings.chunk_size = 2048
nodes = Settings.node_parser.get_nodes_from_documents(documents, show_progress=True)

How can I index them in a way that for any query I will ask, the LLM look at the similarity between the query and the meta data only to find the best index and retrieve the whole data related to that index?

I have tried:

index = VectorStoreIndex(nodes=nodes, show_progress=True)
query_engine = index.as_query_engine(similarity_top_k = )
response = query_engine.query("query")

But it retrieves wrong data.

18 comments

LLogan M

You cant retrieve purely from metadata, unless you make nodes with only that metadata as text, that then links to the real nodes

Probably you could use some reranking or even auto-retrieval to help write metadata filters on the fly

AAlwin

thanks for the response. Can you send me any link?
I mean how to link metadata to nodes

LLogan M

auto retrieval
https://docs.llamaindex.ai/en/stable/examples/vector_stores/WeaviateIndex_auto_retriever/?h=auto

Or you can link nodes

Plain Text

from llama_index.core.schema import IndexNode, TextNode

node = TextNode(text="text")
index_node = IndexNode(text="metadata text", index_id="123", obj=node)

index = VectorStoreIndex(nodes=[index_node])

AAlwin

Awesome!
I will look at it and will be back if having issues.
Greatly appreciate it

AAlwin

@Logan M
In your provided code and also in the Weaviat code, the nodes are text, however, mine is JSON file. Shall I convert them to text or there is a way to use JSON directly instead of text?

LLogan M

you cant use json directly, it has to be text (LLMs and embedding models can only work with text)

AAlwin

@Logan M
Oh, ok!
So how can we index JSON files using LLmaIndex?
I know there is a JSON query engine, but I am looking for an indexing method

LLogan M

Depends on the json -- you can dump it into all one text blob in a document, or you can break it into document objects that make sense

For example, if I was indexing an open-api JSON spec, I would want each endpoint spec to be a document

LLogan M

Since JSON can have any structure, theres no single way to do this -- depends on your data

AAlwin

@Logan M
Thanks for your clarification

AAlwin

@Logan M

I'm just curious; why LlamaIndex document rader supports json, xml, yml, etc if they should be in the text format for optimal indexing?
What's the use case of reading those documents?

LLogan M

Those readers put it into some (naive) text format

AAlwin

@Logan M
I mean if it puts JSON in the txt format, so why do we need extra work (compared with original txt/pdf formats) to retrieve appropriate data using RAG pipeline as I asked above?

LLogan M

OK, let's take a step back and think about how retrieval works.

Text is embedded. This maps the text into some vector, so that at query time, we can embed the query text, and retrieve similar pieces of text.

When you think about a PDF or txt file, we can split it into chunks (by sentences, by token limit) and then embed and retrieve each chunk.

Now, think about JSON. It has some structure. The structure could be random, or it could follow some specific schema.

We can't just chunk at a token limit, because half a JSON loses the hierarchical structure of the original file. (And this would also confuse the LLM)

So, you need extra work to parse it into pieces that are meaningful for embeddings to retrieve the proper chunks per query. And so that when I query and retrieve, the LLM can make sense of the chunk retrieved.

LLogan M

TLDR: JSON is not just blobs of text, it has some structure, schema, hierarchical information, that you need to properly capture for retrieval and for sending to the LLM

AAlwin

@Logan M

That's a great clarification.
Thanks

AAlwin

@Logan M

Hi again Logan,
Please consider the following code which reads 20 pdf files in my_pdf directory, etc:

# only load pdf files
required_exts = [".pdf"]

reader = SimpleDirectoryReader(
    input_dir="./my_pdf",
    required_exts=required_exts,
    recursive=True,
)

docs = reader.load_data()
print(f"Loaded {len(docs)} docs")


from llama_index.core.extractors import (
    TitleExtractor,
    QuestionsAnsweredExtractor,
)
from llama_index.core.text_splitter import TokenTextSplitter
from llama_index.core.ingestion import IngestionPipeline

text_splitter = TokenTextSplitter(
    separator=" ", chunk_size=20000, chunk_overlap=128
)
title_extractor = TitleExtractor(nodes=20)
# qa_extractor = QuestionsAnsweredExtractor(questions=5)

pipeline = IngestionPipeline(
    transformations=[
        text_splitter,
        title_extractor,
        # qa_extractor
    ]
)

nodes = pipeline.run(
    documents=docs,
    in_place=True,
    show_progress=True,
)

now if i want to index the nodes of the files uisng InMemory vector store in a way that all the nodes and their corresponding metadata being indexed, I only need to simply use the following code?

index = VectorStoreIndex(nodes=nodes, show_progress=True)

Or I need to add some other args for VectorStoreIndex?

LLogan M

Nope, thats it

Add a reply

Find answers from the community

Hi guys.