You cant retrieve purely from metadata, unless you make nodes with only that metadata as text, that then links to the real nodes
Probably you could use some reranking or even auto-retrieval to help write metadata filters on the fly
thanks for the response. Can you send me any link?
I mean how to link metadata to nodes
Awesome!
I will look at it and will be back if having issues.
Greatly appreciate it
@Logan M
In your provided code and also in the Weaviat code, the nodes are text, however, mine is JSON file. Shall I convert them to text or there is a way to use JSON directly instead of text?
you cant use json directly, it has to be text (LLMs and embedding models can only work with text)
@Logan M
Oh, ok!
So how can we index JSON files using LLmaIndex?
I know there is a JSON query engine, but I am looking for an indexing method
Depends on the json -- you can dump it into all one text blob in a document, or you can break it into document objects that make sense
For example, if I was indexing an open-api JSON spec, I would want each endpoint spec to be a document
Since JSON can have any structure, theres no single way to do this -- depends on your data
@Logan M
Thanks for your clarification
@Logan M
I'm just curious; why LlamaIndex document rader supports json, xml, yml, etc if they should be in the text format for optimal indexing?
What's the use case of reading those documents?
Those readers put it into some (naive) text format
@Logan M
I mean if it puts JSON in the txt format, so why do we need extra work (compared with original txt/pdf formats) to retrieve appropriate data using RAG pipeline as I asked above?
OK, let's take a step back and think about how retrieval works.
Text is embedded. This maps the text into some vector, so that at query time, we can embed the query text, and retrieve similar pieces of text.
When you think about a PDF or txt file, we can split it into chunks (by sentences, by token limit) and then embed and retrieve each chunk.
Now, think about JSON. It has some structure. The structure could be random, or it could follow some specific schema.
We can't just chunk at a token limit, because half a JSON loses the hierarchical structure of the original file. (And this would also confuse the LLM)
So, you need extra work to parse it into pieces that are meaningful for embeddings to retrieve the proper chunks per query. And so that when I query and retrieve, the LLM can make sense of the chunk retrieved.
TLDR: JSON is not just blobs of text, it has some structure, schema, hierarchical information, that you need to properly capture for retrieval and for sending to the LLM
@Logan M
That's a great clarification.
Thanks
@Logan M
Hi again Logan,
Please consider the following code which reads 20 pdf files in my_pdf directory, etc:
# only load pdf files
required_exts = [".pdf"]
reader = SimpleDirectoryReader(
input_dir="./my_pdf",
required_exts=required_exts,
recursive=True,
)
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")
from llama_index.core.extractors import (
TitleExtractor,
QuestionsAnsweredExtractor,
)
from llama_index.core.text_splitter import TokenTextSplitter
from llama_index.core.ingestion import IngestionPipeline
text_splitter = TokenTextSplitter(
separator=" ", chunk_size=20000, chunk_overlap=128
)
title_extractor = TitleExtractor(nodes=20)
# qa_extractor = QuestionsAnsweredExtractor(questions=5)
pipeline = IngestionPipeline(
transformations=[
text_splitter,
title_extractor,
# qa_extractor
]
)
nodes = pipeline.run(
documents=docs,
in_place=True,
show_progress=True,
)
now if i want to index the nodes of the files uisng InMemory vector store in a way that all the nodes and their corresponding metadata being indexed, I only need to simply use the following code?
index = VectorStoreIndex(nodes=nodes, show_progress=True)
Or I need to add some other args for VectorStoreIndex?