Find answers from the community

Updated last year

hi, is there any way to load **json**

At a glance
The community member is trying to load JSON documents from a directory and only vectorize/index a specific field from each JSON document, while still being able to access the other fields for statistics. They have found the SimpleDirectoryReader and are trying to understand how to customize the document loading process. The comments discuss various approaches, including creating a custom Document subclass, using the extra_info parameter, and understanding the default metadata behavior. The community members also have questions about retrieving metadata fields that were not indexed, and the differences between document IDs during ingestion and retrieval. They also discuss issues with customizing the prompt and top-k parameters in the query engine.
Useful resources
hi, is there any way to load json documents from a directory with for example:
documents = SimpleDirectoryReader("./transcriptions_test_json/").load_data()
and then say I only want to vectorize/create the index by taking one of the fields of each json?

for example a json is:

"c49c7a9b-6a12-5f1f-ba76-b81d986e5bc7": { "video_name": "videoplayback2.mp4", "video_path": "/mnt/nas/videos/0-ops/videoplayback2.mp4", "original_text": " Good evening and welcome to T...", "length_characters": 7585, "original_lang": "en", "video_section": "0-ops" }

and I'd only like to vectorize the original_text file, but when I retrieve with the query before genrrating the final answer I'd like to use the rest of the files potentially for statistics.

The SimpleDirectoryReader can ingest the json and I can access to each of the ingested json inside of the documents read but it's getting the json as a string...

print(documents) print ("\n") for doc in documents: print (doc.text) print ("\n")
1
d
L
v
78 comments
I've found this exmple: https://gpt-index.readthedocs.io/en/latest/examples/query_engine/json_query_engine.html but it's not what I was expecting.
I would like something that resembles more to how Weaviate operates.
Ingesting the documents as json files but specifying which fields to vectorize and then in the query being able to spedify as well what json fields that have been vectorized to use to retrieve the related docuemnts.
Is this possile to do in Llamaindex?
Hmm I think you'd have to write a function from scratch to do what you are describing

But really, it's not that hard either. You can do json.load() to load the json, iterate over it, and create document objects.

By default, text and metadata are embeded and also sent to the LLM, but you can control both
Plain Text
document = Document(
  text="...",
  metadata={"key1": "val",},
  excluded_llm_metadata_keys=["key1"],
  excluded_embed_metadata_keys=["key1"]
)
This means I have to define a custom Document subclass? and how does then the documents = SimpleDirectoryReader( know that it's ingesting this new kind of documents?

also the point is that if I have a json file for each text, each document would be formed by a json file, no?
from llama_index import Document
it's the same document class, just creating it manually πŸ™‚
^ guide for adding a custom loader to simple directory reader (i.e. for json)
I was doing the code:
Attachment
image.png
but when I came to the definition of the rest of the fields as metadata and possible to be indexed or not I saw this:
ther's a loader for jsons but it doesn't say anything about how then llamaindex deals with each field
its pretty basic -- most community contributed loaders, you have to read the code tbh lol
this looks right to me -- did it not work?
the code I started works well, but was a bit lost at how to make the part of the

metadata={"key1": "val",},
excluded_llm_metadata_keys=["key1"],
excluded_embed_metadata_keys=["key1"]
)...
with the default behaviour
Attachment
image.png
the file_path of the metadata is vectorized
so the vector associated to a file is not only including in its context the text of the file but also the path?
yea by default all attached metadata is embedded and also read by the LLM -- you can control this with those excluded attributes

Plain Text
from llama_index.schema import MetadataMode
print(document.get_content(metadata_mode=MetadataMode.EMBED))


That's what the embedding model ends up getting under the the hood
not in my case. I've executed the code of above and it confirms it takes the text of the file and only the file_path from the metadata. The rest of metadata files are not taken by default
but why is by defult taking the path of the file? and even why it shuold take by default craetion_date, file type, etc? I think this is puting into the vector space irrelevant info in most of the cases
I can excude the file_path of the embeddings with:
Attachment
image.png
the rest of the metadata fields get excluded even if I don't specify them
the contrary happens with the metadata. If I do this:
Attachment
image.png
then I loose the predefined fields of metadata for the documents
I'm a little lost, but this seems to work fine for me.

If you want to exclude filenames, feel free to exclude it, but in most cases it's not making any difference in embeddings (or in fact, even helps, since most file names are either descriptive or a user might ask about a particular file)

Single test JSON
Plain Text
{"text": "Hello world!", "extra": "extra data"}


Test script
Plain Text
import json
from llama_index import SimpleDirectoryReader
from llama_index.readers.base import BaseReader
from llama_index.schema import Document


class JsonReader(BaseReader):
    def load_data(self, file, extra_info=None):
        with open(file, "r") as f:
            data = json.load(f)
        # load_data returns a list of Document objects
        extra_info = extra_info or {}
        extra_info["extra"] = data["extra"]

        return [Document(text=data["text"], extra_info=extra_info)]


reader = SimpleDirectoryReader(
    input_dir="./test_jsons", file_extractor={".json": JsonReader()}
)

documents = reader.load_data()

from llama_index.schema import MetadataMode

print("Embed data\n", documents[0].get_content(metadata_mode=MetadataMode.EMBED))
print("LLM data\n", documents[0].get_content(metadata_mode=MetadataMode.LLM))
print("All data\n", documents[0].get_content(metadata_mode=MetadataMode.ALL))


And output
Plain Text
Embed data
file_path: test_jsons/test.json
extra: extra data

Hello world!
LLM data
file_path: test_jsons/test.json
extra: extra data

Hello world!
All data
file_path: test_jsons/test.json
file_name: test.json
file_type: application/json
file_size: 48
creation_date: 2023-11-25
last_modified_date: 2023-11-25
last_accessed_date: 2023-11-25
extra: extra data

Hello world!
this is what I get when defining the metadata
So, when one defines some metadata and specifies it in the Document load_data function, then the the default metadata is gone.
The default exclusions are preserved and then if one add them more, they are just added.
Attachment
image.png
Attachment
image.png
Attachment
image.png
In your function, instead of defining new metadata, use the extra_info that is passed in

Plain Text
extra_info = extra_info or {}
extra_info["extra"] = data["extra"]

return [Document(text=data["text"], extra_info=extra_info)]


Then the default metadata is not lost
I'll try and tell. Thanks Logan!
yes it works. Indeed I didn't have to specify the first row you mentioned
Attachment
image.png
the dictionary is defined and initially built in the Base.py class
So, new key/value pairs can be added just by using an assignation
I have a couple of questions @Logan M :
  1. when retrieving the documents from the index by making a query, is it possible to go through each of the documents and get some of its metadata fields even if they were not indexed? I like to display that info along with the LLM generated answer.
  2. when I create the documents and I show the doc ids these are different from the doc ids for the retrieved documents. Is this normal? There are only 2 documents ingested and the ids get different when the retrieval is done...
Attachment
image.png
it's also strange that I only ingest the documents a single time and the results give the same documents repeated with different doc_id
Attachment
image.png
get_formatted_sources is maybe being a little misleading
when you put Document objects into an index, they are broken into nodes, each with their own unique node_id. What is being reported in get_formatted_sources is actually the node_id

Each node has a ref_doc_id attribute pointing to the parent document

response.sources_nodes[0].node.ref_doc_id

And metadata is accessed similarily

resposnse.source_nodes[0].node.metadata
@Logan M I'm trying to create a query engine with which one can specify the number of results to be taken into account for the generation step and also that one can customize the prompt. With query_engine2 I can cusotmize the prompt but the top_k doesn't have an effect. While with the query_engine3, I can change the top_k but it iseems as it cannot have passed as an argument a custmoized prompt...
Attachment
image.png
query engine 2, retriever is not actually an argument that gets used, you need to pass index.as_query_engine(text_qa_template=qa_template, similarity_top_k=4)

query engine 3, you need pass in the prompt to the response synthesizer (this is where as_query_engine is also passing it)

reponse_synthesizer = get_response_synthesizer(text_qa_template=qa_template)
ok @Logan M. now query_engine2 uses the prompt and the top_k as intended but it make use of a "refine_template". Do you know why?
query_engine3 is not returning anything in the answer...
Attachment
image.png
Attachment
image.png
are two calls to the LLM made then? one for the initial query and one for synthesize it? is this effective if we are passing again the same retrieved documents/nodes?
right, theres a text_qa_template and a refine_template

If the nodes retrieved do not fit into a single LLM call, then it will use the text_qa_template to get an initial answer, and then use the refine_template to refine that answer for the rest of the chunks

The refine_template shows the LLM the existing answer, user query, and new context, and asks the LLM to either update or repeat the existing answer given the new context

Since you set the top_k to 4 (and I'm guessing you are using the default chunk size), this retrieves 4x1024 tokens worth of nodes, which needs to be split across LLM calls
got it. So, there are 3 techniques at a theoretical level: map_reduce, refine and map_rerank. But in llamaindex, there's more options even: https://docs.llamaindex.ai/en/stable/module_guides/querying/response_synthesizers/root.html
Right, but a few of those options are similar to each other

For example, compact is just refine, but making every LLM call use as much context as possible (rather than one call per node)
got it. However, for the query_engine3 still it's not picking the template
Attachment
image.png
where do you create the response synthesizer?
That's where the template has to go
yep, there was a typo. Thanks @Logan M
Hi @Logan M, I'm trying to tell the LLM to avoid giving its opinion and avoid saying that there's no evidence about the statements retrieved. I've changed the prompts for the qa_template and the refine_template. However, the answers continue to include sentences like: "there's no evidence about what the author says", etc.
Would it be possible to get rid of those warnings? I cannot do it just by using the prompt tuning.
Attachment
image.png
I could solve the problem by changing the LLM
llm = Ollama(model="wizard-vicuna-uncensored",base_url="http://192.168.1.xyz:11435")
apparently, many topics are biased towards "lack of evidence" responses in the default LLMs.
It's important to notice that I had to restart the notebook kernel a part from generating the index again. Seems as the Ollama is still storing some state independently of executing again the cells of a notebook.
The only option is prompt engineering I'm afraid, or using a different LLM πŸ€” I've run into the same issue as well, tuning prompts can take ages :PSadge: gpt-3.5 is especially bad for this. Maybe the new gpt-3.5-turbo-1106 is better, or even gpt-4-turbo
@Logan M I could solve the issue by using the wizard-vicuna-uncensored model.
I was thinking of having to fine-tune myself a llama2 model, but it wasn't required.
Another option in case that would have not worked, would have been to use few-shot technique in the prompt?
I have seen some stuff related to this in llamaindex docs:
https://docs.llamaindex.ai/en/stable/examples/prompts/prompts_rag.html
Yea few shot prompting is definitely a good idea -- but it can also use up valualable context window space. Tradeoffs I suppose πŸ™‚
By the way, what were you referring by tuning the prompt? just specifying a custom prompt and remarking that it should not doubt about the "evidences" didn't work. And passing a very long prompt to "exploit" the model is not working I guess because the same prompt already is long containing all the documents/nodes.
Yea thats what I meant, just playing around with how the prompt is worded

LLMs tend to pay more attention to the start (and sometimes end) of the prompt
@kapa.ai how to include prompts in ReActAgent.from_tools()
Let me check the knowledge base for you. One moment! :books:
----------------------
The ReActAgent.from_tools() method in LlamaIndex does not directly accept prompts as an argument. Instead, it takes a set of tools (which are instances of BaseTool or its subclasses) and an instance of a language model (such as OpenAI with a specified model like "gpt-4").

However, you can view the prompts used by the ReActAgent after it has been created. The get_prompts() method returns a dictionary where the keys are the names of the prompts and the values are the prompt templates. Here's an example of how to do this:

Plain Text
llm = OpenAI(model="gpt-4")
agent = ReActAgent.from_tools([multiply_tool, add_tool], llm=llm, verbose=True)

prompt_dict = agent.get_prompts()
for k, v in prompt_dict.items():
    print(f"Prompt: {k}\n\nValue: {v.template}")


In this example, multiply_tool and add_tool are instances of BaseTool or its subclasses. You would replace these with the actual tools you want to use.

For more information, you can refer to the LlamaIndex documentation () and the example notebook on GitHub ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai how to load multiple JSON files using JSONReader?
Let me check the knowledge base for you. One moment! :books:
----------------------
To load multiple JSON files using JSONReader in LlamaIndex, you can create an instance of JSONReader and call the load_data() method for each file. Here's an example:

Plain Text
from llama_index.readers import JSONReader

# Create an instance of JSONReader
reader = JSONReader()

# List of JSON files
json_files = ["file1.json", "file2.json", "file3.json"]

# Load data from each file
documents = []
for file in json_files:
    documents.extend(reader.load_data(input_file=file))


In this example, file1.json, file2.json, and file3.json are the JSON files you want to load. Replace these with the actual paths to your JSON files.

The load_data() method returns a list of Document objects, which are then added to the documents list.

For more information, refer to the LlamaIndex documentation on JSONReader ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai how to include few instructions to the default prompt of ReAct Agent?
Let me check the knowledge base for you. One moment! :books:
----------------------
To include additional instructions to the default prompt of a ReAct Agent in LlamaIndex, you can modify the react_system_header_str variable, which is a string that contains the default prompt. This string is used to create a PromptTemplate object, which is then used by the agent.

Here's an example of how to do this:

Plain Text
from llama_index.prompts import PromptTemplate

react_system_header_str = """\

You are designed to help with a variety of tasks, from answering questions \
    to providing summaries to other types of analyses.

## Tools
You have access to a wide variety of tools. You are responsible for using
the tools in any sequence you deem appropriate to complete the task at hand.
This may require breaking the task into subtasks and using different tools
to complete each subtask.

You have access to the following tools:
{tool_desc}

## Output Format
To answer the question, please use the following format.

Thought: I need to use a tool to help me answer the question.
Action: tool name (one of {tool_names}) if using a tool.
Action Input: the input to the tool, in a JSON format representing the kwargs (e.g. {{"input": "hello world", "num_beams": 5}})
Plain Text
Please ALWAYS start with a Thought.

Please use a valid JSON format for the Action Input. Do NOT do this {{'input': 'hello world', 'num_beams': 5}}.

If this format is used, the user will respond in the following format:

Observation: tool response
Plain Text
You should keep repeating the above format until you have enough information
to answer the question without using any more tools. At that point, you MUST respond
in the one of the following two formats:

Thought: I can answer without using any more tools.
Answer: [your answer here]
Plain Text
```
Thought: I cannot answer the question with the provided tools.
Answer: Sorry, I cannot answer your query.

Additional Rules

  • The answer MUST contain a sequence of bullet points that explain how you arrived at the answer. This can include aspects of the previous conversation history.
  • You MUST obey the function signature of each tool. Do NOT pass in no arguments if the function expects arguments.

Your Instructions

  • [Add your instructions here]
"""

Create a PromptTemplate object

prompt_template = PromptTemplate(react_system_header_str)
`` In this example, replace [Add your instructions here]` with the instructions you want to add. You can add as many instructions as you want, each on a new line.

For more information, refer to the LlamaIndex documentation on PromptTemplate () and the example notebook on GitHub ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
Add a reply
Sign up and join the conversation on Discord