hi, is there any way to load **json**

hi, is there any way to load json documents from a directory with for example:
documents = SimpleDirectoryReader("./transcriptions_test_json/").load_data()
and then say I only want to vectorize/create the index by taking one of the fields of each json?

for example a json is:

  "c49c7a9b-6a12-5f1f-ba76-b81d986e5bc7": {
        "video_name": "videoplayback2.mp4",
        "video_path": "/mnt/nas/videos/0-ops/videoplayback2.mp4",
        "original_text": " Good evening and welcome to T...",
        "length_characters": 7585,
        "original_lang": "en",
        "video_section": "0-ops"
    }

and I'd only like to vectorize the original_text file, but when I retrieve with the query before genrrating the final answer I'd like to use the rest of the files potentially for statistics.

The SimpleDirectoryReader can ingest the json and I can access to each of the ingested json inside of the documents read but it's getting the json as a string...

print(documents)
print ("\n")
for doc in documents:
    print (doc.text)
    print ("\n")

78 comments

I've found this exmple: https://gpt-index.readthedocs.io/en/latest/examples/query_engine/json_query_engine.html but it's not what I was expecting.
I would like something that resembles more to how Weaviate operates.
Ingesting the documents as json files but specifying which fields to vectorize and then in the query being able to spedify as well what json fields that have been vectorized to use to retrieve the related docuemnts.
Is this possile to do in Llamaindex?

Hmm I think you'd have to write a function from scratch to do what you are describing

But really, it's not that hard either. You can do json.load() to load the json, iterate over it, and create document objects.

By default, text and metadata are embeded and also sent to the LLM, but you can control both

Plain Text

document = Document(
  text="...",
  metadata={"key1": "val",},
  excluded_llm_metadata_keys=["key1"],
  excluded_embed_metadata_keys=["key1"]
)

This means I have to define a custom Document subclass? and how does then the documents = SimpleDirectoryReader( know that it's ingesting this new kind of documents?

also the point is that if I have a json file for each text, each document would be formed by a json file, no?

from llama_index import Document

it's the same document class, just creating it manually 🙂

https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader.html#extending-to-other-file-types

^ guide for adding a custom loader to simple directory reader (i.e. for json)

I was doing the code:

Attachment

but when I came to the definition of the rest of the fields as metadata and possible to be indexed or not I saw this:

https://llamahub.ai/l/file-json

ther's a loader for jsons but it doesn't say anything about how then llamaindex deals with each field

even in the repo itself there's not much documentation of how to use it:
https://github.com/run-llama/llama-hub/blob/4afe22b732484fb2aac6412bde0a431b76629db6/llama_hub/file/json/README.md

its pretty basic -- most community contributed loaders, you have to read the code tbh lol

this looks right to me -- did it not work?

the code I started works well, but was a bit lost at how to make the part of the

metadata={"key1": "val",},
excluded_llm_metadata_keys=["key1"],
excluded_embed_metadata_keys=["key1"]
)...

but look:

with the default behaviour

Attachment

the file_path of the metadata is vectorized

so the vector associated to a file is not only including in its context the text of the file but also the path?

yea by default all attached metadata is embedded and also read by the LLM -- you can control this with those excluded attributes

Plain Text

from llama_index.schema import MetadataMode
print(document.get_content(metadata_mode=MetadataMode.EMBED))

That's what the embedding model ends up getting under the the hood

More info on customizing documents here https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_documents.html#customizing-documents

not in my case. I've executed the code of above and it confirms it takes the text of the file and only the file_path from the metadata. The rest of metadata files are not taken by default

but why is by defult taking the path of the file? and even why it shuold take by default craetion_date, file type, etc? I think this is puting into the vector space irrelevant info in most of the cases

I can excude the file_path of the embeddings with:

Attachment

the rest of the metadata fields get excluded even if I don't specify them

the contrary happens with the metadata. If I do this:

Attachment

then I loose the predefined fields of metadata for the documents

I'm a little lost, but this seems to work fine for me.

If you want to exclude filenames, feel free to exclude it, but in most cases it's not making any difference in embeddings (or in fact, even helps, since most file names are either descriptive or a user might ask about a particular file)

Single test JSON

Plain Text

{"text": "Hello world!", "extra": "extra data"}

Test script

Plain Text

import json
from llama_index import SimpleDirectoryReader
from llama_index.readers.base import BaseReader
from llama_index.schema import Document


class JsonReader(BaseReader):
    def load_data(self, file, extra_info=None):
        with open(file, "r") as f:
            data = json.load(f)
        # load_data returns a list of Document objects
        extra_info = extra_info or {}
        extra_info["extra"] = data["extra"]

        return [Document(text=data["text"], extra_info=extra_info)]


reader = SimpleDirectoryReader(
    input_dir="./test_jsons", file_extractor={".json": JsonReader()}
)

documents = reader.load_data()

from llama_index.schema import MetadataMode

print("Embed data\n", documents[0].get_content(metadata_mode=MetadataMode.EMBED))
print("LLM data\n", documents[0].get_content(metadata_mode=MetadataMode.LLM))
print("All data\n", documents[0].get_content(metadata_mode=MetadataMode.ALL))

And output

Plain Text

Embed data
file_path: test_jsons/test.json
extra: extra data

Hello world!
LLM data
file_path: test_jsons/test.json
extra: extra data

Hello world!
All data
file_path: test_jsons/test.json
file_name: test.json
file_type: application/json
file_size: 48
creation_date: 2023-11-25
last_modified_date: 2023-11-25
last_accessed_date: 2023-11-25
extra: extra data

Hello world!

this is what I get when defining the metadata

So, when one defines some metadata and specifies it in the Document load_data function, then the the default metadata is gone.
The default exclusions are preserved and then if one add them more, they are just added.

Attachment

Attachment

Attachment

In your function, instead of defining new metadata, use the extra_info that is passed in

Plain Text

extra_info = extra_info or {}
extra_info["extra"] = data["extra"]

return [Document(text=data["text"], extra_info=extra_info)]

Then the default metadata is not lost

I'll try and tell. Thanks Logan!

yes it works. Indeed I didn't have to specify the first row you mentioned

Attachment

the dictionary is defined and initially built in the Base.py class

https://github.com/run-llama/llama_index/blob/main/llama_index/readers/file/base.py

So, new key/value pairs can be added just by using an assignation

I have a couple of questions @Logan M :

when retrieving the documents from the index by making a query, is it possible to go through each of the documents and get some of its metadata fields even if they were not indexed? I like to display that info along with the LLM generated answer.
when I create the documents and I show the doc ids these are different from the doc ids for the retrieved documents. Is this normal? There are only 2 documents ingested and the ids get different when the retrieval is done...

Attachment

it's also strange that I only ingest the documents a single time and the results give the same documents repeated with different doc_id

Attachment

get_formatted_sources is maybe being a little misleading

when you put Document objects into an index, they are broken into nodes, each with their own unique node_id. What is being reported in get_formatted_sources is actually the node_id

Each node has a ref_doc_id attribute pointing to the parent document

response.sources_nodes[0].node.ref_doc_id

And metadata is accessed similarily

resposnse.source_nodes[0].node.metadata

@Logan M I'm trying to create a query engine with which one can specify the number of results to be taken into account for the generation step and also that one can customize the prompt. With query_engine2 I can cusotmize the prompt but the top_k doesn't have an effect. While with the query_engine3, I can change the top_k but it iseems as it cannot have passed as an argument a custmoized prompt...

Attachment

query engine 2, retriever is not actually an argument that gets used, you need to pass index.as_query_engine(text_qa_template=qa_template, similarity_top_k=4)

query engine 3, you need pass in the prompt to the response synthesizer (this is where as_query_engine is also passing it)

reponse_synthesizer = get_response_synthesizer(text_qa_template=qa_template)

ok @Logan M. now query_engine2 uses the prompt and the top_k as intended but it make use of a "refine_template". Do you know why?
query_engine3 is not returning anything in the answer...

Attachment

Attachment

are two calls to the LLM made then? one for the initial query and one for synthesize it? is this effective if we are passing again the same retrieved documents/nodes?

right, theres a text_qa_template and a refine_template

If the nodes retrieved do not fit into a single LLM call, then it will use the text_qa_template to get an initial answer, and then use the refine_template to refine that answer for the rest of the chunks

The refine_template shows the LLM the existing answer, user query, and new context, and asks the LLM to either update or repeat the existing answer given the new context

Since you set the top_k to 4 (and I'm guessing you are using the default chunk size), this retrieves 4x1024 tokens worth of nodes, which needs to be split across LLM calls

got it. So, there are 3 techniques at a theoretical level: map_reduce, refine and map_rerank. But in llamaindex, there's more options even: https://docs.llamaindex.ai/en/stable/module_guides/querying/response_synthesizers/root.html

Right, but a few of those options are similar to each other

For example, compact is just refine, but making every LLM call use as much context as possible (rather than one call per node)

got it. However, for the query_engine3 still it's not picking the template

Attachment

and what you mention is also stated in the documentation:
https://gpt-index.readthedocs.io/en/v0.7.11/core_modules/model_modules/prompts.html

where do you create the response synthesizer?

That's where the template has to go

yep, there was a typo. Thanks @Logan M

Hi @Logan M, I'm trying to tell the LLM to avoid giving its opinion and avoid saying that there's no evidence about the statements retrieved. I've changed the prompts for the qa_template and the refine_template. However, the answers continue to include sentences like: "there's no evidence about what the author says", etc.
Would it be possible to get rid of those warnings? I cannot do it just by using the prompt tuning.

Attachment

I could solve the problem by changing the LLM
llm = Ollama(model="wizard-vicuna-uncensored",base_url="http://192.168.1.xyz:11435")
apparently, many topics are biased towards "lack of evidence" responses in the default LLMs.
It's important to notice that I had to restart the notebook kernel a part from generating the index again. Seems as the Ollama is still storing some state independently of executing again the cells of a notebook.

The only option is prompt engineering I'm afraid, or using a different LLM 🤔 I've run into the same issue as well, tuning prompts can take ages :PSadge: gpt-3.5 is especially bad for this. Maybe the new gpt-3.5-turbo-1106 is better, or even gpt-4-turbo

@Logan M I could solve the issue by using the wizard-vicuna-uncensored model.
I was thinking of having to fine-tune myself a llama2 model, but it wasn't required.
Another option in case that would have not worked, would have been to use few-shot technique in the prompt?
I have seen some stuff related to this in llamaindex docs:
https://docs.llamaindex.ai/en/stable/examples/prompts/prompts_rag.html

Yea few shot prompting is definitely a good idea -- but it can also use up valualable context window space. Tradeoffs I suppose 🙂

By the way, what were you referring by tuning the prompt? just specifying a custom prompt and remarking that it should not doubt about the "evidences" didn't work. And passing a very long prompt to "exploit" the model is not working I guess because the same prompt already is long containing all the documents/nodes.

Yea thats what I meant, just playing around with how the prompt is worded

LLMs tend to pay more attention to the start (and sometimes end) of the prompt

vvigneshzoro

@kapa.ai how to include prompts in ReActAgent.from_tools()

Let me check the knowledge base for you. One moment! :books:

----------------------
The ReActAgent.from_tools() method in LlamaIndex does not directly accept prompts as an argument. Instead, it takes a set of tools (which are instances of BaseTool or its subclasses) and an instance of a language model (such as OpenAI with a specified model like "gpt-4").

However, you can view the prompts used by the ReActAgent after it has been created. The get_prompts() method returns a dictionary where the keys are the names of the prompts and the values are the prompt templates. Here's an example of how to do this:

Plain Text

llm = OpenAI(model="gpt-4")
agent = ReActAgent.from_tools([multiply_tool, add_tool], llm=llm, verbose=True)

prompt_dict = agent.get_prompts()
for k, v in prompt_dict.items():
    print(f"Prompt: {k}\n\nValue: {v.template}")

In this example, multiply_tool and add_tool are instances of BaseTool or its subclasses. You would replace these with the actual tools you want to use.

For more information, you can refer to the LlamaIndex documentation () and the example notebook on GitHub ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

vvigneshzoro

@kapa.ai how to load multiple JSON files using JSONReader?

Let me check the knowledge base for you. One moment! :books:

----------------------
To load multiple JSON files using JSONReader in LlamaIndex, you can create an instance of JSONReader and call the load_data() method for each file. Here's an example:

Plain Text

from llama_index.readers import JSONReader

# Create an instance of JSONReader
reader = JSONReader()

# List of JSON files
json_files = ["file1.json", "file2.json", "file3.json"]

# Load data from each file
documents = []
for file in json_files:
    documents.extend(reader.load_data(input_file=file))

In this example, file1.json, file2.json, and file3.json are the JSON files you want to load. Replace these with the actual paths to your JSON files.

The load_data() method returns a list of Document objects, which are then added to the documents list.

For more information, refer to the LlamaIndex documentation on JSONReader ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

vvigneshzoro

@kapa.ai how to include few instructions to the default prompt of ReAct Agent?

Let me check the knowledge base for you. One moment! :books:

----------------------
To include additional instructions to the default prompt of a ReAct Agent in LlamaIndex, you can modify the react_system_header_str variable, which is a string that contains the default prompt. This string is used to create a PromptTemplate object, which is then used by the agent.

Here's an example of how to do this:

Plain Text

from llama_index.prompts import PromptTemplate

react_system_header_str = """\

You are designed to help with a variety of tasks, from answering questions \
    to providing summaries to other types of analyses.

## Tools
You have access to a wide variety of tools. You are responsible for using
the tools in any sequence you deem appropriate to complete the task at hand.
This may require breaking the task into subtasks and using different tools
to complete each subtask.

You have access to the following tools:
{tool_desc}

## Output Format
To answer the question, please use the following format.

Thought: I need to use a tool to help me answer the question.
Action: tool name (one of {tool_names}) if using a tool.
Action Input: the input to the tool, in a JSON format representing the kwargs (e.g. {{"input": "hello world", "num_beams": 5}})

Plain Text

Please ALWAYS start with a Thought.

Please use a valid JSON format for the Action Input. Do NOT do this {{'input': 'hello world', 'num_beams': 5}}.

If this format is used, the user will respond in the following format:

Observation: tool response

Plain Text

You should keep repeating the above format until you have enough information
to answer the question without using any more tools. At that point, you MUST respond
in the one of the following two formats:

Thought: I can answer without using any more tools.
Answer: [your answer here]

Plain Text

```
Thought: I cannot answer the question with the provided tools.
Answer: Sorry, I cannot answer your query.