I've found this exmple:
https://gpt-index.readthedocs.io/en/latest/examples/query_engine/json_query_engine.html but it's not what I was expecting.
I would like something that resembles more to how Weaviate operates.
Ingesting the documents as json files but specifying which fields to vectorize and then in the query being able to spedify as well what json fields that have been vectorized to use to retrieve the related docuemnts.
Is this possile to do in Llamaindex?
Hmm I think you'd have to write a function from scratch to do what you are describing
But really, it's not that hard either. You can do
json.load()
to load the json, iterate over it, and create document objects.
By default, text and metadata are embeded and also sent to the LLM, but you can control both
document = Document(
text="...",
metadata={"key1": "val",},
excluded_llm_metadata_keys=["key1"],
excluded_embed_metadata_keys=["key1"]
)
This means I have to define a custom Document subclass? and how does then the documents = SimpleDirectoryReader( know that it's ingesting this new kind of documents?
also the point is that if I have a json file for each text, each document would be formed by a json file, no?
from llama_index import Document
it's the same document class, just creating it manually π
^ guide for adding a custom loader to simple directory reader (i.e. for json)
but when I came to the definition of the rest of the fields as metadata and possible to be indexed or not I saw this:
ther's a loader for jsons but it doesn't say anything about how then llamaindex deals with each field
its pretty basic -- most community contributed loaders, you have to read the code tbh lol
this looks right to me -- did it not work?
the code I started works well, but was a bit lost at how to make the part of the
metadata={"key1": "val",},
excluded_llm_metadata_keys=["key1"],
excluded_embed_metadata_keys=["key1"]
)...
with the default behaviour
the file_path of the metadata is vectorized
so the vector associated to a file is not only including in its context the text of the file but also the path?
yea by default all attached metadata is embedded and also read by the LLM -- you can control this with those excluded attributes
from llama_index.schema import MetadataMode
print(document.get_content(metadata_mode=MetadataMode.EMBED))
That's what the embedding model ends up getting under the the hood
not in my case. I've executed the code of above and it confirms it takes the text of the file and only the file_path from the metadata. The rest of metadata files are not taken by default
but why is by defult taking the path of the file? and even why it shuold take by default craetion_date, file type, etc? I think this is puting into the vector space irrelevant info in most of the cases
I can excude the file_path of the embeddings with:
the rest of the metadata fields get excluded even if I don't specify them
the contrary happens with the metadata. If I do this:
then I loose the predefined fields of metadata for the documents
I'm a little lost, but this seems to work fine for me.
If you want to exclude filenames, feel free to exclude it, but in most cases it's not making any difference in embeddings (or in fact, even helps, since most file names are either descriptive or a user might ask about a particular file)
Single test JSON
{"text": "Hello world!", "extra": "extra data"}
Test script
import json
from llama_index import SimpleDirectoryReader
from llama_index.readers.base import BaseReader
from llama_index.schema import Document
class JsonReader(BaseReader):
def load_data(self, file, extra_info=None):
with open(file, "r") as f:
data = json.load(f)
# load_data returns a list of Document objects
extra_info = extra_info or {}
extra_info["extra"] = data["extra"]
return [Document(text=data["text"], extra_info=extra_info)]
reader = SimpleDirectoryReader(
input_dir="./test_jsons", file_extractor={".json": JsonReader()}
)
documents = reader.load_data()
from llama_index.schema import MetadataMode
print("Embed data\n", documents[0].get_content(metadata_mode=MetadataMode.EMBED))
print("LLM data\n", documents[0].get_content(metadata_mode=MetadataMode.LLM))
print("All data\n", documents[0].get_content(metadata_mode=MetadataMode.ALL))
And output
Embed data
file_path: test_jsons/test.json
extra: extra data
Hello world!
LLM data
file_path: test_jsons/test.json
extra: extra data
Hello world!
All data
file_path: test_jsons/test.json
file_name: test.json
file_type: application/json
file_size: 48
creation_date: 2023-11-25
last_modified_date: 2023-11-25
last_accessed_date: 2023-11-25
extra: extra data
Hello world!
this is what I get when defining the metadata
So, when one defines some metadata and specifies it in the Document load_data function, then the the default metadata is gone.
The default exclusions are preserved and then if one add them more, they are just added.
In your function, instead of defining new metadata, use the extra_info that is passed in
extra_info = extra_info or {}
extra_info["extra"] = data["extra"]
return [Document(text=data["text"], extra_info=extra_info)]
Then the default metadata is not lost
I'll try and tell. Thanks Logan!
yes it works. Indeed I didn't have to specify the first row you mentioned
the dictionary is defined and initially built in the Base.py class
So, new key/value pairs can be added just by using an assignation
I have a couple of questions @Logan M :
- when retrieving the documents from the index by making a query, is it possible to go through each of the documents and get some of its metadata fields even if they were not indexed? I like to display that info along with the LLM generated answer.
- when I create the documents and I show the doc ids these are different from the doc ids for the retrieved documents. Is this normal? There are only 2 documents ingested and the ids get different when the retrieval is done...
it's also strange that I only ingest the documents a single time and the results give the same documents repeated with different doc_id
get_formatted_sources is maybe being a little misleading
when you put Document
objects into an index, they are broken into nodes, each with their own unique node_id. What is being reported in get_formatted_sources is actually the node_id
Each node has a ref_doc_id
attribute pointing to the parent document
response.sources_nodes[0].node.ref_doc_id
And metadata is accessed similarily
resposnse.source_nodes[0].node.metadata
@Logan M I'm trying to create a query engine with which one can specify the number of results to be taken into account for the generation step and also that one can customize the prompt. With query_engine2 I can cusotmize the prompt but the top_k doesn't have an effect. While with the query_engine3, I can change the top_k but it iseems as it cannot have passed as an argument a custmoized prompt...
query engine 2, retriever is not actually an argument that gets used, you need to pass index.as_query_engine(text_qa_template=qa_template, similarity_top_k=4)
query engine 3, you need pass in the prompt to the response synthesizer (this is where as_query_engine is also passing it)
reponse_synthesizer = get_response_synthesizer(text_qa_template=qa_template)
ok @Logan M. now query_engine2 uses the prompt and the top_k as intended but it make use of a "refine_template". Do you know why?
query_engine3 is not returning anything in the answer...
are two calls to the LLM made then? one for the initial query and one for synthesize it? is this effective if we are passing again the same retrieved documents/nodes?
right, theres a text_qa_template and a refine_template
If the nodes retrieved do not fit into a single LLM call, then it will use the text_qa_template to get an initial answer, and then use the refine_template to refine that answer for the rest of the chunks
The refine_template shows the LLM the existing answer, user query, and new context, and asks the LLM to either update or repeat the existing answer given the new context
Since you set the top_k to 4 (and I'm guessing you are using the default chunk size), this retrieves 4x1024 tokens worth of nodes, which needs to be split across LLM calls
Right, but a few of those options are similar to each other
For example, compact
is just refine
, but making every LLM call use as much context as possible (rather than one call per node)
got it. However, for the query_engine3 still it's not picking the template
where do you create the response synthesizer?
That's where the template has to go
yep, there was a typo. Thanks @Logan M
Hi @Logan M, I'm trying to tell the LLM to avoid giving its opinion and avoid saying that there's no evidence about the statements retrieved. I've changed the prompts for the qa_template and the refine_template. However, the answers continue to include sentences like: "there's no evidence about what the author says", etc.
Would it be possible to get rid of those warnings? I cannot do it just by using the prompt tuning.
I could solve the problem by changing the LLM
llm = Ollama(model="wizard-vicuna-uncensored",base_url="http://192.168.1.xyz:11435")
apparently, many topics are biased towards "lack of evidence" responses in the default LLMs.
It's important to notice that I had to restart the notebook kernel a part from generating the index again. Seems as the Ollama is still storing some state independently of executing again the cells of a notebook.
The only option is prompt engineering I'm afraid, or using a different LLM π€ I've run into the same issue as well, tuning prompts can take ages :PSadge: gpt-3.5 is especially bad for this. Maybe the new gpt-3.5-turbo-1106 is better, or even gpt-4-turbo
@Logan M I could solve the issue by using the wizard-vicuna-uncensored model.
I was thinking of having to fine-tune myself a llama2 model, but it wasn't required.
Another option in case that would have not worked, would have been to use few-shot technique in the prompt?
I have seen some stuff related to this in llamaindex docs:
https://docs.llamaindex.ai/en/stable/examples/prompts/prompts_rag.htmlYea few shot prompting is definitely a good idea -- but it can also use up valualable context window space. Tradeoffs I suppose π
By the way, what were you referring by tuning the prompt? just specifying a custom prompt and remarking that it should not doubt about the "evidences" didn't work. And passing a very long prompt to "exploit" the model is not working I guess because the same prompt already is long containing all the documents/nodes.
Yea thats what I meant, just playing around with how the prompt is worded
LLMs tend to pay more attention to the start (and sometimes end) of the prompt
@kapa.ai how to include prompts in ReActAgent.from_tools()
Let me check the knowledge base for you. One moment! :books:
----------------------
The
ReActAgent.from_tools()
method in LlamaIndex does not directly accept prompts as an argument. Instead, it takes a set of tools (which are instances of
BaseTool
or its subclasses) and an instance of a language model (such as
OpenAI
with a specified model like "gpt-4").
However, you can view the prompts used by the
ReActAgent
after it has been created. The
get_prompts()
method returns a dictionary where the keys are the names of the prompts and the values are the prompt templates. Here's an example of how to do this:
llm = OpenAI(model="gpt-4")
agent = ReActAgent.from_tools([multiply_tool, add_tool], llm=llm, verbose=True)
prompt_dict = agent.get_prompts()
for k, v in prompt_dict.items():
print(f"Prompt: {k}\n\nValue: {v.template}")
In this example,
multiply_tool
and
add_tool
are instances of
BaseTool
or its subclasses. You would replace these with the actual tools you want to use.
For more information, you can refer to the
LlamaIndex documentation () and the
example notebook on GitHub ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai how to load multiple JSON files using JSONReader?
Let me check the knowledge base for you. One moment! :books:
----------------------
To load multiple JSON files using
JSONReader
in LlamaIndex, you can create an instance of
JSONReader
and call the
load_data()
method for each file. Here's an example:
from llama_index.readers import JSONReader
# Create an instance of JSONReader
reader = JSONReader()
# List of JSON files
json_files = ["file1.json", "file2.json", "file3.json"]
# Load data from each file
documents = []
for file in json_files:
documents.extend(reader.load_data(input_file=file))
In this example,
file1.json
,
file2.json
, and
file3.json
are the JSON files you want to load. Replace these with the actual paths to your JSON files.
The
load_data()
method returns a list of
Document
objects, which are then added to the
documents
list.
For more information, refer to the
LlamaIndex documentation on JSONReader ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai how to include few instructions to the default prompt of ReAct Agent?
Let me check the knowledge base for you. One moment! :books:
----------------------
To include additional instructions to the default prompt of a ReAct Agent in LlamaIndex, you can modify the
react_system_header_str
variable, which is a string that contains the default prompt. This string is used to create a
PromptTemplate
object, which is then used by the agent.
Here's an example of how to do this:
from llama_index.prompts import PromptTemplate
react_system_header_str = """\
You are designed to help with a variety of tasks, from answering questions \
to providing summaries to other types of analyses.
## Tools
You have access to a wide variety of tools. You are responsible for using
the tools in any sequence you deem appropriate to complete the task at hand.
This may require breaking the task into subtasks and using different tools
to complete each subtask.
You have access to the following tools:
{tool_desc}
## Output Format
To answer the question, please use the following format.
Thought: I need to use a tool to help me answer the question.
Action: tool name (one of {tool_names}) if using a tool.
Action Input: the input to the tool, in a JSON format representing the kwargs (e.g. {{"input": "hello world", "num_beams": 5}})
Please ALWAYS start with a Thought.
Please use a valid JSON format for the Action Input. Do NOT do this {{'input': 'hello world', 'num_beams': 5}}.
If this format is used, the user will respond in the following format:
Observation: tool response
You should keep repeating the above format until you have enough information
to answer the question without using any more tools. At that point, you MUST respond
in the one of the following two formats:
Thought: I can answer without using any more tools.
Answer: [your answer here]
```
Thought: I cannot answer the question with the provided tools.
Answer: Sorry, I cannot answer your query.
Additional Rules
- The answer MUST contain a sequence of bullet points that explain how you arrived at the answer. This can include aspects of the previous conversation history.
- You MUST obey the function signature of each tool. Do NOT pass in no arguments if the function expects arguments.
Your Instructions
- [Add your instructions here]
"""
Create a PromptTemplate object
prompt_template = PromptTemplate(react_system_header_str)
``
In this example, replace
[Add your instructions here]` with the instructions you want to add. You can add as many instructions as you want, each on a new line.
For more information, refer to the
LlamaIndex documentation on PromptTemplate () and the
example notebook on GitHub ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------