Find answers from the community

Updated 3 weeks ago

Agent

Hello , I am working on a test RAG system, I am using LlamaParse and I am having some issues with the output of the RAG system. I want to know if the data that I am parsing (which I am using as test to create a template for my actual data) is good or if the errors that I am getting are due to something else.

Attached is the CSV file with a little data that I am working with.

My code looks as follows:
Plain Text
def indexing_function():
  Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")
  Settings.llm = OpenAI(model="gpt-4o")
  Settings.chunk_size = 250
  Settings.chunk_overlap = 50

  parser = LlamaParse(verbose = True, premium_mode=True, show_progress=True)
  csv_file_extractor = {".csv": parser}

  db = chromadb.PersistentClient(path=db_path)

  test_documentation = SimpleDirectoryReader("./CSV Data", file_extractor=csv_file_extractor).load_data()

  docs_collection = db.get_or_create_collection("docs")
  docs_vector_store = ChromaVectorStore(chroma_collection=docs_collection)
  docs_storage_context = StorageContext.from_defaults(vector_store=docs_vector_store)
  index = VectorStoreIndex.from_documents(
        test_documentation,
        storage_context=docs_storage_context,
        embed_model=OpenAIEmbedding(model="text-embedding-3-large"),
        transformations=[SentenceSplitter(chunk_size=250, chunk_overlap=30)],
        show_progress=True
  )
  return index

def citation_engine(llm_4, index):
    data_engine = CitationQueryEngine.from_args(
        index,
        llm=llm_4,
        metadata_mode=ToolMetadata(
            name="test_documentation",
            description=(
                "Information regarding FedRAMP Controls"
            ),
        ),
        similarity_top_k=2,
    )
  return data_engine

def query_engine(data_engine):
    query_tool = QueryEngineTool(
        query_engine=data_engine,
        metadata=ToolMetadata(
            name="docs",
            description="Information regarding FedRAMP Controls"
        )
    )
  query_engine_tools = [query_tool]
  return query_engine_tools

def rag_model(prompt, query_engine_tools, llm_4):
    llm_predictor = LLMPredictor(llm_4)
    decompose_transform = DecomposeQueryTransform(llm_predictor, verbose=True)

    react_agent = ReActAgent.from_tools(
        tools = query_engine_tools, 
        llm = llm_4,
        verbose=True
    )
    
    query_engine = MultiStepQueryEngine(
        query_engine = react_agent,
        query_transform=decompose_transform,
    )

    response = query_engine.query(
        prompt
    )

def main(prompt, query_engine_tools, llm_4):
    response = rag_model(prompt, query_engine_tools, llm_4)


Let's say I ask a basic question such as prompt = '''Give me information on the Control "AC-1".'''

In the observation stage all I get is Observation: None of the provided sources contain information about AC-1. And spits out information about searching for the documentation online.

I've tried different values and chunk sizes, overlap sizes etc.
Thanks for your time.
L
C
15 comments
Don't use the decompose query engine or the multistep query engine to wrap an agent. The agent is already doing that for you through the react algorithm (this is probably the biggest issue imo)

Your chunk size is very small. I would just use defaults? Your top k is also very small compared to your chunk size

Your tool description is pretty small. You could try adding more details regarding what the tool is useful for?
Will do, I am going to try that right now.
One more thing, I did see in another post to use Llama3.# instead of OpenAI when using ReAct. Would this apply to the parsing as well?
I made the recommended changes. I increased the chunk size to 1024 with a chunk overlap of 100. I removed the decompose query engine as well as the multistep query engine. I also increased the top k size to 10 instead of 2. I also added a much longer description of the tool.

Prompt: "Provide detailed information about the control 'AC-1,' including its objectives, implementation requirements, and any associated parameters or baselines (e.g., High, Moderate, and Low Baselines)."

This is my output:
Plain Text
Thought: The current language of the user is English. I need to use a tool to help me answer the question.
Action: docs
Action Input: {'input': 'AC-1'}
Observation: None of the provided sources contain information about AC-1.
> Running step {Hidden Info}. Step input: None
Thought: I cannot answer the question with the provided tools.
So it ran your query engine with the prompt "AC-1", and none of the results came back with anything that mentioned AC-1
That's odd because in the CSV that I shared which is the one that I am using, row 2 has AC-1. In the parsed content in LlamaCloud under JSON for the results, I have this:

Plain Text
"items": [
        {
          "type": "table",
          "rows": [
            [
              "Family",
              "Control ID",
              "Control Name",
              "NIST Control Description",
              "NIST Discussion",
              "Additional FedRAMP Requirements and Guidance",
              "High Baseline",
              "Moderate Baseline",
              "Low Baseline"
            ],
            [
              "ACCESS CONTROL",
              "AC-1",
              "Policy and Procedures",
              "a. Develop, document, and disseminate to [Assignment: organization-defined personnel or roles]:,  1. [Selection (one or more): Organization-level, Mission/business process-level, System-level] access control policy that:,  (a) Addresses purpose, scope, roles, responsibilities, management commitment, coordination among organizational entities, and compliance, and,  (b) Is consistent with applicable laws, executive orders, directives, regulations, policies, standards, and guidelines, and,  2. Procedures to facilitate the implementation of the access control policy and the associated access controls,   b. Designate an [Assignment: organization-defined official] to manage the development, documentation, and dissemination of the access control policy and procedures, and,  c. Review and update the current access control:,  1. Policy [Assignment: organization-defined frequency] and following [Assignment: organization-defined events], and,  2. Procedures [Assignment: organization-defined frequency] and following [Assignment: organization-defined events].",
              "Access control policy and procedures address the controls in the AC family that are implemented within systems and organizations. The risk management strategy is an important factor in establishing such policies and procedures. Policies and procedures contribute to security and privacy assurance. Therefore, it is important that security and privacy programs collaborate on the development of access control policy and procedures. Security and privacy program policies and procedures at the organization level are preferable, in general, and may obviate the need for mission- or system-specific policies and procedures. The policy can be included as part of the general security and privacy policy or be represented by multiple policies reflecting the complex nature of organizations. Procedures can be established for security and privacy programs, for mission or business processes, and for systems, if needed. Procedures describe how the policies or controls are implemented and can be directed at the individual or role that is the object of the procedure. Procedures can be documented in system security and privacy plans or in one or more separate documents. Events that may precipitate an update to access control policy and procedures include assessment or audit findings, security incidents or breaches, or changes in laws, executive orders, directives, regulations, policies, standards, and guidelines. Simply restating controls does not constitute an organizational policy or procedure.",
              "N/A",
              "AC-1 (c) (1) [at least annually], AC-1 (c) (2) [at least annually] [significant changes]",
              "AC-1 (c) (1) [at least every 3 years] , AC-1 (c) (2) [at least annually] [significant changes]",
              "AC-1 (c) (1) [at least every 3 years] , AC-1 (c) (2) [at least annually] [significant changes]"
            ],
I guess it wasn't retrieved as part of the top-k ? You can test the query engine on its own to confirm this
Honestly CSVs are bad choice for RAG. In most cases, you want a more structured lookup (like running SQL queries or pandas commands)
Ahh I see. It makes sense.
If the CSV is small, I wouldn't even chunk it, you can retrieve the entire file
I used different versions of the data same data, for example JSON, Text, CSV, Excel.
I have approximately 18 different csv files so I called the Simple Directory Reader to load them all from the folder and then chunk them.
I would create a node per csv file, and not do any chunking (but thats assuming the token size of each CSV is manageable)
Luckily it's doing that. Since the CSV files are pretty manageable it parses only 18 nodes (1 per file). Then it generates embeddings (after chunking got removed from code). I tested without the chunking code and I increased the top k to 20 but I am still getting the same output.
I might try with another file type tbh. Thank you very much for your help. I really appreciate it.
Add a reply
Sign up and join the conversation on Discord