Find answers from the community

Updated 4 months ago

Hi This is not related to LlamaIndex but

At a glance
Hi! This is not related to LlamaIndex, but an overall question. I would appreciate your help.

I have one pdf (pdfA) on one side and multiple pdfs (pdfsN) on another side.

I need to scan those pdfs and list down all the contradictions between pdfA and pdfsN.

For example, If somewhere in pdfA it is written that the company made $30mil revenue in 2021, but somewhere in pdfsN it is written that the company made revenue of $40mil (or even made loss), I need to output it.

Could you please help me with some ideas on how to develop it?
p
L
25 comments
The only approach that comes to my mind is pairwise comparison between i page from pdfA and i page from pdfsN
Maybe another approach is extracting structured statement/facts from each document, and then doing a pairwise comparison?

i.e. using a pydantic program to extract something like

Plain Text
class Fact(BaseModel):
  statement: str
  evidence: str


And then comparing facts between documents using an NLI model?
doing a base pairwise-comparisson with an LLM will be slow and expensive, just trying to think of cheaper options πŸ™‚
Hi @Logan M . Thanks for feedback.

What do you mean by structured statement/facts and how can I extract it?
Using the function calling API. The pydantic program in llama-index is a good example

https://gpt-index.readthedocs.io/en/stable/examples/output_parsing/openai_pydantic_program.html
@Logan M I'm still abit confused about the approach you proposed.

1) Create a list of Fact objects for all text chunks inside each pdf using OpenAI Pydantic Program. So let's say I split each pdf into chunks of size=~512 tokens (since it is a context window of NLI). I need to create Fact object for each of those chunks (since number of tokens in one pdf is way beyond 4096).

2) Use an NLI model for pairwise-comparison between Facts from pdfA vs Facts in pdfsN

3) After I set a threshold in NLI and pick similar chunks. I can then use those as an input for LLM.

Did I understand your approach correctly?
1) I don't think each chunk needs to be 512 -- it could probably be anywhere up to about 3000. All that really matters is the length of the actual fact produced

2) Yes

3) Yea so the NLI model should give you a best guess at contradictions -- so you could identify the contradicting facts and pass those to an LLM to explain/summarize if needed
@Logan M Great!

Just for clarification, by "contradictions" here what I mean is "content discrepancies" (maybe a better term for this)
Yea I thiiiink NLI should be ok for this -- NLI models are trained to predict neutral, agreement, contradiction

i.e I made $10 and I made $20 should be a contradiction (I think lol)
e.g.:

text-1:
The company made $50mil in 2020

text-2:
The company's revenue in 2020 was $20mil

text-3:
I love cookies

text-1, text-2 -> contradiction

text-3 -> irrelevant (not contradiction)
@Logan M sorry, here:

Plain Text
class Fact(BaseModel):
  statement: str
  evidence: str


statement is a hypothesis and evidence is a premise?
pretty much -- not sure if both are actually needed, was just thinking of what info would be useful for the task lol
Hi @Logan M ! I'm trying to use the program pattern you sent me yesterday, but it keeps extracting super irrelevant information.

Plain Text
prompt_template_str = """\
This is a text from a page from a pdf: {page_text}. \
I need to extract some statements (If they exist) from that page, which will be used for fact checking (NLI application).

Generate a Page object, with: \
1) Whether there are some interesting information/statements in the page which can be used for fact checking.
2) List of Statement objects derived from that page with a statement and range (span of text) from where did you take that statement. 

Note: If there is nothing interesting in a page (nothing to be used for fact checking), label `contains_information_for_fact_checking` parameter as False and leave `statements` as empty list.
\
"""

class Statement(BaseModel):
    """Data model for statement, retrieved from pdf page"""
    statement: str
    range: List[int]

class Page(BaseModel):
    """Data model for pdf page"""
    contains_information_for_fact_checking: bool
    statements: List[Statement]


I used a sample (dummy) pdf from the internet and there is definitely nothing interesting there for fact checking, but the program keeps outputting smth.
Here the output:

Plain Text
```
Function call: Page with args: {
  "contains_interesting_information": true,
  "statements": [
    {
      "statement": "This is a small demonstration .pdf file",
      "range": [19, 52]
    },
    {
      "statement": "Boring, zzzzz.",
      "range": [211, 225]
    },
    {
      "statement": "Continued on page 2 ....",
      "range": [286, 310]
    }
  ]
}
Function call: Page with args: {
  "contains_interesting_information": true,
  "statements": [
    {
      "statement": "Yet more text.",
      "range": [35, 47]
    },
    {
      "statement": "And more text.",
      "range": [52, 66]
    },
    {
      "statement": "Oh, how boring typing this stuff.",
      "range": [137, 172]
    },
    {
      "statement": "More, a little more text.",
      "range": [204, 229]
    }
  ]
}
am I doing something wrong?
hmmm, can probably simplify and clarify the prompt/models a bit. Maybe something like this? Tbh I'm not sure how an LLM is supposed to figure out the range component, so I left it commented out lol Seems like something you can manually track since you are feeding the program pages/chunks?

Plain Text
prompt_template_str = """\
This is a text from a page from a pdf: {page_text}. \
I need to extract some statements (If they exist) from that page, which will be used for fact checking (NLI application).

Using only the information from the page, and not prior knowledge, generate a `Page` object, with with a list of statements.

Note: If there is nothing interesting in a page (nothing to be used for fact checking), leave `statements` as empty list.
\
"""

from pydantic import BaseModel, Field

class Statement(BaseModel):
    """Data model for statement, retrieved from pdf page"""
    statement: str = Field(description="A statement from a PDF that can be fact checked."
    # range: List[int] = Field(description="...")

class Page(BaseModel):
    """Data model for pdf page"""
    statements: List[Statement] = Field(default_factory=list, description="List of statements for fact checking.")
Wow that output sure is something ha πŸ˜…
Maybe try a more realistic PDF for testing
Yeah, I will try a real PDF, but wanted to first check If the program can distinguish "relevant" pages (that contain info for fact checking) with irrelevant ones
The output is still trash lol. Let me try with a real pdf (Amazon's annual report)
Wit this prompt it seems to work a bit better:

Plain Text
"""\
This is a text from a page from a pdf: {page_text}. \

This pdf contains only facts.

I have another pdf. My goal is to check, If there are some contradictions between this pdf and that pdf.
For that, I need to extract some facts from this pdf.
\
"""
Sorry, Logan. How can I use Azure OpenAI, instead of the regular one in the program code above?
Good question! You need to make sure you are using the latest api version of azure

Then, make the azure LLM and pass it in as a kwarg to the pydantic program

Quick example of setting up an azure llm

https://gpt-index.readthedocs.io/en/stable/examples/customization/llms/AzureOpenAI.html
Add a reply
Sign up and join the conversation on Discord