i
page from pdfA and i
page from pdfsNclass Fact(BaseModel): statement: str evidence: str
structured statement/facts
and how can I extract it?Fact
objects for all text chunks inside each pdf using OpenAI Pydantic Program. So let's say I split each pdf into chunks of size=~512 tokens (since it is a context window of NLI). I need to create Fact
object for each of those chunks (since number of tokens in one pdf is way beyond 4096).Fact
s from pdfA vs Fact
s in pdfsNsimilar
chunks. I can then use those as an input for LLM.I made $10
and I made $20
should be a contradiction (I think lol)class Fact(BaseModel): statement: str evidence: str
statement
is a hypothesis and evidence
is a premise?program
pattern you sent me yesterday, but it keeps extracting super irrelevant information.prompt_template_str = """\ This is a text from a page from a pdf: {page_text}. \ I need to extract some statements (If they exist) from that page, which will be used for fact checking (NLI application). Generate a Page object, with: \ 1) Whether there are some interesting information/statements in the page which can be used for fact checking. 2) List of Statement objects derived from that page with a statement and range (span of text) from where did you take that statement. Note: If there is nothing interesting in a page (nothing to be used for fact checking), label `contains_information_for_fact_checking` parameter as False and leave `statements` as empty list. \ """ class Statement(BaseModel): """Data model for statement, retrieved from pdf page""" statement: str range: List[int] class Page(BaseModel): """Data model for pdf page""" contains_information_for_fact_checking: bool statements: List[Statement]
``` Function call: Page with args: { "contains_interesting_information": true, "statements": [ { "statement": "This is a small demonstration .pdf file", "range": [19, 52] }, { "statement": "Boring, zzzzz.", "range": [211, 225] }, { "statement": "Continued on page 2 ....", "range": [286, 310] } ] } Function call: Page with args: { "contains_interesting_information": true, "statements": [ { "statement": "Yet more text.", "range": [35, 47] }, { "statement": "And more text.", "range": [52, 66] }, { "statement": "Oh, how boring typing this stuff.", "range": [137, 172] }, { "statement": "More, a little more text.", "range": [204, 229] } ] }
prompt_template_str = """\ This is a text from a page from a pdf: {page_text}. \ I need to extract some statements (If they exist) from that page, which will be used for fact checking (NLI application). Using only the information from the page, and not prior knowledge, generate a `Page` object, with with a list of statements. Note: If there is nothing interesting in a page (nothing to be used for fact checking), leave `statements` as empty list. \ """ from pydantic import BaseModel, Field class Statement(BaseModel): """Data model for statement, retrieved from pdf page""" statement: str = Field(description="A statement from a PDF that can be fact checked." # range: List[int] = Field(description="...") class Page(BaseModel): """Data model for pdf page""" statements: List[Statement] = Field(default_factory=list, description="List of statements for fact checking.")
"""\ This is a text from a page from a pdf: {page_text}. \ This pdf contains only facts. I have another pdf. My goal is to check, If there are some contradictions between this pdf and that pdf. For that, I need to extract some facts from this pdf. \ """