Logan M i have this use case for

At a glance

@Logan M i have this use case for extracting differences between 400 page legal contracts. Client asks for the most similar contract to one they specify, and the goal is to find differences at a clause level since the contracts have broadly similar structure in sections. To first fetch the most similar documents, we would have to construct some structure around the chunked documents and establish a similarity comparison method - as we cant just compare chunked documents, we would have to compare sets of chunked documents between 2 contract files, arranged in some structure hierarchically i suppose? Not sure how to do that comparison across large documents in one go. Have you encountered this before and do you have any ideas from the documentation that you suggest I can explore straight away?

13 comments

bbmax

Commenting to follow along, interesting problem.

LLogan M

Yea certainly an interesting problem!

My gut reaction, you would need some kind of normalization step -- get the contract into some kind of expected structure, so that comparisons can be done piece by piece
https://docs.llamaindex.ai/en/stable/examples/query_engine/pydantic_query_engine.html

We recently added pydnatic outputs to query engines (thanks @bmax ❤️ )

So if you can think of some structure to normalize a contract to, this could work quite well. It could fill out the structure as it iterates over the contract

VVish

@jerryjliu0 if you do have some insights, would appreciate that too

VVish

Thanks a lot!

Right, I guess the main challenges come down to -

Getting contracts broken down into chunks that fit into a structure
Comparison of entire contracts with structure constructed in 1 for similarity
Having the structure decomposable into granular

I guess the task boils down to 2 levels of similarity comparisons:

Document Level:
- Using the query, we filter to findthe corresponding contract documentthe user is referring to (metadata filters here?)
- Once we find the user's candidate contract, we filter other "contract" documents that are similar to the candidate contract, and choose top 1

Items within 2 contracts
- Find similar sections, clauses and outline how the 2 contracts are different across clauses and sections

VVish

Just trying to wrap my head around how I should load these 400 page documents to enable these similarity comparisons on both levels

VVish

To then of course use RAG for querying similarities on Level 2

VVish

@here do you guys think knowledge graphs might be useful to explore here?

LLogan M

I think not quite knowledge graphs, but there might be something the node relationships to exploit here 🤔

Just a quick refresh, nodes allow you to set parent, children, next, and prev relationships

LLogan M

Seems like this is something where I would take 2-5 contracts and build a POC from there 😅 No need to worry about 400+ yet

VVish

Hmm..

VVish

Well one contract is 400 pages long, thats really where this problem starts

VVish

But yeah I get the spirit of the message - I'll try hacking and see where I get stuck

aadeelhasan

@Logan M hello as suggested by you i have created a fastapi for my query engine but it fails to handle multiple requests at same time ,get weird response , plz help

Add a reply

Find answers from the community

Logan M i have this use case for