I think inserting you data with page numbers in metadata, then you can retrieve data and filter by page numbers?
I can filter but it still doesn't have the concept of page numbers and the sequencing of pages. Assuming you're able to block prior knowledge too (for popular books it may be already trained on). Not sure how to do this, maybe it's a context window issue and I should ultimately be able to summarize from page 1 up to page X?
For instance, I'm on page 108. I want to ask something about this page: "Why did character X do this?" and the LLM will answer, "back in page 50, it did Y and Z and so so"
Assuming you're able to block prior knowledge
-- yea, this is mostly done with prompt engineering (i.e. our default prompts tell the LLM to only use knowledge from the provided context, but it doesn't always do that)
I can filter but it still doesn't have the concept of page numbers and the sequencing of pages.
-- actually if this information is in the metadata of nodes, the LLM will see this. You could even write a customr retriever or postprocessor to sort by page number.
For instance, I'm on page 108. I want to ask something about this page: "Why did character X do this?" and the LLM will answer, "back in page 50, it did Y and Z and so so"
-- I wonder if the tree index would be good at this? It certaintly wouldn't be fast though
A good knowledge graph would probably handle this too. But the hard part is building one π Hoping to make some KG refactors soon in the library π
The metadata is something I don't have a complete grasp on yet. Even if I assign metadata and filter on it, how would the LLM know about the metadata itself? From what I understand it's opaque to it and is only used in the retrieval process. Is my understanding wrong, how did you mean that the LLM can see the metadata? It's also part of the prompt?
From what I gather... it seems like given the context window limitations, if I want a summary from page 1 to page X that are still very detailed (because fiction needs all the nuances that is particular to that specific story, the LLM cannot infer from general knowledge) then the best way is to use synthetic data for each paragraph, summarize, etc. and fine-tune on that first? How do you think about this approach? π
Fine-tuning will probably lead to the fastest system (since you wouldn't need multiple LLM hops to answer a query). But it's also the hardest approach to debug when things go wrong. Maybe worth a shot for this though
Thank you, I'll check the metadata customization. Maybe it's a combination of cleaning up the inputs more and a knowledge graph is the way. I tried using privategpt but the context ends up being bits and pieces from the content that (understandably) ends up not being precise because the LLM wouldn't have an understanding of the narrative flow
Re this β why would it be hard? You mean I should know in advance the connections between paragraphs that are far from one another? Or the page connections are enough?
I was under the impression it's the LLM's role to semantically link the chunks but I guess without dumping the whole novel in each prompt there's only so much the embeddings can do
Oh I was thinking a richer knowledge graph than just pages/paragraphs. I.e. events, places, people
Ohh I see, can I do this all in the vector db part?
Okay so thinking like a graph db per se... Another thing is that I want to discover the details about the book as I go (fiction) so I can't have a manual preprocessing step to build out the graph, also practically so I can apply it to any other book I don't know about yet
But appreciate the pointers so far, what I'm getting is that I can do it mostly with llamaindex and it's not like an entirely different set of tooling I need
Yea happy to help brainstorm a bit π
Oh would love to confirm this too: Say I got the knowledge graph done, whatever connections the entities have, they all ultimately end up in the LLM prompt right? Like:
context below
---
{KNOWLEDGE_GRAPH_CONTEXT} <-- so via arrows or notation that the LLM should be trained to understand? natural language seems too verbose
{NARRATIVE_CONTEXT}
---
answer based on above
{QUERY}
Because if it implies that I have to establish this whole context for each prompt then perhaps using a generic LLM untrained on the whole story is an inefficient way to go about it
Or is there a way in the pipeline to programmatically plug in the knowledge graph even without fine tuning and before I do the prompt. I'm not too familiar with the inner workings but gut tells me LLM is really just tokens in tokens out ie the earliest point of contact is the prompt
Yea if you had the knowledge graph, basically you can retrieve nodes that math what the user is asking, as well as the surrounding sub-graph, and you can format all this into a prompt. It's pretty neat, and works surprisingly well.
The prompt you describe above is essentially what it might look like
Switching from code to natural language in the end feels all wonky, but interesting how the increasing smartness of the LLM offsets that lack of structure
I suppose the connections can be derived with this but nothing much to be done with the precise details in the story other than dumping text into the model with a big enough context window. With hope of reducing that context size by filtering out irrelevant chunks
PrivateGPT is so bad at this
For context I tried it because it's the most popular "chat with pdf" open source tool atm, so thinking I could build on top of it
I think private gpt is just doing top-k retrieval, but I could be wrong lol
Ohh yeah looks like it, so that's probably helpful for general knowledge but not fictional work
I'm reading about KG + LLM talks and this definitely looks like the way to go. Now to build it...
There could be a fun use case here where one could autogenerate a whole wiki based on their progress on the novel. There could be a slider for certain milestones or even per page and the content will transform