Find answers from the community

Updated 2 months ago

hi guys, quick question - is it possible

hi guys, quick question - is it possible to save documents to Disk? I couldn't find any reference to that in the doc.
I have a pipeline that needs to read documents from disk and index them. The fetching and the indexing happens on 2 different microservices. I want the first service to first store the documents to disk (S3/GCS) and then have the second service reads those documents and embed them
L
d
t
26 comments
like, the files themsevles or the document objects? Both are pretty easy

files I think are obvious

documents are pydantic objects, you should be able to just dump things

Plain Text
doc_dicts = [doc.model_dump() for doc in documents]
doc_json_str = json.dumps(doc_dicts)

doc_dicts = json.loads(doc_json_str)
documents = [Document.model_validate(doc_dict) for doc_dict in doc_dicts]
@Logan M amazing, thanks. And will that dump everything in the document? text, metadata, ids, etc?
BTW - do you know what is the difference between document.metadata and document.extra_info? I've seen some readers using sometimes extra_info and sometimes metadata
it should πŸ™‚ the powers of pydantic
extra_info is an alias for metadata
kept for backwards compat
gotcha πŸ™‚ thanks a lot!
@Logan M hi, i've just tried to do model_dump on my documents and it doesn't seem to have this method. I also tried model_dump_json but it doesn't exist as well
Do you know why?
Did you do it on the list or a single document?

A list won't have it, because it's a list.

A single document object will
i'm trying it on a single doc.
(Also, that's a pydantc v2 specific method)
from llama_index.core.schema import Document
If you have pydantic v1 somehow, then it's .dict() or .json()
gotcha, dict() worked
i just did some extracting data from a webpage thing using gemini's json output, very nice.

though, one headache is that i actually need to input large chunk of data, larger than the max_output tokens(8192) that gemini, or any other advanced LLM nowadays, is capable of. how to retrieve the full response in llamaindex recursively? i cannot split the input into chunks because that could lose structured data.
No llm has an 8192 context window. Gemini and openai and whatnot are all over 100k these days

The 8192 limit could be openai emvessings, if you are embedding things
Do you need embeddings?
i didn't state clearly in this thread but i think I did better in the new message I just posted but nowhere to find. doesn't matter anyway.

I mean, I am extracting data from webpages, the workflow is
  1. firecrawl scrape page into markdown
  2. feed into LLM. the input could already be tens of KBs, let's say 50KB
  3. extract data. the extracted could be a bit lesser, let's say 30K.
so, context window is not a problem for literally any LLM, even latest Llama 3.2 3b, but the max output tokens is. gemini/claude-3.5-sonnet are already largest but only 8192 tokens. in a chat client like chatgpt, we gotta reply with "continue" again and again to get the full output.

what about in Llama Index or any other workflow management? how do I handle large output?

I will do embedding/reranking later, but I have to extract the data first.
I don't think there's an easy way to "continue" the output

Probably the best idea is just adding the current output to the input, and asking the LLM to finish
but it would mixed data up with previous responses. what I need to achieve now is to extract thousands of json items, another scenario could be proofreading, so the output is actually continuous.
sorry about my english not quite accurate, but i guess you could picture.
I don't think it would mix it up

"Here's some text, extract this json object."

And then

"Here's some data, extract this json object. Here's what you have so far, continue."
I would play around with something like that
Feels like the only way imo
oy now i get what you were saying, like writing a book, but feed the latest paragraph/context/data to the next conversation so that it could pickup from the exact location. well another form of continue.
well i am still a bit confused that whether it is feasible to instruct gemini(or other smaller LLM) to follow the instruction and not to miss any single item of data...

honestly i am quite new to llamaindex in the first place, i thought there could already exist a workflow to recursively "GET THE FULL OUTPUT IN CHUNKS"
Add a reply
Sign up and join the conversation on Discord