hi guys, quick question - is it possible

like, the files themsevles or the document objects? Both are pretty easy

files I think are obvious

documents are pydantic objects, you should be able to just dump things

Plain Text

doc_dicts = [doc.model_dump() for doc in documents]
doc_json_str = json.dumps(doc_dicts)

doc_dicts = json.loads(doc_json_str)
documents = [Document.model_validate(doc_dict) for doc_dict in doc_dicts]

@Logan M amazing, thanks. And will that dump everything in the document? text, metadata, ids, etc?

BTW - do you know what is the difference between document.metadata and document.extra_info? I've seen some readers using sometimes extra_info and sometimes metadata

it should 🙂 the powers of pydantic

extra_info is an alias for metadata

kept for backwards compat

gotcha 🙂 thanks a lot!

@Logan M hi, i've just tried to do model_dump on my documents and it doesn't seem to have this method. I also tried model_dump_json but it doesn't exist as well

Do you know why?

Did you do it on the list or a single document?

A list won't have it, because it's a list.

A single document object will

i'm trying it on a single doc.

(Also, that's a pydantc v2 specific method)

from llama_index.core.schema import Document

If you have pydantic v1 somehow, then it's .dict() or .json()

gotcha, dict() worked

thanks!

i just did some extracting data from a webpage thing using gemini's json output, very nice.

though, one headache is that i actually need to input large chunk of data, larger than the max_output tokens(8192) that gemini, or any other advanced LLM nowadays, is capable of. how to retrieve the full response in llamaindex recursively? i cannot split the input into chunks because that could lose structured data.

No llm has an 8192 context window. Gemini and openai and whatnot are all over 100k these days

The 8192 limit could be openai emvessings, if you are embedding things

Do you need embeddings?

i didn't state clearly in this thread but i think I did better in the new message I just posted but nowhere to find. doesn't matter anyway.

I mean, I am extracting data from webpages, the workflow is

firecrawl scrape page into markdown
feed into LLM. the input could already be tens of KBs, let's say 50KB
extract data. the extracted could be a bit lesser, let's say 30K.

so, context window is not a problem for literally any LLM, even latest Llama 3.2 3b, but the max output tokens is. gemini/claude-3.5-sonnet are already largest but only 8192 tokens. in a chat client like chatgpt, we gotta reply with "continue" again and again to get the full output.

what about in Llama Index or any other workflow management? how do I handle large output?

I will do embedding/reranking later, but I have to extract the data first.

I don't think there's an easy way to "continue" the output

Probably the best idea is just adding the current output to the input, and asking the LLM to finish

but it would mixed data up with previous responses. what I need to achieve now is to extract thousands of json items, another scenario could be proofreading, so the output is actually continuous.
sorry about my english not quite accurate, but i guess you could picture.

I don't think it would mix it up

"Here's some text, extract this json object."

And then

"Here's some data, extract this json object. Here's what you have so far, continue."

I would play around with something like that

Feels like the only way imo