Pdf loading

jjon-chuang

Hello, may I ask why loading a pdf with SimpleDirectoryReader produces a list of documents per page? This seems counter intuitive and really the wrong behaviour. It should load a single document containing all of the pages?

Plain Text

ddacc = SimpleDirectoryReader(input_files=["data/2202.08137 (1).pdf"]).load_data()

13 comments

LLogan M

The main motivation here is keeping track of sources (the page number is inserted into the metadata)... at least that's what I assume lol

Also easy enough to concat the documents into a single one if you prefer, or you could roll your own pdf loader (the one in llama index is pretty basic tbh)

jjon-chuang

Hmm, my preference would be to embed the page number into the document (instead of using metadata field), since this method of pre-chunking by pages causes uneven chunks.

I'm trying to build some tools that extract various document-level and subsection level metadata. Hence, considering a document as a whole and chunking evenly is preferred...

jjon-chuang

Not sure if anyone makes use of page metadata... my sense is that it is less important than extracting the semantic content...

jjon-chuang

However, yes, perhaps when trying to cite the source it becomes easier with this method. However, my sense again is that one should be able to have a chunk span two pages.

jjon-chuang

So it seems that the chunking and node metadata should somehow be aware of the pages.

So that a single chunk could span multiple pages.

"pages": 1, 2

LLogan M

Yea like right now, I think the page number metadata is fairly useful since it shows up on response.source_nodes

But you are right, it does introduce some uneven chunking

LLogan M

Not sure what a good solution is, without introducing too much complexity lol

jjon-chuang

I think it's better to consider a Document as being something canonical. And accept a list of subdocuments as the chunking input.

It is very document dependent - for certain documents - page is a good segmentation (e.g. PDFs, brochures), but for research papers, it is not.

I get the feeling we can extract what kind of document it is and apply the appropriate chunking strategy based on the first few pages.

This will be part of metadata extraction.

jjon-chuang

Anw, quite iffy, and generally this is the data-cleaning part of data science that is very hard. But I think that to become a mature framework, llamaindex should implement these kinds of auto-cleaning defaults that "just work". The case needs to be proved out against a selection of hard tasks that are very common in real world use on which the naive approaches completely fail.

Personally, I think that manually constructing index structures is too high a burden on the user. Custom indices are also very brittle and would fail when trying to incorporate more data sources. What one has is a problem of "multiway joins" on unstructured data. It has a high maintenance burden and the engineer/user needs to have a very good understanding of LlamaIndex concepts. Which in 90-99% of the cases means they will use the most naive approach, which can be found almost anywhere else, defeating the purpose of LlamaIndex. So the problem comes from going from prototype to production. The user realizes they have no use of LlamaIndex and that manually configuring is too hard. And just use a naive vector database approach.

Which can just as well be found on langchain or other integrations.

jjon-chuang

And in the production use case it can become very necessary to have high recall. So the naive approaches will not be satisfactory - the user will discover the limits of the system, and yearn for a better solution. Of course proving this to the user is hard, but that is the job of the system developer.

Hence, I believe that the right approach is to be able to appropriately sort, autolabel/auto-metadata extract, structure the knowledge base, without needing to construct custom indices and labelled tools.

The user should only have a single interface. Upload whatever data I want, in any structure and format that I want.

The user should only need some minimal fine-tuning where human feedback is requested from the system. But even then, such a requirement should be reduced as far as possible.

For instance, when a user uploads a new dataset, the system should feedback the autolabelling steps it has taken and allow the user to correct it in the form of high-leverage parameters.

jjon-chuang

I think the niche that LlamaIndex needs to fill is constructing a knowledge base connector that is autotuned to the what ever input is chucked in, and provably better than all other naive approaches.

The approach that should be taken for auto data cleaning/labelling is "a wide net approach" with "multiple fallbacks". The aim is to provide as much contextual information as possible while falling back on increasingly more approximate approaches.

One example of an best-in-class-defaults-integrator is MosaicML's platform: https://github.com/mosaicml/composer which recently got acquired by databricks

LLogan M

The main issue with the more complex data-ingestion approaches in runtime -- each of those things takes time to run 👀

But, it makes sense. Data pre-processing is hard, and is a decent moat to build out

jjon-chuang

yeah it takes a lot of time. So it should not be the first interaction with the system, i.e. prototype.

But definitely when the user returns wanting to solve some edge cases, this optimized but more costly/time consuming pipeline should be available.

Well, ideally, more and more pre-defined knobs can be turned on (e.g. extract doc title, extract doc keywords, extract subsections etc), and the user can see from an objective metric that the recall/accuracy is going up.

Add a reply

Find answers from the community

Pdf loading