@Logan M PDFs are first chunked into "documents" page by page, then each page is chunked into "nodes" that comply to the max tokens limit, no ?
It means that chapters in a PDF are chunked page by page, sentences are cut in half if continued on the next page etc. Am I right? (this is what I noticed with the PDF I tried) ...
Also PDF to plain text conversion is horrible (not llamaindex fault): there are almost no notion of headings, no notion of columns in PDF, etc (unless specifically made for accessibility reasons) ... everything is just treated as (stupid) text ... and also I've noticed that many PDF to text conversion end up in having one word then a new line, then one word, then a new line.
Thus a sentence like "I like my black dog" becomes
I
like
my
black
dog
And then for a reason that eludes me, the text sent to be vectorized removes the \n and we end up with "ilikemyblackdog" without any space ...
It seems that openAI does not care the spaces are missing but it's hard to believe it doesn't have any impact ...