do i get it right that we can now upload all books with llava, not worrying about chunking? storage wise, are images converted into similar vector storage / dimensions or will they require more space?
it's a matter of time, probably weeks or days, when we will be 'reading' books just recording a video of us turning the pages of a book. really what it takes is to figure out the right time interval and roi for captured images, and then feed the images to gpt4v or llava. but llava is not there yet. it can understand the images. but does not always want to extract the text. they put some strict guadrails to instruct llava not to do anything if it 'sees' what looks like a book. you need to be very creative in your prompt and how you capture the image.