LlamaIndex

Log inLog into community

Find answers from the community

Updated 6 months ago

TS

TS

At a glance

The community member has a minimal Dockerized API using embeddings with ChatGPT to answer questions. When trying to process a full batch of wiki content, they encountered a failure due to the maximum context length being exceeded. They are looking for a way to process documents in batches and have the entire collection available at the end.

The community members discuss potential solutions, such as splitting the context properly, using HTTP Toolkit to debug the issue, and building an HTML loader to handle the wiki content. They also discuss the challenges of working with TypeScript dependencies and the CJS/ESM ecosystem.

There is no explicitly marked answer, but the community members collaborate to troubleshoot the issue and explore potential solutions.

Useful resources

·

Hey all I've got a minimal Dockerized API using embeddings with ChatGPT to answer questions. Everything works great with a single wiki page imported and the storage context saved/read from disk. I went to run the full batch of wiki content and got this failure:

BadRequestError: 400 This model's maximum context length is 8192 tokens, however you requested 31061 tokens (31061 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.

Is there a way to process documents in batches, and have the entire collection available at the end? This is the line I'd like to take apart:

const index = await VectorStoreIndex.fromDocuments(allDocs, { storageContext: ctx });

L

m

Y

36 comments

This is in the typescript package hey? Pinging @Yi Ding 🏓

Yup, thanks

Thanks!

I think there might be an issue where we're not splitting the context properly for a certain wiki. Happy to help debug.

I see, so this might be a problem with the internal chunking. Maybe I can isolate a specific file that fails.

Yeah I recommend using HTTP Toolkit to see what is actually being sent to OpenAI. There might be a corner case in the way we're doing splitting.

@Yi Ding Well, this might not be fair to LITS...First off, it's an HTML export from our wiki. Shouldn't be a problem, right? But have a look at the file contents 🙂

Ahh.... yeah, we don't currently have an HTML loader, and loading arbitrary HTML is actually quite difficult (may even be one of the hardest problems out there).

So a couple of things we can do: 1. we can build a simple HTML loader. 2. You can build a specialized HTML loader for your particular data source. 3. You can try to convert your HTML to Markdown first. If the resulting Markdown looks OK then you can try using the chunker. Once again, even this is not fool proof, but worth a shot.

@Yi Ding Well, I've got some pretty challenging test files for an HTML loader. I was going to try and pull the raw MD/wiki text as a next step, but I will look through the LITS code base and docs, and see what it would take to implement a basic/pessimistic HTML loader. Thanks for your help!

Thanks! Yeah a basic HTML parser might be a wrapper around html-to-text so you could give that a try.

@Yi Ding Quick question - I like the development environment and I've got the hang of pnpm, etc. But what's the right way to use the updated library from within the codespaces docker? The example/package.json references "llamaindex" but that won't be the updated code I'm hacking on.

Use the apps/simple. It’s the same scripts but will refer to your latest build. You can run pnpm run dev to have it watch for changes and rebuild.

It’s a bit clunky I know . Was just talking to someone about this yesterday. Have some ideas but need time to try them out.

@Yi Ding Will try it. I have to say that overall I'm impressed how easy it was to get in and hack on this code base.

Thanks! Appreciate you saying so and helping!

@Yi Ding Added html.ts to apps/simple (hacked-up copy of pdf.ts). I had to add ts-node to the package.json and re-run pnpm i. Then running npx ts-node html.ts fails with this output:

Hm. Just tried pdf.ts and it worked okay. So this seems like a dependency issue I've created...I'll keep digging.

Yeah adding ts-node to the package.json is a good idea probably.

I have it in the README as pnpx I think? But might as well just have it.

That error looked like it was related to not being able to properly load tiktoken. I haven't seen that in the apps/simple, but maybe we need to change the tsconfig somehow.

@Yi Ding The pdf.ts sample works fine, but my html.ts sample fails with this output:

TSError: ⨯ Unable to compile TypeScript:
html.ts:2:10 - error TS2305: Module '"llamaindex"' has no exported member 'HTMLReader'.

import { HTMLReader } from "llamaindex";

Ahh...

You'll need to export it from the barrel file.

index.ts

@Yi Ding DUH

You'll see a bunch of re-exports in the core/src/index.ts.

Barrel files are out of fashion so I might try to replace that at some point, but for now...

Didn't think to look for it.

Honestly, I kept on making fun of how hard python dependency management is, and then I tried making a TS npm package that supports Node, Next, and ESM, and I will never make fun of the Python folks again.

@Yi Ding

Attachment

@Yi Ding https://github.com/run-llama/LlamaIndexTS/pull/154

OK I published a snapshot. llamaindex@0.0.0-20231026231921 Can you give it a try?

So I upgraded the package because there was a security alert (turns out unrelated) but in order to do so I had to use a dynamic import for CJS.

This CJS/ESM stuff is some of the most confusing stuff I've worked on in my career.

Yeah, I started with the latest version but didn't want to mess with the CJS/ESM thing, so I went back to the author's recommended earlier version.

Add a reply

Sign up and join the conversation on Discord