This is in the typescript package hey? Pinging @Yi Ding π
I think there might be an issue where we're not splitting the context properly for a certain wiki. Happy to help debug.
I see, so this might be a problem with the internal chunking. Maybe I can isolate a specific file that fails.
Yeah I recommend using HTTP Toolkit to see what is actually being sent to OpenAI. There might be a corner case in the way we're doing splitting.
@Yi Ding Well, this might not be fair to LITS...First off, it's an HTML export from our wiki. Shouldn't be a problem, right? But have a look at the file contents π
Ahh.... yeah, we don't currently have an HTML loader, and loading arbitrary HTML is actually quite difficult (may even be one of the hardest problems out there).
So a couple of things we can do: 1. we can build a simple HTML loader. 2. You can build a specialized HTML loader for your particular data source. 3. You can try to convert your HTML to Markdown first. If the resulting Markdown looks OK then you can try using the chunker. Once again, even this is not fool proof, but worth a shot.
@Yi Ding Well, I've got some pretty challenging test files for an HTML loader. I was going to try and pull the raw MD/wiki text as a next step, but I will look through the LITS code base and docs, and see what it would take to implement a basic/pessimistic HTML loader. Thanks for your help!
Thanks! Yeah a basic HTML parser might be a wrapper around html-to-text so you could give that a try.
@Yi Ding Quick question - I like the development environment and I've got the hang of pnpm, etc. But what's the right way to use the updated library from within the codespaces docker? The example/package.json references "llamaindex" but that won't be the updated code I'm hacking on.
Use the apps/simple. Itβs the same scripts but will refer to your latest build. You can run pnpm run dev to have it watch for changes and rebuild.
Itβs a bit clunky I know . Was just talking to someone about this yesterday. Have some ideas but need time to try them out.
@Yi Ding Will try it. I have to say that overall I'm impressed how easy it was to get in and hack on this code base.
Thanks! Appreciate you saying so and helping!
@Yi Ding Added html.ts to apps/simple (hacked-up copy of pdf.ts). I had to add ts-node to the package.json and re-run pnpm i. Then running npx ts-node html.ts
fails with this output:
Hm. Just tried pdf.ts and it worked okay. So this seems like a dependency issue I've created...I'll keep digging.
Yeah adding ts-node to the package.json is a good idea probably.
I have it in the README as pnpx I think? But might as well just have it.
That error looked like it was related to not being able to properly load tiktoken. I haven't seen that in the apps/simple, but maybe we need to change the tsconfig somehow.
@Yi Ding The pdf.ts sample works fine, but my html.ts sample fails with this output:
TSError: β¨― Unable to compile TypeScript:
html.ts:2:10 - error TS2305: Module '"llamaindex"' has no exported member 'HTMLReader'.
import { HTMLReader } from "llamaindex";
You'll need to export it from the barrel file.
You'll see a bunch of re-exports in the core/src/index.ts.
Barrel files are out of fashion so I might try to replace that at some point, but for now...
Didn't think to look for it.
Honestly, I kept on making fun of how hard python dependency management is, and then I tried making a TS npm package that supports Node, Next, and ESM, and I will never make fun of the Python folks again.
OK I published a snapshot. llamaindex@0.0.0-20231026231921 Can you give it a try?
So I upgraded the package because there was a security alert (turns out unrelated) but in order to do so I had to use a dynamic import for CJS.
This CJS/ESM stuff is some of the most confusing stuff I've worked on in my career.
Yeah, I started with the latest version but didn't want to mess with the CJS/ESM thing, so I went back to the author's recommended earlier version.