GitHub - xenova/transformers.js: Run 🤗 ...

⚕⚕dr. Kónya

Would it be feasible to use your react frontend and integrate
https://github.com/xenova/transformers.js into it, so embeddings could be generated right on the client side with onnx? Or do you think that the available models are too small yet for this?

⚕

17 comments

LLogan M

Hmmm, it's hard to say without trying it.

But personally, in my setup, it makes more sense to add embeddings to the backend index_server.py file. You can see how to customize embeddings here: https://gpt-index.readthedocs.io/en/latest/how_to/embeddings.html#custom-embeddings

And there are many embedding models on huggingface. I would try something from sentence transformers like https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

⚕⚕dr. Kónya

Yep, that's one of the models available also on client side in the repo i mentioned.

⚕⚕dr. Kónya

I saw applications of llamaindex for semantic search tasks.
I wonder, could I somehow use it to retrieve structured information from ocr-ed form documents in a few-shot learning way, where the examples of input - output pairs are within the index?

LLogan M

It could work! You'll need a custom prompt, and you will have to get pretty creative with it haha

For example, the knowledge-graph index uses a few-shot prompt like this: https://github.com/jerryjliu/llama_index/blob/main/gpt_index/prompts/default_prompts.py#L253

If you can give a few small examples in the prompt, then hopefully the model can figure out what you want.

From my experience, LLMs aren't quite ready for document info extraction. I would look at something like LayouLM, LiLT, or DONUT and use Document VQA. Here's a huggingface space as an example: https://huggingface.co/spaces/impira/docquery

⚕⚕dr. Kónya

I tried DONUT, Ernie layout (uses LayoutXML). Donut showed first good result with 10k(ish) synthetic data.

⚕⚕dr. Kónya

I agree with creative prompting, even T5 flan XL can deliver pretty good results (w/o any visual clue), in few-shot setting.

⚕⚕dr. Kónya

The only problem is that the input token length is quite limited... so you rather end up with one-shot, and it's not the best with the raw ocr-ed text.

⚕⚕dr. Kónya

I don't want to end up with a 1k+ token call to GPT4 for every page 🙂 - which delivers the best results (with clever prompting) so far.

LLogan M

haha exactly! To give a decent few-shot, you need to use a lot of valuable context space.

Maybe fine-tuning could help? You could leverage the DocVQA dataset to automatically create a completion dataset for an LLM

⚕⚕dr. Kónya

And visually rich documents usually lack context 😦 and this doesn't help. just caption - answer pairs...

⚕⚕dr. Kónya

So probably visual clue/layout is much needed.
I had hopes on UDOP, but could not try it yet.

LLogan M

haha I've been waiting so long for UDOP to get added to huggingface. There was a PR for like a month that got cancelled, but now someone else is working on it 🙏

The original UDOP codebase is not user friendly sadly 🙄

LLogan M

The problem with Donut (and likely UDOP) is that they are generative models, so it can be hard to detect when it hallucinates an answer

LayoutLMV1/2/3, LiLT, and Ernie-Layout are all extractive, which is a bit more reliable in my opnion

⚕⚕dr. Kónya

wow. Token usage is pretty high for one query...

Attachment

⚕⚕dr. Kónya

@Logan M Any plans to make an example with a local running Llama or Alpaca? 🙂

LLogan M

well, at $0.002/1k tokens, 5K tokens is not bad 😅 Even with davinci at $0.02/1k, still could be worse

Llama index supports any local LLM, it's just up to you to pass the text to the model and return the new generated tokens

See this small example with FLAN -> https://github.com/jerryjliu/llama_index/issues/544

You mileage may vary though. Seems like every LLM needs slightly tweaked prompts. The default prompts are optimized for davinci and chatgpt

⚕⚕dr. Kónya

Add a reply

Find answers from the community

GitHub - xenova/transformers.js: Run 🤗 ...