Question for the NLP experts

At a glance

Question for the NLP experts:
Say I have thousands of academic articles (~50 page per article on average) on a certain broad subject which I would like to index. The idea is to find paragraphs/articles related to a very specific use case given as free text input.
Originally I thought about indexing each article using the GPTSimpleVectorIndex with rather small chunk size (256) and then run the query (I use GPT3 as the LLM), but happy to hear your thoughts on more sophisticated indexing schemas (hierarchies?) as I'm afraid this doesn't work as good as I expected.
TIA for your valuable insights!

18 comments

NNachos

I would like to know this as well, what would be the most apt method to use index GPT if the data is larger.

MMikko

Well for starters I would look outside the GPTSimpleVectorIndex and use an actual vector database, so that searching is more efficient.

That said, with a large dataset it is tougher to find the correct context for the model. Especially if you are an expert on the material, it may feel like the model misses relevant data. This can be remedied somewhat by just fetching a lot of vectors, but that leads to a) increasing costs and b) challenges for the LLM to compress the information into something meaningful. So results may vary 😄

I'd look at finetuning the models, then maybe building an index on top of that. And use a query template that doesn't instruct the model to forget its previous knowledge.

Just some initial thoughts..

yyoelk

Interesting, thanks @Mikko . For now I'm trying to optimize without retraining any model.
I agree that on large datasets it loses context, that's why I thought it might be good to build several levels of indices on top of each other with varying chunk sizes.
Happy to hear your thoughts on that, also maybe I should use different indices than the SimpleVector. Note that for now I'm focusing on quality, we can put aside the efficiency aspect.
I do need to make sure though that I'm not calling the LLM too much.

LLogan M

"Build several levels of indices on top of each other..."

Have you tried the Tree Index? Each node in the tree is a summary of its children. Using the embeddings mode with that might be helpful

yyoelk

Hi @Logan M, I have't tried TreeIndex yet as I understand it doesn't use any embeddings right? I guess this will result in many calls to the LLM

yyoelk

However I'm still not sure I understand how TreeIndex works. If there's no similarity score between vectors (as there are no embeddings), how does it know which child nodes to go to next? I read through the docs in https://gpt-index.readthedocs.io/en/latest/guides/use_cases.html but couldn't really understand the flow in those examples

LLogan M

The tree index supports embeddings (see the mode option during the query https://github.com/jerryjliu/gpt_index/blob/main/examples/test_wiki/TestNYC_Embeddings.ipynb

https://gpt-index.readthedocs.io/en/latest/reference/indices/tree_query.html)

I'm still learning how the index works, but it seems to work pretty well. It's slightly more LLM calls to build the index, but queries will be much faster

TTeemu

I would love to hear more of your thoughts on this. I assume Pinecone is better at finding correct context information when searching through a large corpus of embedded documents? GPTSimpleVectorIndex seems to be bad at this function (does gpt-index maybe offer some other better option for this sort of tasks?)

Also just wondering how or/ does fine-tuning lead to better context recognition? My experience has usually been model degradation especially when dealing with questions the model hasn't been fine-tuned on. (I assume this is caused by the reversion to the standard davinci model).

Also does AWS hosting provide some noticeable benefits?

MMikko

GPTSimpleVectorIndex should actually be more accurate since it is naively comparing against all vectors. Assuming the default cosine distance works for your application. Pinecone et al may use stochastic lookups when there is a lot of data

MMikko

Hosting shouldn't matter. You can get more compute than locally though 🙂

MMikko

Fine-tuning is not really a replacement for context.. but it should improve overall performance with correct prompts. Probably not with GPT-index prompts which tell the model to ignore its previoys knowledge.

TTeemu

Hmm that's weird. Even with about 40 text documents+ GPTSimpleVectorIndex it seems to respond quite often with: "not mentioned in the context information", despite very clearly the context information being contained in the files used to create the index.

Maybe there is another solution to this then?

TTeemu

My problem is merely with it finding any context information at all, not that the context information would be irrelevant or wrong

MMikko

Odd, how many context chunks are you pulling and how are their similarity scores looking, what are you using for embeddings?

TTeemu

That's with just using the standard query method: # load from disk index = GPTSimpleVectorIndex.load_from_disk('index.json')

print(index.query

LLM token usage seems to fluctuate between 2-3k when not finding the context information. I have played around with chunk size limits etc. but that hasn't helped. Not sure how to check the similarity scores.

TTeemu

I suspect the issue might also be caused by the .txt files used to create the embeddings. What is your preferred method of preprocessing or formatting the data you use? Or is this even mandatory?

TTeemu

Previously I managed without any preprocessing but I'm struggling with this current dataset

LLogan M

@Teemu you can see the similarity scores using something like print(response.source_nodes) each source node has a similarity attribute

Add a reply

Find answers from the community

Question for the NLP experts