When setting up for indexing for my data

At a glance

When setting up for indexing for my data files is it best to have 1 doc per data topic. "Page Build" and another for "Page Layout" both in their own documents or does that not matter once it goes into the json anyway?

What is the best way to use List Indexing for Keyword search?

11 comments

LLogan M

I think separate documents will work best, at least in my experience

On the list index, you can include a list of keywords at query time. If a text node doesn't contain one of the keywords, it will skip that node.

MMeathead

Ok Cool, I was just starting to do that. So each doc will have their own title - Example Pages, Structure, Components.

Does the title of the doc also help with queries?

LLogan M

If the title is in the document text, yea it definitely helps

MMeathead

Also when it comes to the text within the document, how is it being read?

Example

Header 1 - Styling | Does this give more strength for the query?

Also for Code,

If I have Code Blocks should I be also adding "Code" before the blocks. So again does that help the indexing?

LLogan M

Yea I think that helps for sure! Anything that will help the model make sense of what it is about to read 📚 👌

LLogan M

Definitely try to test small and test first though. Although it sounds like you've got a good handle on things

MMeathead

So I can see it reads a lot of different types of files, I wonder if some are better then others.

LIke a text doc if you copy pasta directly from a site it wont take the formatting with it. So everything is basically on the same level. As in the Header and Paragraph. I wonder if I put it into a markdown format and put that into a text doc would it play better. Beause we know that OpenAI LLM knows what a header is etc.

LLogan M

Yea I think that might be helpful! You can parse the raw markdown yourself, and llama_index also has a markdown parser.

I personally haven't used it, so maybe double check what it actually does haha https://github.com/logan-markewich/gpt_index/blob/main/gpt_index/readers/file/markdown_parser.py

LLogan M

Looks like it creates a list of tuples, containing the header and the text 🤷‍♂️

MMeathead

Yeah, could add more value to the headers.

MMeathead

I will have to test it, from raw boring text vs md. lol

Add a reply

Find answers from the community

When setting up for indexing for my data