Find answers from the community

Updated 2 years ago

When setting up for indexing for my data

At a glance
When setting up for indexing for my data files is it best to have 1 doc per data topic. "Page Build" and another for "Page Layout" both in their own documents or does that not matter once it goes into the json anyway?

What is the best way to use List Indexing for Keyword search?
L
M
11 comments
I think separate documents will work best, at least in my experience

On the list index, you can include a list of keywords at query time. If a text node doesn't contain one of the keywords, it will skip that node.
Ok Cool, I was just starting to do that. So each doc will have their own title - Example Pages, Structure, Components.

Does the title of the doc also help with queries?
If the title is in the document text, yea it definitely helps
Also when it comes to the text within the document, how is it being read?

Example

Header 1 - Styling | Does this give more strength for the query?

Also for Code,

If I have Code Blocks should I be also adding "Code" before the blocks. So again does that help the indexing?
Yea I think that helps for sure! Anything that will help the model make sense of what it is about to read πŸ“š πŸ‘Œ
Definitely try to test small and test first though. Although it sounds like you've got a good handle on things
So I can see it reads a lot of different types of files, I wonder if some are better then others.

LIke a text doc if you copy pasta directly from a site it wont take the formatting with it. So everything is basically on the same level. As in the Header and Paragraph. I wonder if I put it into a markdown format and put that into a text doc would it play better. Beause we know that OpenAI LLM knows what a header is etc.
Yea I think that might be helpful! You can parse the raw markdown yourself, and llama_index also has a markdown parser.

I personally haven't used it, so maybe double check what it actually does haha https://github.com/logan-markewich/gpt_index/blob/main/gpt_index/readers/file/markdown_parser.py
Looks like it creates a list of tuples, containing the header and the text πŸ€·β€β™‚οΈ
Yeah, could add more value to the headers.
I will have to test it, from raw boring text vs md. lol
Add a reply
Sign up and join the conversation on Discord