Does anybody have ideas for extracting

It looks like I can get an answer like this if I hard code the dates and include enough nodes (4 in this example)

Starting query: what was my average readiness score for dec 30 2022, dec 31 2022 and jan 1 2023?

[query] Total token usage: 3233 tokens

The average readiness score for December 30, 2022, December 31, 2022, and January 1, 2023 can be calculated by adding the readiness scores for each day and dividing by three. The readiness scores for each day are as follows: December 30, 2022: 93, December 31, 2022: 86, and January 1, 2023: 84. Therefore, the average readiness score for these three days is 88.

This is interesting. Off the top of my head you could try using GPTListIndex for this type of query, instead of GPTSimpleVectorIndex - GPTListIndex will iterate through all nodes in the list to synthesize the response.

More broadly you raise interesting questions about being able to infer a schema from unstructured data, and running queries on that. I think there's some works on 1) generating schemas from unstructured data, and 2) translating natural language queries to SQL e.g. https://blog.seekwell.io/gpt3. I'll brainstorm some directions here for GPT Index!

Perhaps there could be a DbTable index type in GPTIndex backed by sqlite, and queries on top of it could formulate a sql query to execute on that db?

I've had some success with learning a schema from unstructured data with GPT, including data type, and then using the schema to extract values for different fields.

Yeah @KKT that's a good idea! There's the natural language -> SQL interface that I can build

Re: learning a schema from unstructured data, this is also really interesting and a feature I want to include in GPT index. What prompts did you use? Were you able to use the schema to extract unstructured data into a db?

Just tried it on my app, first time it kind of modified the original input but a second fresh try gave this. This was zero shot (besides some prompt examples), but after that, we use existing matches for future entries so it becomes more reliable.

Attachment

Let me try to find the constructed prompt

Plain Text

The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly.

Human: We are extracting fields from text documents. Each field only appears exactly once in each document, pick the best one. The fields to extract are:
name, age, occupation.
1~ My neighbor Peter is a thirty-three year old journalist and he walks his dog everyday
2~ The article said Simon Banksy is a billionaire CEO who just turned 30.

AI: The field annotations and values are:
1~ My neighbor [[name|Peter]] is a [[age|thirty-three year old]] [[occupation|journalist]] and he walks his dog everyday
1~ Peter, 33, journalist
2~ The article said [[name|Simon Banksy]] is a billionaire [[occupation|CEO]] who just turned [[age|30]].
2~ Simon Banksy, 30, CEO

Human: We are extracting fields from text documents. Each field only appears exactly once in each document, pick the best one. The fields to extract are: .
1~ Date:: 2023-01-01, Weight:: 145, Readiness:: 83

AI: The field annotations and values are:

I guess this one kind of got lucky when there were no known fields, but it could be separated into a different prompt. Have a few different variants of this structure.

Related: it would be interesting to hook this into dataview which indexes metadata found in markdown files (both inline and frontmatter) for the purpose of querying. https://blacksmithgu.github.io/obsidian-dataview/

@KKT this is helpful! This gets me thinking, maybe GPT Index could provide a structure where it takes in a prompt parsing unstructured data into structured data with user-specified fields (we could also extract fields by default but perhaps that is less immediately useful)

Honestly both would be incredibly useful. 🤤

@arminta7 are your data fields all currently annotated with the dataview kind of syntax? if not, i think variations of the prompt i pasted above can potentially identify the common kinds of concepts/fields across different notes that are worthwhile to extract, subject to prompt size limitations in scanning multiple notes in aggregate. in other words, it should be possible to surface fields automatically, though i haven't done a lot of testing on that.

having the user explicitly list some fields would definitely make it more deterministic/predictable in behavior. so perhaps automatic field identification is an optional experimental helper utility?

actually maybe that bit is similar to the keyword table index -- but not limited to exact keywords but concepts

Yes, I would say 99% of my data is using inline dataview syntax. The rest would be yaml front-matter.

I see, it might be easier to use a regex to extract those given that it is quite parseable. But inserting them into a db / table to query via natural language would still be nice to have.