I'm at a consultancy doing enterprise app dev and project management, and trying to see what interesting stuff I can do with ~20 years worth of our timesheet notes. There are a variety of note-writing styles, but in general it's short to-the-point summaries of the work we're doing for clients. I hope to be able to answer questions like "summarize the work employee X did on project Y", "did employee X ever work with technology Y, and for how long?", "how often does employee X do Y" and so on.
I'm currently working with our 2022 data, which has about 18 different people working on a few dozen projects over the year. There are 194 unique employee-project combinations. I'm new to gpt-index and not sure the best way to go about structuring the data. It's coming from a SQL database and I can save/convert it to whatever I want.
I've had best results with dumping it out as a single CSV or writing each time entry to a line in a big text file, keeping the natural order from the database. I tried grouping each employee-project combination into a markdown text file, but keeping the rows "mixed-up" has yielded the best results so far.
I'm using GPTSimpleVectorIndex. I tried to use GPTListIndex but pulled the plug after querying it ran for about ~7 minutes without getting an answer. I think I need to do something with keywords, or otherwise somehow "tagging" the ListIndex elements with employee or project, and then using keyword filters at query time to make sure it's only taking the files about that particular employee. When I scale this up I'll probably have keywords for year/quarter/etc. too.
The problems I'm seeing from just chucking the whole file at gpt-index and SimpleVectorIndex are: conflating activities from one employee with another, probably because everything is all mixed. It will accurately describe what two employees worked on together in Q3, but then start appending a bunch of stuff only employee 1 did alone.
The problem with grouping each employee-project into a markdown file is that it misses a lot of things. I think this is because some of the logs are long enough (every workday for a year) they are getting split up, and the first line of the file has the employee/project name. subsequent chunks wouldn't contain this and only consist of the date, hours, and note.
The markdown files look like this:
Timesheet summary of employee Doe, John on project ACME - IT- Big Software Project Report Development:
- Mar 29 2022 for 8 hours: On-boarding / Kick-off Meeting w/ ACME Team (Jane Doe, Bob Smith, etc.); Access (ACRONYM) meeting w/ Jane Doe.; Analysis and "Report Log" review.; Big Software analysis and review (ERD, permissions, etc.)
- Apr 4 2022 for 8 hours: Weekly Meeting w/ Jane, Bob, Alice, and Accounting Team.; Developing: Balance Due Report w/ Alice.
The csv/text single-file version looked like that, but with the employee and project name on each line, and everything in date/natural order straight out of the database.
Any pointers on adding a little structure to my data to make it work better with gpt-index?