The goal may be something as simple as retrieving segments (chunks) of the interview/transcript based on thematic codes (Drivers, Solutions, Barriers), summarization tasks, or additional Q&A.
The question revolves around the best doc loading / chunking / embedding strategy given the structure of whisper transcripts. If one wanted to maintain metadata in the doc (segment) level such as "speaker" "confidence" "timestamps", how would one then structure the chunks and embedding to maintain semantic cohesion?
IE we may have 15 lines from a 5000 line interview (whisper JSON file) which should be grouped together: ... [ Speaker1: Asking a question Speaker1: Continuing the same question Speaker1: filler word Speaker2: Asks clarifying question Speaker 1: quick answer: Speaker2: begins answering... Speaker2: Continues.... Speaker2: Continues... Speaker1: interrupts with quick clarifier Speaker 2: continues... (end of answer) ] ...
What are some methods to employ to isolate these high level question answer pairs from a whisper transcript? How can the JSON loader be employed, or are there best practices around whisper transcript RAG in general?
Maybe I'm an idiot, but I'm having a hard ass time figuring out how nodes / documents can be built without using the abstracted "loaders" and "parsers" and the loaders and parsers are not working the way I'd like them to for my use case.