Find answers from the community

s
F
Y
a
P
Updated 2 years ago

So trying to figure out something trying

So trying to figure out something, trying to build out a document and index store. I have various transcription files (personal journals, work meetings, tutorials), and I want to build a pipeline to get these docs into the proper indexes. I extended the Document class to add file metadata and hashes, and I want to summarize the document before it is indexed. First off, is this something I need to call the LLM directly for, or is there some functionality within the indexer I'm forgetting?
1
j
B
M
11 comments
If you want a summary per document, you can first create a list index per document and call response_mode="tree_summarize" https://gpt-index.readthedocs.io/en/latest/guides/use_cases.html#use-case-summarization-over-documents.

What are you using the summary for? If you want to build an index per Document and also a higher-order index across documents, you can consider using our composability feature https://gpt-index.readthedocs.io/en/latest/how_to/composability.html
Seems inefficient to have a multi level index where the nodes have just one file in them, I was thinking I might use langchain to summarize the file and do transcript cleanup and summaries and other transformations, then route the resultant doc into the proper list index.
mm composability is just a feature in case you want to define some structure over your data.

you can either use gpt index or langchain for summarization (you don't need composability in gpt index to do this), just use a list index with tree_summarize as detailed above.

let me know if you do try these features out though! happy to help
Yep just wrapping my head around it. Thx.
I have a similar use case where i'd like to create a summary per document and had a couple of questions in trying to implement your suggestion:
1) I'd like to leverage the directory reader to read all the PDFs, and then iterate over them to create the List Indexes - but seems like the Document object isn't iterable? And directory reader can't accept single file paths?
2) What is the efficient way to store all of the individual document indexes? Is it to use compose them together into a higher order index and leverage the save_to_disk on that index?
So I have this function:
Plain Text
def file_to_index(file_path):
    """
    :param file_path: str path to file to be indexed
    :return: GPTSimpleVectorIndex
    """
    # from gpt_index.readers.schema.base import Document
    # convert file path to a Document, then to a GPTSimpleVectorIndex
    document = FileDocument(file_path)
    index = GPTListIndex([document])
    return index
Plain Text
class FileDocument(gpt_index_document):
    """
    A document that is stored in a file.
    """
    def __init__(self, filepath: str, summary: str = None, *args, **kwargs):
        self.filepath = filepath
        self.name = filepath.split('/')[-1]
        self.text = utils.read_file(filepath)
        self.summary = summary
        kwargs['text'] = self.text
        kwargs['extra_info'] = {
            'filepath': filepath,
            'file_hash': generate_file_hash(filepath),
            'name': self.name,
            'summary': self.summary
        }
        super().__init__(*args, **kwargs)
I'm thinking that I want to actually save the file docs to disk as part of the process, before indexing, or maybe to a Mongo DB, which is where I'm currently planning on keeping the actual indexes. We'll make the documents objects, which I think can be piped through a langchain component to set the document summary, an another that has a list of indexes and pipes the document to the proper index.
is utils.read_file a function that you also wrote, or are you leveraging something else? i'm trying to avoid having to write my own parsers and want to leverage the gpt_index ones
Plain Text
def read_file(filepath):
    with open(filepath, 'r') as f:
        data = f.read()
    return data
Add a reply
Sign up and join the conversation on Discord