Find answers from the community

Updated 2 months ago

Am I missing a concept. The

Am I missing a concept. The SimpleDirectoryReader.load_data() returns more Documents than the input_files list I send to it. Can someone explain how this makes sense?
O
L
j
31 comments
Or should it not do that and I must be doing something wrong?
PDFs get split per page, helps with citations
feel free to pass in your own pdf reader, or postprocess the outputs
How do you update the document id in a predictable way then?
I generate a guid for each doc, and then I want to make sure it matches in all my systems (to enable deletion later).
guid + page number? Or merge the document objects that come from the same pdf file so that you only have one to worry about
Any example I could look at for "merge the document objects that come from the same pdf file so that you only have one to worry about"?
Alternatively, how might you delete_ref_doc() using a pattern? <my_doc_id>-*
Not terribly complicated, just python stuff πŸ™‚

Plain Text
documents = SimpleDirecotryReader(...).load_data()

merged_docs = {}
for doc in documents:
  if doc.metadata['file_path'] not in merged_docs:
    doc.id_ = "some id you calculated"
    merged_docs[doc.metadata['file_path']] = doc
  else:
    merged_docs[doc.metadata['file_path']].text += "\n\n" + doc.text

documents = list(merged_docs.values())
Oh. I wouldn't have thought the indexer would accept a list of a dictionary like that. I would have worried about all the values being correct after
I convert the dict to a list at the end
documents = list(merged_docs.values())
Ah yes. I see that now. I guess I assumed that all the values in the List[Document] would be more complicated.
Nope, pretty simple objects, mostly just .text and .metadata attributes that you need to worry about
hmm...the delete_ref_doc() didn't work (the chat engine is unable to answer a question before I add the file, but after it can, and it still can after I delete the doc by id).
you started with a fresh chat history?
some code to reproduce might help
I just looked in the docstore.json and looks like there are numerous ids still there. So something isn't quite right yet. (yes, all my chats are 1-shot).
maybe double check you are inserting document with the doc ids you expect
Works fine in a quick test

Plain Text
>>> from llama_index.core import Document, VectorStoreIndex
>>> documents = [Document(text="I like dogs", id_="12345")]

>>> index = VectorStoreIndex.from_documents(documents)
>>> index.as_retriever().retrieve("test")[0].text
'I like dogs'

>>> index.delete_ref_doc("12345", delete_from_docstore=True)

>>> index.as_retriever().retrieve("test")[0].text
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list index out of range
>>> 
The docstore.json should have just 1 entry per file if I did what you suggested above, right?
I'm seeing numerous entries for just 1 file.
Actually, this looks right
Attachment
image.png
There's just 1 ref_doc_info for 1 file
πŸ˜‚ I think I found the problem
Attachment
image.png
@Logan M thanks for bearing with me
There! It all works now πŸ™‚
LOL good catch
glad it works!
i wonder if someone could help me with the following issue. I am using the GithubRepositoryReader to read markdown files from my repository. When i run the code multiple times it creates another set of embeddings within my pgvector database. How can i get it to replace the existing embeddings for a particular file

Here is the code

Plain Text
def index_repository(org, repo, branch, use_wiki):
    # create a GH client
    gh_client = GithubClient()

    # create a GH reader
    reader = GithubRepositoryReader(
        github_client=gh_client,
        owner=org,
        repo=repo,
        verbose=False,
        retries=3,
        filter_file_extensions=(
            [".md"],
            GithubRepositoryReader.FilterType.INCLUDE
        )
    )

    # load the documents
    docs = reader.load_data(branch=branch)
    embed_model = OpenAIEmbedding()
    vector_store = get_vector_store(table_name="wiki_docs")
    extractors = [
        SemanticSplitterNodeParser(
            buffer_size=1, breakpoint_percentile_threshold=90, embed_model=embed_model
        ),
        TitleExtractor(nodes=5),
        SummaryExtractor(summaries=["prev", "self", "next"]),
        QuestionsAnsweredExtractor(questions=15, metadata=MetadataMode.EMBED),
        KeywordExtractor(keywords=10),
        embed_model,
    ]

    pipeline = IngestionPipeline(transformations=extractors, vector_store=vector_store)
    nodes = pipeline.run(documents=docs)


Part 2 of the question is how do i use the same reader to process the wiki documents attached to the repo
Add a reply
Sign up and join the conversation on Discord