Am I missing a concept. The

At a glance

Am I missing a concept. The SimpleDirectoryReader.load_data() returns more Documents than the input_files list I send to it. Can someone explain how this makes sense?

31 comments

OOrion Pax

Or should it not do that and I must be doing something wrong?

LLogan M

PDFs get split per page, helps with citations

LLogan M

feel free to pass in your own pdf reader, or postprocess the outputs

OOrion Pax

How do you update the document id in a predictable way then?

OOrion Pax

I generate a guid for each doc, and then I want to make sure it matches in all my systems (to enable deletion later).

LLogan M

guid + page number? Or merge the document objects that come from the same pdf file so that you only have one to worry about

OOrion Pax

Any example I could look at for "merge the document objects that come from the same pdf file so that you only have one to worry about"?

OOrion Pax

Alternatively, how might you delete_ref_doc() using a pattern? <my_doc_id>-*

LLogan M

Not terribly complicated, just python stuff 🙂

Plain Text

documents = SimpleDirecotryReader(...).load_data()

merged_docs = {}
for doc in documents:
  if doc.metadata['file_path'] not in merged_docs:
    doc.id_ = "some id you calculated"
    merged_docs[doc.metadata['file_path']] = doc
  else:
    merged_docs[doc.metadata['file_path']].text += "\n\n" + doc.text

documents = list(merged_docs.values())

OOrion Pax

Oh. I wouldn't have thought the indexer would accept a list of a dictionary like that. I would have worried about all the values being correct after

LLogan M

I convert the dict to a list at the end

LLogan M

documents = list(merged_docs.values())

OOrion Pax

Ah yes. I see that now. I guess I assumed that all the values in the List[Document] would be more complicated.

LLogan M

Nope, pretty simple objects, mostly just .text and .metadata attributes that you need to worry about

OOrion Pax

hmm...the delete_ref_doc() didn't work (the chat engine is unable to answer a question before I add the file, but after it can, and it still can after I delete the doc by id).

LLogan M

you started with a fresh chat history?

LLogan M

some code to reproduce might help

OOrion Pax

I just looked in the docstore.json and looks like there are numerous ids still there. So something isn't quite right yet. (yes, all my chats are 1-shot).

LLogan M

maybe double check you are inserting document with the doc ids you expect

LLogan M

Works fine in a quick test

Plain Text

>>> from llama_index.core import Document, VectorStoreIndex
>>> documents = [Document(text="I like dogs", id_="12345")]

>>> index = VectorStoreIndex.from_documents(documents)
>>> index.as_retriever().retrieve("test")[0].text
'I like dogs'

>>> index.delete_ref_doc("12345", delete_from_docstore=True)

>>> index.as_retriever().retrieve("test")[0].text
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list index out of range
>>>

OOrion Pax

The docstore.json should have just 1 entry per file if I did what you suggested above, right?

OOrion Pax

Attachment

OOrion Pax

I'm seeing numerous entries for just 1 file.

OOrion Pax

Actually, this looks right

Attachment

OOrion Pax

There's just 1 ref_doc_info for 1 file

OOrion Pax

😂 I think I found the problem

Attachment

OOrion Pax

@Logan M thanks for bearing with me

OOrion Pax

There! It all works now 🙂

LLogan M

LOL good catch

LLogan M

glad it works!

jjalateras1963

i wonder if someone could help me with the following issue. I am using the GithubRepositoryReader to read markdown files from my repository. When i run the code multiple times it creates another set of embeddings within my pgvector database. How can i get it to replace the existing embeddings for a particular file

Here is the code

Plain Text

def index_repository(org, repo, branch, use_wiki):
    # create a GH client
    gh_client = GithubClient()

    # create a GH reader
    reader = GithubRepositoryReader(
        github_client=gh_client,
        owner=org,
        repo=repo,
        verbose=False,
        retries=3,
        filter_file_extensions=(
            [".md"],
            GithubRepositoryReader.FilterType.INCLUDE
        )
    )

    # load the documents
    docs = reader.load_data(branch=branch)
    embed_model = OpenAIEmbedding()
    vector_store = get_vector_store(table_name="wiki_docs")
    extractors = [
        SemanticSplitterNodeParser(
            buffer_size=1, breakpoint_percentile_threshold=90, embed_model=embed_model
        ),
        TitleExtractor(nodes=5),
        SummaryExtractor(summaries=["prev", "self", "next"]),
        QuestionsAnsweredExtractor(questions=15, metadata=MetadataMode.EMBED),
        KeywordExtractor(keywords=10),
        embed_model,
    ]

    pipeline = IngestionPipeline(transformations=extractors, vector_store=vector_store)
    nodes = pipeline.run(documents=docs)

Part 2 of the question is how do i use the same reader to process the wiki documents attached to the repo

Add a reply

Find answers from the community

Am I missing a concept. The