Find answers from the community

Updated 12 months ago

GithubRepositoryReader

At a glance

The community member is trying to use the GithubRepositoryReader from the LlamaIndex library to read data from a GitHub repository, but is encountering a KeyError: 'commit' error. The community members have tried this on multiple repositories and are getting the same error.

The comments suggest that the issue may be related to trying to fetch the "main" branch instead of the "master" branch. There is also a discussion around the need for a try-except block to handle the key error, and some suggestions to optimize the performance of the data ingestion process, such as only including specific file extensions or excluding certain directories.

One community member suggests trying the following configuration:

github_client = GithubClient(github_token=GITHUB_TOKEN, verbose=True) documents = GithubRepositoryReader( github_client=github_client, owner='run-llama', repo='llama_index', use_parser=False, verbose=False, filter_directories=(["examples", "tests", "docs"], GithubRepositoryReader.FilterType.EXCLUDE), filter_file_extensions=( ["*.py"], GithubRepositoryReader.FilterType.INCLUDE
Useful resources
Hi, I am trying to use GithubRepositoryReader here: https://docs.llamaindex.ai/en/stable/examples/data_connectors/GithubRepositoryReaderDemo.html
like this:
Plain Text
github_client = GithubClient(github_token=GITHUB_TOKEN, verbose=True)
documents = GithubRepositoryReader(
        github_client=github_client,
        owner="jerryjliu",
        repo="llama_index",
        use_parser=False,
        verbose=False,
        filter_directories=(["examples", "tests", "logs"], GithubRepositoryReader.FilterType.EXCLUDE),
        filter_file_extensions=(
            GithubRepositoryReader.FilterType.EXCLUDE,
        ),
    ).load_data(branch='main')
    print(len(documents))

But getting this error:
Plain Text
  File "/Users/ig/.pyenv/versions/3.11.3/lib/python3.11/asyncio/futures.py", line 203, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/Users/ig/.pyenv/versions/3.11.3/lib/python3.11/asyncio/tasks.py", line 267, in __step
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/Users/ig/Documents/llm_infra/llm_infra/lib/python3.11/site-packages/llama_index/readers/github/repository/github_client.py", line 361, in get_branch
    return GitBranchResponseModel.from_json(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ig/Documents/llm_infra/llm_infra/lib/python3.11/site-packages/dataclasses_json/api.py", line 63, in from_json
    return cls.from_dict(kvs, infer_missing=infer_missing)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ig/Documents/llm_infra/llm_infra/lib/python3.11/site-packages/dataclasses_json/api.py", line 70, in from_dict
    return _decode_dataclass(cls, kvs, infer_missing)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ig/Documents/llm_infra/llm_infra/lib/python3.11/site-packages/dataclasses_json/core.py", line 172, in _decode_dataclass
    field_value = kvs[field.name]
                  ~~~^^^^^^^^^^^^
KeyError: 'commit'

getting same error on other repos
3
L
17 comments
for example i tried on following repo for e.g. https://github.com/345ishaan/DenseLidarNet yielding same error
seems like the reason for i was trying to fetch main instead of master
I think this should be fixed tbh
pip install -U llama-index-readers-github
if using v0.10.x anyways
There's a big try except to avoid key errors
Ah actually, missed a try except there, lame
also not related but does it make sense to exclude directories if they exist in the tree_path . e.g. currently if i want to exclude all folders with name "examples" it only excludes if it is in the main and not something like "xyz/abc/examples" https://github.com/run-llama/llama_index/blob/9d9e10bd4c2ad4f4cacfc6dab5ff20cc31c515e4/llama-index-integrations/readers/llama-index-readers-github/llama_index/readers/github/repository/base.py#L161
I was trying to ingest llama-index repo, and with currently configuration i am trying it is taking a lot of time to load documents.
if i do a simple os.walk on the rootdir by cloning into my local dir and chunk it is very fast
There might be some files in the docs that take some time to load. I usually specify required_exts=[...] in the loader to only pull files I care about (I hope I spelled that kwarg right lol)
this folder has only md files? https://github.com/run-llama/llama_index/tree/9d9e10bd4c2ad4f4cacfc6dab5ff20cc31c515e4/docs

I tried exclude them via file_extensions and still i couldn't see it finishing.
that can be a generalization but let me see how fast that yields
That folder has much more than md files πŸ‘€

Ohh you are using the github repo reader too
ok i think you are asking to try out:
Plain Text
github_client = GithubClient(github_token=GITHUB_TOKEN, verbose=True)
    documents = GithubRepositoryReader(
        github_client=github_client,
        owner='run-llama',
        repo='llama_index',
        use_parser=False,
        verbose=False,
        filter_directories=(["examples", "tests", "docs"], GithubRepositoryReader.FilterType.EXCLUDE),
        filter_file_extensions=(
            ["*.py"],
            GithubRepositoryReader.FilterType.INCLUDE,
        ),
    ).load_data(branch='main')
`
Add a reply
Sign up and join the conversation on Discord