GithubRepositoryReader

At a glance

The community member is trying to use the GithubRepositoryReader from the LlamaIndex library to read data from a GitHub repository, but is encountering a KeyError: 'commit' error. The community members have tried this on multiple repositories and are getting the same error.

The comments suggest that the issue may be related to trying to fetch the "main" branch instead of the "master" branch. There is also a discussion around the need for a try-except block to handle the key error, and some suggestions to optimize the performance of the data ingestion process, such as only including specific file extensions or excluding certain directories.

One community member suggests trying the following configuration:

github_client = GithubClient(github_token=GITHUB_TOKEN, verbose=True)
documents = GithubRepositoryReader(
    github_client=github_client,
    owner='run-llama',
    repo='llama_index',
    use_parser=False,
    verbose=False,
    filter_directories=(["examples", "tests", "docs"], GithubRepositoryReader.FilterType.EXCLUDE),
    filter_file_extensions=( ["*.py"], GithubRepositoryReader.FilterType.INCLUDE

Useful resources

3345ishaan

Hi, I am trying to use GithubRepositoryReader here: https://docs.llamaindex.ai/en/stable/examples/data_connectors/GithubRepositoryReaderDemo.html
like this:

Plain Text

github_client = GithubClient(github_token=GITHUB_TOKEN, verbose=True)
documents = GithubRepositoryReader(
        github_client=github_client,
        owner="jerryjliu",
        repo="llama_index",
        use_parser=False,
        verbose=False,
        filter_directories=(["examples", "tests", "logs"], GithubRepositoryReader.FilterType.EXCLUDE),
        filter_file_extensions=(
            GithubRepositoryReader.FilterType.EXCLUDE,
        ),
    ).load_data(branch='main')
    print(len(documents))

But getting this error:

Plain Text

  File "/Users/ig/.pyenv/versions/3.11.3/lib/python3.11/asyncio/futures.py", line 203, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/Users/ig/.pyenv/versions/3.11.3/lib/python3.11/asyncio/tasks.py", line 267, in __step
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/Users/ig/Documents/llm_infra/llm_infra/lib/python3.11/site-packages/llama_index/readers/github/repository/github_client.py", line 361, in get_branch
    return GitBranchResponseModel.from_json(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ig/Documents/llm_infra/llm_infra/lib/python3.11/site-packages/dataclasses_json/api.py", line 63, in from_json
    return cls.from_dict(kvs, infer_missing=infer_missing)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ig/Documents/llm_infra/llm_infra/lib/python3.11/site-packages/dataclasses_json/api.py", line 70, in from_dict
    return _decode_dataclass(cls, kvs, infer_missing)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ig/Documents/llm_infra/llm_infra/lib/python3.11/site-packages/dataclasses_json/core.py", line 172, in _decode_dataclass
    field_value = kvs[field.name]
                  ~~~^^^^^^^^^^^^
KeyError: 'commit'

getting same error on other repos

17 comments

3345ishaan

for example i tried on following repo for e.g. https://github.com/345ishaan/DenseLidarNet yielding same error

3345ishaan

seems like the reason for i was trying to fetch main instead of master

LLogan M

I think this should be fixed tbh

LLogan M

pip install -U llama-index-readers-github

LLogan M

if using v0.10.x anyways

LLogan M

There's a big try except to avoid key errors

LLogan M

Ah actually, missed a try except there, lame

LLogan M

https://github.com/run-llama/llama_index/blob/9d9e10bd4c2ad4f4cacfc6dab5ff20cc31c515e4/llama-index-integrations/readers/llama-index-readers-github/llama_index/readers/github/repository/github_client.py#L361

3345ishaan

also not related but does it make sense to exclude directories if they exist in the tree_path . e.g. currently if i want to exclude all folders with name "examples" it only excludes if it is in the main and not something like "xyz/abc/examples" https://github.com/run-llama/llama_index/blob/9d9e10bd4c2ad4f4cacfc6dab5ff20cc31c515e4/llama-index-integrations/readers/llama-index-readers-github/llama_index/readers/github/repository/base.py#L161

3345ishaan

I was trying to ingest llama-index repo, and with currently configuration i am trying it is taking a lot of time to load documents.

3345ishaan

if i do a simple os.walk on the rootdir by cloning into my local dir and chunk it is very fast

LLogan M

There might be some files in the docs that take some time to load. I usually specify required_exts=[...] in the loader to only pull files I care about (I hope I spelled that kwarg right lol)

3345ishaan

this folder has only md files? https://github.com/run-llama/llama_index/tree/9d9e10bd4c2ad4f4cacfc6dab5ff20cc31c515e4/docs

I tried exclude them via file_extensions and still i couldn't see it finishing.

3345ishaan

do you mean to say only include for e.g .py in filter_file_extensions https://github.com/run-llama/llama_index/blob/7276609e56704b25ab58b98cf1ab842d6937161b/llama-index-integrations/readers/llama-index-readers-github/llama_index/readers/github/repository/base.py#L81

3345ishaan

that can be a generalization but let me see how fast that yields

LLogan M

That folder has much more than md files 👀

Ohh you are using the github repo reader too

3345ishaan

ok i think you are asking to try out:

Plain Text

github_client = GithubClient(github_token=GITHUB_TOKEN, verbose=True)
    documents = GithubRepositoryReader(
        github_client=github_client,
        owner='run-llama',
        repo='llama_index',
        use_parser=False,
        verbose=False,
        filter_directories=(["examples", "tests", "docs"], GithubRepositoryReader.FilterType.EXCLUDE),
        filter_file_extensions=(
            ["*.py"],
            GithubRepositoryReader.FilterType.INCLUDE,
        ),
    ).load_data(branch='main')

Add a reply

Find answers from the community

GithubRepositoryReader