Find answers from the community

Updated 11 months ago

GithubRepositoryReader

At a glance
Hi, I am trying to use GithubRepositoryReader here: https://docs.llamaindex.ai/en/stable/examples/data_connectors/GithubRepositoryReaderDemo.html
like this:
Plain Text
github_client = GithubClient(github_token=GITHUB_TOKEN, verbose=True)
documents = GithubRepositoryReader(
        github_client=github_client,
        owner="jerryjliu",
        repo="llama_index",
        use_parser=False,
        verbose=False,
        filter_directories=(["examples", "tests", "logs"], GithubRepositoryReader.FilterType.EXCLUDE),
        filter_file_extensions=(
            GithubRepositoryReader.FilterType.EXCLUDE,
        ),
    ).load_data(branch='main')
    print(len(documents))

But getting this error:
Plain Text
  File "/Users/ig/.pyenv/versions/3.11.3/lib/python3.11/asyncio/futures.py", line 203, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/Users/ig/.pyenv/versions/3.11.3/lib/python3.11/asyncio/tasks.py", line 267, in __step
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/Users/ig/Documents/llm_infra/llm_infra/lib/python3.11/site-packages/llama_index/readers/github/repository/github_client.py", line 361, in get_branch
    return GitBranchResponseModel.from_json(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ig/Documents/llm_infra/llm_infra/lib/python3.11/site-packages/dataclasses_json/api.py", line 63, in from_json
    return cls.from_dict(kvs, infer_missing=infer_missing)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ig/Documents/llm_infra/llm_infra/lib/python3.11/site-packages/dataclasses_json/api.py", line 70, in from_dict
    return _decode_dataclass(cls, kvs, infer_missing)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ig/Documents/llm_infra/llm_infra/lib/python3.11/site-packages/dataclasses_json/core.py", line 172, in _decode_dataclass
    field_value = kvs[field.name]
                  ~~~^^^^^^^^^^^^
KeyError: 'commit'

getting same error on other repos
3
L
17 comments
for example i tried on following repo for e.g. https://github.com/345ishaan/DenseLidarNet yielding same error
seems like the reason for i was trying to fetch main instead of master
I think this should be fixed tbh
pip install -U llama-index-readers-github
if using v0.10.x anyways
There's a big try except to avoid key errors
Ah actually, missed a try except there, lame
also not related but does it make sense to exclude directories if they exist in the tree_path . e.g. currently if i want to exclude all folders with name "examples" it only excludes if it is in the main and not something like "xyz/abc/examples" https://github.com/run-llama/llama_index/blob/9d9e10bd4c2ad4f4cacfc6dab5ff20cc31c515e4/llama-index-integrations/readers/llama-index-readers-github/llama_index/readers/github/repository/base.py#L161
I was trying to ingest llama-index repo, and with currently configuration i am trying it is taking a lot of time to load documents.
if i do a simple os.walk on the rootdir by cloning into my local dir and chunk it is very fast
There might be some files in the docs that take some time to load. I usually specify required_exts=[...] in the loader to only pull files I care about (I hope I spelled that kwarg right lol)
this folder has only md files? https://github.com/run-llama/llama_index/tree/9d9e10bd4c2ad4f4cacfc6dab5ff20cc31c515e4/docs

I tried exclude them via file_extensions and still i couldn't see it finishing.
that can be a generalization but let me see how fast that yields
That folder has much more than md files πŸ‘€

Ohh you are using the github repo reader too
ok i think you are asking to try out:
Plain Text
github_client = GithubClient(github_token=GITHUB_TOKEN, verbose=True)
    documents = GithubRepositoryReader(
        github_client=github_client,
        owner='run-llama',
        repo='llama_index',
        use_parser=False,
        verbose=False,
        filter_directories=(["examples", "tests", "docs"], GithubRepositoryReader.FilterType.EXCLUDE),
        filter_file_extensions=(
            ["*.py"],
            GithubRepositoryReader.FilterType.INCLUDE,
        ),
    ).load_data(branch='main')
`
Add a reply
Sign up and join the conversation on Discord