LlamaIndex

Log inLog into community

Find answers from the community

Updated 2 years ago

github_repo exploring

github_repo exploring

At a glance

A community member is having trouble getting the GitHub repository loader to work, encountering a "ModuleNotFoundError" error. Other community members provide suggestions, such as modifying the import statement and disabling the use_parser option. The issue appears to be related to the GitHub repository reader, and a community member named HAL 9000 is involved in helping to resolve it.

After some back-and-forth, HAL 9000 provides a working code example that resolves the issue. The community member confirms that the code works, but encounters a timeout error when trying to index multiple directories. HAL 9000 explains that the timeout is likely due to the GitHub rate limiting, and suggests adjusting the concurrent_requests parameter to avoid the issue.

The community members also discuss the possibility of making the llama-hub a subtree or submodule to make it easier to add and test new loaders.

Useful resources

·

hey guys 😄 has anyone tried the github_repo loader? https://llamahub.ai/l/github_repo
i cant seem to get it working, looks like i need to import something more from github than what it says in the example in the link above. anyone knows?

this is the error: from modules.github_repo import GithubClient, GithubRepositoryReader
ModuleNotFoundError: No module named 'modules'

Attachment

1

l

M

H

33 comments

rewriting line 7 to this worked ‘from llama_index.readers.llamahub_modules import GithubClient, GithubRepositoryReader`

I have managed to index the files, but it seems that it only indexing plain text files.. not code.

is the filter_directories recursive? meaning that every folder inside the top folder I specify will be indexed? when trying index this folder with only .cs files, the json result file is empty (using 0 tokens when indexing).

INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 0 tokens

maybe I am misunderstanding the filtering... this is my implementation:

Attachment

@HAL 9000 i read somewhere that you have knowledge of the github loader, any idea?😊 the umbraco.json file is empty when running the file in the image above

I just wrote this https://github.com/Softlandia-Ltd/metaflow-index with the github reader

It seems like you're not getting any files from the reader if token usage is zero

and it should recursively get files from directories

Hi @lars thanks for using LlamaIndex and GithubReader.
I’ll take a look at it when I get home. In the mean time, could you try disabling use_parser.

and Could you also turn on the use_verbose and post the output here if possible?

this is the output after running the code in the image above: (message.txt)

and this is the content of the umbraco.json file:

{"index_struct_id": "53347f08-c351-42a4-a6c3-8ebc46a95fee", "docstore": {"docs": {"53347f08-c351-42a4-a6c3-8ebc46a95fee": {"text": null, "doc_id": "53347f08-c351-42a4-a6c3-8ebc46a95fee", "embedding": null, "extra_info": null, "nodes_dict": {}, "id_map": {}, "embeddings_dict": {}, "__type__": "simple_dict"}}}, "vector_store": {"simple_vector_store_data_dict": {"embedding_dict": {}, "text_id_to_doc_id": {}}}}

you added src/Umbraco.Web.Common to include it but looks like the reader somehow ignores it.

Plain Text

Checking src\Umbraco.Web.Common whether to FilterType.INCLUDE it based on the filter directories: ['src/Umbraco.Web.Common']
        ignoring directory Umbraco.Web.Common due to filter

I found the problem. I had this open PR: https://github.com/emptycrown/llama-hub/pull/73
that I think would solve it but I wasn't actually properly testing. I'll push my fixes and hopefully this PR should resolve the issue. Sorry for late replies. fyi @lars

@HAL 9000 no worries! awesome work dude!! thanks so much, i will update you 😁 so impressed of the work you guys do! 🔥

@jerryjliu0 @HAL 9000 is this github loader fix released? https://github.com/emptycrown/llama-hub/pull/109

i've been lurking in this PR for some days 😛 is there a way to follow which PRs/issues is released? couldnt find any link to issues/PRs in the release log

should be fixed in a followup PR! https://github.com/emptycrown/llama-hub/pull/104

@lars yeah sorry to clarify, llamahub loaders don't follow a "release" schedule - the moment the PR is merged, it's immediately available for use on llamahub.ai

ah, isee. good to know 😄 thats amazing, ty!

@jerryjliu0 sorry for asking alot of questions, but how do i use the new version of the code that lies within llamahub? i have updated to llama_index 0.4.28. but the github loader still doesnt work. when i ctrl+click into the GithubRepositoryReader, i see that the code in the PR you refered to above is not applied.
https://github.com/emptycrown/llama-hub/pull/104/files#diff-f682320a7f5b1ac241fe19eb5bf61d8f583ab8655c3f7d2472bb14882690f6ea

this is how i load the loader modules: from llama_index.readers.llamahub_modules import GithubClient, GithubRepositoryReader

Attachments

@lars by default we cache when you use download_loader, try download_loader(..., refresh_cache=True)

okey. it didnt work, but managed to use the new code by changing the import from from llama_index.readers.llamahub_modules import GithubClient, GithubRepositoryReader

to
from llama_index.readers.github_readers import github_api_client, github_repository_reader
and init: loader = github_repository_reader.GithubRepositoryReader
github_client = github_api_client.GithubClient

The one in LlamaIndex is not actually up to date.
I have tested with your code example:

pip freeze > uninstall.txt && pip uninstall -y -r uninstall.txt && pip cache purge && pip install --upgrade  httpx llama-index && python main.py

Plain Text

from llama_index import download_loader

download_loader("GithubRepositoryReader", 
                refresh_cache=True, 
                loader_hub_url="https://raw.githubusercontent.com/ahmetkca/llama-hub/github-reader-test-and-fix/loader_hub")
from llama_index.readers.llamahub_modules.github_repo import GithubRepositoryReader, GithubClient

def main():
    github_client = GithubClient()
    github_repo_reader = GithubRepositoryReader(
            github_client,
            owner = "umbraco",
            repo = "Umbraco-CMS",
            use_parser = False,
            filter_directories = (["src/Umbraco.Web.Common"], GithubRepositoryReader.FilterType.INCLUDE),
            filter_file_extensions = ([".cs"], GithubRepositoryReader.FilterType.INCLUDE),
            verbose = True,
            concurrent_requests = 2,
    )

    docs = github_repo_reader.load_data(branch="v10/main")

    for doc in docs:
        print(doc.extra_info)
if __name__ == "__main__":
    main()

Attachment

@lars Could you try the above code?

@HAL 9000 yes, your code worked!

import os
from llama_index import download_loader, GPTSimpleVectorIndex
from llama_index.readers.llamahub_modules.github_repo import GithubRepositoryReader, GithubClient

os.environ[
"OPENAI_API_KEY"] = 'token'

download_loader("GithubRepositoryReader",
refresh_cache=True,
loader_hub_url="https://raw.githubusercontent.com/ahmetkca/llama-hub/github-reader-test-and-fix/loader_hub")

def main():
github_client = GithubClient("token")
github_repo_reader = GithubRepositoryReader(
github_client,
owner = "umbraco",
repo = "Umbraco-CMS",
use_parser = False,
filter_directories = (["src\Umbraco.Web.Common\Extensions"], GithubRepositoryReader.FilterType.INCLUDE),
filter_file_extensions = ([".cs"], GithubRepositoryReader.FilterType.INCLUDE),
verbose = True,
concurrent_requests = 2,
)

docs = github_repo_reader.load_data(branch="v10/main")

index = GPTSimpleVectorIndex(docs)

index.save_to_disk('./indexes/umbraco.json')

for doc in docs:
print(doc.extra_info)
if name == "main":
main()

this worked

i had to start with only using one folder, because the executing timeout on multiple folders. this is the error message i got when it timed out:

do you have any idea why it times out during execution? anyway, it works 😄 so cool. will explore more with it tomorrow! so grateful for you work 🙂

Thank you for your patient @lars . This is my first time officially contributing to the open source.
For your question, you can adjust the concurrent_requests it is 5 by default which means the GithubRepoReader will retrieve 5 files concurrently.
I think the default rate limit is 5000 request per hour set by GitHub. You can increase the concurrent_requests but it also means there will be high chance you will encounter with ConnectionTimeout because of the rate limiting by GitHub. I suggest 5 or below.

that so cool! hopefully i can contribute in the future as well. great, thanks for the explanation. i will test with different values

@HAL 9000 thanks for the help! out of curiosity, is this code not checked into llamahub yet?

it should be

Attachment

oh wait I was pointing to my branch when testing
@lars You should actually use the below code that doesn't point to my branch.

Plain Text

from llama_index import download_loader

download_loader("GithubRepositoryReader", 
                refresh_cache=True)
from llama_index.readers.llamahub_modules.github_repo import GithubRepositoryReader, GithubClient

def main():
    github_client = GithubClient()
    github_repo_reader = GithubRepositoryReader(
            github_client,
            owner = "umbraco",
            repo = "Umbraco-CMS",
            use_parser = False,
            filter_directories = (["src/Umbraco.Web.Common"], GithubRepositoryReader.FilterType.INCLUDE),
            filter_file_extensions = ([".cs"], GithubRepositoryReader.FilterType.INCLUDE),
            verbose = True,
            concurrent_requests = 2,
    )

    docs = github_repo_reader.load_data(branch="v10/main")

    for doc in docs:
        print(doc.extra_info)
if __name__ == "__main__":
    main()

@jerryjliu0 What do you think about making llama-hub subtree or submodule?

that's not a bad idea, i haven't done the scoping on the effort required. but seems like it would make it easier to add / test new loaders!

okey, will test later.

Add a reply

Sign up and join the conversation on Discord