Find answers from the community

Updated 2 years ago

github_repo exploring

hey guys πŸ˜„ has anyone tried the github_repo loader? https://llamahub.ai/l/github_repo
i cant seem to get it working, looks like i need to import something more from github than what it says in the example in the link above. anyone knows?

this is the error: from modules.github_repo import GithubClient, GithubRepositoryReader
ModuleNotFoundError: No module named 'modules'
Attachment
image.png
1
l
M
H
33 comments
rewriting line 7 to this worked β€˜from llama_index.readers.llamahub_modules import GithubClient, GithubRepositoryReader`

I have managed to index the files, but it seems that it only indexing plain text files.. not code.

is the filter_directories recursive? meaning that every folder inside the top folder I specify will be indexed? when trying index this folder with only .cs files, the json result file is empty (using 0 tokens when indexing).

INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 0 tokens

maybe I am misunderstanding the filtering... this is my implementation:
Attachment
image.png
@HAL 9000 i read somewhere that you have knowledge of the github loader, any idea?😊 the umbraco.json file is empty when running the file in the image above
It seems like you're not getting any files from the reader if token usage is zero
and it should recursively get files from directories
Hi @lars thanks for using LlamaIndex and GithubReader.
I’ll take a look at it when I get home. In the mean time, could you try disabling use_parser.
and Could you also turn on the use_verbose and post the output here if possible?
this is the output after running the code in the image above: (message.txt)

and this is the content of the umbraco.json file:
{"index_struct_id": "53347f08-c351-42a4-a6c3-8ebc46a95fee", "docstore": {"docs": {"53347f08-c351-42a4-a6c3-8ebc46a95fee": {"text": null, "doc_id": "53347f08-c351-42a4-a6c3-8ebc46a95fee", "embedding": null, "extra_info": null, "nodes_dict": {}, "id_map": {}, "embeddings_dict": {}, "__type__": "simple_dict"}}}, "vector_store": {"simple_vector_store_data_dict": {"embedding_dict": {}, "text_id_to_doc_id": {}}}}
you added src/Umbraco.Web.Common to include it but looks like the reader somehow ignores it.
Plain Text
Checking src\Umbraco.Web.Common whether to FilterType.INCLUDE it based on the filter directories: ['src/Umbraco.Web.Common']
        ignoring directory Umbraco.Web.Common due to filter
I found the problem. I had this open PR: https://github.com/emptycrown/llama-hub/pull/73
that I think would solve it but I wasn't actually properly testing. I'll push my fixes and hopefully this PR should resolve the issue. Sorry for late replies. fyi @lars
@HAL 9000 no worries! awesome work dude!! thanks so much, i will update you 😁 so impressed of the work you guys do! πŸ”₯
@jerryjliu0 @HAL 9000 is this github loader fix released? https://github.com/emptycrown/llama-hub/pull/109

i've been lurking in this PR for some days πŸ˜› is there a way to follow which PRs/issues is released? couldnt find any link to issues/PRs in the release log
@lars yeah sorry to clarify, llamahub loaders don't follow a "release" schedule - the moment the PR is merged, it's immediately available for use on llamahub.ai
ah, isee. good to know πŸ˜„ thats amazing, ty!
@jerryjliu0 sorry for asking alot of questions, but how do i use the new version of the code that lies within llamahub? i have updated to llama_index 0.4.28. but the github loader still doesnt work. when i ctrl+click into the GithubRepositoryReader, i see that the code in the PR you refered to above is not applied.
https://github.com/emptycrown/llama-hub/pull/104/files#diff-f682320a7f5b1ac241fe19eb5bf61d8f583ab8655c3f7d2472bb14882690f6ea

this is how i load the loader modules: from llama_index.readers.llamahub_modules import GithubClient, GithubRepositoryReader
Attachments
image.png
image.png
@lars by default we cache when you use download_loader, try download_loader(..., refresh_cache=True)
okey. it didnt work, but managed to use the new code by changing the import from from llama_index.readers.llamahub_modules import GithubClient, GithubRepositoryReader

to
from llama_index.readers.github_readers import github_api_client, github_repository_reader
and init: loader = github_repository_reader.GithubRepositoryReader
github_client = github_api_client.GithubClient
The one in LlamaIndex is not actually up to date.
I have tested with your code example:
pip freeze > uninstall.txt && pip uninstall -y -r uninstall.txt && pip cache purge && pip install --upgrade httpx llama-index && python main.py

Plain Text
from llama_index import download_loader

download_loader("GithubRepositoryReader", 
                refresh_cache=True, 
                loader_hub_url="https://raw.githubusercontent.com/ahmetkca/llama-hub/github-reader-test-and-fix/loader_hub")
from llama_index.readers.llamahub_modules.github_repo import GithubRepositoryReader, GithubClient

def main():
    github_client = GithubClient()
    github_repo_reader = GithubRepositoryReader(
            github_client,
            owner = "umbraco",
            repo = "Umbraco-CMS",
            use_parser = False,
            filter_directories = (["src/Umbraco.Web.Common"], GithubRepositoryReader.FilterType.INCLUDE),
            filter_file_extensions = ([".cs"], GithubRepositoryReader.FilterType.INCLUDE),
            verbose = True,
            concurrent_requests = 2,
    )

    docs = github_repo_reader.load_data(branch="v10/main")

    for doc in docs:
        print(doc.extra_info)
if __name__ == "__main__":
    main()
@lars Could you try the above code?
@HAL 9000 yes, your code worked!

import os
from llama_index import download_loader, GPTSimpleVectorIndex
from llama_index.readers.llamahub_modules.github_repo import GithubRepositoryReader, GithubClient

os.environ[
"OPENAI_API_KEY"] = 'token'

download_loader("GithubRepositoryReader",
refresh_cache=True,
loader_hub_url="https://raw.githubusercontent.com/ahmetkca/llama-hub/github-reader-test-and-fix/loader_hub")


def main():
github_client = GithubClient("token")
github_repo_reader = GithubRepositoryReader(
github_client,
owner = "umbraco",
repo = "Umbraco-CMS",
use_parser = False,
filter_directories = (["src\Umbraco.Web.Common\Extensions"], GithubRepositoryReader.FilterType.INCLUDE),
filter_file_extensions = ([".cs"], GithubRepositoryReader.FilterType.INCLUDE),
verbose = True,
concurrent_requests = 2,
)

docs = github_repo_reader.load_data(branch="v10/main")

index = GPTSimpleVectorIndex(docs)

index.save_to_disk('./indexes/umbraco.json')

for doc in docs:
print(doc.extra_info)
if name == "main":
main()

this worked
i had to start with only using one folder, because the executing timeout on multiple folders. this is the error message i got when it timed out:
do you have any idea why it times out during execution? anyway, it works πŸ˜„ so cool. will explore more with it tomorrow! so grateful for you work πŸ™‚
Thank you for your patient @lars . This is my first time officially contributing to the open source.
For your question, you can adjust the concurrent_requests it is 5 by default which means the GithubRepoReader will retrieve 5 files concurrently.
I think the default rate limit is 5000 request per hour set by GitHub. You can increase the concurrent_requests but it also means there will be high chance you will encounter with ConnectionTimeout because of the rate limiting by GitHub. I suggest 5 or below.
that so cool! hopefully i can contribute in the future as well. great, thanks for the explanation. i will test with different values
@HAL 9000 thanks for the help! out of curiosity, is this code not checked into llamahub yet?
it should be
Attachment
image.png
oh wait I was pointing to my branch when testing
@lars You should actually use the below code that doesn't point to my branch.
Plain Text
from llama_index import download_loader

download_loader("GithubRepositoryReader", 
                refresh_cache=True)
from llama_index.readers.llamahub_modules.github_repo import GithubRepositoryReader, GithubClient

def main():
    github_client = GithubClient()
    github_repo_reader = GithubRepositoryReader(
            github_client,
            owner = "umbraco",
            repo = "Umbraco-CMS",
            use_parser = False,
            filter_directories = (["src/Umbraco.Web.Common"], GithubRepositoryReader.FilterType.INCLUDE),
            filter_file_extensions = ([".cs"], GithubRepositoryReader.FilterType.INCLUDE),
            verbose = True,
            concurrent_requests = 2,
    )

    docs = github_repo_reader.load_data(branch="v10/main")

    for doc in docs:
        print(doc.extra_info)
if __name__ == "__main__":
    main()
@jerryjliu0 What do you think about making llama-hub subtree or submodule?
that's not a bad idea, i haven't done the scoping on the effort required. but seems like it would make it easier to add / test new loaders!
okey, will test later.
Add a reply
Sign up and join the conversation on Discord