Find answers from the community

Updated last year

how to extract data from github?

At a glance

The community member asked how to extract data from GitHub. Another community member provided a step-by-step guide on using the GithubRepositoryReader class from the LlamaIndex library to extract data from a GitHub repository. The guide includes installing the necessary packages, setting up the GitHub token and OpenAI API key, loading data from the repository, creating an index, and querying the index. The community member also mentioned that more details can be found in a GitHub notebook.

@kapa.ai how to extract data from github?
k
3 comments
Please stand by @Matthews as I search the knowledge base. I normally take about 20 seconds or so. :wave:
----------------------
To extract data from GitHub using LlamaIndex, you can use the GithubRepositoryReader class. Here's a step-by-step guide:

  1. Install the necessary packages. You'll need llama-index, nest_asyncio, and httpx.
Plain Text
!pip install llama-index
!pip install nest_asyncio httpx
import nest_asyncio
nest_asyncio.apply()


  1. Import the necessary modules.
Plain Text
from llama_index import VectorStoreIndex, GithubRepositoryReader
import os


  1. Set your GitHub token and OpenAI API key as environment variables.
Plain Text
%env OPENAI_API_KEY=your_openai_api_key
%env GITHUB_TOKEN=your_github_token


  1. Use the GithubRepositoryReader to load data from the GitHub repository.
Plain Text
github_token = os.environ.get("GITHUB_TOKEN")
owner = "jerryjliu"
repo = "llama_index"
branch = "main"

documents = GithubRepositoryReader(
    github_token=github_token,
    owner=owner,
    repo=repo,
    use_parser=False,
    verbose=False,
    ignore_directories=["examples"],
).load_data(branch=branch)


  1. Create an index from the documents.
Plain Text
index = VectorStoreIndex.from_documents(documents)


  1. You can now use the query_engine to query the index.
Plain Text
query_engine = index.as_query_engine()
response = query_engine.query(
    "Your query here",
    verbose=True,
)


  1. Display the response.
Plain Text
from IPython.display import Markdown, display
display(Markdown(f"<b>{response}</b>"))


Please replace your_openai_api_key and your_github_token with your actual OpenAI API key and GitHub token.
You can find more details in the GithubRepositoryReaderDemo notebook () on GitHub.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
Add a reply
Sign up and join the conversation on Discord