LlamaIndex

Log inLog into community

Find answers from the community

Updated last year

how to extract data from github?

how to extract data from github?

At a glance

The community member asked how to extract data from GitHub. Another community member provided a step-by-step guide on using the GithubRepositoryReader class from the LlamaIndex library to extract data from a GitHub repository. The guide includes installing the necessary packages, setting up the GitHub token and OpenAI API key, loading data from the repository, creating an index, and querying the index. The community member also mentioned that more details can be found in a GitHub notebook.

·

@kapa.ai how to extract data from github?

k

3 comments

Please stand by @Matthews as I search the knowledge base. I normally take about 20 seconds or so. :wave:

----------------------
To extract data from GitHub using LlamaIndex, you can use the GithubRepositoryReader class. Here's a step-by-step guide:

Install the necessary packages. You'll need llama-index, nest_asyncio, and httpx.

Plain Text

!pip install llama-index
!pip install nest_asyncio httpx
import nest_asyncio
nest_asyncio.apply()

Import the necessary modules.

Plain Text

from llama_index import VectorStoreIndex, GithubRepositoryReader
import os

Set your GitHub token and OpenAI API key as environment variables.

Plain Text

%env OPENAI_API_KEY=your_openai_api_key
%env GITHUB_TOKEN=your_github_token

Use the GithubRepositoryReader to load data from the GitHub repository.

Plain Text

github_token = os.environ.get("GITHUB_TOKEN")
owner = "jerryjliu"
repo = "llama_index"
branch = "main"

documents = GithubRepositoryReader(
    github_token=github_token,
    owner=owner,
    repo=repo,
    use_parser=False,
    verbose=False,
    ignore_directories=["examples"],
).load_data(branch=branch)

Create an index from the documents.

Plain Text

index = VectorStoreIndex.from_documents(documents)

You can now use the query_engine to query the index.

Plain Text

query_engine = index.as_query_engine()
response = query_engine.query(
    "Your query here",
    verbose=True,
)

Display the response.

Plain Text

from IPython.display import Markdown, display
display(Markdown(f"<b>{response}</b>"))

Please replace your_openai_api_key and your_github_token with your actual OpenAI API key and GitHub token.

You can find more details in the GithubRepositoryReaderDemo notebook () on GitHub.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

Add a reply

Sign up and join the conversation on Discord