Find answers from the community

Updated 3 months ago

i want to build a chatbot trained on custom knowledge base. In the examples there's one us

i want to build a chatbot trained on custom knowledge base. In the examples there's one using html data. I want it to use a folder full of pdf files instead. Give me what I need to modify in the examples to do this.
k
a
14 comments
Please stand by @adityasingh95 as I search the knowledge base. We are experiencing especially high traffic right now and might take longer than usual. :wave:
----------------------
To build a chatbot trained on a custom knowledge base using a folder full of PDF files, you'll need to modify the data loading part of the example. In the provided extracts, the example uses the UnstructuredReader to load HTML files. You'll need to use a different reader that can handle PDF files.

Unfortunately, the provided extracts do not include information on how to load PDF files using LlamaIndex. However, you can use an external library to convert your PDF files into plain text or another format that LlamaIndex can ingest, and then follow the rest of the example.

Here's the example code for loading HTML files:

Plain Text
from llama_index import download_loader, GPTVectorStoreIndex, ServiceContext, StorageContext, load_index_from_storage
from pathlib import Path

years = [2022, 2021, 2020, 2019]
UnstructuredReader = download_loader("UnstructuredReader", refresh_cache=True)

loader = UnstructuredReader()
doc_set = {}
all_docs = []
for year in years:
    year_docs = loader.load_data(file=Path(f'./data/UBER/UBER_{year}.html'), split_documents=False)
    # insert year metadata into each year
    for d in year_docs:
        d.extra_info = {"year": year}
    doc_set[year] = year_docs
    all_docs.extend(year_docs)


You'll need to replace the UnstructuredReader with a PDF reader and modify the loop to load PDF files from your folder. Once you have the text data, you can continue with the rest of the example to build your chatbot.

For more information on LlamaIndex, you can refer to the documentation ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai Can you provide to me a modified version of the sec10k example that uses simpledirectoryreader instead?
Let me check the knowledge base for you. One moment! :books:
----------------------
Sure, I can provide you with a modified version of the sec10k example that uses SimpleDirectoryReader instead. Here's the modified code:

Plain Text
from llama_index import download_loader, GPTVectorStoreIndex, ServiceContext, StorageContext, load_index_from_storage
from pathlib import Path

# Replace the UnstructuredReader with SimpleDirectoryReader
from llama_index import SimpleDirectoryReader

# Replace the loader with SimpleDirectoryReader and specify the folder containing your files
loader = SimpleDirectoryReader('./data/your_folder')

# Load the documents
all_docs = loader.load_data()

# Continue with the rest of the example


Make sure to replace './data/your_folder' with the path to the folder containing your files. The SimpleDirectoryReader will load all the files in the specified folder.

For more information on LlamaIndex, you can refer to the documentation ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai can you show me where I'll put it in the sec 10k example?
Let me check the knowledge base for you. One moment! :books:
----------------------
I apologize for the confusion earlier. Here's the modified version of the sec 10k example using SimpleDirectoryReader:

Plain Text
import nest_asyncio
nest_asyncio.apply()

from llama_index import SimpleDirectoryReader, LLMPredictor, ServiceContext, GPTVectorStoreIndex
from llama_index.response.pprint_utils import pprint_response
from langchain import OpenAI

from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.query_engine import SubQuestionQueryEngine

# Configure LLM service
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-003", max_tokens=-1, streaming=True))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

# Load data using SimpleDirectoryReader
lyft_docs = SimpleDirectoryReader(input_files=["../data/10k/lyft_2021.pdf"]).load_data()
uber_docs = SimpleDirectoryReader(input_files=["../data/10k/uber_2021.pdf"]).load_data()

# Build indices
lyft_index = GPTVectorStoreIndex.from_documents(lyft_docs)
uber_index = GPTVectorStoreIndex.from_documents(uber_docs)

# Continue with the rest of the example


In this modified version, I replaced the original data loading code with SimpleDirectoryReader to load PDF files for Lyft and Uber. Make sure to replace the file paths in input_files with the correct paths to your PDF files.

For more information on LlamaIndex, you can refer to the documentation ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai can you give me the sec10k example?
Let me check the knowledge base for you. One moment! :books:
----------------------
I apologize, but the provided extracts do not include the complete sec 10k example. However, I can provide you with a link to the full tutorial notebook () mentioned in one of the extracts. You can refer to this notebook for the complete sec 10k example.

For more information on LlamaIndex, you can refer to the documentation ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai can you modify the example to read only one year of data instead of 3?
Let me check the knowledge base for you. One moment! :books:
----------------------
Sure, I can modify the example to read only one year of data instead of multiple years. Here's the modified code snippet from the provided extracts:

Plain Text
from llama_index import download_loader, GPTVectorStoreIndex, ServiceContext, StorageContext, load_index_from_storage
from pathlib import Path

year = 2021  # Choose the year you want to load
UnstructuredReader = download_loader("UnstructuredReader", refresh_cache=True)

loader = UnstructuredReader()
year_docs = loader.load_data(file=Path(f'./data/UBER/UBER_{year}.html'), split_documents=False)

# Insert year metadata into each document
for d in year_docs:
    d.extra_info = {"year": year}

all_docs = year_docs


In this modified version, I removed the loop over multiple years and set the year variable to the desired year (e.g., 2021). The rest of the code remains the same.

Please note that this example is for loading HTML files. If you need to load data from a different file format, you'll need to use an appropriate reader.

For more information on LlamaIndex, you can refer to the documentation ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
Add a reply
Sign up and join the conversation on Discord