UnstructuredReader
to load HTML files. You'll need to use a different reader that can handle PDF files.from llama_index import download_loader, GPTVectorStoreIndex, ServiceContext, StorageContext, load_index_from_storage from pathlib import Path years = [2022, 2021, 2020, 2019] UnstructuredReader = download_loader("UnstructuredReader", refresh_cache=True) loader = UnstructuredReader() doc_set = {} all_docs = [] for year in years: year_docs = loader.load_data(file=Path(f'./data/UBER/UBER_{year}.html'), split_documents=False) # insert year metadata into each year for d in year_docs: d.extra_info = {"year": year} doc_set[year] = year_docs all_docs.extend(year_docs)
UnstructuredReader
with a PDF reader and modify the loop to load PDF files from your folder. Once you have the text data, you can continue with the rest of the example to build your chatbot.Note: you can ask me a follow up question by @ mentioning me again
:speech_balloon:SimpleDirectoryReader
instead. Here's the modified code:from llama_index import download_loader, GPTVectorStoreIndex, ServiceContext, StorageContext, load_index_from_storage from pathlib import Path # Replace the UnstructuredReader with SimpleDirectoryReader from llama_index import SimpleDirectoryReader # Replace the loader with SimpleDirectoryReader and specify the folder containing your files loader = SimpleDirectoryReader('./data/your_folder') # Load the documents all_docs = loader.load_data() # Continue with the rest of the example
'./data/your_folder'
with the path to the folder containing your files. The SimpleDirectoryReader
will load all the files in the specified folder.Note: you can ask me a follow up question by @ mentioning me again
:speech_balloon:SimpleDirectoryReader
:import nest_asyncio nest_asyncio.apply() from llama_index import SimpleDirectoryReader, LLMPredictor, ServiceContext, GPTVectorStoreIndex from llama_index.response.pprint_utils import pprint_response from langchain import OpenAI from llama_index.tools import QueryEngineTool, ToolMetadata from llama_index.query_engine import SubQuestionQueryEngine # Configure LLM service llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-003", max_tokens=-1, streaming=True)) service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor) # Load data using SimpleDirectoryReader lyft_docs = SimpleDirectoryReader(input_files=["../data/10k/lyft_2021.pdf"]).load_data() uber_docs = SimpleDirectoryReader(input_files=["../data/10k/uber_2021.pdf"]).load_data() # Build indices lyft_index = GPTVectorStoreIndex.from_documents(lyft_docs) uber_index = GPTVectorStoreIndex.from_documents(uber_docs) # Continue with the rest of the example
SimpleDirectoryReader
to load PDF files for Lyft and Uber. Make sure to replace the file paths in input_files
with the correct paths to your PDF files.Note: you can ask me a follow up question by @ mentioning me again
:speech_balloon:Note: you can ask me a follow up question by @ mentioning me again
:speech_balloon:from llama_index import download_loader, GPTVectorStoreIndex, ServiceContext, StorageContext, load_index_from_storage from pathlib import Path year = 2021 # Choose the year you want to load UnstructuredReader = download_loader("UnstructuredReader", refresh_cache=True) loader = UnstructuredReader() year_docs = loader.load_data(file=Path(f'./data/UBER/UBER_{year}.html'), split_documents=False) # Insert year metadata into each document for d in year_docs: d.extra_info = {"year": year} all_docs = year_docs
year
variable to the desired year (e.g., 2021). The rest of the code remains the same.Note: you can ask me a follow up question by @ mentioning me again
:speech_balloon: