----------------------
Based on the information provided in the extracts, you can use LlamaIndex to achieve your goal. Here's a general approach:
- Data Connectors: Use data connectors to ingest your existing data from their native source and format. In your case, these would be the PDF reports.
- Data Indexes: Structure your data in intermediate representations that are easy and performant for LLMs to consume. This will help in extracting the required information from the reports.
- Engines: Use engines to provide natural language access to your data. Query engines can be used as powerful retrieval interfaces for knowledge-augmented output.
- Data Agents: These are LLM-powered knowledge workers augmented by tools, from simple helper functions to API integrations and more. They can help in extracting the required information.
- Application Integrations: Tie LlamaIndex back into the rest of your ecosystem. This could be LangChain, Flask, Docker, ChatGPT, or anything else.
For extracting terms and definitions from text, you can follow the tutorial provided in the extracts. Here's a simplified version of the code:
from llama_index.core import (
SimpleDirectoryReader,
VectorStoreIndex,
download_loader,
RAKEKeywordTableIndex,
)
from llama_index.llms.openai import OpenAI
llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
# Load document and build index
reader = SimpleDirectoryReader(input_files=["./data/10k/your_pdf_file.pdf"])
data = reader.load_data()
index = VectorStoreIndex.from_documents(data)
query_engine = index.as_query_engine(streaming=True, similarity_top_k=3)
# Stream response with page citation
response = query_engine.query(
"What was the impact of COVID? Show statements in bullet form and show"
" page reference after each statement."
)
response.print_response_stream()