I'm guessing this one
from llama_index.text_splitter import SentenceSplitter
is now
from llama_index.core.node_parser import SentenceSplitter
but then
text_splitter = SentenceSplitter(
chunk_size=1024,
# separator=" ",
)
produces an error:
ValidationError: 1 validation error for ConfigurableTransformation
component_type
subclass of BaseComponent expected (type=type_error.subclass; expected_class=BaseComponent)
import pdfplumber
Initialize variables to keep track of page numbers and line numbers
page_number = 1
chunks = []
Open the PDF file
with pdfplumber.open("./1.2.2.2 Customer Contract - Stockwood Dr - Woodstock - GA.pdf") as pdf:
print(f"Debug: Total Pages in PDF: {len(pdf.pages)}") # New Debugging Line
# Loop through each page in the PDF
for page in pdf.pages:
print(f"Debug: Reading Page {page_number}") # Debugging
# Extract text from the current page
page_text = page.extract_text()
# Split the page text into lines
lines = page_text.split('\n')
# Apply your existing chunking logic to the lines from this page
for i in range(0, len(lines), 10): # Chunk size is 10 lines
chunk = '\n'.join(lines[i:i+10])
chunks.append((chunk, page_number)) # Include the actual page number
print(f"Debug: Creating Chunk {len(chunks)} from Page {page_number}") # Debugging
# Debugging: Print the first few chunks to see if they contain more lines
print(f"Debug: Chunk {len(chunks)}, Page {page_number}, Content: {chunk[:100]}") # Print the first 100 characters of each chunk
# Increment the page number for the next page
page_number += 1
print(f"Debug: Incremented Page Number to {page_number}") # New Debugging LineYou're the man, whitefang
hmm, Just tried fresh on colab and it is working
thank you for being up this late
Haha, No Its day at my side π
that works but this right after it...
text_splitter = SentenceSplitter(
chunk_size=1024,
# separator=" ",
)
I get this:
ValidationError: 1 validation error for ConfigurableTransformation
component_type
subclass of BaseComponent expected (type=type_error.subclass; expected_class=BaseComponent)
Then you are up very late. lol π
Is this a fresh env you are working with? fresh with latest installation (v0.10.x) of llamaindex or was there any previous version in here earlier
is there some package that has to run?
oh, but I ran this:
from llama_index.legacy import VectorStoreIndex
It has be either legacy or new , combining them is not going to work
from llama_index.core import VectorStoreIndex
is this dead, too now?
from llama_index.schema import TextNode
yes, i think it is
from llama_index.core.schema import TextNode
@WhiteFang_Jr where did this go?
from llama_index.node_parser.extractors import (
MetadataExtractor,
QuestionsAnsweredExtractor,
TitleExtractor,
)
ModuleNotFoundError: No module named 'llama_index.node_parser'
from llama_index.core.extractors import (
QuestionsAnsweredExtractor,
TitleExtractor,
)
what about MetadataExtractor?
@WhiteFang_Jr is this no longer going to work:
Which doc are you using for this code, it must be updated with correct imports, Can you share the doc link
you mean the link to my notebook?
or the link to the llama index documentation
you're always at the same time as me π
Haha yeah this is second time this is happening
hm, so what do you suggest for this:
# Import the necessary modules for metadata extraction
from llama_index.node_parser.extractors import (
MetadataExtractor,
QuestionsAnsweredExtractor,
TitleExtractor,
)
from llama_index.llms import OpenAI
# Initialize the LLM and metadata extractor
llm = OpenAI(model="gpt-3.5-turbo")
#llm = OpenAI(model="gpt-4-Turbo")
metadata_extractor = MetadataExtractor(
extractors=[
TitleExtractor(nodes=5, llm=llm),
QuestionsAnsweredExtractor(questions=3, llm=llm),
],
in_place=False,
)
# Process nodes to add additional metadata
nodes = metadata_extractor.process_nodes(nodes)
print(f"Debug: Processed {len(nodes)} nodes.") # Debugging
for node in nodes:
print(f"Processed Node Metadata: {node.metadata}")
I need to get some sleep. Back at this tomorrow. But, I can say a good amount of code that I had written is now defunct with these latest changes π¦
I can understand, but if you provide the code, I'll be happy to debug and correct it with you next morning.
There was a migration guide and auto-migrate tool. I wonder if you saw that? Haply to help migrate code as well
@Logan M I would love to take you up on that. How do I do that? Just continue to post here?
Hey!
I have updated the imports of the two files that you sent me.
Do check that!
If anything fails you can let us know here
So, the first one fails RuntimeError: asyncio.run() cannot be called from a running event loop
you need to add
import nest_asyncio
nest_asyncio.apply()
to your code (if this is in fastapi, you need to set the loop type to asyncio as well)
from llama_index.node_parser.extractors import (
MetadataExtractor,
QuestionsAnsweredExtractor,
TitleExtractor,
)
from llama_index.llms import OpenAI
# Initialize the LLM and metadata extractor
llm = OpenAI(model="gpt-3.5-turbo")
metadata_extractor = MetadataExtractor(
extractors=[
TitleExtractor(nodes=5, llm=llm),
QuestionsAnsweredExtractor(questions=3, llm=llm),
],
in_place=False,
)
# Process nodes to add additional metadata
nodes = metadata_extractor.process_nodes(nodes)
print(f"Debug: Processed {len(nodes)} nodes.") # Debugging
for node in nodes:
print(f"Processed Node Metadata: {node.metadata}")
from llama_index.core.extractors import (
MetadataExtractor,
QuestionsAnsweredExtractor,
TitleExtractor,
)
from llama_index.llms.openai import OpenAI
oh wait, MetadataExtractor isn't a thing either
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.extractors import (
QuestionsAnsweredExtractor,
TitleExtractor,
)
from llama_index.llms.openai import OpenAI
# Initialize the LLM and metadata extractor
llm = OpenAI(model="gpt-3.5-turbo")
pipeline = IngestionPipeline(transformations=[
TitleExtractor(nodes=5, llm=llm),
QuestionsAnsweredExtractor(questions=3, llm=llm),
]
)
# Process nodes to add additional metadata
nodes = pipeline.run(nodes=nodes)
print(f"Debug: Processed {len(nodes)} nodes.") # Debugging
for node in nodes:
print(f"Processed Node Metadata: {node.metadata}")
ok that looked to have fixed it. interesting.....
I guess, I'll just go through the full list of cells in case down the road someone else is interested...
next cell. I found this needs to be changed
from llama_index.embeddings import OpenAIEmbedding
# Initialize the OpenAI embedding model
embed_model = OpenAIEmbedding()
# Generate embeddings for each node and store them in the node
for node in nodes:
print(f"Debug: Adding node with Metadata: {node.metadata}") # Debugging
node_embedding = embed_model.get_text_embedding(
node.get_content(metadata_mode="all")
)
node.embedding = node_embedding
to this:
from llama_index.embeddings.openai import OpenAIEmbedding
# Initialize the OpenAI embedding model
embed_model = OpenAIEmbedding()
# Generate embeddings for each node and store them in the node
for node in nodes:
print(f"Debug: Adding node with Metadata: {node.metadata}") # Debugging
node_embedding = embed_model.get_text_embedding(
node.get_content(metadata_mode="all")
)
node.embedding = node_embedding
(You really could use the upgrade cli tool for this π unless is barfs on your notebook, which is known to happen)
llamaindex-cli upgrade-file <file>
hm. maybe it would. I'm finding some things work, and then others don't. Some of the things that do not work, seem pinecone related too. Like, they have also updated their api
so, like this: vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
vector_store.add(nodes) #Add to pinecone
AttributeError: 'Pinecone' object has no attribute 'upsert'
Did you run pip install llama-index-vector-stores-pinecone
? It should install the proper version of pinecone (they made some changes with their serverless stuff)
Sorry, I had to hop on a zoom yesterday mid-discussion. Let me verify that....
Ok, so I ran that:
i still get AttributeError: 'Pinecone' object has no attribute 'upsert'
when I run this:
print(f"Debug: Number of nodes added to vector_store: {len(nodes)}") # Debugging
# New Debug Statements
for node in nodes:
print(f"Pre-Vector Store Node Metadata: {node.metadata}")
# 1. Debugging before adding nodes to vector_store
print("Debug: Metadata before adding nodes to vector_store")
for idx, node in enumerate(nodes):
print(f"Node {idx+1} Metadata: {node.metadata}")
print(f"Debug: Number of nodes added to vector_store: {len(nodes)}") # Debugging
vector_store.add(nodes) #Add to pinecone
specifically, the issue is this line: vector_store.add(nodes) #Add to pinecone
which is defined as : from llama_index.vector_stores.pinecone import PineconeVectorStore
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
having run this:
from llama_index.core import VectorStoreIndex
and before that this !pip install llama-index-vector-stores-pinecone
How did you setup the pinecone_index
tho?
oh, I was on version: llama-index-vector-stores-pinecone-0.1.4, maybe I have to upgrade this
that shouldn't matter too much I think π€ It feels like pinecone_index=pinecone_index
is not passing in the correct thing imo
pinecone_index.create_index(
name="eng",
dimension=1536,
metric="euclidean",
spec=PodSpec(
environment="gcp-starter"
)
)
so thats not actually the pinecone_index
object, its really the pc
object
pinecone_index = pc.Index("quickstart")
That will get the correct pinecone_index
object
not sure how you made the connection with such limited context
So, wow, I'm up and running on the vector store!
Now for the KG π stand by
So, I think the imports that @WhiteFang_Jr provided were enough to get the the KG working ....
doing some additional tests now to check that everything is back to normal.
Ok looks like everything is working as per before, save one issue that I can see (so far)
The document title is not quite right
Here is the code that produces the title:
from llama_index.llms.openai import OpenAI
from llama_index.core.query_engine import CitationQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
StorageContext,
load_index_from_storage
)
#define llm and embed model and add it to Settings, This will replace Service context
from llama_index.core import Settings
Settings.llm = llm
Settings.embed_model = embed_model
from llama_index.core.schema import Document as LlamaDocument # Import the llama_index Document class
# Inherit from the llama_index Document class to make it compatible
class Document(LlamaDocument):
def __init__(self, text, metadata):
super().__init__(text=text, metadata=metadata)
def get_doc_id(self):
return f"{self.metadata['filename']}-{self.metadata['starting_line_number']}"
import os
import networkx as nx
#from llama_index.llms import OpenAI (depricated)
from llama_index.llms.openai import OpenAI
#from llama_index.query_engine import CitationQueryEngine (depricated)
from llama_index.core.query_engine import CitationQueryEngine
#from llama_index import ( ## (depricated)
from llama_index.core import (
VectorStoreIndex,
StorageContext,
ServiceContext,
)
# Initialize service context
service_context = ServiceContext.from_defaults(
llm=OpenAI(model="gpt-3.5-turbo", temperature=0)
)
# Explicitly set doc_id in metadata
for doc in documents:
doc.metadata['doc_id'] = doc.get_doc_id()
# Initialize VectorStoreIndex
try:
index = VectorStoreIndex.from_documents(documents, service_context=service_context)
except Exception as e:
print(f"Error: {e}")
# Initialize the CitationQueryEngine
query_engine = CitationQueryEngine.from_args(
index,
similarity_top_k=3,
citation_chunk_size=512,
)
# Query and Retrieve Information
response = query_engine.query("what is the title of this document?")
print("Query Response:", response)
# Create Knowledge Graph Nodes
G = nx.Graph()
# Add nodes to the graph
for i, source_node in enumerate(response.source_nodes):
node_content = source_node.node.get_text()
citation = source_node.node.metadata.get('page_number', 'Unknown')
file_name = source_node.node.metadata.get('filename', 'Unknown')
title = source_node.node.metadata.get('document_title', 'Unknown')
G.add_node(citation, content=node_content, title=title)
Is document_title
actually in your metadata for each document?
yes, and I can see it's wrong earlier in the code
like before the code that I just pasted
seem to get it wrong starting here:
# Updated imports for document processing from whiteFang + Chatgpt + Logan
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.extractors import (
QuestionsAnsweredExtractor,
TitleExtractor,
)
from llama_index.llms.openai import OpenAI
# Initialize the LLM and metadata extractor
llm = OpenAI(model="gpt-3.5-turbo")
pipeline = IngestionPipeline(transformations=[
TitleExtractor(nodes=5, llm=llm),
QuestionsAnsweredExtractor(questions=3, llm=llm),
]
)
# Process nodes to add additional metadata
nodes = pipeline.run(nodes=nodes)
print(f"Debug: Processed {len(nodes)} nodes.") # Debugging
for node in nodes:
print(f"Processed Node Metadata: {node.metadata}")
my guess is that the wrong title is being propagated throughout the code
odd, I'm re-running the notebook and I don't see the problem in all the nodes any longer. But I see it in some of them
I'm not sure what you mean by wrong title -- the title extractor is just using the LLM to predict a title. It could be anything π
response = query_engine.query(query_str)
# Debugging: Extract and print metadata from source nodes
source_nodes = response.source_nodes
for idx, node in enumerate(source_nodes):
print(f"--- Metadata for Source Node {idx + 1} ---")
for key, value in node.metadata.items():
print(f"{key}: {value}")
print("\n") # For better readability
--- Metadata for Source Node 1 ---
source_doc_idx: 6
filename: 1.2.2.2 Customer Contract - Stockwood Dr - Woodstock - GA.pdf
page_number: 3
document_title: Utility Service Agreement and Installation for Energy Efficiency Upgrade at Bob Butrico's Solar System Installation
line_count: 10
starting_line_number: 56
questions_this_excerpt_can_answer: 1. What is the unique identifier for the Design Envelope ID in this document?
- How are original signatures transmitted and received in this document, and what is stated about their validity?
- Which two parties have executed this Order as of the Effective Date mentioned in the document?
--- Metadata for Source Node 2 ---
source_doc_idx: 6
filename: 1.2.2.2 Customer Contract - Stockwood Dr - Woodstock - GA.pdf
page_number: 3
document_title: Utility Service Agreement and Energy Efficiency Upgrade for AT&T Corp.'s Solar System Installation with Eco Engineering Subcontractor
line_count: 10
starting_line_number: 56
questions_this_excerpt_can_answer: 1. What is the unique identifier for the Design Envelope in this document?
- How are original signatures transmitted and received in this agreement?
- Which two parties have executed this Order as of the Effective Date?
--- Metadata for Source Node 1 ---
this one is wrong
--- Metadata for Source Node 2 ---
this one is correct
the title extractor is just using the LLM to predict a title
-- its not guaranteed to be "right" or "wrong" -- its just looking at the text and predicting a title
unless I'm misunderstanding π
seems like to me you ingested the same document/node twice into your index? But since you ran it through the title extractor each time, the generated title and questions are different?
this is what I'm doing
# Updated imports for document processing from whiteFang +
Logan
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.extractors import (
QuestionsAnsweredExtractor,
TitleExtractor,
)
from llama_index.llms.openai import OpenAI
# Initialize the LLM and metadata extractor
llm = OpenAI(model="gpt-3.5-turbo")
pipeline = IngestionPipeline(transformations=[
TitleExtractor(nodes=5, llm=llm),
QuestionsAnsweredExtractor(questions=3, llm=llm),
]
)
# Process nodes to add additional metadata
nodes = pipeline.run(nodes=nodes)
print(f"Debug: Processed {len(nodes)} nodes.") # Debugging
for node in nodes:
print(f"Processed Node Metadata: {node.metadata}")