Find answers from the community

Updated 2 months ago

what happened to this:

what happened to this:

from llama_index.vector_stores import PineconeVectorStore vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

ModuleNotFoundError: No module named 'openai.openai_object'
t
W
L
100 comments
I'm guessing this one

from llama_index.text_splitter import SentenceSplitter

is now

from llama_index.core.node_parser import SentenceSplitter

but then

text_splitter = SentenceSplitter( chunk_size=1024, # separator=" ", )

produces an error:


ValidationError: 1 validation error for ConfigurableTransformation
component_type
subclass of BaseComponent expected (type=type_error.subclass; expected_class=BaseComponent)
import pdfplumber

Initialize variables to keep track of page numbers and line numbers

page_number = 1
chunks = []

Open the PDF file

with pdfplumber.open("./1.2.2.2 Customer Contract - Stockwood Dr - Woodstock - GA.pdf") as pdf:
print(f"Debug: Total Pages in PDF: {len(pdf.pages)}") # New Debugging Line
# Loop through each page in the PDF
for page in pdf.pages:
print(f"Debug: Reading Page {page_number}") # Debugging
# Extract text from the current page
page_text = page.extract_text()



# Split the page text into lines
lines = page_text.split('\n')

# Apply your existing chunking logic to the lines from this page
for i in range(0, len(lines), 10): # Chunk size is 10 lines
chunk = '\n'.join(lines[i:i+10])
chunks.append((chunk, page_number)) # Include the actual page number
print(f"Debug: Creating Chunk {len(chunks)} from Page {page_number}") # Debugging

# Debugging: Print the first few chunks to see if they contain more lines
print(f"Debug: Chunk {len(chunks)}, Page {page_number}, Content: {chunk[:100]}") # Print the first 100 characters of each chunk

# Increment the page number for the next page
page_number += 1
print(f"Debug: Incremented Page Number to {page_number}") # New Debugging Line
You're the man, whitefang
hmm, Just tried fresh on colab and it is working
Attachment
image.png
thank you for being up this late
Haha, No Its day at my side πŸ˜†
that works but this right after it...
text_splitter = SentenceSplitter(
chunk_size=1024,
# separator=" ",
)
oh, ya? where are you?
This works too
Attachment
image.png
I get this:

ValidationError: 1 validation error for ConfigurableTransformation
component_type
subclass of BaseComponent expected (type=type_error.subclass; expected_class=BaseComponent)
Then you are up very late. lol πŸ˜…
haha, yes. but I love it
Is this a fresh env you are working with? fresh with latest installation (v0.10.x) of llamaindex or was there any previous version in here earlier
is there some package that has to run?
oh, but I ran this:

from llama_index.legacy import VectorStoreIndex
It has be either legacy or new , combining them is not going to work
from llama_index.core import VectorStoreIndex
ok that solved it
is this dead, too now?

from llama_index.schema import TextNode
yes, i think it is

from llama_index.core.schema import TextNode
@WhiteFang_Jr where did this go?

from llama_index.node_parser.extractors import ( MetadataExtractor, QuestionsAnsweredExtractor, TitleExtractor, )


ModuleNotFoundError: No module named 'llama_index.node_parser'
Plain Text
from llama_index.core.extractors import (
    QuestionsAnsweredExtractor,
    TitleExtractor,
)
what about MetadataExtractor?
@WhiteFang_Jr is this no longer going to work:
Attachment
image.png
Which doc are you using for this code, it must be updated with correct imports, Can you share the doc link
you mean the link to my notebook?
or the link to the llama index documentation
this was from the summer
lol, But I think this got updated, earlier you had to combine all the extracter in one place and then use it, now you can directly pass it to transformations. No need to have one more extra layer, for isntance check this example:
https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_metadata_extractor.html
you're always at the same time as me πŸ™‚
Haha yeah this is second time this is happening
hm, so what do you suggest for this:

# Import the necessary modules for metadata extraction from llama_index.node_parser.extractors import ( MetadataExtractor, QuestionsAnsweredExtractor, TitleExtractor, ) from llama_index.llms import OpenAI # Initialize the LLM and metadata extractor llm = OpenAI(model="gpt-3.5-turbo") #llm = OpenAI(model="gpt-4-Turbo") metadata_extractor = MetadataExtractor( extractors=[ TitleExtractor(nodes=5, llm=llm), QuestionsAnsweredExtractor(questions=3, llm=llm), ], in_place=False, ) # Process nodes to add additional metadata nodes = metadata_extractor.process_nodes(nodes) print(f"Debug: Processed {len(nodes)} nodes.") # Debugging for node in nodes: print(f"Processed Node Metadata: {node.metadata}")
I need to get some sleep. Back at this tomorrow. But, I can say a good amount of code that I had written is now defunct with these latest changes 😦
I can understand, but if you provide the code, I'll be happy to debug and correct it with you next morning.
Get some sleep now πŸ’ͺ
Wow. Really?!!!
There was a migration guide and auto-migrate tool. I wonder if you saw that? Haply to help migrate code as well
@Logan M I would love to take you up on that. How do I do that? Just continue to post here?
Post away 🚒
Hey!
I have updated the imports of the two files that you sent me.
Do check that!
If anything fails you can let us know here
@WhiteFang_Jr hi! Ok
So, the first one fails RuntimeError: asyncio.run() cannot be called from a running event loop
you need to add

Plain Text
import nest_asyncio
nest_asyncio.apply()


to your code (if this is in fastapi, you need to set the loop type to asyncio as well)
from llama_index.node_parser.extractors import ( MetadataExtractor, QuestionsAnsweredExtractor, TitleExtractor, ) from llama_index.llms import OpenAI # Initialize the LLM and metadata extractor llm = OpenAI(model="gpt-3.5-turbo") metadata_extractor = MetadataExtractor( extractors=[ TitleExtractor(nodes=5, llm=llm), QuestionsAnsweredExtractor(questions=3, llm=llm), ], in_place=False, ) # Process nodes to add additional metadata nodes = metadata_extractor.process_nodes(nodes) print(f"Debug: Processed {len(nodes)} nodes.") # Debugging for node in nodes: print(f"Processed Node Metadata: {node.metadata}")
Plain Text
from llama_index.core.extractors import (
    MetadataExtractor,
    QuestionsAnsweredExtractor,
    TitleExtractor,
)

from llama_index.llms.openai import OpenAI
oh wait, MetadataExtractor isn't a thing either
Plain Text
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.extractors import (
    QuestionsAnsweredExtractor,
    TitleExtractor,
)

from llama_index.llms.openai import OpenAI

# Initialize the LLM and metadata extractor
llm = OpenAI(model="gpt-3.5-turbo")

pipeline = IngestionPipeline(transformations=[
      TitleExtractor(nodes=5, llm=llm),
      QuestionsAnsweredExtractor(questions=3, llm=llm),
  ]
)

# Process nodes to add additional metadata
nodes = pipeline.run(nodes=nodes)
print(f"Debug: Processed {len(nodes)} nodes.")  # Debugging


for node in nodes:
    print(f"Processed Node Metadata: {node.metadata}")
ok that looked to have fixed it. interesting.....
thank you, sirs
I guess, I'll just go through the full list of cells in case down the road someone else is interested...
next cell. I found this needs to be changed

from llama_index.embeddings import OpenAIEmbedding # Initialize the OpenAI embedding model embed_model = OpenAIEmbedding() # Generate embeddings for each node and store them in the node for node in nodes: print(f"Debug: Adding node with Metadata: {node.metadata}") # Debugging node_embedding = embed_model.get_text_embedding( node.get_content(metadata_mode="all") ) node.embedding = node_embedding



to this:

from llama_index.embeddings.openai import OpenAIEmbedding # Initialize the OpenAI embedding model embed_model = OpenAIEmbedding() # Generate embeddings for each node and store them in the node for node in nodes: print(f"Debug: Adding node with Metadata: {node.metadata}") # Debugging node_embedding = embed_model.get_text_embedding( node.get_content(metadata_mode="all") ) node.embedding = node_embedding
(You really could use the upgrade cli tool for this πŸ˜‰ unless is barfs on your notebook, which is known to happen)

llamaindex-cli upgrade-file <file>
hm. maybe it would. I'm finding some things work, and then others don't. Some of the things that do not work, seem pinecone related too. Like, they have also updated their api
so, like this: vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

vector_store.add(nodes) #Add to pinecone

AttributeError: 'Pinecone' object has no attribute 'upsert'
Did you run pip install llama-index-vector-stores-pinecone ? It should install the proper version of pinecone (they made some changes with their serverless stuff)
Sorry, I had to hop on a zoom yesterday mid-discussion. Let me verify that....
Ok, so I ran that:

i still get AttributeError: 'Pinecone' object has no attribute 'upsert' when I run this:

print(f"Debug: Number of nodes added to vector_store: {len(nodes)}") # Debugging # New Debug Statements for node in nodes: print(f"Pre-Vector Store Node Metadata: {node.metadata}") # 1. Debugging before adding nodes to vector_store print("Debug: Metadata before adding nodes to vector_store") for idx, node in enumerate(nodes): print(f"Node {idx+1} Metadata: {node.metadata}") print(f"Debug: Number of nodes added to vector_store: {len(nodes)}") # Debugging vector_store.add(nodes) #Add to pinecone
specifically, the issue is this line: vector_store.add(nodes) #Add to pinecone

which is defined as : from llama_index.vector_stores.pinecone import PineconeVectorStore vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

having run this:
from llama_index.core import VectorStoreIndex and before that this !pip install llama-index-vector-stores-pinecone
How did you setup the pinecone_index tho?
oh, I was on version: llama-index-vector-stores-pinecone-0.1.4, maybe I have to upgrade this
that shouldn't matter too much I think πŸ€” It feels like pinecone_index=pinecone_index is not passing in the correct thing imo
pinecone_index.create_index(
name="eng",
dimension=1536,
metric="euclidean",
spec=PodSpec(
environment="gcp-starter"
)
)
so thats not actually the pinecone_index object, its really the pc object

pinecone_index = pc.Index("quickstart")

That will get the correct pinecone_index object
not sure how you made the connection with such limited context
So, wow, I'm up and running on the vector store!
Now for the KG πŸ™‚ stand by
So, I think the imports that @WhiteFang_Jr provided were enough to get the the KG working ....
doing some additional tests now to check that everything is back to normal.
Ok looks like everything is working as per before, save one issue that I can see (so far)
The document title is not quite right
Here is the code that produces the title:

from llama_index.llms.openai import OpenAI from llama_index.core.query_engine import CitationQueryEngine from llama_index.core.retrievers import VectorIndexRetriever from llama_index.core import ( VectorStoreIndex, SimpleDirectoryReader, StorageContext, load_index_from_storage ) #define llm and embed model and add it to Settings, This will replace Service context from llama_index.core import Settings Settings.llm = llm Settings.embed_model = embed_model
from llama_index.core.schema import Document as LlamaDocument # Import the llama_index Document class # Inherit from the llama_index Document class to make it compatible class Document(LlamaDocument): def __init__(self, text, metadata): super().__init__(text=text, metadata=metadata) def get_doc_id(self): return f"{self.metadata['filename']}-{self.metadata['starting_line_number']}" import os import networkx as nx #from llama_index.llms import OpenAI (depricated) from llama_index.llms.openai import OpenAI #from llama_index.query_engine import CitationQueryEngine (depricated) from llama_index.core.query_engine import CitationQueryEngine #from llama_index import ( ## (depricated) from llama_index.core import ( VectorStoreIndex, StorageContext, ServiceContext, ) # Initialize service context service_context = ServiceContext.from_defaults( llm=OpenAI(model="gpt-3.5-turbo", temperature=0) )
# Explicitly set doc_id in metadata for doc in documents: doc.metadata['doc_id'] = doc.get_doc_id() # Initialize VectorStoreIndex try: index = VectorStoreIndex.from_documents(documents, service_context=service_context) except Exception as e: print(f"Error: {e}") # Initialize the CitationQueryEngine query_engine = CitationQueryEngine.from_args( index, similarity_top_k=3, citation_chunk_size=512, ) # Query and Retrieve Information response = query_engine.query("what is the title of this document?") print("Query Response:", response) # Create Knowledge Graph Nodes G = nx.Graph() # Add nodes to the graph for i, source_node in enumerate(response.source_nodes): node_content = source_node.node.get_text() citation = source_node.node.metadata.get('page_number', 'Unknown') file_name = source_node.node.metadata.get('filename', 'Unknown') title = source_node.node.metadata.get('document_title', 'Unknown') G.add_node(citation, content=node_content, title=title)
Is document_title actually in your metadata for each document?
yes, and I can see it's wrong earlier in the code
like before the code that I just pasted
seem to get it wrong starting here:

# Updated imports for document processing from whiteFang + Chatgpt + Logan from llama_index.core.ingestion import IngestionPipeline from llama_index.core.extractors import ( QuestionsAnsweredExtractor, TitleExtractor, ) from llama_index.llms.openai import OpenAI # Initialize the LLM and metadata extractor llm = OpenAI(model="gpt-3.5-turbo") pipeline = IngestionPipeline(transformations=[ TitleExtractor(nodes=5, llm=llm), QuestionsAnsweredExtractor(questions=3, llm=llm), ] ) # Process nodes to add additional metadata nodes = pipeline.run(nodes=nodes) print(f"Debug: Processed {len(nodes)} nodes.") # Debugging for node in nodes: print(f"Processed Node Metadata: {node.metadata}")
my guess is that the wrong title is being propagated throughout the code
odd, I'm re-running the notebook and I don't see the problem in all the nodes any longer. But I see it in some of them
I'm not sure what you mean by wrong title -- the title extractor is just using the LLM to predict a title. It could be anything πŸ‘€
response = query_engine.query(query_str) # Debugging: Extract and print metadata from source nodes source_nodes = response.source_nodes for idx, node in enumerate(source_nodes): print(f"--- Metadata for Source Node {idx + 1} ---") for key, value in node.metadata.items(): print(f"{key}: {value}") print("\n") # For better readability


--- Metadata for Source Node 1 ---
source_doc_idx: 6
filename: 1.2.2.2 Customer Contract - Stockwood Dr - Woodstock - GA.pdf
page_number: 3
document_title: Utility Service Agreement and Installation for Energy Efficiency Upgrade at Bob Butrico's Solar System Installation
line_count: 10
starting_line_number: 56
questions_this_excerpt_can_answer: 1. What is the unique identifier for the Design Envelope ID in this document?
  1. How are original signatures transmitted and received in this document, and what is stated about their validity?
  2. Which two parties have executed this Order as of the Effective Date mentioned in the document?
--- Metadata for Source Node 2 ---
source_doc_idx: 6
filename: 1.2.2.2 Customer Contract - Stockwood Dr - Woodstock - GA.pdf
page_number: 3
document_title: Utility Service Agreement and Energy Efficiency Upgrade for AT&T Corp.'s Solar System Installation with Eco Engineering Subcontractor
line_count: 10
starting_line_number: 56
questions_this_excerpt_can_answer: 1. What is the unique identifier for the Design Envelope in this document?
  1. How are original signatures transmitted and received in this agreement?
  2. Which two parties have executed this Order as of the Effective Date?
--- Metadata for Source Node 1 ---
this one is wrong

--- Metadata for Source Node 2 ---
this one is correct
the title extractor is just using the LLM to predict a title -- its not guaranteed to be "right" or "wrong" -- its just looking at the text and predicting a title
unless I'm misunderstanding πŸ˜…
seems like to me you ingested the same document/node twice into your index? But since you ran it through the title extractor each time, the generated title and questions are different?
this is what I'm doing
# Updated imports for document processing from whiteFang + Logan from llama_index.core.ingestion import IngestionPipeline from llama_index.core.extractors import ( QuestionsAnsweredExtractor, TitleExtractor, ) from llama_index.llms.openai import OpenAI # Initialize the LLM and metadata extractor llm = OpenAI(model="gpt-3.5-turbo") pipeline = IngestionPipeline(transformations=[ TitleExtractor(nodes=5, llm=llm), QuestionsAnsweredExtractor(questions=3, llm=llm), ] ) # Process nodes to add additional metadata nodes = pipeline.run(nodes=nodes) print(f"Debug: Processed {len(nodes)} nodes.") # Debugging for node in nodes: print(f"Processed Node Metadata: {node.metadata}")
Add a reply
Sign up and join the conversation on Discord