LlamaIndex

Log inLog into community

Find answers from the community

Updated 6 months ago

what happened to this:

what happened to this:

At a glance

The post describes an issue with the PineconeVectorStore from the llama_index library, where a ModuleNotFoundError is raised for the openai.openai_object module. The comments discuss various related issues, such as changes in the llama_index library, import errors, and the need to update code to work with the latest version. Community members provide suggestions and guidance, including the need to use either the "legacy" or the "new" version of the library, and not a combination. They also discuss the availability of migration guides and tools to help update the code. Overall, the community members are working together to resolve the issues and adapt to the changes in the llama_index library.

Useful resources

·

what happened to this:

from llama_index.vector_stores import PineconeVectorStore
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

ModuleNotFoundError: No module named 'openai.openai_object'

t

W

L

100 comments

thank you!

I'm guessing this one

from llama_index.text_splitter import SentenceSplitter

is now

from llama_index.core.node_parser import SentenceSplitter

but then

text_splitter = SentenceSplitter(
    chunk_size=1024,
    # separator=" ",
)

produces an error:

ValidationError: 1 validation error for ConfigurableTransformation
component_type
subclass of BaseComponent expected (type=type_error.subclass; expected_class=BaseComponent)
import pdfplumber

Initialize variables to keep track of page numbers and line numbers

page_number = 1
chunks = []

Open the PDF file

with pdfplumber.open("./1.2.2.2 Customer Contract - Stockwood Dr - Woodstock - GA.pdf") as pdf:
print(f"Debug: Total Pages in PDF: {len(pdf.pages)}") # New Debugging Line
# Loop through each page in the PDF
for page in pdf.pages:
print(f"Debug: Reading Page {page_number}") # Debugging
# Extract text from the current page
page_text = page.extract_text()

# Split the page text into lines
lines = page_text.split('\n')

# Apply your existing chunking logic to the lines from this page
for i in range(0, len(lines), 10): # Chunk size is 10 lines
chunk = '\n'.join(lines[i:i+10])
chunks.append((chunk, page_number)) # Include the actual page number
print(f"Debug: Creating Chunk {len(chunks)} from Page {page_number}") # Debugging

# Debugging: Print the first few chunks to see if they contain more lines
print(f"Debug: Chunk {len(chunks)}, Page {page_number}, Content: {chunk[:100]}") # Print the first 100 characters of each chunk

# Increment the page number for the next page
page_number += 1
print(f"Debug: Incremented Page Number to {page_number}") # New Debugging Line

You're the man, whitefang

hmm, Just tried fresh on colab and it is working

Attachment

thank you for being up this late

Haha, No Its day at my side 😆

that works but this right after it...

text_splitter = SentenceSplitter(
chunk_size=1024,
# separator=" ",
)

oh, ya? where are you?

This works too

Attachment

India and you ?

california.

I get this:

ValidationError: 1 validation error for ConfigurableTransformation
component_type
subclass of BaseComponent expected (type=type_error.subclass; expected_class=BaseComponent)

Then you are up very late. lol 😅

haha, yes. but I love it

Is this a fresh env you are working with? fresh with latest installation (v0.10.x) of llamaindex or was there any previous version in here earlier

let me check

I think so

is there some package that has to run?

oh, but I ran this:

from llama_index.legacy import VectorStoreIndex

Ah thats why

It has be either legacy or new , combining them is not going to work

from llama_index.core import VectorStoreIndex

ok that solved it

is this dead, too now?

from llama_index.schema import TextNode

yes, i think it is

from llama_index.core.schema import TextNode

@WhiteFang_Jr where did this go?

from llama_index.node_parser.extractors import (
    MetadataExtractor,
    QuestionsAnsweredExtractor,
    TitleExtractor,
)

ModuleNotFoundError: No module named 'llama_index.node_parser'

Plain Text

from llama_index.core.extractors import (
    QuestionsAnsweredExtractor,
    TitleExtractor,
)

what about MetadataExtractor?

@WhiteFang_Jr is this no longer going to work:

Attachment

Which doc are you using for this code, it must be updated with correct imports, Can you share the doc link

you mean the link to my notebook?

or the link to the llama index documentation

Yes

this was from the summer

maybe this? https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_metadata_extractor.html

lol, But I think this got updated, earlier you had to combine all the extracter in one place and then use it, now you can directly pass it to transformations. No need to have one more extra layer, for isntance check this example:
https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_metadata_extractor.html

you're always at the same time as me 🙂

Haha yeah this is second time this is happening

hm, so what do you suggest for this:

# Import the necessary modules for metadata extraction

from llama_index.node_parser.extractors import (
    MetadataExtractor,
    QuestionsAnsweredExtractor,
    TitleExtractor,
)

from llama_index.llms import OpenAI

# Initialize the LLM and metadata extractor
llm = OpenAI(model="gpt-3.5-turbo") 
#llm = OpenAI(model="gpt-4-Turbo")

metadata_extractor = MetadataExtractor(
    extractors=[
        TitleExtractor(nodes=5, llm=llm),
        QuestionsAnsweredExtractor(questions=3, llm=llm),
    ],
    in_place=False,
)

# Process nodes to add additional metadata
nodes = metadata_extractor.process_nodes(nodes)
print(f"Debug: Processed {len(nodes)} nodes.")  # Debugging


for node in nodes:
    print(f"Processed Node Metadata: {node.metadata}")

I need to get some sleep. Back at this tomorrow. But, I can say a good amount of code that I had written is now defunct with these latest changes 😦

I can understand, but if you provide the code, I'll be happy to debug and correct it with you next morning.

Get some sleep now 💪

Wow. Really?!!!

Ok!!

There was a migration guide and auto-migrate tool. I wonder if you saw that? Haply to help migrate code as well

@Logan M I would love to take you up on that. How do I do that? Just continue to post here?

Post away 🚢

Hey!
I have updated the imports of the two files that you sent me.
Do check that!
If anything fails you can let us know here

@WhiteFang_Jr hi! Ok

So, the first one fails RuntimeError: asyncio.run() cannot be called from a running event loop

you need to add

Plain Text

import nest_asyncio
nest_asyncio.apply()

to your code (if this is in fastapi, you need to set the loop type to asyncio as well)

from llama_index.node_parser.extractors import (
    MetadataExtractor,
    QuestionsAnsweredExtractor,
    TitleExtractor,
)

from llama_index.llms import OpenAI

# Initialize the LLM and metadata extractor
llm = OpenAI(model="gpt-3.5-turbo")

metadata_extractor = MetadataExtractor(
    extractors=[
        TitleExtractor(nodes=5, llm=llm),
        QuestionsAnsweredExtractor(questions=3, llm=llm),
    ],
    in_place=False,
)

# Process nodes to add additional metadata
nodes = metadata_extractor.process_nodes(nodes)
print(f"Debug: Processed {len(nodes)} nodes.")  # Debugging


for node in nodes:
    print(f"Processed Node Metadata: {node.metadata}")

Plain Text

from llama_index.core.extractors import (
    MetadataExtractor,
    QuestionsAnsweredExtractor,
    TitleExtractor,
)

from llama_index.llms.openai import OpenAI

oh wait, MetadataExtractor isn't a thing either

Plain Text

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.extractors import (
    QuestionsAnsweredExtractor,
    TitleExtractor,
)

from llama_index.llms.openai import OpenAI

# Initialize the LLM and metadata extractor
llm = OpenAI(model="gpt-3.5-turbo")

pipeline = IngestionPipeline(transformations=[
      TitleExtractor(nodes=5, llm=llm),
      QuestionsAnsweredExtractor(questions=3, llm=llm),
  ]
)

# Process nodes to add additional metadata
nodes = pipeline.run(nodes=nodes)
print(f"Debug: Processed {len(nodes)} nodes.")  # Debugging


for node in nodes:
    print(f"Processed Node Metadata: {node.metadata}")

Can read more on the ingestion pipeline here
https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/root.html

ok that looked to have fixed it. interesting.....

thank you, sirs

I guess, I'll just go through the full list of cells in case down the road someone else is interested...

next cell. I found this needs to be changed

from llama_index.embeddings import OpenAIEmbedding

# Initialize the OpenAI embedding model
embed_model = OpenAIEmbedding()

# Generate embeddings for each node and store them in the node
for node in nodes:
    print(f"Debug: Adding node with Metadata: {node.metadata}")  # Debugging
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )
    node.embedding = node_embedding

to this:

from llama_index.embeddings.openai import OpenAIEmbedding

# Initialize the OpenAI embedding model
embed_model = OpenAIEmbedding()

# Generate embeddings for each node and store them in the node
for node in nodes:
    print(f"Debug: Adding node with Metadata: {node.metadata}")  # Debugging
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )
    node.embedding = node_embedding

(You really could use the upgrade cli tool for this 😉 unless is barfs on your notebook, which is known to happen)

llamaindex-cli upgrade-file <file>

hm. maybe it would. I'm finding some things work, and then others don't. Some of the things that do not work, seem pinecone related too. Like, they have also updated their api

so, like this: vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

vector_store.add(nodes) #Add to pinecone

AttributeError: 'Pinecone' object has no attribute 'upsert'

Did you run pip install llama-index-vector-stores-pinecone ? It should install the proper version of pinecone (they made some changes with their serverless stuff)

Theres an updated notebook here as well
https://docs.llamaindex.ai/en/stable/examples/vector_stores/PineconeIndexDemo.html

Sorry, I had to hop on a zoom yesterday mid-discussion. Let me verify that....

Ok, so I ran that:

i still get AttributeError: 'Pinecone' object has no attribute 'upsert' when I run this:

print(f"Debug: Number of nodes added to vector_store: {len(nodes)}")  # Debugging

# New Debug Statements
for node in nodes:
    print(f"Pre-Vector Store Node Metadata: {node.metadata}")


# 1. Debugging before adding nodes to vector_store
print("Debug: Metadata before adding nodes to vector_store")
for idx, node in enumerate(nodes):
    print(f"Node {idx+1} Metadata: {node.metadata}")

print(f"Debug: Number of nodes added to vector_store: {len(nodes)}")  # Debugging
vector_store.add(nodes) #Add to pinecone

specifically, the issue is this line: vector_store.add(nodes) #Add to pinecone

which is defined as :

from llama_index.vector_stores.pinecone import PineconeVectorStore
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

having run this:
from llama_index.core import VectorStoreIndex and before that this !pip install llama-index-vector-stores-pinecone

How did you setup the pinecone_index tho?

oh, I was on version: llama-index-vector-stores-pinecone-0.1.4, maybe I have to upgrade this

that shouldn't matter too much I think 🤔 It feels like pinecone_index=pinecone_index is not passing in the correct thing imo

pinecone_index.create_index(
name="eng",
dimension=1536,
metric="euclidean",
spec=PodSpec(
environment="gcp-starter"
)
)

so thats not actually the pinecone_index object, its really the pc object

pinecone_index = pc.Index("quickstart")

That will get the correct pinecone_index object

https://docs.llamaindex.ai/en/stable/examples/vector_stores/PineconeIndexDemo.html#creating-a-pinecone-index

that was it

not sure how you made the connection with such limited context

amazing

So, wow, I'm up and running on the vector store!
Now for the KG 🙂 stand by

So, I think the imports that @WhiteFang_Jr provided were enough to get the the KG working ....

doing some additional tests now to check that everything is back to normal.

Ok looks like everything is working as per before, save one issue that I can see (so far)

The document title is not quite right

Here is the code that produces the title:

from llama_index.llms.openai import OpenAI
from llama_index.core.query_engine import CitationQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
    load_index_from_storage
)

#define llm and embed model and add it to Settings, This will replace Service context
from llama_index.core import Settings
Settings.llm = llm
Settings.embed_model = embed_model

from llama_index.core.schema import Document as LlamaDocument # Import the llama_index Document class

# Inherit from the llama_index Document class to make it compatible
class Document(LlamaDocument):
    def __init__(self, text, metadata):
        super().__init__(text=text, metadata=metadata)

    def get_doc_id(self):
        return f"{self.metadata['filename']}-{self.metadata['starting_line_number']}"

import os
import networkx as nx
#from llama_index.llms import OpenAI (depricated)
from llama_index.llms.openai import OpenAI 
#from llama_index.query_engine import CitationQueryEngine (depricated)
from llama_index.core.query_engine import CitationQueryEngine


#from llama_index import ( ## (depricated)
from llama_index.core import (
    VectorStoreIndex,
    StorageContext,
    ServiceContext,
)

# Initialize service context
service_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0)
)

# Explicitly set doc_id in metadata
for doc in documents:
    doc.metadata['doc_id'] = doc.get_doc_id()

# Initialize VectorStoreIndex
try:
    index = VectorStoreIndex.from_documents(documents, service_context=service_context)
except Exception as e:
    print(f"Error: {e}")

# Initialize the CitationQueryEngine
query_engine = CitationQueryEngine.from_args(
    index,
    similarity_top_k=3,
    citation_chunk_size=512,
)

# Query and Retrieve Information
response = query_engine.query("what is the title of this document?")
print("Query Response:", response)

# Create Knowledge Graph Nodes
G = nx.Graph()

# Add nodes to the graph
for i, source_node in enumerate(response.source_nodes):
    node_content = source_node.node.get_text()
    citation = source_node.node.metadata.get('page_number', 'Unknown')
    file_name = source_node.node.metadata.get('filename', 'Unknown')
    title = source_node.node.metadata.get('document_title', 'Unknown')
    G.add_node(citation, content=node_content, title=title)

Is document_title actually in your metadata for each document?

yes, and I can see it's wrong earlier in the code

like before the code that I just pasted

seem to get it wrong starting here:

# Updated imports for document processing from whiteFang + Chatgpt + Logan
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.extractors import (
    QuestionsAnsweredExtractor,
    TitleExtractor,
)

from llama_index.llms.openai import OpenAI

# Initialize the LLM and metadata extractor
llm = OpenAI(model="gpt-3.5-turbo")

pipeline = IngestionPipeline(transformations=[
      TitleExtractor(nodes=5, llm=llm),
      QuestionsAnsweredExtractor(questions=3, llm=llm),
  ]
)

# Process nodes to add additional metadata
nodes = pipeline.run(nodes=nodes)
print(f"Debug: Processed {len(nodes)} nodes.")  # Debugging


for node in nodes:
    print(f"Processed Node Metadata: {node.metadata}")

my guess is that the wrong title is being propagated throughout the code

odd, I'm re-running the notebook and I don't see the problem in all the nodes any longer. But I see it in some of them

I'm not sure what you mean by wrong title -- the title extractor is just using the LLM to predict a title. It could be anything 👀

response = query_engine.query(query_str)
# Debugging: Extract and print metadata from source nodes
source_nodes = response.source_nodes
for idx, node in enumerate(source_nodes):
    print(f"--- Metadata for Source Node {idx + 1} ---")
    for key, value in node.metadata.items():
        print(f"{key}: {value}")
    print("\n")  # For better readability

--- Metadata for Source Node 1 ---
source_doc_idx: 6
filename: 1.2.2.2 Customer Contract - Stockwood Dr - Woodstock - GA.pdf
page_number: 3
document_title: Utility Service Agreement and Installation for Energy Efficiency Upgrade at Bob Butrico's Solar System Installation
line_count: 10
starting_line_number: 56
questions_this_excerpt_can_answer: 1. What is the unique identifier for the Design Envelope ID in this document?

How are original signatures transmitted and received in this document, and what is stated about their validity?
Which two parties have executed this Order as of the Effective Date mentioned in the document?

--- Metadata for Source Node 2 ---
source_doc_idx: 6
filename: 1.2.2.2 Customer Contract - Stockwood Dr - Woodstock - GA.pdf
page_number: 3
document_title: Utility Service Agreement and Energy Efficiency Upgrade for AT&T Corp.'s Solar System Installation with Eco Engineering Subcontractor
line_count: 10
starting_line_number: 56
questions_this_excerpt_can_answer: 1. What is the unique identifier for the Design Envelope in this document?

How are original signatures transmitted and received in this agreement?
Which two parties have executed this Order as of the Effective Date?

--- Metadata for Source Node 1 ---
this one is wrong

--- Metadata for Source Node 2 ---
this one is correct

the title extractor is just using the LLM to predict a title -- its not guaranteed to be "right" or "wrong" -- its just looking at the text and predicting a title

unless I'm misunderstanding 😅

seems like to me you ingested the same document/node twice into your index? But since you ran it through the title extractor each time, the generated title and questions are different?

maybe...

this is what I'm doing

# Updated imports for document processing from whiteFang + 
 Logan
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.extractors import (
    QuestionsAnsweredExtractor,
    TitleExtractor,
)

from llama_index.llms.openai import OpenAI

# Initialize the LLM and metadata extractor
llm = OpenAI(model="gpt-3.5-turbo")

pipeline = IngestionPipeline(transformations=[
      TitleExtractor(nodes=5, llm=llm),
      QuestionsAnsweredExtractor(questions=3, llm=llm),
  ]
)

# Process nodes to add additional metadata
nodes = pipeline.run(nodes=nodes)
print(f"Debug: Processed {len(nodes)} nodes.")  # Debugging


for node in nodes:
    print(f"Processed Node Metadata: {node.metadata}")

Add a reply

Sign up and join the conversation on Discord