Here is my data attempting to be inserted:
test_data = [
{
"text": '''
<<Patrick>>
How far from the Sun is the Earth?
<<PF>>
The average distance from the Earth to the Sun is about 93 million miles (150 million kilometers).
This distance is often referred to as one Astronomical Unit (AU), which is a standard unit of measurement
used in astronomy to describe distances within our solar system.
'''
, "metadata": generate_metadata("pm")
}
, {
"text": '''
<<Patrick>>
How many people live in Africa?
<<PF>>
As of 2023, the estimated population of Africa is approximately 1.44 billion people.
This estimate is based on an annual growth rate of 2.5% from the last known accurate figure of 1.37 billion in 2021
'''
, "metadata": generate_metadata("pm")
}
, {
"text": '''
<<Patrick>>
What is Joe Biden's middle name?
<<PF>>
Joe Biden's middle name is Robinette. His full name is Joseph Robinette Biden Jr.
'''
, "metadata": generate_metadata("pm")
}
]
@kapa.ai
Based on my code above, can you help me find out whats wrong?
My code is not throwing errors but after it completes nothing has been inserted into my database
Let me check the knowledge base for you. One moment! :books:
----------------------
I'm sorry, but I can't provide the help you're looking for because the extracts you've provided don't include any code related to your issue. However, if you're working with LlamaIndex and SQLDatabase, make sure you're correctly establishing a connection to your database and that your insert statements are being executed properly.
Here's an example of how to insert data into a SQLDatabase using LlamaIndex, based on the extracts you've provided:
from sqlalchemy import insert
rows = [
{"city_name": "Toronto", "population": 2930000, "country": "Canada"},
{"city_name": "Tokyo", "population": 13960000, "country": "Japan"},
{
"city_name": "Chicago",
"population": 2679000,
"country": "United States",
},
{"city_name": "Seoul", "population": 9776000, "country": "South Korea"},
]
for row in rows:
stmt = insert(city_stats_table).values(**row)
with engine.begin() as connection:
cursor = connection.execute(stmt)
(Source:
LlamaIndex SQLDatabase Example ())
If you're still having trouble, I recommend checking your database connection and ensuring your SQL statements are correct.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
I hate to bump this but kapa isn't helping and I think the timing of my question allowed it to be burried. Any idea what might be the issue here? @Logan M
Curious what process_chunks()
is doing π
Are there nodes returned when you call run()
?
from llama_index import Document
from pf.cbcore.utils.llm import pf_llm
import asyncio
import datetime
async def async_ensure_chunk(text: str, chunk_limit: int=150) -> str:
'''Summarizes large chunks into smaller chunks if the chunk is over the chunk limit.'''
chunk = text
if len(text.split()) > (chunk_limit + int(chunk_limit * 0.05)):
prompt = f'Please summarize the following text to be at or within a chunk/word limit of {chunk_limit}:\n\n' + text
chunk = await pf_llm.apredict(prompt)
return chunk
def tag_chunk(chunk: str, tags: dict) -> Document:
'''Tags a given chunk with given tags'''
doc = Document(text=chunk)
doc.metadata.update(tags)
return doc
def generate_metadata(user: str, env: str="react") -> dict:
'''Returns relevant metadata'''
local_timezone = datetime.datetime.now().astimezone().tzinfo
timestamp = datetime.datetime.now(local_timezone)
metadata = {
"user": user
, "ctimestamp": str(timestamp)
, "environment": env
}
return metadata
async def process_chunks(texts: list[dict], chunk_limit: int=150) -> list[Document]:
'''Processes a list of texts into a list of tagged and potentially summarized chunks'''
async def process_chunk(raw):
text = raw.get("text")
md = raw.get("metadata")
chunk = await async_ensure_chunk(text, chunk_limit)
return tag_chunk(chunk, md)
return await asyncio.gather(*(process_chunk(raw) for raw in texts))
Process chunks is an async way for me to summarize (if the chunk is over 150 tokens) and tag with relevant metadata before sending to the vector store. Let me see what nodes are returned..
Yea that process chunks function seems alright, it's returning document objects
They look like document objects with the embedding
element filled w/ embeddings, whereas before non-embedded documents I think have that ement as None
or False
The nodes look as I would expect them to
So, since you attached the vector store to the pipeline, the should be in your vector db now
There have been 25 rows since before I ran any ingestion pipelines
Those last ones are ones I ran in a notebook not with the scripts I have pasted above
if you run vector_store.add([nodes[0]])
does that end up inserting a node?
I'm also positive my database, schema, and table info in my VectorStore object name match what I'm querying here in my database tool.
nodes as in nodes is the object I index after running the ingest
function or can I just index the processed_chunks
list for a node.
It will have to be a node with an embedding attached, so after ingest
maybe just try on a single document chunk
Looking at the code for ingestion_pipeline.run()
, there is this conditional JUST before returning the nodes
So like... it's most certainly calling add()
, unless one of these conditions is false
No dice, even when I add one in particular it doesn't load.
I was able to load some yesterday using dummy data and the guides above but never with an ingestion pipeline. Just by running insert()
on an Vector Index.
I'm trying to use ingestion pipeliens because they're easier to read and are async
I'm going to paste what nodes[0]
looks like. That may be a factor here. I'm positive its embedded but we'll see...
self.vector_store.add([nodes[0]])
print('################ adding')
print(nodes[0])
print(type(nodes[0]))
print(hasattr(nodes[0], 'embedding'))
Doc ID: 3e248a52-448c-4a5b-b03b-12c6217602c6
Text: <<Patrick>> How far from the Sun is the Earth?
<<PF>> The average distance from the Earth to
the Sun is about 93 million miles (150 million kilometers).
This distance is often referred to as one Astronomical Unit (AU),
which is a standard unit of measurement ...
<class 'llama_index.schema.Document'>
True
So it is indeed a document, and it indeed has an embedding element.
Actually, that would be true even if it wasn't embedded.. let me check what the type of the embedding value is.
Ohhh maybe one thing that might be causing issues (not 100% sure though)
You should probably be running a SentenceSplitter
or TokenTextSplitter
in your ingestion pipeline no?
I'm not splitting anything, just summarizing them.
Yea, curious if the embedding is a List[float]
Looks like embeddings to me.
The fact that vector_store.add()
doesn't work is pretty concerning lol
Do I need this? The examples I'm providing are not big and I would rather summarize than split because I want to keep Prompts & Responses in a single chunk
Otherwise I worry our RAG will perform worse.
So I'm opting for summarizing basically.
Maybe I need to write a Custom Transformer to adhere to the pattern above? I considered that but didn't think it was necessary (I also only thought of it after I had already written my methods outside of the framework so)
Yea I totally feel you. I just wonder if them being a Document
object instead of a TextNode
object is problematic
It needs to be a TextNode?
I was able to load a Document
in a simple example with insert
earlier yesterday. But yeah never tried it with an ingestion pipeline.
I don't THINK so (document and textnode are the same nearly), just reading the source code trying to explain this haha
Maybe try this -- this is the most simple sanity test I can think of haha
from llama_index.schema import TextNode
node = TextNode(text="hello world")
node.embedding = embed_model.get_text_embedding(node.text)
vector_store.add([node])
Can I extract my embed_model from my service context?
Like is that something I can just service_context.embed_model.get_text_embedding()
node = TextNode(text="Testing with Logan")
node.embedding = self.service_context.embed_model.get_text_embedding(node.text)
self.vector_store.add([nodes[0]])
Oh, forgot to update the last line
node = TextNode(text="Testing with Logan")
node.embedding =
self.service_context.embed_model.get_text_embedding(node.text)
self.vector_store.add([node])
The embeddings are there too
It runs, but nothing is in the database
So something is up with either vector store or the tool you are using to view the data π€
I am going to try and insert a different way that worked for me
Omg thats it.. I think I figured it out.
Yeah for some reason data_
is appended to the embeddings which is fine
So I added that in my new pgvector store instantiation, when really I should have just added the original name and let pgvector append data_
itsellf.
Jk it worked. And it even loaded the originals we were stuck on
I can't believe it was something as simple as that.
You mentioning this jogged my memory.
yeaaaa I always forget about that too haha
But now we can't change it without breaking everyones vector stores :cryingskull:
Look for f'data_{vs_name}
as well as vs_name
right?
Still might be confusing in case anyone adds data at the beginning though
yeaaaa... probably a proper fix is adding a deprecation warning for a month or two, to give people notice. But just hasn't been a priority yet haha
Yeah its not really a big deal, I don't blame you.
It broke again, but like, because I was running the final method async (in the example above I wans't because I was avoiding another error lol)
See my most recent thread..