from llama_index.core import SummaryIndex from llama_index.readers.web import SimpleWebPageReader from IPython.display import Markdown, display import os documents = SimpleWebPageReader(html_to_text=True).load_data(["https://www.xyz.com"]) documents
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-41-79eec0778777> in <cell line: 7>()
5 storage_context = StorageContext.from_defaults(vector_store=vector_store)
6
----> 7 raw_index = VectorStoreIndex.from_documents(
8 parsed_docs,
9 storage_context=storage_context,
6 frames
/usr/local/lib/python3.10/dist-packages/llama_index/vector_stores/chroma/base.py in add(self, nodes, **add_kwargs)
263 documents.append(node.get_content(metadata_mode=MetadataMode.NONE))
264
--> 265 self._collection.add(
266 embeddings=embeddings,
267 ids=ids,
AttributeError: 'str' object has no attribute 'add'
SimpleDirectoryReader
, does it encode the image? if so, what encoding type does it use? img = SimpleDirectoryReader("/content/drive/images").load_data()
workflows
for RAG and I'd like to include. approximate metadata filtering for better retrieval accuracy. custom_index = VectorStoreIndex.from_documents( documents, storage_context=storage_context ) class RAGWorkflow(Workflow): @step async def ingest(self, ctx: Context, ev: StartEvent) -> StopEvent | None: """Entry point - ingest documents""" index = custom_index return StopEvent(result=index) @step async def retrieve(self, ctx: Context, ev: StartEvent) -> RetrieverEvent | None: "Entry point for RAG, triggered by a StartEvent with `query`." query = ev.get("query") index = ev.get("index") if not query: return None # store the query in the global context await ctx.set("query", query) await ctx.set("index", index) # get the index from the global context if index is None: print("Index is empty, load some documents before querying!") return None retriever = index.as_retriever(similarity_top_k=10) nodes = await retriever.aretrieve(query) return RetrieverEvent(nodes=nodes)
llama-parse
and implemented the RAG on a set of financial documents. Similar to the example in this notebook[1], we build a raw index
and recursive index
. To my surprise, the results from raw_index.as_query_engine
are more accurate than the recursive
one. I am try to get an intution for why this might be? For context, we have tables with financial data and a sample query might look like - what was the total rent for Property A in 2023? What is the key difference between the two indices? and how do they work under-the-hood?