Find answers from the community

Home
Members
gamecode8
g
gamecode8
Offline, last seen 2 months ago
Joined September 25, 2024
Hello, what is the proper way to handle exceptions in a Workflow? In the example generator function below, even though the exception gets caught, the workflows task exception seems to still bubble up to asyncio's default exception handler. Is this expected behavior?

Plain Text
async def event_generator():
        try:
            wf = MyWorkflow(timeout=30, verbose=True)
            handler = wf.run(user_query=topic["query"])

            async for ev in handler.stream_events():
                yield {"event": "progress", "data": ev.msg}

            final_result = await handler

            # Send final result message
            yield {"event": "workflow_complete", "data": final_result}

        except Exception as e:
            error_message = f"Error in workflow: {str(e)}"
            logger.error(error_message)
            yield {"event": "error", "data": error_message}
12 comments
L
g
Hello, just started trying to implement workflows and have a quick question.

In the context of a web server, are workflows meant to be created on each request or are we supposed to create one instance of a workflow and call workflow.run(…) on each request?
2 comments
L
g
g
gamecode8
·

Docstore

Hello, can I have some clarification on what the docstore is intended to store? Chunks or full document?

The documentation here states chunks: https://docs.llamaindex.ai/en/stable/module_guides/storing/docstores/

However, I have found that the ingestion pipeline stores the full document text before chunking when using document management.

https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/#document-management

Thank you!
7 comments
L
f
g
Hello is there a recommended way to rerun the ingestion pipeline in case of failure? 10K documents were inserted into the docstore but there was a failure during embedding and now rerunning it will skip them since they will be considered duplicates.

Is the solution to delete all from docstore or is there a better way?
4 comments
g
L
Yes I am doing that, and works great when I only use a single parser. But if I apply a second transformation like a text splitter, the deduping breaks in the vector store because the ref_doc_id changed after the first transformation in my code snippet.
21 comments
L
g
L
Hello everyone! Could someone please provide some clarification on IngestionPipeline. I am noticing that when i apply multiple transformations, the original document's ID is lost after the SentenceSplitter transformation which ends up inserting new rows into the vector store since the embedding's doc id is the doc id of the nodes from the MarkdownNodeParser transformation instead of the original document.

Is the this not the intended usage? My goal is to be able to split the markdown sections into chunks after parsing to break down long sections in my document, while preserving the original document's ID.

TIA!

Plain Text
pipeline = IngestionPipeline(
    transformations=[
        MarkdownNodeParser(),
        SentenceSplitter(chunk_size=200, chunk_overlap=0),
        OpenAIEmbedding(),
    ],
    vector_store=pg_vector_store,
    docstore=docstore
)
pipeline.run(documents=documents)
1 comment
L
Hello, is there a llamaindex variant of Langchains HTMLHeaderTextSplitter?

I have a tried HTMLNodeParser but the output im getting for the html i have is not great.

https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/HTML_header_metadata/

I have tried wrapping with LangchainNodeParser but it fails because Llamaindex expects a list of strings while langchain is returning a list of document objects.
5 comments
g
L
Can someone provide some tips around working with 100+ documents? I have used SentenceWindowNodeParser and stored them in my vector db.

However this is performing poorly when I ask a question and expect certain sentences to be retrieved.

TIA!
10 comments
g
L
a