How can I do parallel processing on IngestionPipelines?
My conversations have as many as 200 documents with as many as 800 pages, so I need to preprocess data before my customers can start a conversation.
I’ve scoured the docs/code, but haven’t found a way to run multiple pipeline calls at once. I’m currently using
asyncio.gather
on documents and then pages, call
pipeline.arun
for each page, but my results still appear to be sequential…
Processed 6 documents in 130.94 seconds
Total number of pages processed: 6
Average time per document: 21.82 seconds
Average time per page: 21.50 seconds
Doc 4 took 16.62 seconds
Page 1 took 14.89 seconds
Doc 2 took 39.05 seconds
Page 1 took 38.35 seconds
Doc 6 took 38.55 seconds
Page 1 took 37.80 seconds
Doc 5 took 93.89 seconds
Page 1 took 85.75 seconds
Doc 1 took 129.99 seconds
Page 1 took 128.76 seconds
Doc 3 took 130.94 seconds
Page 1 took 129.01 seconds
If this test conversation of 6 docs / 6 pages (all small text) took ~20 seconds per page, then the entire job should take ~20 seconds, right? Any recs on how to make this work?