question about the new pipeline

At a glance

question about the new pipeline structure. In the example code, you create a custom class to perform a preprocessing step. what happens when we have 10 steps? do we add all those steps to the custom class or do we make 10 custom classes? what is the recommended approach?

12 comments

LLogan M

It's up to you! Both will work 🙂

(Personally, I might combine custom transforms where it makes sense, to reduce the cache size)

ttheta

still trying to wrap my brain around what this means. does this mean that each pipeline entry's transformations are cached to disk? does this caching apply when you run the app multiple times or does the caching only apply during a single run? I'm not clear how this works between runs...like if my app runs and throws an error somewhere while processing the pipeline, when I run the app again, does processing start from step 1 or from the step before it failed?

LLogan M

So, I can explain:

LLogan M

The transforms in the pipeline are run one at a time

One each transform step, and hash of the transform component + input nodes is calculated. This is used to create a hash lookup for the output of the current component

So at the start of each transform, this hash is used, and if there is a cache hit, the cached result is used instead of running the transform step

By default, this cache is in-memory (and needs to be explicitly saved/reloaded)

There is also integrations with redis, mongodb, etc. for remote caches that do not need to be explicitly saved/reloaded

LLogan M

If your pipeline crashed, assuming you saved the cache, on the next re-run with the same data it would use the cached results for the steps that did not crash

LLogan M

I can try drawing this out in figma if it's still a little murky

ttheta

thats making more sense...rhank you kindly! I'm curious how this integrates with framework pipelines like kedro...in kedro, we build dags of nodes, where each node should handle a single processing step, and I'm not clear how to integrate the llama pipeline across a framework pipeline? like, do I put the entire llama pipeline processing step into a single framework node...in kedro you can only pass datasets or objects between nodes...what are you thoughts abou that context?

LLogan M

hmmm, so like, you'd want to try using kedro to process transformation steps?

ttheta

well, I'm trying to use the kedro framework to help me structure the entire application. I've built a web scraping app with a dozen pipelines to scrape, transform and load data into mongo and then I'm using llama index to process the content. so I want to keep using kedro finish the project and I'm unclear how the llama index pipeline fits into a larger pipeline context

LLogan M

hmmm, it sounds like you already have kedro working, so I think think the IngestionPipeline will add much value or integrate easily in this case

LLogan M

like, maybe custom transform components could be pipelines on kedro? Not sure (Never used kedro lol)

ttheta

hmm, ok, thanks for sharing your thoughts /0_o

Add a reply

Find answers from the community

question about the new pipeline