Depending on other jobs
Combining jobs in order to utilize the output of a previous job run can be very powerful. Say you want to update a dashboard after analytical tables have been updated or run a ML job when source data sets have been produced. STOIX makes it trivial to add dependencies between jobs.
When creating or updating a job, you can connect jobs under “Dependencies”. Search for another job by job name or cluster name and click on it to add the dependency. You have two different choices of what job trigger to require from the job.
If you select “Any”, you want to make sure the last execution is successful for the dependency. An example where this is useful is when you depend on a job that updates a lookup table. You want to know the table is produced correctly and up to date but do not require it to follow your schedule.
The other choice is “Same time”. This is used when working with date partitioned data. An example is when your job needs the output of a dependency having the same schedule. With this option, STOIX schedules your job after the dependency has finished successfully for the same scheduled time.
If both you and your dependency should run daily after midnight and the dependency finishes successfully at 00:05, your job will start at 00:05 so that the required data is available. Until such a time, your job will mention that it’s missing it’s dependency in the job overview. In case the dependency fails, your job will not run until the issue has been resolved and the dependency has re-run successfully. This also applies to the concept described under the section “Backfilling historical data”.
With the power of Docker container images, you can also easily schedule workflows and code running on external parties such as Dataflow, Dataform, dbt. Anything with an API or CLI tool can be run in an image and you can add access tokens or such as secrets to the job. This gives you a powerful way of scheduling your data jobs over multiple tools and gathering potential issues that can happen.