Dom's Bytes

How to repair Databricks jobs starting from a specific task

Is it possible to repair a Databricks job starting from a specific task?

This question was asked in one of the company’s Slack channels a couple of days ago. Heck! I must have asked myself this question a dozen of times in the last year.

Let’s get right to it: as of 2023, Databricks UI does not allow you to select which tasks to re-run when you repair a multi-task job. Either you trigger a new job run or you repair the failed one by re-running the failed and skipped tasks.

The catch is that the Databricks Jobs API 2.1 actually implements a field to specify which tasks to re-run during a job repair. The rerun_tasks field specifies the names of the tasks we want to re-run. It does not matter whether the tasks were successful or not during the last job run.

We can use the Databricks CLI to trigger a job repair that uses rerun_tasks. There exist two Databricks CLIs. The legacy one written in Python, and the stable one written in Golang. This article assumes you’re using the latter (version 0.200 or higher).

databricks jobs repair-run --json '{"run_id": JOB_RUN_ID, "rerun_tasks": ["task_B", "task_C"]}'

We can also trigger the job repair procedure exposed via Databricks UI:

databricks jobs repair-run --json '{"run_id": JOB_RUN_ID, "rerun_all_failed_tasks": true}'

You cannot have rerun_tasks and rerun_all_failed_tasks in the same request or Databricks API will complain.