Deploy to Databricks

Deploys a streaming or batch job to Databricks. In the process, if there was already an old version of the job running, this will shut down the old job and deploy the new version.

Most often is used in combination with Deploy artifacts to Azure Blob

Deployment

Add the following task to deployment.yaml:

- task: deploy_to_databricks
  jobs:
  - main_name: "main.py"
    config_file: databricks.json.j2
    lang: python
    name: foo
    run_stream_job_immediately: False
    is_batch: False
    arguments:
    - eventhubs.consumer_group: "my-consumer-group"

This should be after the upload_to_blob task if used together

field	description	value
`jobs`	A list of job configurations	Must have at least one job
`jobs[].main_name`	When `lang` is `python` must be the path to the python main file. When `lang` is `scala` it must be a class name	For `python`: `main/main.py`, for `scala`: `com.databricks.ComputeModels`.
`jobs[].config_file`	The path to a `json` jinja templated Databricks job config	defaults to `databricks.json.j2`
`jobs[].lang` (optional)	The language identifier of your project	One of `python`, `scala`, defaults to `python`
`jobs[].name` (optional)	A postfix to identify your job on Databricks	A postfix of `foo` will name your job `application-name_foo-version`. Defaults to no postfix. This will name all the jobs (if you have multiple) the same.
`jobs[].run_stream_job_immediately` (optional)	Whether or not to run a stream job immediately	`True` or `False`. Defaults to `True`.
`jobs[].is_batch` (optional)	Designate job as an unscheduled batch	`True` or `False`. Defaults to `False`.
`jobs[].arguments` (optional)	Key value pairs to be passed into your project	defaults to no arguments

The json file can use any of supported keys. During deployment the existence of the key schedule in the json file will determine if the job is streaming or batch. When schedule is present or is_batch has been set to True, it is considered a batch job, otherwise a streaming job. A streaming job will be kicked off immediately upon deployment.

An example of databricks.json.pyspark.j2

{
  "name": "{{ application_name }}",
  "new_cluster": {
    "spark_version": "4.3.x-scala2.11",
    "node_type_id": "Standard_DS4_v2",
    "spark_conf": {
      "spark.sql.warehouse.dir": "dbfs:/mnt/sdh/data/raw/managedtables",
      "spark.databricks.delta.preview.enabled": "true",
      "spark.sql.hive.metastore.jars": "builtin",
      "spark.sql.execution.arrow.enabled": "true",
      "spark.sql.hive.metastore.version": "1.2.1"
    },
    "spark_env_vars": {
      "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
    },
    "num_workers": 1,
    "cluster_log_conf": {
      "dbfs": {
        "destination": "dbfs:/mnt/sdh/logs/{{ log_destination }}"
      }
    }
  },
  "max_retries": 5,
  "libraries": [
    { 
      "egg": "{{ egg_file }}"
    }
  ],
  "spark_python_task": {
    "python_file": "{{ python_file }}",
    "parameters":  {{ parameters | tojson }} 
  }
}

An explanation for the Jinja templated values. These values get resolved automatically during deployment.

field	description
`application_name`	`your-git-repo-version` (e.g. `flights-prediction-SNAPSHOT`)
`log_destination`	`your-git-repo` (e.g. `flights-prediction`)
`egg_file`	The location of the egg file uploaded by the task upload_to_blob
`python_file`	The location the python main file uploaded by the task upload_to_blob

An example of databricks.json.scalaspark.j2

{
  "name": "{{ application_name }}",
  "new_cluster": {
    "spark_version": "4.3.x-scala2.11",
    "node_type_id": "Standard_DS4_v2",
    "spark_conf": {
      "spark.sql.warehouse.dir": "dbfs:/mnt/sdh/data/raw/managedtables",
      "spark.databricks.delta.preview.enabled": "true",
      "spark.sql.hive.metastore.jars": "builtin",
      "spark.sql.execution.arrow.enabled": "true",
      "spark.sql.hive.metastore.version": "1.2.1"
    },
    "spark_env_vars": {
      "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
    },
    "num_workers": 1,
    "cluster_log_conf": {
      "dbfs": {
        "destination": "dbfs:/mnt/sdh/logs/{{ log_destination }}"
      }
    }
  },
  "max_retries": 5,
  "libraries": [
    { 
      "jar": "{{ jar_file }}"
    }
  ],
  "spark_jar_task": {
    "main_class_name": "{{ class_name }}",
    "parameters":  {{ parameters | tojson }} 
  }
}

An explanation for the Jinja templated values. These values get resolved automatically during deployment.

field	description
`application_name`	`your-git-repo-version` (e.g. `flights-prediction-SNAPSHOT`)
`log_destination`	`your-git-repo` (e.g. `flights-prediction`)
`jar_file`	The location of the jar file uploaded by the task upload_to_blob
`class_name`	The class in the jar that should be ran

Takeoff config

Make sure takeoff_config.yaml contains the following azure_keyvault_keys:

  azure_storage_account:
    account_name: "azure-shared-blob-username"
    account_key: "azure-shared-blob-password"

and these takeoff_common keys:

  artifacts_shared_blob_container_name: libraries

Azure Databricks jobs

Deploy to Databricks

Deployment

Takeoff config