Deploy to Databricks
Deploys a streaming or batch job to Databricks. In the process, if there was already an old version of the job running, this will shut down the old job and deploy the new version.
Most often is used in combination with Deploy artifacts to Azure Blob
Deployment
Add the following task to deployment.yaml
:
- task: deploy_to_databricks
jobs:
- main_name: "main.py"
config_file: databricks.json.j2
lang: python
name: foo
run_stream_job_immediately: False
is_batch: False
arguments:
- eventhubs.consumer_group: "my-consumer-group"
This should be after the upload_to_blob task if used together
field | description | value |
---|---|---|
jobs |
A list of job configurations | Must have at least one job |
jobs[].main_name |
When lang is python must be the path to the python main file. When lang is scala it must be a class name |
For python : main/main.py , for scala : com.databricks.ComputeModels . |
jobs[].config_file |
The path to a json jinja templated Databricks job config |
defaults to databricks.json.j2 |
jobs[].lang (optional) |
The language identifier of your project | One of python , scala , defaults to python |
jobs[].name (optional) |
A postfix to identify your job on Databricks | A postfix of foo will name your job application-name_foo-version . Defaults to no postfix. This will name all the jobs (if you have multiple) the same. |
jobs[].run_stream_job_immediately (optional) |
Whether or not to run a stream job immediately | True or False . Defaults to True . |
jobs[].is_batch (optional) |
Designate job as an unscheduled batch | True or False . Defaults to False . |
jobs[].arguments (optional) |
Key value pairs to be passed into your project | defaults to no arguments |
The json
file can use any of supported keys. During deployment the existence of the key schedule
in the json
file will determine if the job is streaming or batch. When schedule
is present or is_batch
has been set to True
, it is considered a batch job, otherwise a streaming job. A streaming job will be kicked off immediately upon deployment.
An example of databricks.json.pyspark.j2
{
"name": "{{ application_name }}",
"new_cluster": {
"spark_version": "4.3.x-scala2.11",
"node_type_id": "Standard_DS4_v2",
"spark_conf": {
"spark.sql.warehouse.dir": "dbfs:/mnt/sdh/data/raw/managedtables",
"spark.databricks.delta.preview.enabled": "true",
"spark.sql.hive.metastore.jars": "builtin",
"spark.sql.execution.arrow.enabled": "true",
"spark.sql.hive.metastore.version": "1.2.1"
},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"num_workers": 1,
"cluster_log_conf": {
"dbfs": {
"destination": "dbfs:/mnt/sdh/logs/{{ log_destination }}"
}
}
},
"max_retries": 5,
"libraries": [
{
"egg": "{{ egg_file }}"
}
],
"spark_python_task": {
"python_file": "{{ python_file }}",
"parameters": {{ parameters | tojson }}
}
}
An explanation for the Jinja templated values. These values get resolved automatically during deployment.
field | description |
---|---|
application_name |
your-git-repo-version (e.g. flights-prediction-SNAPSHOT ) |
log_destination |
your-git-repo (e.g. flights-prediction ) |
egg_file |
The location of the egg file uploaded by the task upload_to_blob |
python_file |
The location the python main file uploaded by the task upload_to_blob |
An example of databricks.json.scalaspark.j2
{
"name": "{{ application_name }}",
"new_cluster": {
"spark_version": "4.3.x-scala2.11",
"node_type_id": "Standard_DS4_v2",
"spark_conf": {
"spark.sql.warehouse.dir": "dbfs:/mnt/sdh/data/raw/managedtables",
"spark.databricks.delta.preview.enabled": "true",
"spark.sql.hive.metastore.jars": "builtin",
"spark.sql.execution.arrow.enabled": "true",
"spark.sql.hive.metastore.version": "1.2.1"
},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"num_workers": 1,
"cluster_log_conf": {
"dbfs": {
"destination": "dbfs:/mnt/sdh/logs/{{ log_destination }}"
}
}
},
"max_retries": 5,
"libraries": [
{
"jar": "{{ jar_file }}"
}
],
"spark_jar_task": {
"main_class_name": "{{ class_name }}",
"parameters": {{ parameters | tojson }}
}
}
An explanation for the Jinja templated values. These values get resolved automatically during deployment.
field | description |
---|---|
application_name |
your-git-repo-version (e.g. flights-prediction-SNAPSHOT ) |
log_destination |
your-git-repo (e.g. flights-prediction ) |
jar_file |
The location of the jar file uploaded by the task upload_to_blob |
class_name |
The class in the jar that should be ran |
Takeoff config
Make sure takeoff_config.yaml
contains the following azure_keyvault_keys
:
azure_storage_account:
account_name: "azure-shared-blob-username"
account_key: "azure-shared-blob-password"
and these takeoff_common
keys:
artifacts_shared_blob_container_name: libraries