Databricks
Last updated
Last updated
Databricks is a company founded by the creators of Apache Spark. It may also refer to the Unified Analytics Platform, a web-based Spark platform that automates cluster management and IPython-style notebooks.
Interactive user interface
Cluster sharing
Security features
Collaboration in the same notebooks, revision control, IDE / Github integration
Data management, ability to connect different data sources
It improves collaboration between Data Scientists, Data Engineers, and Business Analysts. It also has some performance improvements over traditional Apache Spark.
It's trivially easy to scale and manage computation resources. There are two cluster modes:
Standard Clusters
Supports Scala notebook cells
Has some ML-ready runtimes
High Concurrency Clusters
Does not support Scala notebook cells
Has some ML-ready runtimes
Allows fair sharing of resources between users through "task preemption"
Fault isolation, creating an isolated environment for each notebook
The Databricks runtime defines the Spark and Scala versions.
Data can be uploaded trough the UI, or imported from a range of data sources into DBFS (Databricks File System), or processed in memory and stored back into a data source. In the Data menu, you can generate a notebook for each option, that demonstrates how to import and convert the data to a table. Options may vary between Azure and AWS platforms.
After generating a notebook, you may have to specify sensitive information such as passwords or secrets. As you can collaborate in Databricks notebooks, other users may see you paste passwords before you remove them. You can instead create a secret scope, to store and secrets in a hidden way, and access them using dbutils:
This tool allows you to:
dbutils.notebook execute notebooks as scripts, passing arguments and receiving a return value
dbutils.secrets use secret scopes and keys, but not show or modify them
dbutils.library install python libraries
dbutils.widgets create and read html inputs placed above the notebook
dbutils.fs access DBFS, crud folders and files
You can also access DBFS through the magic command %fs
that you can put on top of a notebook cell:
This command achieves the same as:
As you can see we can execute shell commands. You can only specify one magic command per cell:
python: Python, may be set as default
scala: Scala, may be set as default
sql: SQL
r: R
run: Run other notebook
md: Markdown
sh: Execute shell
bash: Execute bash
fs: File System through dbutils
Besides the dbutils tool that is accessible in notebook cells, there is also a display function that allows you to display DataFrames as tables, and create visualizations.
Finally there are jobs, allowing you to run a notebook or JAR either immediately or on a scheduled basis. You can create and run jobs using the UI, the CLI, and by invoking the Jobs API. Similarly, you can monitor job run results in the UI, using the CLI, by querying the API, and through email alerts.