1. The Core Idea: Test Your Data π§ͺ
Soda Core lets you run tests on your data tables, much like a developer runs unit tests on their code. You define rules in a YAML file, and Soda runs queries against your Databricks tables to see if they pass. The process involves three key parts:
-
configuration.yml: Tells Soda how to connect to your Databricks SQL Warehouse. -
checks.yml: Contains your data quality rules, telling Soda what to test. -
soda scan: The command you run to execute the tests.
2. How to Write a Check
Anatomy of a checks.yml File
The file is organized into blocks, where each block targets a specific table (which Soda calls a "dataset").
# checks.yml
# Use a "for each dataset" block to target a table.
for each dataset in [customers]:
# Indent your list of checks for this table here.
- row_count > 0
- missing_count(email) = 0
for each dataset in [orders]:
# A separate list of checks for the 'orders' table.
- duplicate_count(order_id) = 0
Anatomy of a Single Check
A check has two main parts: the metric (what you measure) and the threshold (the condition for success).
Example: missing_count(email) = 0
-
missing_count(email): This is the metric. Soda calculates the number ofNULLvalues in theemailcolumn. -
= 0: This is the threshold. The check passes only if the result of the metric is0.
Alert Levels: warn vs. fail
For any check, you can set two levels of alerts: warning π‘ and failure π΄. This lets you distinguish between minor issues and critical, pipeline-blocking problems.
-
warn: The check result is a "warning". The scan will be considered "successful with warnings". This is useful for notifications. -
fail: The check result is a "failure". The scan itself is considered to have failed. This is used to stop a data pipeline or trigger an urgent alert.
for each dataset in [customers]:
- missing_count(email):
# Issue a warning if more than 5 emails are missing.
warn: when > 5
# Fail the scan if more than 50 emails are missing.
fail: when > 50
3. SodaCL Syntax Cheatsheet π
Row Count & Freshness
# Checks for table size and recent updates
- row_count > 1000
- row_count between 5000 and 10000
- freshness(event_timestamp) < 2d 12h # Fails if latest timestamp is > 2.5 days old
Missing, Invalid & Duplicate Values
# Checks for completeness, validity, and uniqueness
- missing_count(column_name) = 0
- missing_percentage(column_name) < 5%
- duplicate_count(id) = 0
- invalid_count(status) = 0:
valid values: ["active", "inactive", "pending"]
- invalid_count(notes) = 0:
forbidden values: ["-", "N/A", "none"]
- invalid_count(email_column) = 0:
valid format: email
Numeric Distribution & Range
# Validate the distribution and range of numeric data
- avg(temperature) between -10 and 40
- min(price) >= 0
- max_percentile(latency_ms, 99) < 2000
- stddev(score) < 5.5
- anomaly score for order_value < 3 # Z-score based outlier detection
- values in (ratings) must be between 1 and 5
Text & String Patterns
# Validate text-based columns using length, format, or regex
- min_length(name) > 1
- max_length(comment) < 500
- invalid_count(product_sku) = 0:
valid regex: "[A-Z]{2,4}-\d{5}"
- invalid_count(user_id) = 0:
starts with: "user_"
- invalid_count(log_message) = 0:
ends with: "[END]"
- invalid_count(hostname) = 0:
contains: ".com"
Date & Time Formats
# Validate date and time string formats (uses Java's SimpleDateFormat)
- invalid_count(created_at) = 0:
valid format: "yyyy-MM-dd HH:mm:ss"
Schema
# Verify column names, types, and order
- schema:
fail:
when required column missing: [id, created_at]
when forbidden column present: [password_hash]
when column type mismatch:
id: int
revenue: decimal(10, 2)
when column index mismatch:
id: 0
email: 1
Column Comparisons & Row-Level Checks
# Write custom rules that apply to each row
- failed rows:
name: "End date must be after start date"
fail condition: end_date <= start_date
- failed rows:
name: "Discount cannot exceed price"
fail condition: discount_amount > price
Cross-Table & Referential Integrity
# Ensure consistency between two tables
- values in (customer_id) must exist in customers (id)
- row_count same as customers # Compare row counts of two tables
Change-Over-Time
# Note: These checks store historical metrics and often require Soda Cloud.
- change for row_count < 10%
- change for avg(order_value) vs last month < 5%
User-Defined SQL Checks
# For complex business logic, write your own SQL query
# π© SQL Metric Check: The query must return a single numeric value
- sql metric:
name: "Max price should be under 1000"
sql: SELECT MAX(price) FROM __table_name__
must be < 1000
# π© Failed Rows Check: The query must return rows that violate a rule
- failed rows:
name: "Shipped orders must have a shipped_date"
fail query: |
SELECT * FROM __table_name__
WHERE status = 'shipped' AND shipped_date IS NULL
Applying Checks to Multiple Columns
# Use a "for each column" block to avoid repetition
for each column in [email, phone_number, address]:
- missing_count = 0
Grouped Checks
# Run checks on segments of your data
- group by:
group_fields: [country_code]
checks:
- row_count > 100 # Fails for any country with <= 100 customers
- avg(order_value) between 10 and 500 # Checks avg order value per country
4. Setup & Execution
Step 1: Create configuration.yml
This file connects Soda to your Databricks SQL Warehouse. Get the Host, HTTP Path, and a Personal Access Token from your Warehouse's Connection Details.
# configuration.yml
data_source my_databricks_wh:
type: databricks
connection:
host: "dbc-xxxxxxxx-xxxx.cloud.databricks.com"
http_path: "/sql/1.0/warehouses/xxxxxxxxxxxxxxxx"
token: "dapixxxxxxxxxxxxxxxxxxxxxxxx"
schema: "default"
Step 2: Run the Scan (CLI)
Execute your checks from the command line.
# Install the library first
pip install soda-core-databricks
# Run the scan
soda scan -d my_databricks_wh -c configuration.yml checks.yml
5. Programmatic Scans (Databricks Notebook)
Integrate Soda directly into your data pipelines by running it from a notebook.
# In a Databricks notebook cell
# Install the library if not already on the cluster
%pip install soda-core-databricks
from soda.scan import Scan
# 1. Configure the scan
scan = Scan()
scan.set_data_source_name("my_databricks_wh")
scan.add_configuration_yaml(
"""
data_source my_databricks_wh:
type: databricks
connection:
# Use dbutils.secrets to securely store your credentials
host: "..."
http_path: "..."
token: dbutils.secrets.get("soda", "databricks_token")
"""
)
# 2. Add your SodaCL checks file
scan.add_sodacl_yaml_files("checks.yml")
# 3. Execute the scan
scan.execute()
# 4. Check the results and take action
if scan.has_failures():
print("π¨ Soda Scan FAILED with critical issues!")
scan.assert_no_failures() # This will fail the notebook cell
elif scan.has_warnings():
print("π‘ Soda Scan PASSED with warnings.")
else:
print("β
All data quality checks passed.")
# You can also get a detailed JSON report
print(scan.get_scan_results())
6. Complete Example: checks.yml
Here is a sample file with a variety of checks across several related tables to show how they work together.
#=============== CHECKS FOR CUSTOMER DATA ===============
for each dataset in [customers]:
# --- Table-level checks ---
- row_count > 0:
name: "Table is not empty"
- freshness(updated_at) < 24h:
name: "Customer data is fresh"
- duplicate_count(id) = 0:
name: "Customer ID is unique"
# --- Schema validation ---
- schema:
name: "Customer schema is correct"
fail:
when required column missing: [id, email, country_code, created_at]
when column type mismatch: {id: string, age: integer, created_at: timestamp}
# --- Column-level data quality checks ---
- missing_percentage(email) < 1%:
name: "Email address is mostly present"
- invalid_count(email) = 0:
name: "Email format is valid"
valid format: email
- invalid_count(country_code) = 0:
name: "Country code is 2 characters"
valid length: 2
- failed rows:
name: "Customers must be 18 or older"
fail condition: age < 18
#=============== CHECKS FOR PRODUCT CATALOG ===============
for each dataset in [products]:
# --- Table-level checks ---
- duplicate_count(sku) = 0:
name: "SKU is unique"
- missing_count(name) = 0:
name: "All products have a name"
# --- Column-level data quality checks ---
- invalid_count(category) = 0:
name: "Category is from an approved list"
valid values: ["electronics", "books", "home goods", "apparel"]
- sql metric:
name: "Price must be positive"
sql: SELECT count(*) FROM __table_name__ WHERE price < 0
must be 0
- anomaly score for price < 3:
name: "Product price is not an outlier"
#=============== CHECKS FOR ORDERS FACT TABLE ===============
for each dataset in [orders]:
# --- Table-level checks ---
- row_count > 0
- freshness(order_date) < 2h
# --- Referential integrity ---
- values in (customer_id) must exist in customers (id):
name: "Order links to a valid customer"
- values in (product_sku) must exist in products (sku):
name: "Order links to a valid product"
# --- Row-level business logic ---
- failed rows:
name: "Shipped date cannot be before order date"
fail condition: shipped_date < order_date
# --- Checks on multiple columns ---
for each column in [order_id, customer_id, product_sku, order_value]:
- missing_count = 0
# --- Segmented data quality checks ---
- group by:
group_fields: [status]
checks:
- row_count > 0
- avg(order_value) > 0:
name: "Average order value is positive for each status"