Soda Core Cheatsheet: SodaCL checks, setup, and scans

1. The Core Idea: Test Your Data 🧪

Soda Core lets you run tests on your data tables, much like a developer runs unit tests on their code. You define rules in a YAML file, and Soda runs queries against your Databricks tables to see if they pass. The process involves three key parts:

configuration.yml: Tells Soda how to connect to your Databricks SQL Warehouse.
checks.yml: Contains your data quality rules, telling Soda what to test.
soda scan: The command you run to execute the tests.

2. How to Write a Check

Anatomy of a `checks.yml` File

The file is organized into blocks, where each block targets a specific table (which Soda calls a "dataset").

# checks.yml

# Use a "for each dataset" block to target a table.
for each dataset in [customers]:
  # Indent your list of checks for this table here.
  - row_count > 0
  - missing_count(email) = 0

for each dataset in [orders]:
  # A separate list of checks for the 'orders' table.
  - duplicate_count(order_id) = 0

Anatomy of a Single Check

A check has two main parts: the metric (what you measure) and the threshold (the condition for success).

Example: missing_count(email) = 0

missing_count(email): This is the metric. Soda calculates the number of NULL values in the email column.
= 0: This is the threshold. The check passes only if the result of the metric is 0.

Alert Levels: `warn` vs. `fail`

For any check, you can set two levels of alerts: warning 🟡 and failure 🔴. This lets you distinguish between minor issues and critical, pipeline-blocking problems.

warn: The check result is a "warning". The scan will be considered "successful with warnings". This is useful for notifications.
fail: The check result is a "failure". The scan itself is considered to have failed. This is used to stop a data pipeline or trigger an urgent alert.

for each dataset in [customers]:
  - missing_count(email):
      # Issue a warning if more than 5 emails are missing.
      warn: when > 5
      # Fail the scan if more than 50 emails are missing.
      fail: when > 50

3. SodaCL Syntax Cheatsheet 📖

Row Count & Freshness

# Checks for table size and recent updates
- row_count > 1000
- row_count between 5000 and 10000
- freshness(event_timestamp) < 2d 12h # Fails if latest timestamp is > 2.5 days old

Missing, Invalid & Duplicate Values

# Checks for completeness, validity, and uniqueness
- missing_count(column_name) = 0
- missing_percentage(column_name) < 5%
- duplicate_count(id) = 0
- invalid_count(status) = 0:
    valid values: ["active", "inactive", "pending"]
- invalid_count(notes) = 0:
    forbidden values: ["-", "N/A", "none"]
- invalid_count(email_column) = 0:
    valid format: email

Numeric Distribution & Range

# Validate the distribution and range of numeric data
- avg(temperature) between -10 and 40
- min(price) >= 0
- max_percentile(latency_ms, 99) < 2000
- stddev(score) < 5.5
- anomaly score for order_value < 3 # Z-score based outlier detection
- values in (ratings) must be between 1 and 5

Text & String Patterns

# Validate text-based columns using length, format, or regex
- min_length(name) > 1
- max_length(comment) < 500
- invalid_count(product_sku) = 0:
    valid regex: "[A-Z]{2,4}-\d{5}"
- invalid_count(user_id) = 0:
    starts with: "user_"
- invalid_count(log_message) = 0:
    ends with: "[END]"
- invalid_count(hostname) = 0:
    contains: ".com"

Date & Time Formats

# Validate date and time string formats (uses Java's SimpleDateFormat)
- invalid_count(created_at) = 0:
    valid format: "yyyy-MM-dd HH:mm:ss"

Schema

# Verify column names, types, and order
- schema:
    fail:
      when required column missing: [id, created_at]
      when forbidden column present: [password_hash]
      when column type mismatch:
        id: int
        revenue: decimal(10, 2)
      when column index mismatch:
        id: 0
        email: 1

Column Comparisons & Row-Level Checks

# Write custom rules that apply to each row
- failed rows:
    name: "End date must be after start date"
    fail condition: end_date <= start_date

- failed rows:
    name: "Discount cannot exceed price"
    fail condition: discount_amount > price

Cross-Table & Referential Integrity

# Ensure consistency between two tables
- values in (customer_id) must exist in customers (id)
- row_count same as customers # Compare row counts of two tables

Change-Over-Time

# Note: These checks store historical metrics and often require Soda Cloud.
- change for row_count < 10%
- change for avg(order_value) vs last month < 5%

User-Defined SQL Checks

# For complex business logic, write your own SQL query
# 🟩 SQL Metric Check: The query must return a single numeric value
- sql metric:
    name: "Max price should be under 1000"
    sql: SELECT MAX(price) FROM __table_name__
    must be < 1000

# 🟩 Failed Rows Check: The query must return rows that violate a rule
- failed rows:
    name: "Shipped orders must have a shipped_date"
    fail query: |
      SELECT * FROM __table_name__
      WHERE status = 'shipped' AND shipped_date IS NULL

Applying Checks to Multiple Columns

# Use a "for each column" block to avoid repetition
for each column in [email, phone_number, address]:
  - missing_count = 0

Grouped Checks

# Run checks on segments of your data
- group by:
    group_fields: [country_code]
    checks:
      - row_count > 100 # Fails for any country with <= 100 customers
      - avg(order_value) between 10 and 500 # Checks avg order value per country

4. Setup & Execution

Step 1: Create `configuration.yml`

This file connects Soda to your Databricks SQL Warehouse. Get the Host, HTTP Path, and a Personal Access Token from your Warehouse's Connection Details.

# configuration.yml
data_source my_databricks_wh:
  type: databricks
  connection:
    host: "dbc-xxxxxxxx-xxxx.cloud.databricks.com"
    http_path: "/sql/1.0/warehouses/xxxxxxxxxxxxxxxx"
    token: "dapixxxxxxxxxxxxxxxxxxxxxxxx"
  schema: "default"

Step 2: Run the Scan (CLI)

Execute your checks from the command line.

# Install the library first
pip install soda-core-databricks

# Run the scan
soda scan -d my_databricks_wh -c configuration.yml checks.yml

5. Programmatic Scans (Databricks Notebook)

Integrate Soda directly into your data pipelines by running it from a notebook.

# In a Databricks notebook cell

# Install the library if not already on the cluster
%pip install soda-core-databricks

from soda.scan import Scan

# 1. Configure the scan
scan = Scan()
scan.set_data_source_name("my_databricks_wh")
scan.add_configuration_yaml(
  """
  data_source my_databricks_wh:
    type: databricks
    connection:
      # Use dbutils.secrets to securely store your credentials
      host: "..."
      http_path: "..."
      token: dbutils.secrets.get("soda", "databricks_token")
  """
)

# 2. Add your SodaCL checks file
scan.add_sodacl_yaml_files("checks.yml")

# 3. Execute the scan
scan.execute()

# 4. Check the results and take action
if scan.has_failures():
  print("🚨 Soda Scan FAILED with critical issues!")
  scan.assert_no_failures() # This will fail the notebook cell
elif scan.has_warnings():
  print("🟡 Soda Scan PASSED with warnings.")
else:
  print("✅ All data quality checks passed.")

# You can also get a detailed JSON report
print(scan.get_scan_results())

6. Complete Example: `checks.yml`

Here is a sample file with a variety of checks across several related tables to show how they work together.

#=============== CHECKS FOR CUSTOMER DATA ===============
for each dataset in [customers]:
  # --- Table-level checks ---
  - row_count > 0:
      name: "Table is not empty"
  - freshness(updated_at) < 24h:
      name: "Customer data is fresh"
  - duplicate_count(id) = 0:
      name: "Customer ID is unique"

  # --- Schema validation ---
  - schema:
      name: "Customer schema is correct"
      fail:
        when required column missing: [id, email, country_code, created_at]
        when column type mismatch: {id: string, age: integer, created_at: timestamp}

  # --- Column-level data quality checks ---
  - missing_percentage(email) < 1%:
      name: "Email address is mostly present"
  - invalid_count(email) = 0:
      name: "Email format is valid"
      valid format: email
  - invalid_count(country_code) = 0:
      name: "Country code is 2 characters"
      valid length: 2
  - failed rows:
      name: "Customers must be 18 or older"
      fail condition: age < 18

#=============== CHECKS FOR PRODUCT CATALOG ===============
for each dataset in [products]:
  # --- Table-level checks ---
  - duplicate_count(sku) = 0:
      name: "SKU is unique"
  - missing_count(name) = 0:
      name: "All products have a name"
  
  # --- Column-level data quality checks ---
  - invalid_count(category) = 0:
      name: "Category is from an approved list"
      valid values: ["electronics", "books", "home goods", "apparel"]
  - sql metric:
      name: "Price must be positive"
      sql: SELECT count(*) FROM __table_name__ WHERE price < 0
      must be 0
  - anomaly score for price < 3:
      name: "Product price is not an outlier"

#=============== CHECKS FOR ORDERS FACT TABLE ===============
for each dataset in [orders]:
  # --- Table-level checks ---
  - row_count > 0
  - freshness(order_date) < 2h

  # --- Referential integrity ---
  - values in (customer_id) must exist in customers (id):
      name: "Order links to a valid customer"
  - values in (product_sku) must exist in products (sku):
      name: "Order links to a valid product"

  # --- Row-level business logic ---
  - failed rows:
      name: "Shipped date cannot be before order date"
      fail condition: shipped_date < order_date
  
  # --- Checks on multiple columns ---
  for each column in [order_id, customer_id, product_sku, order_value]:
    - missing_count = 0

  # --- Segmented data quality checks ---
  - group by:
      group_fields: [status]
      checks:
        - row_count > 0
        - avg(order_value) > 0:
            name: "Average order value is positive for each status"