Skip to main content

Configuration

Weiser uses YAML configuration files to define data quality checks, datasources, and connections. This page provides a comprehensive guide to configuring your Weiser setup.

Configuration Structure

version: 1
datasources:
- name: default
type: postgresql
host: localhost
port: 5432
db_name: mydb
user: postgres
password: secret

connections:
- name: metricstore
type: metricstore
db_type: duckdb
db_name: metricstore.db

checks:
- name: orders_row_count
dataset: orders
type: row_count
condition: gt
threshold: 0

includes:
- path/to/additional/config.yaml

Your configuration can use environment variables for sensitive information, weiser will read your .env file and replace the variables in the configuration. Environment variables already defined in your system can also be used. They are available in the configuration as {{VARIABLE_NAME}}.

Datasources

Configure connections to your data sources:

PostgreSQL

datasources:
- name: postgres_prod
type: postgresql
host: prod-db.company.com
port: 5432
db_name: production
user: weiser_user
password: mysecretpassword

MySQL

datasources:
- name: mysql_analytics
type: mysql
host: mysql.example.com
port: 3306
db_name: analytics
user: analytics_user
password: {{MYSQL_PASSWORD}}

Cube.js

datasources:
- name: cube_api
type: cube
uri: http://localhost:4000/cubejs-api/v1

Connection URI

datasources:
- name: warehouse
type: postgresql
uri: postgresql://user:pass@host:5432/database

Metric Store Connections

Configure where check results are stored:

DuckDB (Default)

connections:
- name: metricstore
type: metricstore
db_type: duckdb
db_name: metricstore.db

PostgreSQL Metric Store

connections:
- name: metricstore
type: metricstore
db_type: postgresql
host: metrics-db.company.com
port: 5432
db_name: metrics
user: metrics_user
password: {{METRICS_PASSWORD}}

S3 Storage (DuckDB with S3)

connections:
- name: metricstore
type: metricstore
db_type: duckdb
s3_access_key: {{S3_ACCESS_KEY}}
s3_secret_access_key: {{S3_SECRET_KEY}}
s3_endpoint: s3.amazonaws.com
s3_bucket: weiser-metrics
s3_region: us-east-1

Check Configuration

Basic Check Structure

checks:
- name: unique_check_name
dataset: table_name_or_sql
type: check_type
condition: comparison_operator
threshold: value_or_range
# Optional parameters
measure: column_or_expression
dimensions: [column1, column2]
filter: where_clause
time_dimension:
name: date_column
granularity: day

Required Parameters

ParameterDescriptionExample
nameUnique identifierorders_row_count
datasetTable or SQL queryorders
typeCheck typerow_count
conditionComparison operatorgt
thresholdExpected value1000

Optional Parameters

ParameterDescriptionExample
datasourceDatasource namepostgres_prod
measureColumn/expressionorder_amount
dimensionsGroup by columns[region, status]
filterWHERE clausestatus = 'active'
descriptionCheck descriptionValidates order count
failAllow check failuresfalse

Conditions

ConditionDescriptionExample
gtGreater thanactual > threshold
geGreater than or equalactual >= threshold
ltLess thanactual < threshold
leLess than or equalactual <= threshold
eqEqual toactual == threshold
neqNot equal toactual != threshold
betweenBetween rangethreshold[0] <= actual <= threshold[1]

Threshold Types

Single Value

threshold: 1000

Range (for between condition)

condition: between
threshold: [100, 500]

Decimal Values

threshold: 0.05 # 5% for percentage checks

Dimensions

Group checks by specific columns:

# Single dimension
dimensions: [region]

# Multiple dimensions
dimensions: [region, product_category, quarter]

# No dimensions (aggregate entire dataset)
dimensions: []

Time Dimensions

Add time-based grouping:

time_dimension:
name: created_at
granularity:
day # millennium, century, decade, year, quarter,
# month, week, day, hour, minute, second

Filters

Add WHERE clause conditions:

# Simple filter
filter:
- status = 'active'

# Complex filter
filter:
- status IN ('active', 'pending') AND created_at >= '2024-01-01'

# Multiple filters, will be combined with AND
filter:
- status = 'active'
- amount > 0

Dataset Options

Table Name

dataset: orders

Multiple Tables

dataset: [orders, returns, refunds]

Custom SQL

dataset: >
SELECT o.*, c.customer_tier
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at >= CURRENT_DATE - INTERVAL '30 days'

Environment Variables

Use environment variables for sensitive data:

datasources:
- name: prod
type: postgresql
host: {{DB_HOST}}
user: {{DB_USER}}
password: {{DB_PASSWORD}}
db_name: {{DB_NAME}}

Set environment variables:

export DB_HOST=prod-db.company.com
export DB_USER=weiser_user
export DB_PASSWORD=secret123
export DB_NAME=production

Configuration Includes

Split configuration across multiple files:

# main.yaml
version: 1
includes:
- datasources.yaml
- checks/row_counts.yaml
- checks/revenue_checks.yaml
# datasources.yaml
datasources:
- name: postgres_prod
type: postgresql
# ... connection details

Templating

Use template variables for reusable configurations:

# Template variables
variables:
min_row_count: 1000
revenue_threshold: 50000

checks:
- name: orders_count
dataset: orders
type: row_count
condition: gt
threshold: 0

Slack Integration

Configure Slack notifications:

slack_url: https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX

Validation

Weiser validates your configuration on startup. Common validation errors:

Missing Required Fields

Error: Check 'my_check' missing required field 'condition'

Invalid Check Type

Error: Unknown check type 'invalid_type'

Invalid Condition

Error: Invalid condition 'invalid_condition'

Missing Threshold for Between

Error: 'between' condition requires array threshold [min, max]

Best Practices

  1. Environment Variables: Use for sensitive data
  2. Descriptive Names: Use clear, descriptive check names
  3. Documentation: Add descriptions to complex checks
  4. Modular Files: Split large configurations into multiple files
  5. Version Control: Store configurations in version control
  6. Testing: Test configurations in non-production environments
  7. Monitoring: Set up alerts for check failures

Example Complete Configuration

version: 1

datasources:
- name: warehouse
type: postgresql
host: localhost
port: 5432
db_name: analytics
user: MyUser
password: MyPassword

connections:
- name: metricstore
type: metricstore
db_type: duckdb
db_name: weiser_metrics.db

checks:
# Row count checks
- name: orders_daily_count
dataset: orders
type: row_count
condition: gt
threshold: 100
time_dimension:
name: created_at
granularity: day

# Revenue validation
- name: daily_revenue
dataset: orders
type: sum
measure: order_amount
condition: ge
threshold: 10000
filter: status = 'completed'
time_dimension:
name: created_at
granularity: day

# Data completeness
- name: customer_data_completeness
dataset: customers
type: not_empty_pct
dimensions: [email, phone]
condition: le
# Max 5% NULL values
threshold: 0.05