Supported Datasources
Weiser supports multiple database systems and data platforms for running data quality checks. This page provides detailed configuration and setup instructions for each supported datasource.
Overview
Currently supported datasources:
Datasource | Type | Status | Use Cases |
---|---|---|---|
PostgreSQL | postgresql | ✅ Fully Supported | OLTP, Analytics, Data Warehouses |
MySQL | mysql | ✅ Fully Supported | OLTP, Web Applications |
Snowflake | snowflake | ✅ Fully Supported | Cloud Data Warehouse |
Databricks | databricks | ✅ Fully Supported | Cloud Data Warehouse |
BigQuery | bigquery | ✅ Fully Supported | Cloud Data Warehouse |
Cube.js | cube | ✅ Fully Supported | Semantic Layer, Business Metrics |
Redshift | redshift | 📋 Planned | Cloud Data Warehouse |
Athena | athena | 📋 Planned | Cloud Data Warehouse |
Trino | trino | 📋 Planned | Distributed Data Warehouse |
PostgreSQL
PostgreSQL is the primary supported datasource with full feature compatibility for all check types.
Configuration
Basic Connection
datasources:
- name: postgres_prod
type: postgresql
host: localhost
port: 5432
db_name: production
user: weiser_user
password: secure_password
Connection URI
datasources:
- name: postgres_warehouse
type: postgresql
uri: postgresql://user:password@host:5432/database
Environment Variables (Recommended)
datasources:
- name: postgres_prod
type: postgresql
host: {{POSTGRES_HOST}}
port: {{POSTGRES_PORT}}
db_name: {{POSTGRES_DB}}
user: {{POSTGRES_USER}}
password: {{POSTGRES_PASSWORD}}
Connection Parameters
Parameter | Required | Default | Description |
---|---|---|---|
name | Yes | - | Unique datasource identifier |
type | Yes | - | Must be postgresql |
host | Yes* | - | Database server hostname |
port | No | 5432 | Database server port |
db_name | Yes* | - | Database name |
user | Yes* | - | Database username |
password | Yes* | - | Database password |
uri | Yes* | - | Complete connection URI (alternative to individual params) |
*Either individual parameters OR uri is required
Setup Requirements
1. Database User Permissions
-- Create dedicated user for Weiser
CREATE USER weiser_user WITH PASSWORD 'secure_password';
-- Grant read permissions
GRANT CONNECT ON DATABASE your_database TO weiser_user;
GRANT USAGE ON SCHEMA public TO weiser_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO weiser_user;
-- Grant permissions for future tables
ALTER DEFAULT PRIVILEGES IN SCHEMA public
GRANT SELECT ON TABLES TO weiser_user;
2. Network Access
Ensure your PostgreSQL server allows connections from Weiser:
# Test connectivity
psql -h your-host -U weiser_user -d your_database -c "SELECT 1;"
3. SSL Configuration (Recommended)
datasources:
- name: postgres_prod
type: postgresql
uri: postgresql://user:password@host:5432/database?sslmode=require
Supported Features
Feature | PostgreSQL Support |
---|---|
Row Count Checks | ✅ Full Support |
Numeric Checks | ✅ Full Support |
Sum/Min/Max Checks | ✅ Full Support |
Not Empty Checks | ✅ Full Support |
Anomaly Detection | ✅ Full Support |
Custom SQL | ✅ Full Support |
Time Dimensions | ✅ Full Support |
Dimensions/Grouping | ✅ Full Support |
Complex Filters | ✅ Full Support |
Window Functions | ✅ Full Support |
Statistical Functions | ✅ Full Support |
PostgreSQL-Specific Examples
Using PostgreSQL Date Functions
- name: recent_data_check
dataset: orders
type: row_count
condition: gt
threshold: 100
filter: created_at >= CURRENT_DATE - INTERVAL '7 days'
Performance Optimization
Indexes
Ensure proper indexes for check performance:
-- Index for time-based checks
CREATE INDEX idx_orders_created_at ON orders(created_at);
-- Index for filtered checks
CREATE INDEX idx_orders_status ON orders(status);
-- Composite index for grouped checks
CREATE INDEX idx_sales_region_date ON sales(region, sale_date);
Common Issues & Solutions
Connection Timeout
# Increase timeout for large queries
datasources:
- name: postgres_prod
type: postgresql
uri: postgresql://user:password@host:5432/database?connect_timeout=30
SSL Certificate Issues
# Skip SSL verification (not recommended for production)
datasources:
- name: postgres_dev
type: postgresql
uri: postgresql://user:password@host:5432/database?sslmode=disable
Large Result Sets
# Use LIMIT in checks for large tables
- name: sample_data_check
dataset: |
SELECT * FROM large_table
ORDER BY created_at DESC
LIMIT 100000
type: row_count
condition: gt
threshold: 50000
MySQL
MySQL is a popular open-source relational database that supports all Weiser check types. The MySQL connector uses the PyMySQL driver for reliable connectivity and full feature compatibility.
Configuration
Basic Connection
datasources:
- name: mysql_prod
type: mysql
host: localhost
port: 3306
db_name: production
user: weiser_user
password: secure_password
Connection URI
datasources:
- name: mysql_warehouse
type: mysql
uri: mysql+pymysql://user:password@host:3306/database
Environment Variables (Recommended)
datasources:
- name: mysql_prod
type: mysql
host: {{MYSQL_HOST}}
port: {{MYSQL_PORT}}
db_name: {{MYSQL_DB}}
user: {{MYSQL_USER}}
password: {{MYSQL_PASSWORD}}
Connection Parameters
Parameter | Required | Default | Description |
---|---|---|---|
name | Yes | - | Unique datasource identifier |
type | Yes | - | Must be mysql |
host | Yes* | - | Database server hostname |
port | No | 3306 | Database server port |
db_name | Yes* | - | Database name |
user | Yes* | - | Database username |
password | No | - | Database password |
uri | Yes* | - | Complete connection URI (alternative to individual params) |
*Either individual parameters OR uri is required
Setup Requirements
1. Database User Permissions
-- Create dedicated user for Weiser
CREATE USER 'weiser_user'@'%' IDENTIFIED BY 'secure_password';
-- Grant read permissions
GRANT SELECT ON production.* TO 'weiser_user'@'%';
-- Grant connection permissions
GRANT USAGE ON *.* TO 'weiser_user'@'%';
-- Apply changes
FLUSH PRIVILEGES;
2. Network Access
Ensure your MySQL server allows connections from Weiser:
# Test connectivity
mysql -h your-host -u weiser_user -p production -e "SELECT 1;"
3. SSL Configuration (Recommended)
datasources:
- name: mysql_prod
type: mysql
uri: mysql+pymysql://user:password@host:3306/database?ssl_ca=/path/to/ca.pem&ssl_verify_identity=true
Supported Features
Feature | MySQL Support |
---|---|
Row Count Checks | ✅ Full Support |
Numeric Checks | ✅ Full Support |
Sum/Min/Max Checks | ✅ Full Support |
Not Empty Checks | ✅ Full Support |
Anomaly Detection | ✅ Full Support |
Custom SQL | ✅ Full Support |
Time Dimensions | ✅ Full Support |
Dimensions/Grouping | ✅ Full Support |
Complex Filters | ✅ Full Support |
Window Functions | ✅ Full Support |
Statistical Functions | ✅ Full Support |
MySQL-Specific Examples
Using MySQL Date Functions
- name: recent_data_check
dataset: orders
type: row_count
condition: gt
threshold: 100
filter: created_at >= DATE_SUB(CURDATE(), INTERVAL 7 DAY)
Using MySQL String Functions
- name: email_format_check
dataset: customers
type: row_count
condition: eq
threshold: 0
filter: email NOT REGEXP '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$'
JSON Data Validation
- name: json_data_check
dataset: user_preferences
type: row_count
condition: gt
threshold: 1000
filter: JSON_VALID(preferences_json) = 1
Performance Optimization
Indexes
Ensure proper indexes for check performance:
-- Index for time-based checks
CREATE INDEX idx_orders_created_at ON orders(created_at);
-- Index for filtered checks
CREATE INDEX idx_orders_status ON orders(status);
-- Composite index for grouped checks
CREATE INDEX idx_sales_region_date ON sales(region, sale_date);
Query Optimization
-- Use appropriate storage engines
ALTER TABLE large_table ENGINE=InnoDB;
-- Enable query cache for repeated checks
SET GLOBAL query_cache_type = ON;
SET GLOBAL query_cache_size = 268435456; -- 256MB
Common Issues & Solutions
Connection Timeout
# Increase timeout for large queries
datasources:
- name: mysql_prod
type: mysql
uri: mysql+pymysql://user:password@host:3306/database?connect_timeout=30&read_timeout=60
SSL Certificate Issues
# Disable SSL (not recommended for production)
datasources:
- name: mysql_dev
type: mysql
uri: mysql+pymysql://user:password@host:3306/database?ssl_disabled=true
Character Set Issues
# Specify character set
datasources:
- name: mysql_prod
type: mysql
uri: mysql+pymysql://user:password@host:3306/database?charset=utf8mb4
Large Result Sets
# Use LIMIT in checks for large tables
- name: sample_data_check
dataset: |
SELECT * FROM large_table
ORDER BY created_at DESC
LIMIT 100000
type: row_count
condition: gt
threshold: 50000
MySQL Version Compatibility
Weiser supports MySQL versions:
- MySQL 5.7+: Full support for all features
- MySQL 8.0+: Enhanced support with JSON functions and window functions
- MariaDB 10.3+: Compatible with MySQL features
Cloud MySQL Services
Amazon RDS for MySQL
datasources:
- name: mysql_rds
type: mysql
host: myinstance.123456789012.us-east-1.rds.amazonaws.com
port: 3306
db_name: production
user: {{RDS_USERNAME}}
password: {{RDS_PASSWORD}}
Google Cloud SQL for MySQL
datasources:
- name: mysql_cloudsql
type: mysql
host: 10.1.2.3 # Private IP or public IP
port: 3306
db_name: production
user: {{CLOUDSQL_USERNAME}}
password: {{CLOUDSQL_PASSWORD}}
Azure Database for MySQL
datasources:
- name: mysql_azure
type: mysql
host: myserver.mysql.database.azure.com
port: 3306
db_name: production
user: {{AZURE_MYSQL_USERNAME}}
password: {{AZURE_MYSQL_PASSWORD}}
Snowflake
Snowflake is a fully-managed cloud data warehouse that supports all Weiser check types. The Snowflake connector provides native support for Snowflake-specific features like warehouses, roles, and schemas.
Configuration
Basic Connection
datasources:
- name: snowflake_prod
type: snowflake
account: your-account-id
db_name: PRODUCTION_DB
user: WEISER_USER
password: secure_password
warehouse: COMPUTE_WH
role: ANALYST_ROLE
schema_name: PUBLIC
Connection URI
datasources:
- name: snowflake_warehouse
type: snowflake
uri: snowflake://user:password@account.snowflakecomputing.com/database?warehouse=WH&role=ROLE&schema=SCHEMA
Environment Variables (Recommended)
datasources:
- name: snowflake_prod
type: snowflake
account: {{SNOWFLAKE_ACCOUNT}}
db_name: {{SNOWFLAKE_DATABASE}}
user: {{SNOWFLAKE_USER}}
password: {{SNOWFLAKE_PASSWORD}}
warehouse: {{SNOWFLAKE_WAREHOUSE}}
role: {{SNOWFLAKE_ROLE}}
schema_name: {{SNOWFLAKE_SCHEMA}}
Connection Parameters
Parameter | Required | Default | Description |
---|---|---|---|
name | Yes | - | Unique datasource identifier |
type | Yes | - | Must be snowflake |
host | Yes* | - | Snowflake account URL (e.g., account.snowflakecomputing.com) |
db_name | Yes* | - | Database name |
user | Yes* | - | Snowflake username |
password | Yes* | - | Snowflake password |
warehouse | No | - | Warehouse to use for compute |
role | No | - | Role to assume for permissions |
schema_name | No | - | Default schema to use |
uri | Yes* | - | Complete connection URI (alternative to individual params) |
*Either individual parameters OR uri is required
Setup Requirements
1. Snowflake User and Role Setup
-- Create role for Weiser
CREATE ROLE weiser_role;
-- Create user for Weiser
CREATE USER weiser_user
PASSWORD = 'secure_password'
DEFAULT_ROLE = weiser_role
DEFAULT_WAREHOUSE = COMPUTE_WH
DEFAULT_NAMESPACE = 'PRODUCTION_DB.PUBLIC';
-- Grant role to user
GRANT ROLE weiser_role TO USER weiser_user;
-- Grant database and schema permissions
GRANT USAGE ON DATABASE PRODUCTION_DB TO ROLE weiser_role;
GRANT USAGE ON SCHEMA PRODUCTION_DB.PUBLIC TO ROLE weiser_role;
GRANT SELECT ON ALL TABLES IN SCHEMA PRODUCTION_DB.PUBLIC TO ROLE weiser_role;
-- Grant warehouse usage
GRANT USAGE ON WAREHOUSE COMPUTE_WH TO ROLE weiser_role;
-- Grant permissions for future tables
GRANT SELECT ON FUTURE TABLES IN SCHEMA PRODUCTION_DB.PUBLIC TO ROLE weiser_role;
2. Network Access
Snowflake is cloud-hosted and accessible via HTTPS. Ensure your network allows outbound HTTPS connections.
3. Account Identifier
Find your Snowflake account identifier:
-- Run this in Snowflake to get your account identifier
SELECT CURRENT_ACCOUNT();
Supported Features
Feature | Snowflake Support |
---|---|
Row Count Checks | ✅ Full Support |
Numeric Checks | ✅ Full Support |
Sum/Min/Max Checks | ✅ Full Support |
Not Empty Checks | ✅ Full Support |
Anomaly Detection | ✅ Full Support |
Custom SQL | ✅ Full Support |
Time Dimensions | ✅ Full Support |
Dimensions/Grouping | ✅ Full Support |
Complex Filters | ✅ Full Support |
Window Functions | ✅ Full Support |
Statistical Functions | ✅ Full Support |
Snowflake-Specific Examples
Using Snowflake Date Functions
- name: recent_data_check
dataset: orders
type: row_count
condition: gt
threshold: 100
filter: created_at >= DATEADD(day, -7, CURRENT_DATE())
Time Travel Queries
- name: historical_comparison
dataset: |
SELECT COUNT(*) as current_count
FROM orders
WHERE created_at >= CURRENT_DATE()
type: numeric
measure: current_count
condition: gt
threshold: 1000
Using Snowflake Warehouse Scaling
# Use larger warehouse for intensive checks
datasources:
- name: snowflake_large
type: snowflake
host: account.snowflakecomputing.com
db_name: ANALYTICS_DB
user: weiser_user
password: secure_password
warehouse: LARGE_WH # Use larger warehouse for heavy workloads
role: ANALYST_ROLE
Performance Optimization
Warehouse Sizing
Choose appropriate warehouse size based on check complexity:
- XS/S: Simple row counts and basic aggregations
- M/L: Complex joins and window functions
- XL/2XL: Large-scale statistical analysis
Query Optimization
-- Use clustering keys for large tables
ALTER TABLE orders CLUSTER BY (created_at);
-- Use materialized views for complex aggregations
CREATE MATERIALIZED VIEW daily_order_metrics AS
SELECT
DATE_TRUNC('day', created_at) as order_date,
COUNT(*) as order_count,
SUM(amount) as total_amount
FROM orders
GROUP BY 1;
Common Issues & Solutions
Authentication Errors
# Ensure correct role and warehouse
datasources:
- name: snowflake_prod
type: snowflake
host: account.snowflakecomputing.com
db_name: PROD_DB
user: WEISER_USER
password: secure_password
warehouse: COMPUTE_WH # Must have USAGE privilege
role: ANALYST_ROLE # Must be granted to user
Warehouse Suspension
# Auto-resume warehouses (default behavior)
# Warehouses automatically resume when queries are executed
# Consider using auto-suspend settings in Snowflake
Large Result Sets
# Use LIMIT for sampling large tables
- name: sample_data_check
dataset: |
SELECT * FROM large_table
SAMPLE (1000 ROWS) -- Snowflake sampling
type: row_count
condition: eq
threshold: 1000
Databricks
Databricks is a unified data analytics platform that supports all Weiser check types. The Databricks connector provides native support for both SQL warehouses and compute clusters with Unity Catalog integration.
Configuration
Basic Connection (SQL Warehouse)
datasources:
- name: databricks_prod
type: databricks
host: workspace-123.cloud.databricks.com
access_token: dapi123456789abcdef
http_path: /sql/1.0/warehouses/abc123def456
catalog: main
schema_name: default
Cluster Connection
datasources:
- name: databricks_cluster
type: databricks
host: workspace-456.cloud.databricks.com
access_token: dapi987654321fedcba
http_path: /sql/protocolv1/o/123456789/clusters/1234-567890-abc123
catalog: hive_metastore
schema_name: public
Environment Variables (Recommended)
datasources:
- name: databricks_prod
type: databricks
host: {{DATABRICKS_HOST}}
access_token: {{DATABRICKS_ACCESS_TOKEN}}
http_path: {{DATABRICKS_HTTP_PATH}}
catalog: {{DATABRICKS_CATALOG}}
schema_name: {{DATABRICKS_SCHEMA}}
Connection Parameters
Parameter | Required | Default | Description |
---|---|---|---|
name | Yes | - | Unique datasource identifier |
type | Yes | - | Must be databricks |
host | Yes* | - | Databricks workspace hostname (e.g., workspace-123.cloud.databricks.com) |
access_token | Yes* | - | Databricks personal access token |
http_path | Yes* | - | HTTP path to SQL warehouse or cluster endpoint |
catalog | No | - | Unity Catalog name (e.g., main, hive_metastore) |
schema_name | No | - | Default schema to use |
uri | Yes* | - | Complete connection URI (alternative to individual params) |
*Either individual parameters OR uri is required
Setup Requirements
1. Databricks Access Token
Create a personal access token in your Databricks workspace:
- Go to User Settings > Developer > Access tokens
- Generate new token with appropriate permissions
- Copy the token value (starts with
dapi
)
2. SQL Warehouse or Cluster Setup
For SQL Warehouses (Recommended):
-- Get HTTP path from SQL Warehouses page
-- Format: /sql/1.0/warehouses/{warehouse-id}
For Compute Clusters:
-- Get HTTP path from Compute page
-- Format: /sql/protocolv1/o/{org-id}/clusters/{cluster-id}
3. Unity Catalog Setup (Optional)
-- Create catalog for data quality
CREATE CATALOG data_quality;
-- Grant permissions
GRANT USE CATALOG ON data_quality TO `weiser@company.com`;
GRANT USE SCHEMA ON data_quality.default TO `weiser@company.com`;
GRANT SELECT ON data_quality.default.* TO `weiser@company.com`;
Supported Features
Feature | Databricks Support |
---|---|
Row Count Checks | ✅ Full Support |
Numeric Checks | ✅ Full Support |
Sum/Min/Max Checks | ✅ Full Support |
Not Empty Checks | ✅ Full Support |
Anomaly Detection | ✅ Full Support |
Custom SQL | ✅ Full Support |
Time Dimensions | ✅ Full Support |
Dimensions/Grouping | ✅ Full Support |
Complex Filters | ✅ Full Support |
Window Functions | ✅ Full Support |
Statistical Functions | ✅ Full Support |
Unity Catalog | ✅ Full Support |
Databricks-Specific Examples
Using Unity Catalog
- name: unity_catalog_check
dataset: main.sales.orders
type: row_count
condition: gt
threshold: 1000
filter: order_date >= current_date() - interval 7 days
Delta Lake Time Travel
- name: historical_comparison
dataset: |
SELECT COUNT(*) as current_count
FROM delta.`/mnt/delta/orders`
VERSION AS OF 123
type: numeric
measure: current_count
condition: gt
threshold: 5000
Using Databricks SQL Functions
- name: data_freshness_check
dataset: events
type: numeric
measure: datediff(current_timestamp(), max(event_timestamp))
condition: lt
threshold: 24
filter: event_date = current_date()
Performance Optimization
Warehouse Sizing
Choose appropriate warehouse size based on check complexity:
- 2X-Small/X-Small: Simple row counts and basic aggregations
- Small/Medium: Complex joins and window functions
- Large/X-Large: Large-scale statistical analysis and heavy workloads
Query Optimization
-- Use Delta Lake optimization features
OPTIMIZE delta.`/mnt/delta/orders` ZORDER BY (order_date);
-- Create materialized views for complex aggregations
CREATE MATERIALIZED VIEW daily_order_metrics AS
SELECT
date_trunc('day', order_date) as order_date,
COUNT(*) as order_count,
SUM(amount) as total_amount
FROM orders
GROUP BY 1;
Common Issues & Solutions
Authentication Errors
# Ensure access token has proper permissions
datasources:
- name: databricks_prod
type: databricks
host: workspace-123.cloud.databricks.com
access_token: dapi123456789abcdef # Must start with 'dapi'
http_path: /sql/1.0/warehouses/abc123def456
Warehouse Auto-Suspend
# Warehouses auto-suspend when idle
# They automatically resume when queries are executed
# Consider warehouse settings in Databricks UI for cost optimization
Unity Catalog Access
# Specify full three-part names for Unity Catalog tables
- name: unity_catalog_check
dataset: catalog_name.schema_name.table_name
type: row_count
condition: gt
threshold: 100
Large Result Sets
# Use LIMIT for sampling large tables
- name: sample_data_check
dataset: |
SELECT * FROM large_table
TABLESAMPLE (1000 ROWS) -- Databricks sampling
type: row_count
condition: eq
threshold: 1000
BigQuery
BigQuery is Google Cloud's fully-managed data warehouse that supports all Weiser check types. The BigQuery connector provides native support for Google Cloud authentication, regional datasets, and BigQuery-specific SQL features.
Configuration
Basic Connection
datasources:
- name: bigquery_prod
type: bigquery
project_id: my-gcp-project
dataset_id: production_data
credentials_path: /path/to/service-account.json
Connection with Location
datasources:
- name: bigquery_eu
type: bigquery
project_id: my-gcp-project
dataset_id: eu_dataset
location: europe-west1
credentials_path: /path/to/service-account.json
Connection URI
datasources:
- name: bigquery_warehouse
type: bigquery
uri: bigquery://my-project/my-dataset?credentials_path=/path/to/creds.json&location=us-central1
Environment Variables (Recommended)
datasources:
- name: bigquery_prod
type: bigquery
project_id: {{GCP_PROJECT_ID}}
dataset_id: {{BIGQUERY_DATASET}}
credentials_path: {{GOOGLE_APPLICATION_CREDENTIALS}}
location: {{BIGQUERY_LOCATION}}
Connection Parameters
Parameter | Required | Default | Description |
---|---|---|---|
name | Yes | - | Unique datasource identifier |
type | Yes | - | Must be bigquery |
project_id | Yes* | - | Google Cloud project ID |
dataset_id | No | - | Default dataset ID to use |
db_name | No | - | Alternative to dataset_id (fallback) |
credentials_path | No | - | Path to service account JSON file |
location | No | - | Dataset location (e.g., us-central1, europe-west1) |
uri | Yes* | - | Complete connection URI (alternative to individual params) |
*Either individual parameters OR uri is required
Setup Requirements
1. Google Cloud Project Setup
- Create or select a Google Cloud project
- Enable the BigQuery API
- Create a service account with BigQuery permissions
- Download the service account JSON key file
2. Service Account Permissions
# Grant BigQuery Data Viewer role
gcloud projects add-iam-policy-binding my-gcp-project \
--member="serviceAccount:weiser@my-gcp-project.iam.gserviceaccount.com" \
--role="roles/bigquery.dataViewer"
# Grant BigQuery Job User role (for running queries)
gcloud projects add-iam-policy-binding my-gcp-project \
--member="serviceAccount:weiser@my-gcp-project.iam.gserviceaccount.com" \
--role="roles/bigquery.jobUser"
3. Authentication Methods
Option A: Service Account Key File
datasources:
- name: bigquery_prod
type: bigquery
project_id: my-gcp-project
credentials_path: /path/to/service-account.json
Option B: Environment Variable
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
Option C: Application Default Credentials (GCP environments)
# No credentials_path needed when running on GCP
datasources:
- name: bigquery_prod
type: bigquery
project_id: my-gcp-project
Supported Features
Feature | BigQuery Support |
---|---|
Row Count Checks | ✅ Full Support |
Numeric Checks | ✅ Full Support |
Sum/Min/Max Checks | ✅ Full Support |
Not Empty Checks | ✅ Full Support |
Anomaly Detection | ✅ Full Support |
Custom SQL | ✅ Full Support |
Time Dimensions | ✅ Full Support |
Dimensions/Grouping | ✅ Full Support |
Complex Filters | ✅ Full Support |
Window Functions | ✅ Full Support |
Statistical Functions | ✅ Full Support |
Standard SQL | ✅ Full Support |
BigQuery-Specific Examples
Using BigQuery Date Functions
- name: recent_data_check
dataset: orders
type: row_count
condition: gt
threshold: 100
filter: created_at >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
Cross-Dataset Queries
- name: cross_dataset_check
dataset: "`my-project.dataset1.orders` o JOIN `my-project.dataset2.customers` c ON o.customer_id = c.id"
type: row_count
condition: gt
threshold: 1000
Using BigQuery Array Functions
- name: array_data_check
dataset: events
type: numeric
measure: ARRAY_LENGTH(event_tags)
condition: between
threshold: [1, 10]
filter: event_date = CURRENT_DATE()
Partitioned Table Queries
- name: partition_efficiency_check
dataset: "`my-project.analytics.events`"
type: row_count
condition: gt
threshold: 10000
filter: _PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
Performance Optimization
Query Optimization
-- Use clustering and partitioning for large tables
CREATE TABLE `my-project.analytics.events`
(
event_timestamp TIMESTAMP,
user_id STRING,
event_name STRING,
event_data JSON
)
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_name;
Cost Management
# Use specific datasets to limit query scope
- name: cost_efficient_check
dataset: "`my-project.production.orders`" # Specific table reference
type: row_count
condition: gt
threshold: 1000
filter: DATE(created_at) = CURRENT_DATE() # Partition pruning
Regional Optimization
# Use location parameter for regional datasets
datasources:
- name: bigquery_eu
type: bigquery
project_id: my-gcp-project
dataset_id: eu_dataset
location: europe-west1 # Reduces latency for EU data
Common Issues & Solutions
Authentication Errors
# Ensure service account has proper permissions
datasources:
- name: bigquery_prod
type: bigquery
project_id: my-gcp-project
credentials_path: /path/to/service-account.json # Must have BigQuery permissions
Dataset Access Issues
# Use fully qualified table names for cross-project access
- name: cross_project_check
dataset: "`other-project.public_data.table_name`"
type: row_count
condition: gt
threshold: 100
Query Quotas and Limits
# Use LIMIT for large table sampling
- name: sample_data_check
dataset: |
SELECT * FROM `my-project.large_dataset.big_table`
TABLESAMPLE SYSTEM (1 PERCENT) -- BigQuery sampling
type: row_count
condition: gt
threshold: 1000
Slot Availability
# Consider query priority and complexity
# BigQuery automatically manages slot allocation
# For consistent performance, consider reservations for production workloads
Cost Considerations
Query Costs
- BigQuery charges based on data processed
- Use partition pruning and clustering for cost efficiency
- Consider query caching for repeated checks
Best Practices
# Efficient date filtering
- name: cost_efficient_check
dataset: orders
type: row_count
condition: gt
threshold: 100
filter: DATE(created_at) = CURRENT_DATE() # Uses partition pruning
# Avoid SELECT * on large tables
- name: specific_column_check
dataset: |
SELECT order_id, status FROM orders
WHERE DATE(created_at) = CURRENT_DATE()
type: row_count
condition: gt
threshold: 50
Cube
Cube is a semantic layer that allows you to define business metrics and dimensions. Weiser integrates with Cube to run data quality checks on these metrics. Cube implements the PostgreSQL interface, so you can use the same configuration as PostgreSQL. Visit the Cube documentation for more details on setting up Cube.
Contributing
Help us expand datasource support! We welcome contributions for:
- New database connectors
- Performance optimizations
- Documentation improvements
- Testing and validation
See our GitHub repository for contribution guidelines.