Standardizing Security Data with Data Contracts
Standardizing Security Data with Data Contracts
TL;DR: Security tools produce implicit schemas that break silently downstream. Data contracts give your security data a formal definition — and let you export that directly to dbt, Iceberg, and more from a single YAML file. And with AI agents entering security pipelines, that formal definition is becoming load-bearing infrastructure.
The recurring data schema problem
Security tools are great at producing data. They are terrible at agreeing on what that data looks like.
Every SIEM integration, every data lake pipeline, every threat intelligence feed I have worked with shares the same underlying problem: the producer has an implicit schema that lives in someone’s head or in a README, and consumers build silent assumptions on top of it. A field gets renamed, a new module adds an optional key, a null shows up where a string was expected — and three dashboards break quietly before anyone notices. At scale across a large security program, this is not an edge case. It is the default state.
It gets more pressing as AI enters the picture. Security teams are increasingly building agentic workflows that query and reason over this data at runtime — enrichment pipelines, automated triage, threat correlation. An AI agent querying malware analysis output needs to know what yara_matches means, when packer is null, and what distinguishes an email record from a PE record. Without a formal definition, you are prompt-engineering around an undocumented schema and hoping it holds.
The engineering world solved this for APIs with OpenAPI specs, and for event streams with Avro and Schema Registry. For security data pipelines, the answer is starting to crystallise around data contracts.
What is a data contract
A data contract is a formal, version-controlled agreement between a data producer and its consumers. It specifies the schema, field semantics, quality rules, ownership, SLAs, and target systems — in a single YAML file that both humans and tooling can read.
The Open Data Contract Standard (ODCS) is the open-source spec, backed by the Linux Foundation’s Bitol project. The datacontract-cli is the toolchain that lints, tests, and exports them. Think of the contract as the interface definition between your security tool and everything downstream that depends on its output.
MalZoo as the example
MalZoo is a mass static malware analysis tool I built years ago. It analyses PE files, Office documents, emails, and ZIP archives — and writes results as JSON to MongoDB, Elasticsearch, or Splunk. The output schema has always been implicit: the wiki describes it, the code enforces it, and any consumer just has to trust that those two things agree.
I recently pushed a data contract for it to the repository. A few excerpts that show what a contract actually captures.
The discriminator pattern. MalZoo writes five different record types into one collection, distinguished by filetype. The contract makes this explicit and machine-readable:
- name: filetype
logicalType: string
required: true
description: >
MIME type or magic-derived file type string. Acts as discriminator
for the record schema. Known values: PE, Office, ZIP, Email, Other.
examples:
- "PE32 executable (GUI) Intel 80386"
- "Zip archive data"
- "MIME entity"
Quality rules. Not just field definitions, but assertions that tooling can execute against real data:
quality:
- type: not_null
property: md5
- type: regex
property: md5
pattern: "^[a-f0-9]{32}$"
- type: custom
engine: sql
implementation: "filesize > 0"
Server declarations. The contract knows where the data goes, which is what enables the targeted exports in the next sections:
servers:
- name: json-log
type: local
format: json
- name: mongodb
type: mongodb
- name: splunk-hec
type: http
format: json
The full contract is in the repository at datacontract.yaml.
Export to dbt
Once the contract exists, datacontract-cli generates a dbt schema file directly from it:
datacontract export datacontract.yaml --format dbt
This produces a schema.yml with column definitions, descriptions, and — the useful part — dbt tests generated from the quality rules. The not_null on md5 becomes a dbt not_null test automatically. The regex becomes a custom test. Your analytics engineers get contract enforcement inside their existing dbt workflow without writing a single test by hand.
The contract is the single source of truth. Change a quality rule in the YAML, re-export, and the dbt test updates automatically. No separate test maintenance, no drift between what the contract says and what dbt actually checks.
Export to Iceberg
For the data lakehouse path — Databricks, Snowflake, Athena — Iceberg is typically where security data lands for long-term retention and for feeding ML pipelines. The export command:
datacontract export datacontract.yaml --format iceberg --model analysis_record
Gotcha worth knowing: Running this without
--modelthrows:Exception: Can only output one model at a time, found 7 modelsThe Iceberg exporter maps one contract model to one Iceberg table by design. The
--modelflag is required and the value is the model name from your contract. In the MalZoo case that isanalysis_record.
The output is an Iceberg table JSON schema with column types, nullability constraints, and descriptions mapped directly from the contract. Drop this into your table creation workflow and the schema is sourced from the same YAML that governs your dbt tests and your Splunk field extractions.
CI/CD: making it enforceable
A contract that is not enforced is documentation. The datacontract-cli has lint and test commands that run in seconds, which makes GitHub Actions the natural place to enforce it:
- name: Lint the data contract
run: datacontract lint datacontract.yaml
- name: Test contract against sample output
run: |
datacontract test datacontract.yaml \
--server json-log \
--data tests/fixtures/sample_records.jsonl
Any PR that touches a worker module and silently breaks the contract fails the pipeline before it merges. Schema drift caught at review time, not during an incident at 2am.
Back to the AI agent angle
At the start of this post the problem was: producers have implicit schemas, consumers build fragile assumptions on top of them. AI agents make this significantly worse before they make it better, because an agent that misunderstands a field does not throw an exception — it reasons confidently with wrong data.
Data contracts are to AI agents what OpenAPI specs are to REST APIs. The description fields in the contract — the ones that explain when packer is null, or what filetype values to expect — are exactly what you inject into an agent’s system prompt or tool definition to ground its reasoning. The JSON Schema export from datacontract-cli maps directly onto the function schemas that LLM tool-calling expects.
The contracts you write for your data pipelines today become the interface definitions your agent workflows rely on tomorrow. Worth starting now.
Final thoughts
The investment is small: one YAML file, a lint step in CI, and a discipline of updating the contract when the producer changes. The return is a data pipeline where schema drift is caught early, downstream consumers have a formal reference, and the same definition generates your dbt tests, your Iceberg schema, and eventually your agent tool specs — from a single source of truth.
Security data deserves the same engineering rigour we apply to APIs. Data contracts are how you get there.
Happy contracting :)