DMS — Data Meta Syntax

Version: 0.14 (draft)

file extension: .dms

A data syntax with YAML's clean look and a small, strict spec. Structure is indent-based (no repeated section headers). Types are distinct and never inferred from context. Heredocs and multi-line comments are first-class.

Design principles

  1. Indent-based structure, but a tiny indent rule (no YAML-style complexity).
  2. Strict, distinct types — never infer from context.
  3. No anchors, aliases, tags, merge keys, schemas, or references.
  4. Every value has exactly one canonical representation.
  5. UTF-8 only, NFC-normalized. LF or CRLF line endings, both accepted, LF canonical.

Lexical

! @ $ % ^ & * | ~ ` . , > < ? ; =

are reserved as decorator sigils at line-start position. A body line whose first non-whitespace character is one of these is a parse error in tier 0. The reservation is fixed in this spec — it is not derived from any per-document declaration. Tier 1 (defined in TIER1.md) binds these sigils to dialect-published decorator families via the _dms_imports front-matter field; tier-0-only decoders reject tier-1 documents at front-matter decode (_dms_tier: 1 triggers rejection on a tier-0-only decoder by the tier-marker rule below). The underscore (_) is not a reserved decorator sigil here — it has its own category, reserved for core / built-in decorators (e.g. heredoc modifiers _trim, _fold_paragraphs).

Three other punctuation characters that might look like candidates for this set are explicitly not reserved, because they already carry tier-0 grammar:

The reservation cost in tier 0 is zero: none of the seventeen reserved-sigil characters are members of the bare-key character set (§Keys and scalars), and none can appear as the first non- whitespace character of any other valid tier-0 construct, so no pre-existing valid document is invalidated. Decoders that previously accepted such lines by oversight must reject them after this spec revision.

All four sub-ranges are frozen at Unicode 15.1.0 alongside the bare-key and NFC tables. The set is closed under Unicode's monotonic-additive stability guarantees: future SPEC bumps can only add codepoints, never invalidate documents that decode cleanly under the current floor.

This selection covers every emoji renderable as a single visual glyph in current systems — single-codepoint emoji (🚀), ZWJ families (👨‍👩‍👧), skin-tone variants (👍🏽), flags (🇺🇸), and keycaps (1️⃣).

ASCII overlap is naturally excluded. Characters with Emoji=Yes in UTS #51 but not Extended_Pictographic=Yes — digits 0-9, #, * — are not in the Reserved Emoji Set and continue to carry their tier-0 grammar meaning (numeric scalars, comment marker, decorator-sigil candidate). The Extended_Pictographic property was specifically designed to exclude these ASCII overlaps; that's why we use it as the primary base.

Latin-1 trademark symbols are included. © (U+00A9), ® (U+00AE), and (U+2122) are classified as Extended_Pictographic=Yes by Unicode and are therefore reserved. To use them as plain-text prose at the start of a line or as a bare-key character, write them inside quotes (note: "© 2026 Acme") or a string scalar. This is a deliberate consequence of taking Unicode's classification at face value rather than carving out exceptions per-codepoint — carve-outs would require ports to track their own pictographic-vs-text taste, which would diverge over time.

Concretely, reservation applies as follows:

The four sub-ranges are sourced from UCD 15.1 (emoji-data.txt for Extended_Pictographic and Emoji_Modifier; the Unicode core database for Regional Indicators and U+20E3) and shipped per port as a single frozen table, in lockstep with the bare-key and NFC tables. A port must not delegate the check to its host runtime's Unicode library — host tables track the runtime's Unicode version and would silently diverge as new emoji are assigned. Grapheme-cluster segmentation is likewise performed against a UAX #29 algorithm frozen at 15.1.0 (the algorithm itself is stable; only the property tables it consults need pinning).

Decoder error messages must name the offending codepoint by hex value and category, e.g. "U+1F680 (🚀, Extended_Pictographic) reserved as emoji", since terminals and editors render some Reserved-Emoji codepoints as monochrome text glyphs that authors may not recognize as emoji.

Unicode normalization

To prevent visually-identical strings from comparing unequal — e.g. é written as the precomposed scalar U+00E9 vs. as U+0065 U+0301 (e + combining acute) — every string the decoder produces is normalized to Normalization Form C (NFC) as defined by Unicode Standard Annex #15.

This applies uniformly to:

Normalization is applied to the source after UTF-8 decoding and before tokenization, so even structural elements like bare keys see normalized input — a bare key written as decomposed café (e + U+0301) becomes precomposed café before the bare-key category check, and is accepted. Strings produced by escape sequences (\uXXXX, \UXXXXXXXX) are additionally NFC-normalized at the point the string is constructed, since escape-decoded scalars don't pass through the source-level NFC pass. NFC does not salvage non-scalar escapes — a surrogate escape is still a parse error.

NFC is stable under the Unicode Stability Policy: for any character already assigned in Unicode 4.1 (2005) or later, its NFC form does not change in newer Unicode versions. New characters assigned in later Unicode releases get new NFC mappings, however — so a port that delegates NFC to its host runtime would normalize a Unicode 16 codepoint differently from a port frozen at 15.1.

To keep documents byte-identical across ports and across time, NFC tables are frozen at Unicode 15.1.0 and shipped with each port, exactly like the bare-key set. A port must not delegate NFC to its host runtime's Unicode library (Python's unicodedata.normalize, ICU's unorm2, etc.) — those track whichever Unicode version the runtime was built against and would silently diverge once new codepoints are assigned. The two table generations (NFC + bare-key set) are kept in lockstep on a single SPEC-controlled Unicode floor; a future SPEC bump moves both together.

Bump policy. A future SPEC version may move the Unicode floor from 15.1 to a higher release. When that happens, ports ship only the new tables — the prior floor's tables are not retained, and ports do not carry the cumulative union of historical UCD snapshots. This is safe because both properties involved are monotonic under Unicode's own stability guarantees: XID_Continue only grows (characters are never removed from the identifier sets), and NFC mappings of already-assigned characters never change. Consequently every document valid under floor N decodes byte-identically under floor N+k — a bump is purely additive (new accepted characters, new NFC entries for newly assigned codepoints) and never breaks pre-bump documents.

The duplicate-key check operates on NFC-normalized keys, so writing café twice in the same table — once precomposed, once decomposed — is a parse error rather than two distinct keys.

Round-trip. A decoder-emitter pair preserves the NFC value of every string, not the original source bytes. Non-NFC input becomes NFC on re-emit; emitters do not (and should not) reconstruct the original encoding.

The indent rule

Nesting is expressed by indentation. One rule:

Inside a single parent, all direct children must be indented by the same number of spaces. The first child sets the width; every subsequent sibling must match it exactly.

Different parents can pick different widths. A block ends when a line is encountered at an indent strictly less than its children's width.

a:
    b: 1        # a's children are 4-wide (first child set the width)
    c: 2        # must be 4
    d:
      e: 1      # d's children are 2-wide; independent of a's choice
      f: 2      # must be 2
g: 3            # back at the root level

Inconsistent sibling indent (e.g. b at 4 spaces then c at 3) is a decode error with the column pointed at.

Front matter

A DMS document may begin with an optional front matter block delimited by +++ lines. If present, the block must precede any other content (blank lines, line comments, and block comments may appear before +++).

+++
app_name:     "myservice"
doc_version:  "1.2.3"
updated:      2026-04-23
+++

# the actual document body starts here
database:
  host: "db.internal"
  port: 5432

Rules:

Currently-defined reserved keys

Key Type Meaning
_dms_tier non-negative int Declares the minimum decoder tier required. Absent ⇒ tier 0 implied. _dms_tier: 0 is the explicit tier-0 form. _dms_tier: 1 opts into tier 1 (see TIER1.md) — a tier-0-only decoder rejects with the tier-1-pointing error described below; a tier-1-capable decoder accepts. _dms_tier: N for N ≥ 2 is a parse error in this revision (no tier ≥ 2 is currently defined). A value of any other type — string ("0"), float (0.0), bool, list, table, datetime — is a parse error: "_dms_tier must be a non-negative integer".

Any other _-prefixed key inside the front matter is currently reserved but undefined; a decoder encountering one must refuse with "unknown reserved key: <name>". Reserved keys exist as a forward-compatibility hook: future versions of DMS may add new reserved keys, and old decoders will give a clean error message rather than silently misinterpreting them.

Tier semantics

Decoding modes — full and lite

Independent of the tier axis, every conforming DMS decoder exposes two decoding modes: a full mode (default) and a lite mode (opt-in). Both modes share the same grammar, the same error diagnostics, and the same data tree — they differ only in how much round-trip metadata the decoder keeps.

Aspect Full mode (default) Lite mode (opt-in)
Data tree (tables, lists, scalars) produced produced
Front matter (_meta) produced produced
Comment AST (leading / inner / trailing / floating) produced not produced — comments are lexed and discarded
original_forms (integer base, string form, heredoc form) produced not produced
Full-mode encode() (preserving round-trip) supported not supported — needs comments + original_forms
Lite-mode encode() (canonical-form emit) supported supported

The grammar is identical in both modes. Lite mode does not relax error checking, does not skip front-matter validation, does not loosen Unicode normalization, does not change which inputs are accepted. It is the same decoder with two output channels turned off.

What lite mode is for. Read-only consumers — application configs, CI pipelines, deploy scripts, sysctl-style readers — that decode, extract values, and never re-emit the document. The comment-AST and original_forms machinery exists to support encode(); if you don't call encode(), those structures are dead weight. Lite mode lets read-only callers skip the bookkeeping and recover wall-clock time (reference benchmarks show roughly 1.5–2× on flat-table workloads; varies by port).

What lite mode is not. Lite mode is not a "permissive" mode, not a non-conforming subset, and not an alternative conformance level. A document that decodes in full mode decodes in lite mode and vice versa; a document that errors in full mode errors at the same character in lite mode. The two modes produce the same data tree.

encode() itself has two modes — full and lite — orthogonal to the decode-side modes. Same name pattern, different concern. The decode-side mode controls how much round-trip metadata the decoder captures. The emit-side mode controls how much of that metadata encode() re-emits.

encode mode (input → output) Comments original_forms Use case
Full (default) — preserving emit re-emitted re-emitted (hex/oct/bin/literal-string forms) Round-trip a decoded file, hand-edited config writer
Lite — canonical emit dropped dropped (decimal ints, basic-quoted strings) Generate DMS from in-memory data; bench/strip

Lite-mode encode accepts any Document — full or lite. It ignores comments and original_forms even when present, and emits canonical form: decimal integers, basic-quoted strings, no comments, emitter-default whitespace. The output is always valid DMS that re-decodes to a data-equivalent Document.

Full-mode encode (the existing default) requires a full-mode-decoded Document (or one constructed in code with the metadata fields populated). A decoder that ships both encode modes MUST refuse encode(lite_doc, mode=full) with a clean error ("full-mode emit requires comments + original_forms; got a lite-mode Document"). encode(lite_doc, mode=lite) is always valid.

Round-trip stability (under §encode) is required only for full-mode emit of a full-mode-decoded Document. Lite-mode emit is canonical-form lossy by design — encode_lite(decode(src)) may strip comments and re-render hex integers as decimal; that is the intended behaviour, not a violation.

Conformance. Every conforming decoder MUST ship full-mode decode AND full-mode encode. Lite-mode decode and lite-mode encode are optional to ship; decoders that ship them MUST do so under the contract above. A decoder that exposes only lite mode (either side) is non-conforming.

Capability reporting. A decoder that ships lite mode advertises it via a supports_lite_mode boolean on its capability surface. Callers can probe before opting in.

Unordered tables — optional opt-in (orthogonal axis)

A third, independent axis exists alongside full/lite: the table ordering guarantee. Tier 0 makes insertion-order preservation a default invariant — every conforming decoder ships an ordered mode and the conformance corpus is checked against it. Some consumers don't care: kubectl-style read-only loaders, monitoring agents, batch processors that consume DMS, project to a few keys, and never re-emit. For those callers, an unordered mode is allowed as an optional opt-in.

Aspect Ordered (default) Unordered (opt-in)
Iteration order over a decoded table insertion-order arbitrary — decoder may use a hash-only backing
Conformance corpus expected output byte-stable best-effort; equality compares structurally, not order
Full-mode encode() (round-trip) supported not supported — round-trip needs stable order
Lite-mode encode() (canonical) supported supported (emits in iteration order, no stability promise)

API shape. A decoder that ships unordered mode exposes it via a parallel entry point or a flag — decode_document_unordered(src), decode(src, ignore_order=true), etc. The exact name is language-specific. The CLI convention used by the conformance and bench harnesses is --ignore-order.

Capability reporting. A decoder that ships unordered mode advertises it via a supports_ignore_order boolean on its capability surface. Callers probe before opting in.

Combinations. Unordered is orthogonal to full/lite — all four combinations are conforming if the decoder ships them: (ordered, full), (ordered, lite), (unordered, full), (unordered, lite). The most useful pairing for read-only callers is (unordered, lite): fastest decode, no comment AST, hash-only table backing.

Reference implementation note (informational, not normative). The DMS Rust reference ships --ignore-order on the CLI surface with spec-correct semantics, but at the time of writing the runtime backing is still IndexMap-based (no measurable decode-speed win). The flag is plumbed end-to-end so other ports can implement the HashMap-backed fast path without API churn. Ports that DO swap to a hash-only backing should advertise supports_ignore_order = true.

API shape. Language-specific. The general pattern is a construction-time option or a parallel entry point — e.g. decode(source, mode="lite") versus decode(source), or decode_lite(source) versus decode(source). The spec does not mandate the exact API name; it mandates the contract above.

Examples

Full-mode decode (default), with comments preserved:

doc = dms.decode(source)              # full mode by default
doc.comments[("db", "port")]         # leading + trailing AttachedComments
out = dms.encode(doc)                # round-trips the source

Lite-mode decode (opt-in), no comment AST:

doc = dms.decode(source, mode="lite") # comments lexed and discarded
doc.body["db"]["port"]               # data is identical to full mode
doc.comments                         # empty / absent
dms.encode(doc)                      # ERROR: round-trip requires full mode
dms.encode(doc, mode="lite")         # OK — canonical emit, no metadata needed

Lite-mode emit on a full-mode Document — strip comments + canonicalise:

doc = dms.decode(source)              # full-mode decode, comments captured
canonical = dms.encode(doc, mode="lite")
# `canonical` has no comments, decimal integers (even if source used 0xFF),
# basic-quoted strings (even if source used '...' literal). re-decodes to a
# data-equivalent Document.

API shape

A decoder must expose the decoded document as both a body value (what the rest of the spec already defines) and a front matter table (the decoded +++ block, or an empty/absent value if the document had no front matter). Exactly how is language-specific; for this implementation's conformance encoder (JSON output), the shape is:

This means every existing conformance test — none of which declare front matter — keeps its expected output unchanged. New tests that use front matter produce the wrapped form.

Examples

Explicit tier 0 declaration:

+++
_dms_tier: 0
+++
host: "db.internal"
port: 5432

User metadata:

+++
title:   "Production config"
author:  "ada@example.com"
updated: 2026-04-23
+++
database:
  host: "db.internal"

(Front matter surfaces as metadata.)

Tier ≥ 1 (parse error):

+++
_dms_tier: 1
+++
host: "db.internal"

(No tier ≥ 1 is currently defined; the decoder refuses.)

User tries a reserved key (parse error):

+++
_my_app_version: "1.0"   # error: '_'-prefixed keys are reserved
+++

Front-matter-only decode

Every conforming decoder must expose a separate entry point that decodes only the front-matter block and stops, skipping the body. This exists for callers that need only the document's metadata — config loaders checking _dms_tier, indexers harvesting user keys, dispatchers choosing a downstream decoder — and would otherwise pay the full decode cost for a few header lines.

Contract:

API shape. Language-specific. Reference ports use decode_front_matter(source) / decodeFrontMatter(source) per host idiom. The CLI convention used by the conformance and bench harnesses is --front-matter-only.

Conformance. Required at tier 0. Absence of this entry point is non-conformance; there is no capability flag.

Forward compatibility

DMS evolves by reserving syntactic and lexical real estate today so future versions can extend it without breaking existing documents. Reservations fall into two groups: declared (the spec explicitly names them) and implicit (the tier-0 grammar rejects them today, leaving the slot free for a future tier to define).

Declared reservations

Implicit reservations

The tier-0 grammar is strict: any token sequence not matched by an explicit production is a parse error. The following positions are currently rejected and are candidates for tier ≥ 1 extension. Tier-0 decoders must continue to reject them; a tier ≥ 1 document opts in via _dms_tier.

These reservations do not commit DMS to ever populating these slots — they document where future extensions could land without breaking existing tier-0 documents.

Decoders SHOULD emit a tier-1-pointing error when they encounter a reserved decorator sigil in any reserved slot ("decorator sigil '' requires tier 1; set _dms_tier: 1 and declare the dialect in _dms_imports — see TIER1.md"), rather than a generic "unexpected token."

Document root

A DMS document's root value is polymorphic — it can be a table, a list, or a scalar. The root type is determined by the first significant line (significant = not blank, not a line comment, not a block comment):

First significant line begins with… Root is a …
a key followed by : table
+ (list item marker) list
any other value token (string, number, """, …) scalar
nothing (document is empty or only comments) empty table ({})

Once the root type is committed, every subsequent top-level (column 0) line must match it:

A top-level line that violates the committed root type is a parse error.

Examples

Table root (the common case for config):

title: "production"
database:
  host: "db.internal"

List root:

+ name: "web1"
  ipv4: "10.0.0.1"
+ name: "web2"
  ipv4: "10.0.0.2"

Scalar root:

42
"""
A document whose entire value is a multi-line string.
"""

Keys and scalars

bare_key:     1             # bare key: letters, digits, underscore, dash
"quoted key": 2             # double-quoted: escapes processed
'quoted key': 3             # single-quoted: literal, every character as-is
résumé:       4             # Unicode letters are allowed in bare keys
42:           5             # numeric-looking bare keys are fine
"":           6             # empty string key — must be quoted

What counts as a bare key

A bare key is one or more characters, each drawn from:

XID_Continue is the Unicode-standard "identifier continuation" set — letters, digits, combining marks, and a curated handful of joiners — and is what Python, Rust, and most modern languages use for identifiers. Subtracting Default_Ignorable_Code_Point removes invisibles such as zero-width joiners and variation selectors that would otherwise let two visually-identical bare keys differ in their byte content.

The accepted set is frozen, not host-derived. Each port ships its own table generated once from the UCD 15.1 data files. A port must not delegate this check to its host runtime's Unicode library (Python's str.isidentifier(), ICU, etc.), because those track whichever Unicode version the runtime was built against — meaning the set of accepted bare-key characters would silently grow over time as the host platform updates. Freezing the set at 15.1 guarantees that a document written today decodes identically a decade from now, on every port, regardless of what new code points Unicode introduces. A future SPEC version may bump the floor; until then, ports re-emit their tables only on explicit SPEC instruction.

Keys that look like numbers (42), booleans (true), or other reserved identifiers (inf, nan) are valid bare keys — the trailing : disambiguates context. Every decoded key is a string, regardless of whether it was written bare or quoted: 42: x produces the string key "42".

A bare key may consist entirely of _ and/or - characters (_:, -:, _-_: are all valid keys). The character-set rule is positional, not compositional — there is no "must contain at least one letter or digit" requirement.

Quoting

An empty string key must be quoted ("" or '') — a bare key requires at least one character. Any key containing whitespace, :, #, {}, [], ", ', or . must also be quoted.

Separator whitespace

A : that terminates a key must be followed by a space (or end-of-line, if the value is a child block). host:localhost is a parse error; host: localhost is fine.

Duplicate keys

Two keys in the same table that, after decoding, produce the same string are a parse error. This rule compares the final key strings — which are NFC-normalized (see Unicode normalization) — so "42" and 42 collide, "hello" and 'hello' collide, and a key written as precomposed é collides with one written as e + U+0301.

Key order

Key insertion order is preserved. A DMS decoder must expose each table as an insertion-ordered structure so that doc-to-doc diffs are stable and round-tripping a document emits keys in the same order they were written.

Block vs scalar values

A key can take its value in one of three shapes:

  1. Inline scalar — value fits on the same line: port: 5432 name: "web1"
  2. Child block — key ends with bare : and the next non-blank line is indented further: database: host: "db.internal" port: 5432
  3. Heredoc — a triple-quote opener (""" or ''', with an optional label) follows the :; content starts on the next line and runs until the terminator. See Heredocs below.

A key with bare : and no indented block beneath it is a parse error. Use [] or {} flow form for empty collections.

The three shapes are mutually exclusive: a key with an inline scalar (or a heredoc) cannot also have an indented child block beneath it. Source like

port: 5432
  child: 1   # parse error: inline value already given for `port`

is rejected — the decoder commits to the inline value on the port line and then sees the deeper-indented child: as illegal indent (no parent block was opened). To get a child block, drop the inline value:

port:
  child: 1

Lists

A list item is a line whose first non-whitespace character is +, followed by a space and the item's content.

Mnemonic: read + as "push this onto the list." Each + line appends one item to the enclosing list, the same way list.push(x) (JS), list.append(x) (Python), or vec.push(x) (Rust) appends to an array — and the visual column of the + plays the role of "which list" when lists are nested.

tags:
  + "web"                   # scalar item
  + "frontend"
  + "public"

servers:
  + name: "web1"            # table item: first key sits on the + line
    ipv4: "10.0.0.1"
    disks:
      + mount: "/"
        size_gb: 100
      + mount: "/var"
        size_gb: 500
  + name: "web2"
    ipv4: "10.0.0.2"

Rules:

Comments

Required of every conforming DMS decoder. This is not a "preserving variant" of full mode — comment-AST attachment and round-trip preservation are part of the format definition. A decoder that drops comments in full mode, or that decodes them but can't reproduce them via encode (see §encode), does not conform. Lite mode (see §Decoding modes — full and lite) is an explicit, opt-in alternative that discards comments by design; that is a documented mode, not a preservation failure.

DMS preserves comments through decoding. Every comment in the source — line comment (# or //), C-style block comment (/* … */, nestable), or hash-block comment (###LABEL … LABEL, ### … ###) — is captured during decoding and attached to the nearest neighbor in the value tree as a first-class AST node. This is what makes a decode → modify → re-encode round-trip keep comments at the right places in the output.

The contrast with most config formats is the load-bearing piece of DMS's design:

DMS bakes preservation into the format itself. Every reference decoder (Rust, Go, C, Zig, Python, Node, Perl pure, Perl XS) returns a Document whose comments are tracked at the same level as the data, by spec.

The rest of this section defines the attachment rules and round-trip contract. See §encode for the emitter side of the round-trip.

Attachment rules

When the decoder encounters a comment, it attaches it to a value / kvpair / list-item / container node according to these rules.

Leading comment. A line whose only significant content is a comment, immediately preceding a kvpair, list item, or block opener, attaches to that following node as a leading comment — provided there is no blank line between the comment and the node. Multiple leading comments stack on the same node, in source order:

# server pool                    ← leading
# updated 2026-04-22             ← leading (also)
servers:
  + name: "web1"

Trailing comments. Comments on the same line as a value, after that value, attach to the value as trailing comments. Multiple trailing comments stack in source order — possible because a /* ... */ block comment doesn't terminate the line. A # or // line comment, if present, consumes the rest of the line and must therefore come last:

port:  8080   # default                              ← trailing on `port`
retry: 3      /* aggressive */ /* see SLO */         ← two trailing block
token: "x"    /* see vault */ # never log this       ← block, then line

Inner comments. A /* ... */ block comment appearing between a key's : and its value, or between a + and a list item's content, attaches to that kvpair or list item as an inner comment. Multiple inner comments stack in source order:

secret: /* see vault */ /* rotated 2026-04-01 */ "REDACTED"
+ /* see runbook */ name: "web1"

The rule is positional — inner attaches whenever a /* ... */ appears between the opening sigil (: or +) and the value, regardless of whether the value is an inline scalar, a heredoc opener, or the newline that opens a child block:

db: /* connection cluster */
  host: "db.internal"
  port: 5432

Inner comments are /* ... */ only. Line comments (#, //) consume to end-of-line and ### ... ### block comments require the opener on its own line, so neither can appear mid-token.

Floating comment. A comment that does not attach to a following sibling — either because a blank line separates it from the next sibling, or because the block closes before any sibling appears — attaches to the enclosing container (table or list) as a floating comment, in source order:

servers:
  + name: "web1"
  # the following block is currently disabled

  # restore by uncommenting       ← floating on `servers`

The blank-line rule is what disambiguates "section header" comments (meant to apply broadly) from "this-key" comments (meant to apply to the next key). Authors who want a comment to attach to the immediately following key just omit the blank line; authors who want a section header keep the blank line.

Block comments in leading / trailing / floating positions. The block-comment forms (### ... ### / ### LABEL ... LABEL, and /* ... */) follow the same attachment rules as line comments when they occupy any of the leading / trailing / floating positions. A block comment that occupies its own line-range and is followed by a kvpair (no blank line) attaches as a leading comment; a block comment on the same line as a value, after the value, attaches as trailing; otherwise the floating logic applies. The inner position (above) is the only position restricted to a single comment form (/* ... */).

For example, a trailing /* ... */:

x: 1 /* trailing C-block */

attaches as trailing on the kvpair for x and round-trips through encode next to its value, exactly like a trailing # foo line comment would.

Front matter comments. Comments inside the +++ ... +++ block follow the same attachment rules, scoped to the front matter table. Their leading / inner / trailing / floating attachments live alongside the front matter user keys; they do not leak into the body.

Paths

The comment-attachment metadata and the round-trip original_forms records both key into the document by path. A path is an ordered sequence of segments, each one of:

Conventions:

Examples:

Source Path of value "web1"
host: "web1" ["host"]
db:
host: "web1"
["db", "host"]
+ "web1" (list root) [0]
servers:
+ name: "web1"
["servers", 0, "name"]
+++
app: "web1"
+++
x: 1
["__fm__", "app"]

What's stored

Each comment is captured as { content, kind }:

The decoder does not assign stable IDs, breadcrumbs, or position metadata beyond the leading / inner / trailing / floating classification. Comments are identified solely by their attachment — there is no cross-document identity. (Earlier drafts of the spec experimented with content-hash IDs as part of a richer modifier system; that mechanism has been removed and is not part of DMS today.)

Every position is a list of comments stored in source order. Leading and floating may contain any mix of line and block comments; trailing accepts any mix but a # / // line comment must come last (it consumes the rest of the line); inner accepts only /* ... */ block comments.

Round-trip semantics

Worked example

Source:

# the database section
db:
  host: "localhost"
  # raised from 80 after the LB change in 2024-Q4
  port: 8080      # default for staging
  secret: /* see vault */ /* rotated 2026-04-01 */ "REDACTED"

  # restore by uncommenting
  # debug: true

After decoding, Document.comments contains the entries below (paths are breadcrumbs into the value tree). Every entry's value is a list of {content, kind} records, in source order:

Path Position Contents
["db"] leading # the database section
["db", "port"] leading # raised from 80 after the LB change in 2024-Q4
["db", "port"] trailing # default for staging
["db", "secret"] inner /* see vault */, /* rotated 2026-04-01 */
["db"] floating # restore by uncommenting, # debug: true

Now mutate the decoded tree — say, change port to 5432:

doc.body["db"]["port"] = 5432

encode(doc) emits:

# the database section
db:
  host: "localhost"
  # raised from 80 after the LB change in 2024-Q4
  port: 5432      # default for staging
  secret: /* see vault */ /* rotated 2026-04-01 */ "REDACTED"

  # restore by uncommenting
  # debug: true

The data changed; the comments stayed at the same nodes.

encode — DMS-output emitter contract

Every conforming decoder must ship encode(document). It has two modes — full (default) and lite — orthogonal to the decode-side mode (see §Decoding modes — full and lite). The mode controls how much round-trip metadata the emitter re-emits.

Full mode (default) re-emits valid DMS source from a decoded Document produced in full-mode decode. The contract is data + comments + literal-form preservation under round-trip:

decode(encode(decode(source))) produces a Document that is data-equivalent to decode(source), has the same comments at the same attached paths, and uses the same literal forms for the values it can preserve (integer base, string form).

Lite mode emits the same data tree in canonical form: comments are dropped, integers are emitted in decimal regardless of source base, strings are emitted in basic-quoted form regardless of source flavour, no original_forms consultation. Lite-mode emit accepts both full-mode and lite-mode decoded Documents — the metadata is simply ignored. Lite-mode emit is lossy by design for comments / source forms; it preserves only the data tree.

The rest of this section specifies full-mode emit (where the preservation contract has teeth). Lite-mode emit follows the same data-tree rules but skips every "Required preservation" point below.

Byte-for-byte source preservation is not required (whitespace and indentation choices are emitter-determined; original column alignment is not preserved); semantic round-trip is.

Required preservations:

  1. Integer base. A decoder must record the source-form of every integer literal (e.g. 0x1F40, 0o755, 0b1010_0110, 1_000_000, +42, -7). On emit, the integer is re-rendered in its original base, preserving sign and underscore separators where present. Stored as a side-channel original_forms keyed by the value's path in the tree (analogous to comment attachment); see API shape below.
  2. String form. A decoder must record which of the four string forms produced each decoded string: basic "...", literal '...', heredoc-basic """[LABEL]...LABEL (or """..."""), heredoc-literal '''[LABEL]...LABEL (or '''...'''). Heredoc records additionally include the label text (or its absence) and any heredoc modifiers applied (_trim(...), _fold_paragraphs(), in source order). On emit, the string is re-rendered in the same form.
  3. Comments. Every AttachedComment is re-emitted at its classified position (Leading / Inner / Trailing / Floating) on the still-present node at its path. Within a position, comments emit in their stored source order. Inner comments are emitted between the kvpair's : and its value (or between a list item's + and its content), separated from each other and from the surrounding tokens by single spaces — key: /* a */ /* b */ value — to keep round 2 byte-stable.
  4. Key insertion order. Already a tier-0 invariant; carries through encode.

Emitter-determined (not preserved):

Front matter: an emitter omits the +++ ... +++ block entirely when both meta is empty (or absent) AND no comments are attached to the front matter. Otherwise the block is emitted with its keys and any front-matter comments at their attached paths.

API shape (language-specific naming):

encode(document)            -> str    // full mode (default)
encode(document, mode=lite) -> str    // lite mode (canonical-form emit)
encode_lite(document)       -> str    // alternative lite-mode entry point

Document {
    meta, body, comments,
    original_forms              // sparse map: path -> {integer-lit | string-form-record}
}

The exact API name is language-specific — a parallel entry point (encode_lite) and a mode parameter (encode(doc, mode="lite")) are both conforming. The contract is what matters.

comments and original_forms are populated during full-mode decoding and consulted only by full-mode encode. Lite-mode encode ignores both fields. The conformance JSON encoder ignores them too.

Round-trip stability (full mode only): encode(decode(encode(decode(source)))) must produce the same string as encode(decode(source)) — i.e. the second round-trip is byte-for-byte stable. This is the test condition for the round-trip-comments fixture corpus. Lite-mode emit has no round-trip stability requirement (it's canonical-form, lossy on comments / source forms by design); it does have a data-stability requirement: decode(encode_lite(doc)) must be data-equivalent to doc.

Line comments

Two interchangeable forms, both running from their opener to the end of the line:

Either form must be preceded by whitespace or start the line; key:5#x and key:5//x are parse errors (the 5 runs into the sigil). Mixing styles in one document is allowed — this is a spec-level convenience, not a style rule. Linters may pick a canonical form.

No leading space is required after the sigil. #foo and # foo are both valid line comments; //foo and // foo are both valid. Style preferences (e.g., requiring a space) are a linter concern, not a decoder one.

# full-line comment
// also a full-line comment
port: 5432   # trailing comment
port: 5432   // same

Block comments

Three forms, all equivalent in semantics (contents are discarded; no decoding happens inside):

Form Terminator
/* ... */ next unmatched */ (nested /* */ allowed)
### ... ### line whose trimmed content equals ###
###LABEL ... LABEL line whose trimmed content equals LABEL

/* ... */ (C-style). Nesting is supported: every /* opens a new level and every */ closes the innermost. The decoder only returns to code mode when the nesting count reaches zero. Can appear inline or span multiple lines:

port: /* inline */ 8080
/* this /* nested */ is fine */

###LABEL and ### (heredoc-shaped). Opener and terminator must each be on their own line (trimmed to match). Useful when content contains unbalanced */ — the labeled form lets you pick any terminator word. The label follows heredoc label rules ([A-Za-z_][A-Za-z0-9_]*) and sits directly after ### with no whitespace.

###NOTE
  The alerts below are owned by the SRE team.
  Escalation policy: see runbook.md.
  Any text, any syntax, no escaping — even raw */ survives.
  NOTE

###
  Short block comment, no label. Closed by another ### line.
  ###

/*
  Same idea, C-shape. Nests with /* ... */.
*/

alerts:
  primary: "pager"

A block comment may appear anywhere a blank line or a line comment would be valid.

Types

Strings

basic:   "hello\tworld"     # escapes processed
literal: 'C:\Users\ada'     # no escapes, every character taken literally

Escape sequences (basic strings)

Inside "..." basic strings, the following backslash escapes are recognized. Any other backslash sequence is a parse error.

Escape Decoded character Unicode scalar
\" quotation mark U+0022
\\ reverse solidus (backslash) U+005C
\b backspace U+0008
\f form feed U+000C
\n line feed (LF) U+000A
\r carriage return U+000D
\t character tabulation (tab) U+0009
\uXXXX Unicode scalar U+XXXX (BMP scalars only) U+0000..U+FFFF, excluding U+D800..U+DFFF
\UXXXXXXXX Unicode scalar U+XXXXXXXX (full range) U+0000..U+10FFFF, excluding U+D800..U+DFFF

Rules and edge cases:

Literal strings ('...') process no escapes — every character between the delimiters is taken as-is, then the resulting scalar sequence is NFC-normalized like any other DMS string (see Unicode normalization). Literal strings cannot contain their own delimiter ('); use a heredoc for content that mixes both quote kinds.

Strings never span lines. For multi-line text, use a heredoc.

Heredocs

Two quote flavors, optional label. Four forms total:

Form Escapes Terminator (trimmed line content equals)
"""LABEL yes LABEL
""" yes """
'''LABEL no LABEL
''' no '''

config: long_text: """EOF line one line two EOF # ← terminator at column 0, but the parent next_key: 1 # block at indent 2 continues uninterrupted

This pattern is the canonical way to embed left-aligned multi-line text from inside an indented context: place the terminator at the outermost column you want stripped (often column 0), and the body's indentation is preserved verbatim.

Examples (each row is a heredoc body after indent-strip):

Content lines Value
["line 0"] "line 0"
["line 0", ""] "line 0\n"
["line 0", "line 1"] "line 0\nline 1"
["line 0", "", "line 1"] "line 0\n\nline 1"
["line 0", "line 1", "", "", ""] "line 0\nline 1\n\n\n"
[] (heredoc with no content lines) ""

Labels allow content to contain the triple-quote sequence. Without a label, the first line whose trimmed content is the triple-quote itself closes the heredoc — a simpler form for content that can't contain """ or '''.

Line continuation (escape-on only)

In a """ heredoc, a \ that is the last non-whitespace character on a line is a line continuation. The backslash, any trailing whitespace on its line, the line terminator, and any leading whitespace on the next non-blank line are all consumed — the two lines splice into one. Applies before modifiers run.

prose: """EOF
    The quick brown \
        fox jumps over \
        the lazy dog.
    EOF
# → "The quick brown fox jumps over the lazy dog."

This gives per-line author control over where a newline is kept versus chomped, which the global modifiers (_fold_paragraphs, _trim) can't do at the character level.

Details and edge cases:

Modifiers

One or more modifiers may follow the opener (and label, if present), separated by whitespace. Every modifier is written in function-call form: an identifier followed by a parenthesized argument list. The parentheses are required even when the argument list is empty.

Modifiers work with both labeled and unlabeled heredocs. Whitespace rules (consistent with the rest of the language):

Example:

sql: """EOF _trim("\n", ">")
    SELECT id, name
    FROM users
    EOF
# → "SELECT id, name\nFROM users"  (trailing newline stripped)

Disambiguation follows from those two rules combined with the "modifiers always have parens" rule: a bare identifier touching the opener is a label; an identifier with (...) preceded by whitespace is a modifier. """foo() (no whitespace, has parens) is a parse error; write """ foo() if you mean modifier-with-no-label, or """foo bar() if you mean label-plus-modifier.

Modifiers run left-to-right, after indent-strip, before the value is returned. Each modifier operates on whatever the previous step produced.

The two standard modifiers

Modifier Effect
_trim(chars, where, replacement = "") Find runs of characters matching chars, at positions selected by where, and replace each run with replacement. The Swiss-army content shaper — covers strip, chomp, and interior replace.
_fold_paragraphs() Collapse runs of non-blank lines within a paragraph into space-joined single lines; blank-line paragraph breaks stay as a single \n.

Unknown modifiers are a parse error. Argument types must match the signatures; wrong types error at decode time (e.g. _trim(42, "*")).

_trim(chars, where, replacement = "")

Run collapse. A consecutive run of matching characters becomes one replacement, not one per char. So _trim("\n", "*", ", ") applied to "a\n\nb" produces "a, b", not "a, , b".

Common recipes:

Intent Call
Strip trailing newlines _trim("\n", ">")
Ensure exactly one trailing newline _trim("\n", ">", "\n")
Ensure exactly three trailing newlines _trim("\n", ">", "\n\n\n")
Join all lines with a comma-space _trim("\n", "*", ", ")
Concatenate lines (no separator) _trim("\n", "*")
Full whitespace-trim (both ends) _trim(" \t\n\r", "<>")
Trim leading whitespace only _trim(" \t\n\r", "<")
Strip per-line indentation remnants _trim(" \t", "\|")
Tabs → spaces everywhere _trim("\t", "*", " ")
Collapse all runs of whitespace to single space _trim(" \t\n\r", "*", " ")

The third argument is required only when you want to replace with something other than empty. All of the strip-style uses can omit it.

_fold_paragraphs()

Collapses non-blank-line runs into space-joined single lines, preserving blank-line paragraph breaks as single \ns. Not expressible as trim because it's a structural transform on paragraphs (two-level: lines within paragraphs, paragraphs within a body), not character-level replacement. Takes no arguments. Fine to combine with _trim(...).

Mapping to YAML block scalars — every YAML block scalar mode is expressible as a modifier combination:

DMS YAML
(default — no modifier needed) \|+
_trim("\n", ">", "\n") \|
_trim("\n", ">") \|-
_fold_paragraphs() >+
_fold_paragraphs() _trim("\n", ">", "\n") >
_fold_paragraphs() _trim("\n", ">") >-

Modifiers stack, applied left-to-right:

csv: """EOF _trim("\n", "*", ", ") _trim(" \t", "<>")
    alpha
    beta
    gamma
    EOF
# → "alpha, beta, gamma"

summary: """EOF _fold_paragraphs() _trim("\n", ">", "\n")
    First paragraph line one
    first paragraph line two.

    Second paragraph line one
    second paragraph line two.
    EOF
# → "First paragraph line one first paragraph line two.\nSecond paragraph line one second paragraph line two.\n"

Escape hatch for triple-quote content

If the body may contain a line whose trimmed content is the triple-quote opener, the unlabeled form will close early. Use the labeled form instead — labels exist precisely for this case:

doc: """END
    my_string = """
    """
    END

The two forms differ in their fallback when labels aren't used:

sql: """END
    SELECT id, name
    FROM users
    WHERE active = true
    END

regex: '''
    ^\d{4}-\d{2}-\d{2}$
    '''

note: """
    First paragraph.

    Second paragraph after a blank line.
    """

ascii_art: '''
/\_/\
( o.o )
 > ^ <
'''

In the last example, the terminator sits at column 0, so strip depth is zero and every leading space in the art is preserved.

Integers

dec: 1_000_000          # underscores allowed between digits
hex: 0xDEAD_BEEF
oct: 0o755
bin: 0b1010_0110
neg: -42

64-bit signed; values outside [-2^63, 2^63 - 1] are a parse error. Leading zeros on decimal literals are a parse error (reserved for future use). Underscores must be between two digits (never at the start, end, or adjacent to the base prefix or sign).

Floats

pi:    3.14159
avog:  6.022e23
small: 1.5e-10
inf_p: +inf
inf_n: -inf
nan:   nan

IEEE 754 binary64. inf and nan are keywords, not identifiers.

Decimal floats require at least one digit on each side of the decimal point. 1. and .5 are parse errors; write 1.0 and 0.5. The exponent (e/E) is optional; if present, it's a signed decimal integer.

Non-decimal floats use 0x / 0o / 0b prefixes and a binary exponent marker p (mandatory — it's what distinguishes a non-decimal float from a non-decimal integer). The value is mantissa × 2^exponent, where the mantissa is the base-N number's value.

hex_f:   0x1.8p3          # 1.5 × 2^3 = 12.0
hex_int: 0xFp0            # 15.0  (no dot, but p makes it a float)
oct_f:   0o1.4p3          # (1 + 4/8) × 2^3 = 12.0
bin_f:   0b1.1p3          # 1.5 × 2^3 = 12.0
neg_e:   0x1p-3           # 0.125

The digit-on-both-sides rule applies to every form: 0x1.p3 and 0x.8p3 are parse errors. Underscore separators are allowed in the mantissa but not in the exponent.

Round-trip. A decoder-emitter pair is required to preserve the value of +inf, -inf, and nan across a decode-then-emit cycle, though an emitter may normalize the spelling (e.g. always emit nan, never NaN). Finite values round-trip to the shortest decimal literal that produces the same binary64.

Booleans

enabled: true
debug:   false

Date & time

RFC 3339 / ISO 8601 subset, four distinct types:

offset_dt: 1979-05-27T07:32:00-08:00   # offset datetime
local_dt:  1979-05-27T07:32:00         # local datetime
local_d:   1979-05-27                  # local date
local_t:   07:32:00.999                # local time

The date/time separator is uppercase T only. RFC 3339 section 5.6 allows a lowercase t and a single space as alternates; DMS rejects both to keep emitted output canonical. Lowercase t is a parse error; a space between date and time is also a parse error (since a date followed by whitespace is itself a complete local_d value).

Fractional seconds are optional and limited to nanosecond precision (up to 9 digits after the decimal point). More digits are a parse error rather than silently truncated, so the written precision always matches the stored value.

Arrays (flow form)

For inline or compact use. Block form uses + items (see Lists).

empty:  []                              # the empty list
ints:   [1, 2, 3]
mixed:  [1, "two", 3.0, true]           # heterogeneous allowed
nested: [[1, 2], [3, 4]]                # arrays of arrays
tables: [{x: 1}, {x: 2}, {x: 3}]        # arrays of tables
multi: [
    "first",
    "second",
    "third",                            # trailing comma ok
]

Flow arrays may span multiple lines. Between [ and the matching ], whitespace (including newlines and any indentation) is insignificant — the outer indent rule is suspended inside a flow form.

Flow tables

empty:    {}                                       # the empty table
point:    { x: 1, y: 2 }
user:     { name: "ada", email: "ada@example.com" }
quoted:   { "with space": 1, plain: 2 }            # quoted and bare keys mix
matrix:   { rows: [[1, 2], [3, 4]], cols: 2 }      # tables containing arrays
nested:   { outer: { inner: { deep: true } } }     # tables of tables
trailing: { a: 1, b: 2, }                          # trailing comma ok
multi: {
    name:  "ada",
    email: "ada@example.com",
    role:  "admin",
}

Flow tables may span multiple lines. Between { and the matching }, whitespace (including newlines and any indentation) is insignificant — the outer indent rule is suspended inside a flow form.

Keys in a flow table must be unique within that table — the same rule that applies to block tables. Repeating a key is a parse error.

Flow forms — canonical multi-line layout

Multi-line flow is permissive on the decode side: any whitespace between the brackets is accepted. On the encode side, every conforming port emits multi-line flow forms in one canonical layout, so two ports producing the same value tree always emit byte-identical output.

The close-bracket anchors the indent.

# array, scalar root
xs: [
  "first",
  "second",
  "third",
]

# table, scalar root
user: {
  name:  "ada",
  email: "ada@example.com",
  role:  "admin",
}

# nested: closing bracket anchors at each level's opener
config: {
  servers: [
    {
      name: "web1",
      port: 8080,
    },
    {
      name: "web2",
      port: 8081,
    },
  ],
}

When to break to multi-line. Encoders SHOULD emit multi-line form when:

  1. The single-line rendering would exceed the configured line-width threshold (port default: 80 chars; user-configurable), OR
  2. The form contains a heredoc or another multi-line member whose own rendering spans multiple lines.

Encoders MAY emit multi-line for other reasons (e.g., user-set always_multiline_above: 3 for tables of size ≥ 3) — the rule only specifies layout when multi-line is chosen, not the break threshold.

Encoders that canonicalize non-empty flow values to block form (tier-0's typical strategy) never emit multi-line flow for non-empty cases and so do not exercise the layout rule. The rule binds when an encoder genuinely emits a multi-line [...] / {...} — most often for tier-1 decorator-call parens (which have no block-form alternative; see TIER1.md).

What flow forms cannot contain

Flow forms are restricted to inline values. The following are decode errors inside [ ... ] or { ... }:

Nested flow values

Flow arrays and flow tables compose freely — any value position accepts another flow array, flow table, or any scalar. There is no depth limit.

Grammar (sketch)

The grammar is indentation-sensitive; the token stream includes synthetic INDENT and DEDENT tokens produced by the lexer according to the indent rule above.

document     = { trivia } ( table_root | list_root | scalar_root | empty ) ;
trivia       = blank_line | line_comment | block_comment ;
table_root   = kvpair { trivia | kvpair } ;
list_root    = list_item { trivia | list_item } ;
scalar_root  = ( inline_value | heredoc_ref ) { trivia } ;
empty        = { trivia } ;

block_item   = kvpair | list_item | line_comment | block_comment ;
kvpair       = key ":" ( inline_value | heredoc_ref | child_block ) ;
list_item    = "+" ( inline_value | heredoc_ref | kvpair { INDENT kvpair DEDENT } | child_block ) ;
child_block  = NEWLINE INDENT block_item { block_item } DEDENT ;

key          = bare_key | basic_string | literal_string ;
bare_key     = ( xid_cont_safe | "_" | "-" )+ ;
             (* xid_cont_safe = any character in                    *)
             (*   XID_Continue \ Default_Ignorable_Code_Point       *)
             (*   per Unicode 15.1+ derived properties              *)

inline_value = string | integer | float | boolean | datetime
             | flow_array | flow_table ;

heredoc_ref  = ( '"""' | "'''" ) [ label ] { WS modifier } ;
modifier     = ident "(" [ inline_value { "," inline_value } ] ")" ;
ident        = ( letter | "_" ) { letter | digit | "_" } ;
label        = ident ;

flow_array   = "[" [ inline_value { "," inline_value } [ "," ] ] "]" ;
flow_table   = "{" [ flow_kv { "," flow_kv } [ "," ] ] "}" ;
flow_kv      = key ":" inline_value ;
             (* Inside flow_array / flow_table, all whitespace including *)
             (* newlines is insignificant; the outer indent rule is      *)
             (* suspended between the opening and closing bracket/brace. *)
             (* inline_value excludes heredoc_ref, and no comment        *)
             (* production appears inside flow forms — both are decode    *)
             (* errors. See "What flow forms cannot contain".            *)

line_comment  = ( "#" | "//" ) { any_char_except_newline } ;
block_comment = hash_block | c_block ;
hash_block    = "###" [ label ] NEWLINE { any_line } terminator_line ;
c_block       = "/*" { any_char | c_block } "*/" ;   (* nested *)

(String/integer/float/datetime productions follow the rules described above.)

Example document

# Server config
title:   "production"
updated: 2026-04-22T10:00:00-04:00

database:
  host: "db.internal"
  port: 5432
  pool:
    size: 10
  dsn: """END
      host=db.internal
      port=5432
      sslmode=require
      END

servers:
  + name: "web1"
    ipv4: "10.0.0.1"
    disks:
      + mount: "/"
        size_gb: 100
      + mount: "/var"
        size_gb: 500
  + name: "web2"
    ipv4: "10.0.0.2"

###NOTE
  The feature flags below are owned by the growth team.
  See team-growth/flags.md before changing.
  NOTE
features:
  enabled: ["auth", "billing", "search"]
  limits:  { rpm: 1000, burst: 50 }

Design non-goals

Decisions taken that intentionally don't appear in the spec.