DMS — Data Meta Syntax

Version: 0.14 (draft)

file extension: .dms

A data syntax with YAML's clean look and a small, strict spec. Structure is indent-based (no repeated section headers). Types are distinct and never inferred from context. Heredocs and multi-line comments are first-class.

Design principles

Indent-based structure, but a tiny indent rule (no YAML-style complexity).
Strict, distinct types — never infer from context.
No anchors, aliases, tags, merge keys, schemas, or references.
Every value has exactly one canonical representation.
UTF-8 only, NFC-normalized. LF or CRLF line endings, both accepted, LF canonical.

Lexical

Indentation: spaces only. Tabs are banned in structural indent (they may still appear inside string values). A hard tab at the start of any non-heredoc-body line is a parse error.
Line terminator: LF or CRLF.
Whitespace inside a line: space (U+0020) or tab (U+0009), not significant except where explicitly required (e.g. after :, after +).
Case sensitivity: keys, keywords (true, false, inf, nan), and heredoc labels are case-sensitive.
Reserved decorator sigils: the characters

! @ $ % ^ & * | ~ ` . , > < ? ; =

are reserved as decorator sigils at line-start position. A body line whose first non-whitespace character is one of these is a parse error in tier 0. The reservation is fixed in this spec — it is not derived from any per-document declaration. Tier 1 (defined in TIER1.md) binds these sigils to dialect-published decorator families via the _dms_imports front-matter field; tier-0-only decoders reject tier-1 documents at front-matter decode (_dms_tier: 1 triggers rejection on a tier-0-only decoder by the tier-marker rule below). The underscore (_) is not a reserved decorator sigil here — it has its own category, reserved for core / built-in decorators (e.g. heredoc modifiers _trim, _fold_paragraphs).

Three other punctuation characters that might look like candidates for this set are explicitly not reserved, because they already carry tier-0 grammar:

/ opens line comments (//) and block comments (/* */).
- is a member of the bare-key character set (-key: 1 is a valid kvpair) and prefixes negative numeric scalar roots (-5).
_ is the core-decorator prefix (see above).

The reservation cost in tier 0 is zero: none of the seventeen reserved-sigil characters are members of the bare-key character set (§Keys and scalars), and none can appear as the first non- whitespace character of any other valid tier-0 construct, so no pre-existing valid document is invalidated. Decoders that previously accepted such lines by oversight must reject them after this spec revision.

Reserved emoji characters: any extended grapheme cluster (per UAX #29, frozen at Unicode 15.1.0) that contains at least one codepoint from the Reserved Emoji Set is reserved in tier 0. The Reserved Emoji Set is the union of:
Extended_Pictographic=Yes (per UTS #51) — covers all pictographic emoji bases, including ZWJ-sequence components.
Regional Indicators, U+1F1E6..U+1F1FF — flag pairs.
Emoji modifiers (skin-tone), U+1F3FB..U+1F3FF.
Combining Enclosing Keycap, U+20E3 — keycap sequences.

All four sub-ranges are frozen at Unicode 15.1.0 alongside the bare-key and NFC tables. The set is closed under Unicode's monotonic-additive stability guarantees: future SPEC bumps can only add codepoints, never invalidate documents that decode cleanly under the current floor.

This selection covers every emoji renderable as a single visual glyph in current systems — single-codepoint emoji (🚀), ZWJ families (👨‍👩‍👧), skin-tone variants (👍🏽), flags (🇺🇸), and keycaps (1️⃣).

ASCII overlap is naturally excluded. Characters with Emoji=Yes in UTS #51 but not Extended_Pictographic=Yes — digits 0-9, #, * — are not in the Reserved Emoji Set and continue to carry their tier-0 grammar meaning (numeric scalars, comment marker, decorator-sigil candidate). The Extended_Pictographic property was specifically designed to exclude these ASCII overlaps; that's why we use it as the primary base.

Latin-1 trademark symbols are included. © (U+00A9), ® (U+00AE), and ™ (U+2122) are classified as Extended_Pictographic=Yes by Unicode and are therefore reserved. To use them as plain-text prose at the start of a line or as a bare-key character, write them inside quotes (note: "© 2026 Acme") or a string scalar. This is a deliberate consequence of taking Unicode's classification at face value rather than carving out exceptions per-codepoint — carve-outs would require ports to track their own pictographic-vs-text taste, which would diverge over time.

Concretely, reservation applies as follows:

Banned from bare keys. No bare-key character may belong to the Reserved Emoji Set. Emoji-bearing keys must be quoted ("🚀": 1, "🇺🇸": "United States").
Banned at line-start position. A body line whose first non-whitespace extended grapheme cluster contains a Reserved-Emoji codepoint is a parse error in tier 0, on the same footing as a reserved decorator sigil. Tier 1 dialects may bind such clusters to decorator families via _dms_imports.
Banned as the first grapheme cluster of any unquoted value position — flow scalar, array element, or inline scalar value following :. Emoji in value position must be inside a quoted string or heredoc.
Allowed verbatim inside strings and heredocs. Reservation applies only to unquoted positions; name: "🚀 launch", literal strings, and heredoc bodies are unaffected.

The four sub-ranges are sourced from UCD 15.1 (emoji-data.txt for Extended_Pictographic and Emoji_Modifier; the Unicode core database for Regional Indicators and U+20E3) and shipped per port as a single frozen table, in lockstep with the bare-key and NFC tables. A port must not delegate the check to its host runtime's Unicode library — host tables track the runtime's Unicode version and would silently diverge as new emoji are assigned. Grapheme-cluster segmentation is likewise performed against a UAX #29 algorithm frozen at 15.1.0 (the algorithm itself is stable; only the property tables it consults need pinning).

Decoder error messages must name the offending codepoint by hex value and category, e.g. "U+1F680 (🚀, Extended_Pictographic) reserved as emoji", since terminals and editors render some Reserved-Emoji codepoints as monochrome text glyphs that authors may not recognize as emoji.

Unicode normalization

To prevent visually-identical strings from comparing unequal — e.g. é written as the precomposed scalar U+00E9 vs. as U+0065 U+0301 (e + combining acute) — every string the decoder produces is normalized to Normalization Form C (NFC) as defined by Unicode Standard Annex #15.

This applies uniformly to:

bare keys and the inner content of quoted keys ("..." and '...');
basic strings ("..."), after escape decoding;
literal strings ('...'), even though no escapes are processed;
heredoc bodies of every form, after any escape decoding.

Normalization is applied to the source after UTF-8 decoding and before tokenization, so even structural elements like bare keys see normalized input — a bare key written as decomposed café (e + U+0301) becomes precomposed café before the bare-key category check, and is accepted. Strings produced by escape sequences (\uXXXX, \UXXXXXXXX) are additionally NFC-normalized at the point the string is constructed, since escape-decoded scalars don't pass through the source-level NFC pass. NFC does not salvage non-scalar escapes — a surrogate escape is still a parse error.

NFC is stable under the Unicode Stability Policy: for any character already assigned in Unicode 4.1 (2005) or later, its NFC form does not change in newer Unicode versions. New characters assigned in later Unicode releases get new NFC mappings, however — so a port that delegates NFC to its host runtime would normalize a Unicode 16 codepoint differently from a port frozen at 15.1.

To keep documents byte-identical across ports and across time, NFC tables are frozen at Unicode 15.1.0 and shipped with each port, exactly like the bare-key set. A port must not delegate NFC to its host runtime's Unicode library (Python's unicodedata.normalize, ICU's unorm2, etc.) — those track whichever Unicode version the runtime was built against and would silently diverge once new codepoints are assigned. The two table generations (NFC + bare-key set) are kept in lockstep on a single SPEC-controlled Unicode floor; a future SPEC bump moves both together.

Bump policy. A future SPEC version may move the Unicode floor from 15.1 to a higher release. When that happens, ports ship only the new tables — the prior floor's tables are not retained, and ports do not carry the cumulative union of historical UCD snapshots. This is safe because both properties involved are monotonic under Unicode's own stability guarantees: XID_Continue only grows (characters are never removed from the identifier sets), and NFC mappings of already-assigned characters never change. Consequently every document valid under floor N decodes byte-identically under floor N+k — a bump is purely additive (new accepted characters, new NFC entries for newly assigned codepoints) and never breaks pre-bump documents.

The duplicate-key check operates on NFC-normalized keys, so writing café twice in the same table — once precomposed, once decomposed — is a parse error rather than two distinct keys.

Round-trip. A decoder-emitter pair preserves the NFC value of every string, not the original source bytes. Non-NFC input becomes NFC on re-emit; emitters do not (and should not) reconstruct the original encoding.

The indent rule

Nesting is expressed by indentation. One rule:

Inside a single parent, all direct children must be indented by the same number of spaces. The first child sets the width; every subsequent sibling must match it exactly.

Different parents can pick different widths. A block ends when a line is encountered at an indent strictly less than its children's width.

a:
    b: 1        # a's children are 4-wide (first child set the width)
    c: 2        # must be 4
    d:
      e: 1      # d's children are 2-wide; independent of a's choice
      f: 2      # must be 2
g: 3            # back at the root level

Inconsistent sibling indent (e.g. b at 4 spaces then c at 3) is a decode error with the column pointed at.

Front matter

A DMS document may begin with an optional front matter block delimited by +++ lines. If present, the block must precede any other content (blank lines, line comments, and block comments may appear before +++).

+++
app_name:     "myservice"
doc_version:  "1.2.3"
updated:      2026-04-23
+++

# the actual document body starts here
database:
  host: "db.internal"
  port: 5432

Rules:

Open/close delimiters: each +++ must appear on its own line and start at column 1 — no leading whitespace. Trailing whitespace after the +++ is permitted and ignored. Any other content on the line (a comment, a key, anything beyond whitespace) is a parse error.
Unterminated front matter is a parse error. If an opening +++ appears but no closing +++ is found before end-of-file, the decoder must reject with a clear error pointing at the opener line.
Empty front matter is allowed. A +++ \n +++ block with no content between the delimiters decodes as a present-but-empty front-matter table (encoder shape: { "_meta": {}, "_body": <body> }). This is distinct from "no front matter at all," which omits the _meta wrapper entirely.
Position: the opening +++ must be the first significant line (blank lines and comments before it are fine). A +++ line appearing after any non-comment content is a plain syntax error — it is not recognized as a front matter opener.
Contents: inside the block, content decodes as an ordinary DMS table (arbitrary nesting, all scalar types, flow/block forms, heredocs, comments — everything tier 0 supports).
Reserved prefix: every key inside the front matter that starts with _ (U+005F LOW LINE) is reserved for DMS's use. Users must not introduce their own keys with a leading _; doing so is a parse error. Unknown reserved keys (a leading-underscore key the decoder doesn't recognize) are also a parse error — this lets future DMS versions add new reserved keys and still give old decoders a clean error message.
User keys: any key not starting with _ belongs to the author. The decoder surfaces these as document metadata (see API shape below) and otherwise does not interpret them.
Front matter itself is tier 0. Every tier-0 decoder must be able to recognize the +++ delimiters, decode the contents as a table, enforce the reserved-prefix rule, and act on the reserved keys defined below.

Currently-defined reserved keys

Key	Type	Meaning
`_dms_tier`	non-negative int	Declares the minimum decoder tier required. Absent ⇒ tier 0 implied. `_dms_tier: 0` is the explicit tier-0 form. `_dms_tier: 1` opts into tier 1 (see TIER1.md) — a tier-0-only decoder rejects with the tier-1-pointing error described below; a tier-1-capable decoder accepts. `_dms_tier: N` for N ≥ 2 is a parse error in this revision (no tier ≥ 2 is currently defined). A value of any other type — string (`"0"`), float (`0.0`), bool, list, table, datetime — is a parse error: `"_dms_tier must be a non-negative integer"`.

Any other _-prefixed key inside the front matter is currently reserved but undefined; a decoder encountering one must refuse with "unknown reserved key: <name>". Reserved keys exist as a forward-compatibility hook: future versions of DMS may add new reserved keys, and old decoders will give a clean error message rather than silently misinterpreting them.

Tier semantics

A document with no front matter, or a front matter with _dms_tier absent or equal to 0, is a tier 0 document. Every conforming decoder can read it.
A document with _dms_tier: 1 is a tier 1 document (see TIER1.md). Tier-1-capable decoders accept it; tier-0- only decoders must reject with the tier-1-pointing error described in §Reservations below.
A document with _dms_tier: N for N ≥ 2 is currently a parse error. The _dms_tier key remains a forward-compatibility hook for future tiers; no tier ≥ 2 is defined today.

Decoding modes — full and lite

Independent of the tier axis, every conforming DMS decoder exposes two decoding modes: a full mode (default) and a lite mode (opt-in). Both modes share the same grammar, the same error diagnostics, and the same data tree — they differ only in how much round-trip metadata the decoder keeps.

Aspect	Full mode (default)	Lite mode (opt-in)
Data tree (tables, lists, scalars)	produced	produced
Front matter (`_meta`)	produced	produced
Comment AST (leading / inner / trailing / floating)	produced	not produced — comments are lexed and discarded
`original_forms` (integer base, string form, heredoc form)	produced	not produced
Full-mode `encode()` (preserving round-trip)	supported	not supported — needs comments + original_forms
Lite-mode `encode()` (canonical-form emit)	supported	supported

The grammar is identical in both modes. Lite mode does not relax error checking, does not skip front-matter validation, does not loosen Unicode normalization, does not change which inputs are accepted. It is the same decoder with two output channels turned off.

What lite mode is for. Read-only consumers — application configs, CI pipelines, deploy scripts, sysctl-style readers — that decode, extract values, and never re-emit the document. The comment-AST and original_forms machinery exists to support encode(); if you don't call encode(), those structures are dead weight. Lite mode lets read-only callers skip the bookkeeping and recover wall-clock time (reference benchmarks show roughly 1.5–2× on flat-table workloads; varies by port).

What lite mode is not. Lite mode is not a "permissive" mode, not a non-conforming subset, and not an alternative conformance level. A document that decodes in full mode decodes in lite mode and vice versa; a document that errors in full mode errors at the same character in lite mode. The two modes produce the same data tree.

encode() itself has two modes — full and lite — orthogonal to the decode-side modes. Same name pattern, different concern. The decode-side mode controls how much round-trip metadata the decoder captures. The emit-side mode controls how much of that metadata encode() re-emits.

`encode` mode (input → output)	Comments	`original_forms`	Use case
Full (default) — preserving emit	re-emitted	re-emitted (hex/oct/bin/literal-string forms)	Round-trip a decoded file, hand-edited config writer
Lite — canonical emit	dropped	dropped (decimal ints, basic-quoted strings)	Generate DMS from in-memory data; bench/strip

Lite-mode encode accepts any Document — full or lite. It ignores comments and original_forms even when present, and emits canonical form: decimal integers, basic-quoted strings, no comments, emitter-default whitespace. The output is always valid DMS that re-decodes to a data-equivalent Document.

Full-mode encode (the existing default) requires a full-mode-decoded Document (or one constructed in code with the metadata fields populated). A decoder that ships both encode modes MUST refuse encode(lite_doc, mode=full) with a clean error ("full-mode emit requires comments + original_forms; got a lite-mode Document"). encode(lite_doc, mode=lite) is always valid.

Round-trip stability (under §encode) is required only for full-mode emit of a full-mode-decoded Document. Lite-mode emit is canonical-form lossy by design — encode_lite(decode(src)) may strip comments and re-render hex integers as decimal; that is the intended behaviour, not a violation.

Conformance. Every conforming decoder MUST ship full-mode decode AND full-mode encode. Lite-mode decode and lite-mode encode are optional to ship; decoders that ship them MUST do so under the contract above. A decoder that exposes only lite mode (either side) is non-conforming.

Capability reporting. A decoder that ships lite mode advertises it via a supports_lite_mode boolean on its capability surface. Callers can probe before opting in.

Unordered tables — optional opt-in (orthogonal axis)

A third, independent axis exists alongside full/lite: the table ordering guarantee. Tier 0 makes insertion-order preservation a default invariant — every conforming decoder ships an ordered mode and the conformance corpus is checked against it. Some consumers don't care: kubectl-style read-only loaders, monitoring agents, batch processors that consume DMS, project to a few keys, and never re-emit. For those callers, an unordered mode is allowed as an optional opt-in.

Aspect	Ordered (default)	Unordered (opt-in)
Iteration order over a decoded table	insertion-order	arbitrary — decoder may use a hash-only backing
Conformance corpus expected output	byte-stable	best-effort; equality compares structurally, not order
Full-mode `encode()` (round-trip)	supported	not supported — round-trip needs stable order
Lite-mode `encode()` (canonical)	supported	supported (emits in iteration order, no stability promise)

API shape. A decoder that ships unordered mode exposes it via a parallel entry point or a flag — decode_document_unordered(src), decode(src, ignore_order=true), etc. The exact name is language-specific. The CLI convention used by the conformance and bench harnesses is --ignore-order.

Capability reporting. A decoder that ships unordered mode advertises it via a supports_ignore_order boolean on its capability surface. Callers probe before opting in.

Combinations. Unordered is orthogonal to full/lite — all four combinations are conforming if the decoder ships them: (ordered, full), (ordered, lite), (unordered, full), (unordered, lite). The most useful pairing for read-only callers is (unordered, lite): fastest decode, no comment AST, hash-only table backing.

Reference implementation note (informational, not normative). The DMS Rust reference ships --ignore-order on the CLI surface with spec-correct semantics, but at the time of writing the runtime backing is still IndexMap-based (no measurable decode-speed win). The flag is plumbed end-to-end so other ports can implement the HashMap-backed fast path without API churn. Ports that DO swap to a hash-only backing should advertise supports_ignore_order = true.

API shape. Language-specific. The general pattern is a construction-time option or a parallel entry point — e.g. decode(source, mode="lite") versus decode(source), or decode_lite(source) versus decode(source). The spec does not mandate the exact API name; it mandates the contract above.

Examples

Full-mode decode (default), with comments preserved:

doc = dms.decode(source)              # full mode by default
doc.comments[("db", "port")]         # leading + trailing AttachedComments
out = dms.encode(doc)                # round-trips the source

Lite-mode decode (opt-in), no comment AST:

doc = dms.decode(source, mode="lite") # comments lexed and discarded
doc.body["db"]["port"]               # data is identical to full mode
doc.comments                         # empty / absent
dms.encode(doc)                      # ERROR: round-trip requires full mode
dms.encode(doc, mode="lite")         # OK — canonical emit, no metadata needed

Lite-mode emit on a full-mode Document — strip comments + canonicalise:

doc = dms.decode(source)              # full-mode decode, comments captured
canonical = dms.encode(doc, mode="lite")
# `canonical` has no comments, decimal integers (even if source used 0xFF),
# basic-quoted strings (even if source used '...' literal). re-decodes to a
# data-equivalent Document.

API shape

A decoder must expose the decoded document as both a body value (what the rest of the spec already defines) and a front matter table (the decoded +++ block, or an empty/absent value if the document had no front matter). Exactly how is language-specific; for this implementation's conformance encoder (JSON output), the shape is:

No front matter present: encoder output is the body as tagged JSON, identical to pre-front-matter behavior.
Front matter present: encoder output is a JSON object { "_meta": <front-matter tagged>, "_body": <body tagged> }. Both subtrees use the standard tagged-JSON encoding.

This means every existing conformance test — none of which declare front matter — keeps its expected output unchanged. New tests that use front matter produce the wrapped form.

Examples

Explicit tier 0 declaration:

+++
_dms_tier: 0
+++
host: "db.internal"
port: 5432

User metadata:

+++
title:   "Production config"
author:  "ada@example.com"
updated: 2026-04-23
+++
database:
  host: "db.internal"

(Front matter surfaces as metadata.)

Tier ≥ 1 (parse error):

+++
_dms_tier: 1
+++
host: "db.internal"

(No tier ≥ 1 is currently defined; the decoder refuses.)

User tries a reserved key (parse error):

+++
_my_app_version: "1.0"   # error: '_'-prefixed keys are reserved
+++

Front-matter-only decode

Every conforming decoder must expose a separate entry point that decodes only the front-matter block and stops, skipping the body. This exists for callers that need only the document's metadata — config loaders checking _dms_tier, indexers harvesting user keys, dispatchers choosing a downstream decoder — and would otherwise pay the full decode cost for a few header lines.

Contract:

Input: a DMS source (string or byte buffer).
Output: the decoded front-matter table, or a language-specific empty/absent value when the document has no front matter at all (no opening +++ after trivia). Present-but-empty front matter (+++\n+++) returns an empty table — distinguishable by the caller from "no front matter".
Scope: the decoder scans leading trivia (blank lines, line and block comments), the opening +++, the front-matter contents, and the closing +++, then returns. Body bytes after the closer are not tokenized.
Validation: every front-matter rule from §Front matter still applies — open/close on their own lines, unterminated front matter is a parse error, the _-prefix namespace is enforced, _dms_tier is type-checked, unknown reserved keys are rejected. Front-matter- only decode is not a permissive mode; it is the same grammar with an early stop.
Mode: front-matter-only decode runs in lite mode — no comment AST, no original_forms inside the front matter. (Front-matter preservation through encode is a full-decode concern.)
Errors: diagnostics inside the +++ ... +++ block are byte- identical to a full decode. Errors that only manifest in the body (duplicate body keys, unterminated body heredoc) are not surfaced by this API; callers needing whole-document validation must call the full decoder.

API shape. Language-specific. Reference ports use decode_front_matter(source) / decodeFrontMatter(source) per host idiom. The CLI convention used by the conformance and bench harnesses is --front-matter-only.

Conformance. Required at tier 0. Absence of this entry point is non-conformance; there is no capability flag.

Forward compatibility

DMS evolves by reserving syntactic and lexical real estate today so future versions can extend it without breaking existing documents. Reservations fall into two groups: declared (the spec explicitly names them) and implicit (the tier-0 grammar rejects them today, leaving the slot free for a future tier to define).

Declared reservations

_dms_tier — _dms_tier: 1 opts into tier 1 (defined in TIER1.md); tier-0-only decoders refuse cleanly with the tier-1-pointing error described below. _dms_tier: N for N ≥ 2 remains reserved for future tiers (see Tier semantics).
Front-matter _-prefix keys — the entire _-prefix namespace inside +++ ... +++ is reserved for DMS. Unknown reserved keys are a parse error, so future versions can introduce directives (a merge policy, a schema reference, etc.) and old decoders surface a clean error rather than silently misreading.
Heredoc modifier names — unknown modifier identifiers are a decode error; new modifiers can land in later versions without ambiguity.
_trim where flags — unknown flag characters are silently ignored, so new flags are forward-compatible. (Inverse policy from modifier names — chosen because flags are a bag-of-chars, not identifiers.)
Lexical reservations — leading zeros on decimal integer literals, octal escape sequences (\012), and unknown backslash escapes are parse errors today, reserved against future definition.

Implicit reservations

The tier-0 grammar is strict: any token sequence not matched by an explicit production is a parse error. The following positions are currently rejected and are candidates for tier ≥ 1 extension. Tier-0 decoders must continue to reject them; a tier ≥ 1 document opts in via _dms_tier.

Post-inline-value, same line. The slot following an inline_value on a kvpair line, before the newline or trailing comment. Today port: 5432 _example() is a tier-0 parse error. Reserved as the natural attachment point for future post-value annotations (e.g. modifier-style transforms applied to non-heredoc values, of the same ident(args) shape used by heredoc modifiers).
Post-root trailing content. Non-comment, non-blank tokens following a scalar-root value, or following the final value of a table/list root. Reserved as the attachment point for future whole-document annotations.
Sigil tokens in value-positions. A reserved decorator sigil (! @ $ % ^ & * | ~ ) appearing where aninline_valuewould be expected — afterkey:, after+`, in flow-array or flow-table element positions, at scalar root — is a tier-0 parse error. Reserved as the future location for tier-≥1 decoration prefixes.

These reservations do not commit DMS to ever populating these slots — they document where future extensions could land without breaking existing tier-0 documents.

Decoders SHOULD emit a tier-1-pointing error when they encounter a reserved decorator sigil in any reserved slot ("decorator sigil '' requires tier 1; set _dms_tier: 1 and declare the dialect in _dms_imports — see TIER1.md"), rather than a generic "unexpected token."

Document root

A DMS document's root value is polymorphic — it can be a table, a list, or a scalar. The root type is determined by the first significant line (significant = not blank, not a line comment, not a block comment):

First significant line begins with…	Root is a …
a key followed by `:`	table
`+` (list item marker)	list
any other value token (string, number, `"""`, …)	scalar
nothing (document is empty or only comments)	empty table (`{}`)

Once the root type is committed, every subsequent top-level (column 0) line must match it:

Under a table root, every column-0 line must be a kvpair.
Under a list root, every column-0 line must be a + item.
Under a scalar root, there must be no further significant lines.

A top-level line that violates the committed root type is a parse error.

Examples

Table root (the common case for config):

title: "production"
database:
  host: "db.internal"

List root:

+ name: "web1"
  ipv4: "10.0.0.1"
+ name: "web2"
  ipv4: "10.0.0.2"

Scalar root:

"""
A document whose entire value is a multi-line string.
"""

Keys and scalars

bare_key:     1             # bare key: letters, digits, underscore, dash
"quoted key": 2             # double-quoted: escapes processed
'quoted key': 3             # single-quoted: literal, every character as-is
résumé:       4             # Unicode letters are allowed in bare keys
42:           5             # numeric-looking bare keys are fine
"":           6             # empty string key — must be quoted

What counts as a bare key

A bare key is one or more characters, each drawn from:

ASCII letters and digits (A-Z, a-z, 0-9)
ASCII underscore (_) and dash (-)
Any character in the set XID_Continue ∖ Default_Ignorable_Code_Point ∖ Reserved-Emoji-Set (see §Lexical → Reserved emoji characters), as defined by the Unicode derived properties frozen at Unicode 15.1.0. Document encoding is UTF-8.

XID_Continue is the Unicode-standard "identifier continuation" set — letters, digits, combining marks, and a curated handful of joiners — and is what Python, Rust, and most modern languages use for identifiers. Subtracting Default_Ignorable_Code_Point removes invisibles such as zero-width joiners and variation selectors that would otherwise let two visually-identical bare keys differ in their byte content.

The accepted set is frozen, not host-derived. Each port ships its own table generated once from the UCD 15.1 data files. A port must not delegate this check to its host runtime's Unicode library (Python's str.isidentifier(), ICU, etc.), because those track whichever Unicode version the runtime was built against — meaning the set of accepted bare-key characters would silently grow over time as the host platform updates. Freezing the set at 15.1 guarantees that a document written today decodes identically a decade from now, on every port, regardless of what new code points Unicode introduces. A future SPEC version may bump the floor; until then, ports re-emit their tables only on explicit SPEC instruction.

Keys that look like numbers (42), booleans (true), or other reserved identifiers (inf, nan) are valid bare keys — the trailing : disambiguates context. Every decoded key is a string, regardless of whether it was written bare or quoted: 42: x produces the string key "42".

A bare key may consist entirely of _ and/or - characters (_:, -:, _-_: are all valid keys). The character-set rule is positional, not compositional — there is no "must contain at least one letter or digit" requirement.

Quoting

An empty string key must be quoted ("" or '') — a bare key requires at least one character. Any key containing whitespace, :, #, {}, [], ", ', or . must also be quoted.

Separator whitespace

A : that terminates a key must be followed by a space (or end-of-line, if the value is a child block). host:localhost is a parse error; host: localhost is fine.

Duplicate keys

Two keys in the same table that, after decoding, produce the same string are a parse error. This rule compares the final key strings — which are NFC-normalized (see Unicode normalization) — so "42" and 42 collide, "hello" and 'hello' collide, and a key written as precomposed é collides with one written as e + U+0301.

Key order

Key insertion order is preserved. A DMS decoder must expose each table as an insertion-ordered structure so that doc-to-doc diffs are stable and round-tripping a document emits keys in the same order they were written.

Block vs scalar values

A key can take its value in one of three shapes:

Inline scalar — value fits on the same line: port: 5432 name: "web1"
Child block — key ends with bare : and the next non-blank line is indented further: database: host: "db.internal" port: 5432
Heredoc — a triple-quote opener (""" or ''', with an optional label) follows the :; content starts on the next line and runs until the terminator. See Heredocs below.

A key with bare : and no indented block beneath it is a parse error. Use [] or {} flow form for empty collections.

The three shapes are mutually exclusive: a key with an inline scalar (or a heredoc) cannot also have an indented child block beneath it. Source like

port: 5432
  child: 1   # parse error: inline value already given for `port`

is rejected — the decoder commits to the inline value on the port line and then sees the deeper-indented child: as illegal indent (no parent block was opened). To get a child block, drop the inline value:

port:
  child: 1

Lists

A list item is a line whose first non-whitespace character is +, followed by a space and the item's content.

Mnemonic: read + as "push this onto the list." Each + line appends one item to the enclosing list, the same way list.push(x) (JS), list.append(x) (Python), or vec.push(x) (Rust) appends to an array — and the visual column of the + plays the role of "which list" when lists are nested.

tags:
  + "web"                   # scalar item
  + "frontend"
  + "public"

servers:
  + name: "web1"            # table item: first key sits on the + line
    ipv4: "10.0.0.1"
    disks:
      + mount: "/"
        size_gb: 100
      + mount: "/var"
        size_gb: 500
  + name: "web2"
    ipv4: "10.0.0.2"

Rules:

Sibling + markers must be at the same column (they are direct children of the same parent; the indent rule applies).
A table item's first key sits after + on the same line; sibling keys of that item must align with that first key's column, not with the +.
A list item may also be empty-on-same-line and open a nested block on the next line: ``` matrix: +
- 1
- 2 +
- 3
- 4 ```

Comments

Required of every conforming DMS decoder. This is not a "preserving variant" of full mode — comment-AST attachment and round-trip preservation are part of the format definition. A decoder that drops comments in full mode, or that decodes them but can't reproduce them via encode (see §encode), does not conform. Lite mode (see §Decoding modes — full and lite) is an explicit, opt-in alternative that discards comments by design; that is a documented mode, not a preservation failure.

DMS preserves comments through decoding. Every comment in the source — line comment (# or //), C-style block comment (/* … */, nestable), or hash-block comment (###LABEL … LABEL, ### … ###) — is captured during decoding and attached to the nearest neighbor in the value tree as a first-class AST node. This is what makes a decode → modify → re-encode round-trip keep comments at the right places in the output.

The contrast with most config formats is the load-bearing piece of DMS's design:

JSON has no comments at all.
TOML and YAML specs allow comments, but every mainstream decoder (tomli, tomlc99, toml-rs, @iarna/toml, BurntSushi/toml, PyYAML/CSafeLoader, libyaml, yaml.v3, js-yaml, YAML::XS) treats them as lexer trivia and discards them during decoding. The data tree exposed to your application has no record of where comments lived. Re-emitting that tree therefore can't reproduce the comments — they're gone the moment you decoded.
Some libraries preserve comments via a separate value type (ruamel.yaml for Python YAML, the toml-edit crate for Rust). These are language-specific add-ons that sit alongside the "normal" decoder; they're slower and incompatible with the ecosystem's main toolchain.

DMS bakes preservation into the format itself. Every reference decoder (Rust, Go, C, Zig, Python, Node, Perl pure, Perl XS) returns a Document whose comments are tracked at the same level as the data, by spec.

The rest of this section defines the attachment rules and round-trip contract. See §encode for the emitter side of the round-trip.

Attachment rules

When the decoder encounters a comment, it attaches it to a value / kvpair / list-item / container node according to these rules.

Leading comment. A line whose only significant content is a comment, immediately preceding a kvpair, list item, or block opener, attaches to that following node as a leading comment — provided there is no blank line between the comment and the node. Multiple leading comments stack on the same node, in source order:

# server pool                    ← leading
# updated 2026-04-22             ← leading (also)
servers:
  + name: "web1"

Trailing comments. Comments on the same line as a value, after that value, attach to the value as trailing comments. Multiple trailing comments stack in source order — possible because a /* ... */ block comment doesn't terminate the line. A # or // line comment, if present, consumes the rest of the line and must therefore come last:

port:  8080   # default                              ← trailing on `port`
retry: 3      /* aggressive */ /* see SLO */         ← two trailing block
token: "x"    /* see vault */ # never log this       ← block, then line

Inner comments. A /* ... */ block comment appearing between a key's : and its value, or between a + and a list item's content, attaches to that kvpair or list item as an inner comment. Multiple inner comments stack in source order:

secret: /* see vault */ /* rotated 2026-04-01 */ "REDACTED"
+ /* see runbook */ name: "web1"

The rule is positional — inner attaches whenever a /* ... */ appears between the opening sigil (: or +) and the value, regardless of whether the value is an inline scalar, a heredoc opener, or the newline that opens a child block:

db: /* connection cluster */
  host: "db.internal"
  port: 5432

Inner comments are /* ... */ only. Line comments (#, //) consume to end-of-line and ### ... ### block comments require the opener on its own line, so neither can appear mid-token.

Floating comment. A comment that does not attach to a following sibling — either because a blank line separates it from the next sibling, or because the block closes before any sibling appears — attaches to the enclosing container (table or list) as a floating comment, in source order:

servers:
  + name: "web1"
  # the following block is currently disabled

  # restore by uncommenting       ← floating on `servers`

The blank-line rule is what disambiguates "section header" comments (meant to apply broadly) from "this-key" comments (meant to apply to the next key). Authors who want a comment to attach to the immediately following key just omit the blank line; authors who want a section header keep the blank line.

Block comments in leading / trailing / floating positions. The block-comment forms (### ... ### / ### LABEL ... LABEL, and /* ... */) follow the same attachment rules as line comments when they occupy any of the leading / trailing / floating positions. A block comment that occupies its own line-range and is followed by a kvpair (no blank line) attaches as a leading comment; a block comment on the same line as a value, after the value, attaches as trailing; otherwise the floating logic applies. The inner position (above) is the only position restricted to a single comment form (/* ... */).

For example, a trailing /* ... */:

x: 1 /* trailing C-block */

attaches as trailing on the kvpair for x and round-trips through encode next to its value, exactly like a trailing # foo line comment would.

Front matter comments. Comments inside the +++ ... +++ block follow the same attachment rules, scoped to the front matter table. Their leading / inner / trailing / floating attachments live alongside the front matter user keys; they do not leak into the body.

Paths

The comment-attachment metadata and the round-trip original_forms records both key into the document by path. A path is an ordered sequence of segments, each one of:

a table key — the decoded key string (always a string, see §Keys), which selects a value inside an enclosing table.
a list index — a non-negative integer, which selects a value inside an enclosing list.

Conventions:

The empty path [] denotes the document root.
A path with first segment "__fm__" (a sentinel string key) denotes a node inside the front-matter table; the rest of the path is then resolved against the FM table the same way as the body.
The encoder's tagged-JSON output uses the same path convention implicitly: a JSON object's keys are table-key segments, a JSON array's positions are list-index segments.
Paths are not strings — they are typed sequences. Implementations may serialize them however they like internally (Rust's Vec<BreadcrumbSegment>, Python's tuple of str | int, JS's array of string | number, etc.), but the segment types are mandatory so a string key "1" and an index 1 never collide.

Examples:

Source	Path of value `"web1"`
`host: "web1"`	`["host"]`
`db:` `host: "web1"`	`["db", "host"]`
`+ "web1"` (list root)	`[0]`
`servers:` `+ name: "web1"`	`["servers", 0, "name"]`
`+++` `app: "web1"` `+++` `x: 1`	`["__fm__", "app"]`

What's stored

Each comment is captured as { content, kind }:

content — the raw comment text including delimiters (# foo, // foo, /* foo */, ### NOTE\n...\nNOTE).
kind — "line" or "block".

The decoder does not assign stable IDs, breadcrumbs, or position metadata beyond the leading / inner / trailing / floating classification. Comments are identified solely by their attachment — there is no cross-document identity. (Earlier drafts of the spec experimented with content-hash IDs as part of a richer modifier system; that mechanism has been removed and is not part of DMS today.)

Every position is a list of comments stored in source order. Leading and floating may contain any mix of line and block comments; trailing accepts any mix but a # / // line comment must come last (it consumes the rest of the line); inner accepts only /* ... */ block comments.

Round-trip semantics

Decode → no modification → re-encode preserves every comment at its attached node. (Whitespace within comment runs may be normalized by the emitter — e.g. spacing around # — but the content text is preserved.)
Decode → modify → re-encode preserves comments on still-present nodes: if the node a comment is attached to remains in the tree after modification, the comment travels with it. Newly inserted nodes carry no comments; deleted nodes drop their attached comments along with the node itself.
The JSON conformance encoder described elsewhere in this spec does not emit comments — it reflects decoded values only. Comment preservation matters when re-emitting DMS output (see §encode).

Worked example

Source:

# the database section
db:
  host: "localhost"
  # raised from 80 after the LB change in 2024-Q4
  port: 8080      # default for staging
  secret: /* see vault */ /* rotated 2026-04-01 */ "REDACTED"

  # restore by uncommenting
  # debug: true

After decoding, Document.comments contains the entries below (paths are breadcrumbs into the value tree). Every entry's value is a list of {content, kind} records, in source order:

Path	Position	Contents
`["db"]`	leading	`# the database section`
`["db", "port"]`	leading	`# raised from 80 after the LB change in 2024-Q4`
`["db", "port"]`	trailing	`# default for staging`
`["db", "secret"]`	inner	`/* see vault /`, `/ rotated 2026-04-01 */`
`["db"]`	floating	`# restore by uncommenting`, `# debug: true`

Now mutate the decoded tree — say, change port to 5432:

doc.body["db"]["port"] = 5432

encode(doc) emits:

# the database section
db:
  host: "localhost"
  # raised from 80 after the LB change in 2024-Q4
  port: 5432      # default for staging
  secret: /* see vault */ /* rotated 2026-04-01 */ "REDACTED"

  # restore by uncommenting
  # debug: true

The data changed; the comments stayed at the same nodes.

`encode` — DMS-output emitter contract

Every conforming decoder must ship encode(document). It has two modes — full (default) and lite — orthogonal to the decode-side mode (see §Decoding modes — full and lite). The mode controls how much round-trip metadata the emitter re-emits.

Full mode (default) re-emits valid DMS source from a decoded Document produced in full-mode decode. The contract is data + comments + literal-form preservation under round-trip:

decode(encode(decode(source))) produces a Document that is data-equivalent to decode(source), has the same comments at the same attached paths, and uses the same literal forms for the values it can preserve (integer base, string form).

Lite mode emits the same data tree in canonical form: comments are dropped, integers are emitted in decimal regardless of source base, strings are emitted in basic-quoted form regardless of source flavour, no original_forms consultation. Lite-mode emit accepts both full-mode and lite-mode decoded Documents — the metadata is simply ignored. Lite-mode emit is lossy by design for comments / source forms; it preserves only the data tree.

The rest of this section specifies full-mode emit (where the preservation contract has teeth). Lite-mode emit follows the same data-tree rules but skips every "Required preservation" point below.

Byte-for-byte source preservation is not required (whitespace and indentation choices are emitter-determined; original column alignment is not preserved); semantic round-trip is.

Required preservations:

Integer base. A decoder must record the source-form of every integer literal (e.g. 0x1F40, 0o755, 0b1010_0110, 1_000_000, +42, -7). On emit, the integer is re-rendered in its original base, preserving sign and underscore separators where present. Stored as a side-channel original_forms keyed by the value's path in the tree (analogous to comment attachment); see API shape below.
String form. A decoder must record which of the four string forms produced each decoded string: basic "...", literal '...', heredoc-basic """[LABEL]...LABEL (or """..."""), heredoc-literal '''[LABEL]...LABEL (or '''...'''). Heredoc records additionally include the label text (or its absence) and any heredoc modifiers applied (_trim(...), _fold_paragraphs(), in source order). On emit, the string is re-rendered in the same form.
Comments. Every AttachedComment is re-emitted at its classified position (Leading / Inner / Trailing / Floating) on the still-present node at its path. Within a position, comments emit in their stored source order. Inner comments are emitted between the kvpair's : and its value (or between a list item's + and its content), separated from each other and from the surrounding tokens by single spaces — key: /* a */ /* b */ value — to keep round 2 byte-stable.
Key insertion order. Already a tier-0 invariant; carries through encode.

Emitter-determined (not preserved):

Float formatting: the shortest-decimal ryu shape is used uniformly. pi: 3.14 round-trips as 3.14; e: 2.7182818284 may render as the same digit string or as 2.7182818284 etc., depending on the binary64 value. (Round-trip stability follows from ryu being canonical for any f64.)
Indentation and whitespace. The emitter picks a consistent indentation (2 spaces is the recommended default; the contract doesn't pin a specific width) and consistent separator whitespace. Original column alignment is lost.
Block vs flow form for containers. A list decoded as [1, 2, 3] may emit as the same flow form OR as block-form + 1 / + 2 / + 3; similarly for tables. Emitters MAY track and preserve the original form (recommended for tight round-trip), but the contract requires only data+comments+literal-form equivalence.

Front matter: an emitter omits the +++ ... +++ block entirely when both meta is empty (or absent) AND no comments are attached to the front matter. Otherwise the block is emitted with its keys and any front-matter comments at their attached paths.

API shape (language-specific naming):

encode(document)            -> str    // full mode (default)
encode(document, mode=lite) -> str    // lite mode (canonical-form emit)
encode_lite(document)       -> str    // alternative lite-mode entry point

Document {
    meta, body, comments,
    original_forms              // sparse map: path -> {integer-lit | string-form-record}
}

The exact API name is language-specific — a parallel entry point (encode_lite) and a mode parameter (encode(doc, mode="lite")) are both conforming. The contract is what matters.

comments and original_forms are populated during full-mode decoding and consulted only by full-mode encode. Lite-mode encode ignores both fields. The conformance JSON encoder ignores them too.

Round-trip stability (full mode only): encode(decode(encode(decode(source)))) must produce the same string as encode(decode(source)) — i.e. the second round-trip is byte-for-byte stable. This is the test condition for the round-trip-comments fixture corpus. Lite-mode emit has no round-trip stability requirement (it's canonical-form, lossy on comments / source forms by design); it does have a data-stability requirement: decode(encode_lite(doc)) must be data-equivalent to doc.

Line comments

Two interchangeable forms, both running from their opener to the end of the line:

# — shell / YAML / TOML style.
// — C / JS / Rust style.

Either form must be preceded by whitespace or start the line; key:5#x and key:5//x are parse errors (the 5 runs into the sigil). Mixing styles in one document is allowed — this is a spec-level convenience, not a style rule. Linters may pick a canonical form.

No leading space is required after the sigil. #foo and # foo are both valid line comments; //foo and // foo are both valid. Style preferences (e.g., requiring a space) are a linter concern, not a decoder one.

# full-line comment
// also a full-line comment
port: 5432   # trailing comment
port: 5432   // same

Block comments

Three forms, all equivalent in semantics (contents are discarded; no decoding happens inside):

Form	Terminator
`/* ... */`	next unmatched `/` (nested `/ */` allowed)
`### ... ###`	line whose trimmed content equals `###`
`###LABEL ... LABEL`	line whose trimmed content equals `LABEL`

/* ... */ (C-style). Nesting is supported: every /* opens a new level and every */ closes the innermost. The decoder only returns to code mode when the nesting count reaches zero. Can appear inline or span multiple lines:

port: /* inline */ 8080
/* this /* nested */ is fine */

###LABEL and ### (heredoc-shaped). Opener and terminator must each be on their own line (trimmed to match). Useful when content contains unbalanced */ — the labeled form lets you pick any terminator word. The label follows heredoc label rules ([A-Za-z_][A-Za-z0-9_]*) and sits directly after ### with no whitespace.

###NOTE
  The alerts below are owned by the SRE team.
  Escalation policy: see runbook.md.
  Any text, any syntax, no escaping — even raw */ survives.
  NOTE

###
  Short block comment, no label. Closed by another ### line.
  ###

/*
  Same idea, C-shape. Nests with /* ... */.
*/

alerts:
  primary: "pager"

A block comment may appear anywhere a blank line or a line comment would be valid.

Types

Strings

basic:   "hello\tworld"     # escapes processed
literal: 'C:\Users\ada'     # no escapes, every character taken literally

Escape sequences (basic strings)

Inside "..." basic strings, the following backslash escapes are recognized. Any other backslash sequence is a parse error.

Escape	Decoded character	Unicode scalar
`\"`	quotation mark	U+0022
`\\`	reverse solidus (backslash)	U+005C
`\b`	backspace	U+0008
`\f`	form feed	U+000C
`\n`	line feed (LF)	U+000A
`\r`	carriage return	U+000D
`\t`	character tabulation (tab)	U+0009
`\uXXXX`	Unicode scalar U+XXXX (BMP scalars only)	U+0000..U+FFFF, excluding U+D800..U+DFFF
`\UXXXXXXXX`	Unicode scalar U+XXXXXXXX (full range)	U+0000..U+10FFFF, excluding U+D800..U+DFFF

Rules and edge cases:

Hex digits in \uXXXX and \UXXXXXXXX are case-insensitive.
\uXXXX must consume exactly four hex digits; fewer is a parse error. \UXXXXXXXX must consume exactly eight.
Surrogates are not scalars. Any escape in U+D800..U+DFFF is a parse error (the decoder reports "unicode escape is not a scalar value"). DMS does not recognize UTF-16 surrogate pairs — to encode a character above the BMP via an escape, use \UXXXXXXXX directly. So 😀 may be written literally as 😀 (UTF-8 source) or as \U0001F600, but not as the surrogate-pair escape 😀.
\UXXXXXXXX values must be ≤ U+10FFFF (the Unicode maximum); a larger value is a parse error.
No \xXX byte escape: DMS strings are Unicode scalar sequences, not byte sequences. To embed a non-ASCII character, write it literally (UTF-8 source) or use \uXXXX / \UXXXXXXXX.
No \0 null escape, and no raw NUL in source. U+0000 is not expressible — neither via an escape (no \0) nor as a raw byte anywhere in the source (the decoder rejects U+0000 in input with a parse error before lexing begins). Use binary-safe encodings (e.g. base64) outside DMS if you need raw bytes.
No \' single-quote escape: single quotes don't need escaping in basic strings, and basic strings don't terminate on them.
No octal escapes (e.g. \012) — for forward compatibility.

Literal strings ('...') process no escapes — every character between the delimiters is taken as-is, then the resulting scalar sequence is NFC-normalized like any other DMS string (see Unicode normalization). Literal strings cannot contain their own delimiter ('); use a heredoc for content that mixes both quote kinds.

Strings never span lines. For multi-line text, use a heredoc.

Heredocs

Two quote flavors, optional label. Four forms total:

Form	Escapes	Terminator (trimmed line content equals)
`"""LABEL`	yes	`LABEL`
`"""`	yes	`"""`
`'''LABEL`	no	`LABEL`
`'''`	no	`'''`

A label is [A-Za-z_][A-Za-z0-9_]* and sits directly after the opening triple quote with no whitespace between them. Labels are case-sensitive and have no case requirement — EOF, eof, and End_of_File are all valid. Uppercase-by-convention is a style choice for linters, not a decoder rule.
Content begins on the line after the opener and runs until the first line whose trimmed content equals the terminator above.
Heredoc bodies do not participate in the outer indent rule; the decoder suspends indent checking until the terminator.
Indent strip is always on. The column of the terminator's first non-whitespace character sets the strip depth; that many leading whitespace characters are removed from every non-blank content line.
Body construction. Every content line — blank or non-blank — contributes its (indent-stripped) text to the value. Lines are joined by a single \n; there is no implicit terminator after the final line. A body of N content lines therefore contains exactly N − 1 newlines from the join.
Line endings normalize to \n. Source CRLF and bare LF both produce \n in the value, regardless of how the surrounding document is line-terminated. Heredoc bodies never emit \r from raw source bytes; to embed a literal CR, use \r in a basic heredoc.
Blank lines (lines containing only whitespace, including empty lines) are exempt from the strip-depth check; their contributed text is the empty string. Trailing newlines in the value come from blank lines acquiring \n separators on either side via the join — not from a blank line emitting a \n of its own.
Non-blank content lines whose indent is less than the strip depth are a parse error.
Trailing newlines: user-controlled. Each trailing blank line adds exactly one \n to the value (it contributes "" and gains a separator from the join). To strip trailing newlines, use _trim("\n", ">") (see Modifiers).
To preserve leading whitespace verbatim, place the terminator at column 0 so strip depth is zero.
Terminator column is independent of the surrounding indent. The terminator may appear at any column ≤ the smallest non-blank body indent, regardless of how deeply the heredoc's enclosing block is nested. The heredoc closes when its trimmed line content equals the terminator string; the column where that happens does not have to align with anything in the outer block. After the terminator line, the next non-blank line is interpreted under the surrounding block's indent rule, not the terminator's column.

config: long_text: """EOF line one line two EOF # ← terminator at column 0, but the parent next_key: 1 # block at indent 2 continues uninterrupted

This pattern is the canonical way to embed left-aligned multi-line text from inside an indented context: place the terminator at the outermost column you want stripped (often column 0), and the body's indentation is preserved verbatim.

Examples (each row is a heredoc body after indent-strip):

Content lines	Value
`["line 0"]`	`"line 0"`
`["line 0", ""]`	`"line 0\n"`
`["line 0", "line 1"]`	`"line 0\nline 1"`
`["line 0", "", "line 1"]`	`"line 0\n\nline 1"`
`["line 0", "line 1", "", "", ""]`	`"line 0\nline 1\n\n\n"`
`[]` (heredoc with no content lines)	`""`

Labels allow content to contain the triple-quote sequence. Without a label, the first line whose trimmed content is the triple-quote itself closes the heredoc — a simpler form for content that can't contain """ or '''.

Line continuation (escape-on only)

In a """ heredoc, a \ that is the last non-whitespace character on a line is a line continuation. The backslash, any trailing whitespace on its line, the line terminator, and any leading whitespace on the next non-blank line are all consumed — the two lines splice into one. Applies before modifiers run.

prose: """EOF
    The quick brown \
        fox jumps over \
        the lazy dog.
    EOF
# → "The quick brown fox jumps over the lazy dog."

This gives per-line author control over where a newline is kept versus chomped, which the global modifiers (_fold_paragraphs, _trim) can't do at the character level.

Details and edge cases:

Only """ (escapes on) supports line continuation. Inside ''', \ is a literal backslash; no line splicing happens.
\\ at end of line is a literal backslash followed by a newline, not a continuation — the first \ escapes the second. The rule fires only on an unescaped trailing \.
Trailing whitespace after the \ on the same line is allowed and consumed.
Continuation consumes blank lines too: foo \ followed by a blank line and then bar produces foo bar.
A line that is only \ (followed by a newline) splices to the next line.
\ as the very last character before the terminator line (with no following content) is a parse error — there is nothing to splice to.

Modifiers

One or more modifiers may follow the opener (and label, if present), separated by whitespace. Every modifier is written in function-call form: an identifier followed by a parenthesized argument list. The parentheses are required even when the argument list is empty.

Modifiers work with both labeled and unlabeled heredocs. Whitespace rules (consistent with the rest of the language):

A label, if present, attaches directly to the opener with no whitespace between them: """EOF, '''END.
A modifier requires whitespace before it: """ foo(), """EOF foo().

Example:

sql: """EOF _trim("\n", ">")
    SELECT id, name
    FROM users
    EOF
# → "SELECT id, name\nFROM users"  (trailing newline stripped)

Disambiguation follows from those two rules combined with the "modifiers always have parens" rule: a bare identifier touching the opener is a label; an identifier with (...) preceded by whitespace is a modifier. """foo() (no whitespace, has parens) is a parse error; write """ foo() if you mean modifier-with-no-label, or """foo bar() if you mean label-plus-modifier.

Modifiers run left-to-right, after indent-strip, before the value is returned. Each modifier operates on whatever the previous step produced.

The two standard modifiers

Modifier	Effect
`_trim(chars, where, replacement = "")`	Find runs of characters matching `chars`, at positions selected by `where`, and replace each run with `replacement`. The Swiss-army content shaper — covers strip, chomp, and interior replace.
`_fold_paragraphs()`	Collapse runs of non-blank lines within a paragraph into space-joined single lines; blank-line paragraph breaks stay as a single `\n`.

Unknown modifiers are a parse error. Argument types must match the signatures; wrong types error at decode time (e.g. _trim(42, "*")).

`_trim(chars, where, replacement = "")`

chars (string) — a bag of characters to match. Matching is per-character; chars is not interpreted as a regex or a substring. "\n" matches newlines only; " \t" matches either spaces or tabs; " \t\n\r" matches any standard whitespace. An empty chars ("") is a no-op: nothing matches, so the body is returned unchanged.

where (string) — a DSL of position flags. Unknown characters in the flags string are silently ignored, so future flags will be forward-compatible.

Flag	Meaning
`<`	Leading edge of the whole string (the first run of matching chars, if any).
`>`	Trailing edge of the whole string (the last run of matching chars, if any).
`\\|`	Per-line edges — leading + trailing runs on every line, considered independently.
`*`	Every occurrence, anywhere — interior runs too. Subsumes `<`, `>`, `\\|` when present.

Flags combine ("<|>" = leading + per-line + trailing). * in combination with other flags still means "everywhere"; the others become redundant.

replacement (string, optional, default "") — what each matched run becomes. "" means strip; ", " means join-with-comma; "\n" means collapse-run-to-single-newline.

Run collapse. A consecutive run of matching characters becomes one replacement, not one per char. So _trim("\n", "*", ", ") applied to "a\n\nb" produces "a, b", not "a, , b".

Common recipes:

Intent	Call
Strip trailing newlines	`_trim("\n", ">")`
Ensure exactly one trailing newline	`_trim("\n", ">", "\n")`
Ensure exactly three trailing newlines	`_trim("\n", ">", "\n\n\n")`
Join all lines with a comma-space	`_trim("\n", "*", ", ")`
Concatenate lines (no separator)	`_trim("\n", "*")`
Full whitespace-trim (both ends)	`_trim(" \t\n\r", "<>")`
Trim leading whitespace only	`_trim(" \t\n\r", "<")`
Strip per-line indentation remnants	`_trim(" \t", "\\|")`
Tabs → spaces everywhere	`_trim("\t", "*", " ")`
Collapse all runs of whitespace to single space	`_trim(" \t\n\r", "*", " ")`

The third argument is required only when you want to replace with something other than empty. All of the strip-style uses can omit it.

`_fold_paragraphs()`

Collapses non-blank-line runs into space-joined single lines, preserving blank-line paragraph breaks as single \ns. Not expressible as trim because it's a structural transform on paragraphs (two-level: lines within paragraphs, paragraphs within a body), not character-level replacement. Takes no arguments. Fine to combine with _trim(...).

Mapping to YAML block scalars — every YAML block scalar mode is expressible as a modifier combination:

DMS	YAML
(default — no modifier needed)	`\\|+`
`_trim("\n", ">", "\n")`	`\\|`
`_trim("\n", ">")`	`\\|-`
`_fold_paragraphs()`	`>+`
`_fold_paragraphs() _trim("\n", ">", "\n")`	`>`
`_fold_paragraphs() _trim("\n", ">")`	`>-`

Modifiers stack, applied left-to-right:

csv: """EOF _trim("\n", "*", ", ") _trim(" \t", "<>")
    alpha
    beta
    gamma
    EOF
# → "alpha, beta, gamma"

summary: """EOF _fold_paragraphs() _trim("\n", ">", "\n")
    First paragraph line one
    first paragraph line two.

    Second paragraph line one
    second paragraph line two.
    EOF
# → "First paragraph line one first paragraph line two.\nSecond paragraph line one second paragraph line two.\n"

Escape hatch for triple-quote content

If the body may contain a line whose trimmed content is the triple-quote opener, the unlabeled form will close early. Use the labeled form instead — labels exist precisely for this case:

doc: """END
    my_string = """
    """
    END

The two forms differ in their fallback when labels aren't used:

""" (escapes on) — you can write \"\"\" on a content line to smuggle in a literal """ without tripping the terminator. Ugly; a label is almost always cleaner.
''' (literal, no escapes) — there is no way to include a line whose trimmed content is ''' in an unlabeled body. Use a label.

sql: """END
    SELECT id, name
    FROM users
    WHERE active = true
    END

regex: '''
    ^\d{4}-\d{2}-\d{2}$
    '''

note: """
    First paragraph.

    Second paragraph after a blank line.
    """

ascii_art: '''
/\_/\
( o.o )
 > ^ <
'''

In the last example, the terminator sits at column 0, so strip depth is zero and every leading space in the art is preserved.

Integers

dec: 1_000_000          # underscores allowed between digits
hex: 0xDEAD_BEEF
oct: 0o755
bin: 0b1010_0110
neg: -42

64-bit signed; values outside [-2^63, 2^63 - 1] are a parse error. Leading zeros on decimal literals are a parse error (reserved for future use). Underscores must be between two digits (never at the start, end, or adjacent to the base prefix or sign).

Floats

pi:    3.14159
avog:  6.022e23
small: 1.5e-10
inf_p: +inf
inf_n: -inf
nan:   nan

IEEE 754 binary64. inf and nan are keywords, not identifiers.

Decimal floats require at least one digit on each side of the decimal point. 1. and .5 are parse errors; write 1.0 and 0.5. The exponent (e/E) is optional; if present, it's a signed decimal integer.

Non-decimal floats use 0x / 0o / 0b prefixes and a binary exponent marker p (mandatory — it's what distinguishes a non-decimal float from a non-decimal integer). The value is mantissa × 2^exponent, where the mantissa is the base-N number's value.

hex_f:   0x1.8p3          # 1.5 × 2^3 = 12.0
hex_int: 0xFp0            # 15.0  (no dot, but p makes it a float)
oct_f:   0o1.4p3          # (1 + 4/8) × 2^3 = 12.0
bin_f:   0b1.1p3          # 1.5 × 2^3 = 12.0
neg_e:   0x1p-3           # 0.125

The digit-on-both-sides rule applies to every form: 0x1.p3 and 0x.8p3 are parse errors. Underscore separators are allowed in the mantissa but not in the exponent.

Round-trip. A decoder-emitter pair is required to preserve the value of +inf, -inf, and nan across a decode-then-emit cycle, though an emitter may normalize the spelling (e.g. always emit nan, never NaN). Finite values round-trip to the shortest decimal literal that produces the same binary64.

Booleans

enabled: true
debug:   false

Date & time

RFC 3339 / ISO 8601 subset, four distinct types:

offset_dt: 1979-05-27T07:32:00-08:00   # offset datetime
local_dt:  1979-05-27T07:32:00         # local datetime
local_d:   1979-05-27                  # local date
local_t:   07:32:00.999                # local time

The date/time separator is uppercase T only. RFC 3339 section 5.6 allows a lowercase t and a single space as alternates; DMS rejects both to keep emitted output canonical. Lowercase t is a parse error; a space between date and time is also a parse error (since a date followed by whitespace is itself a complete local_d value).

Fractional seconds are optional and limited to nanosecond precision (up to 9 digits after the decimal point). More digits are a parse error rather than silently truncated, so the written precision always matches the stored value.

Arrays (flow form)

For inline or compact use. Block form uses + items (see Lists).

empty:  []                              # the empty list
ints:   [1, 2, 3]
mixed:  [1, "two", 3.0, true]           # heterogeneous allowed
nested: [[1, 2], [3, 4]]                # arrays of arrays
tables: [{x: 1}, {x: 2}, {x: 3}]        # arrays of tables
multi: [
    "first",
    "second",
    "third",                            # trailing comma ok
]

Flow arrays may span multiple lines. Between [ and the matching ], whitespace (including newlines and any indentation) is insignificant — the outer indent rule is suspended inside a flow form.

Flow tables

empty:    {}                                       # the empty table
point:    { x: 1, y: 2 }
user:     { name: "ada", email: "ada@example.com" }
quoted:   { "with space": 1, plain: 2 }            # quoted and bare keys mix
matrix:   { rows: [[1, 2], [3, 4]], cols: 2 }      # tables containing arrays
nested:   { outer: { inner: { deep: true } } }     # tables of tables
trailing: { a: 1, b: 2, }                          # trailing comma ok
multi: {
    name:  "ada",
    email: "ada@example.com",
    role:  "admin",
}

Flow tables may span multiple lines. Between { and the matching }, whitespace (including newlines and any indentation) is insignificant — the outer indent rule is suspended inside a flow form.

Keys in a flow table must be unique within that table — the same rule that applies to block tables. Repeating a key is a parse error.

Flow forms — canonical multi-line layout

Multi-line flow is permissive on the decode side: any whitespace between the brackets is accepted. On the encode side, every conforming port emits multi-line flow forms in one canonical layout, so two ports producing the same value tree always emit byte-identical output.

The close-bracket anchors the indent.

The closing bracket (] for arrays, } for tables) sits at the indent level of the line that opened the form.
Members are indented exactly one level deeper than the closing bracket.
The opening bracket stays on the line that opened the form.
A trailing comma after the last member is required in canonical form, for diff-friendliness and to match the decode-side permissiveness already documented.

# array, scalar root
xs: [
  "first",
  "second",
  "third",
]

# table, scalar root
user: {
  name:  "ada",
  email: "ada@example.com",
  role:  "admin",
}

# nested: closing bracket anchors at each level's opener
config: {
  servers: [
    {
      name: "web1",
      port: 8080,
    },
    {
      name: "web2",
      port: 8081,
    },
  ],
}

When to break to multi-line. Encoders SHOULD emit multi-line form when:

The single-line rendering would exceed the configured line-width threshold (port default: 80 chars; user-configurable), OR
The form contains a heredoc or another multi-line member whose own rendering spans multiple lines.

Encoders MAY emit multi-line for other reasons (e.g., user-set always_multiline_above: 3 for tables of size ≥ 3) — the rule only specifies layout when multi-line is chosen, not the break threshold.

Encoders that canonicalize non-empty flow values to block form (tier-0's typical strategy) never emit multi-line flow for non-empty cases and so do not exercise the layout rule. The rule binds when an encoder genuinely emits a multi-line [...] / {...} — most often for tier-1 decorator-call parens (which have no block-form alternative; see TIER1.md).

What flow forms cannot contain

Flow forms are restricted to inline values. The following are decode errors inside [ ... ] or { ... }:

Heredocs (""" / '''). Heredoc bodies start on the next line, which conflicts with flow's whitespace-insignificant rule. Use a single-line quoted string, or switch the container to block form.
Comments of any kind — #, //, or block comments (/* ... */, ### ... ###). Place the comment outside the brackets, or switch to block form where comments attach as leading / inner / trailing / floating.
Block-form children — + list items or indented child blocks. A flow value position accepts only scalars and other flow forms.

Nested flow values

Flow arrays and flow tables compose freely — any value position accepts another flow array, flow table, or any scalar. There is no depth limit.

Grammar (sketch)

The grammar is indentation-sensitive; the token stream includes synthetic INDENT and DEDENT tokens produced by the lexer according to the indent rule above.

document     = { trivia } ( table_root | list_root | scalar_root | empty ) ;
trivia       = blank_line | line_comment | block_comment ;
table_root   = kvpair { trivia | kvpair } ;
list_root    = list_item { trivia | list_item } ;
scalar_root  = ( inline_value | heredoc_ref ) { trivia } ;
empty        = { trivia } ;

block_item   = kvpair | list_item | line_comment | block_comment ;
kvpair       = key ":" ( inline_value | heredoc_ref | child_block ) ;
list_item    = "+" ( inline_value | heredoc_ref | kvpair { INDENT kvpair DEDENT } | child_block ) ;
child_block  = NEWLINE INDENT block_item { block_item } DEDENT ;

key          = bare_key | basic_string | literal_string ;
bare_key     = ( xid_cont_safe | "_" | "-" )+ ;
             (* xid_cont_safe = any character in                    *)
             (*   XID_Continue \ Default_Ignorable_Code_Point       *)
             (*   per Unicode 15.1+ derived properties              *)

inline_value = string | integer | float | boolean | datetime
             | flow_array | flow_table ;

heredoc_ref  = ( '"""' | "'''" ) [ label ] { WS modifier } ;
modifier     = ident "(" [ inline_value { "," inline_value } ] ")" ;
ident        = ( letter | "_" ) { letter | digit | "_" } ;
label        = ident ;

flow_array   = "[" [ inline_value { "," inline_value } [ "," ] ] "]" ;
flow_table   = "{" [ flow_kv { "," flow_kv } [ "," ] ] "}" ;
flow_kv      = key ":" inline_value ;
             (* Inside flow_array / flow_table, all whitespace including *)
             (* newlines is insignificant; the outer indent rule is      *)
             (* suspended between the opening and closing bracket/brace. *)
             (* inline_value excludes heredoc_ref, and no comment        *)
             (* production appears inside flow forms — both are decode    *)
             (* errors. See "What flow forms cannot contain".            *)

line_comment  = ( "#" | "//" ) { any_char_except_newline } ;
block_comment = hash_block | c_block ;
hash_block    = "###" [ label ] NEWLINE { any_line } terminator_line ;
c_block       = "/*" { any_char | c_block } "*/" ;   (* nested *)

(String/integer/float/datetime productions follow the rules described above.)

Example document

# Server config
title:   "production"
updated: 2026-04-22T10:00:00-04:00

database:
  host: "db.internal"
  port: 5432
  pool:
    size: 10
  dsn: """END
      host=db.internal
      port=5432
      sslmode=require
      END

servers:
  + name: "web1"
    ipv4: "10.0.0.1"
    disks:
      + mount: "/"
        size_gb: 100
      + mount: "/var"
        size_gb: 500
  + name: "web2"
    ipv4: "10.0.0.2"

###NOTE
  The feature flags below are owned by the growth team.
  See team-growth/flags.md before changing.
  NOTE
features:
  enabled: ["auth", "billing", "search"]
  limits:  { rpm: 1000, burst: 50 }

Design non-goals

Decisions taken that intentionally don't appear in the spec.

No null / none. Missing values are expressed by key absence.
No string concatenation or line continuation for single-line strings. Heredocs are the only multi-line string mechanism.
No unit suffixes (30s, 5MB). Consumers layer that on top.
No references / interpolation / schemas.
No anchors / aliases / type tags.

DMS — Data Meta Syntax

Design principles

Lexical

Unicode normalization

The indent rule

Front matter

Currently-defined reserved keys

Tier semantics

Decoding modes — full and lite

Unordered tables — optional opt-in (orthogonal axis)

Examples

API shape

Examples

Front-matter-only decode

Forward compatibility

Declared reservations

Implicit reservations

Document root

Examples

Keys and scalars

What counts as a bare key

Quoting

Separator whitespace

Duplicate keys

Key order

Block vs scalar values

Lists

Comments

Attachment rules

Paths

What's stored

Round-trip semantics

Worked example

encode — DMS-output emitter contract

Line comments

Block comments

Types

Strings

Escape sequences (basic strings)

Heredocs

Line continuation (escape-on only)

Modifiers

The two standard modifiers

_trim(chars, where, replacement = "")

_fold_paragraphs()

Escape hatch for triple-quote content

Integers

Floats

Booleans

Date & time

Arrays (flow form)

Flow tables

Flow forms — canonical multi-line layout

What flow forms cannot contain

Nested flow values

Grammar (sketch)

Example document

Design non-goals

`encode` — DMS-output emitter contract

`_trim(chars, where, replacement = "")`

`_fold_paragraphs()`