DMS — Data Meta Syntax
Version: 0.14 (draft)
file extension:
A data syntax with YAML's clean look and a small, strict spec. Structure is indent-based (no repeated section headers). Types are distinct and never inferred from context. Heredocs and multi-line comments are first-class.
Design principles
- Indent-based structure, but a tiny indent rule (no YAML-style complexity).
- Strict, distinct types — never infer from context.
- No anchors, aliases, tags, merge keys, schemas, or references.
- Every value has exactly one canonical representation.
- UTF-8 only, NFC-normalized. LF or CRLF line endings, both accepted, LF canonical.
Lexical
- Indentation: spaces only. Tabs are banned in structural indent (they may still appear inside string values). A hard tab at the start of any non-heredoc-body line is a parse error.
- Line terminator: LF or CRLF.
- Whitespace inside a line: space (U+0020) or tab (U+0009), not
significant except where explicitly required (e.g. after
:, after+). - Case sensitivity: keys, keywords (
true,false,inf,nan), and heredoc labels are case-sensitive. - Reserved decorator sigils: the characters
! @ $ % ^ & * | ~ ` . , > < ? ; =
are reserved as decorator sigils at line-start position. A body
line whose first non-whitespace character is one of these is a parse
error in tier 0. The reservation is fixed in this spec — it is not
derived from any per-document declaration. Tier 1 (defined in
TIER1.md) binds these sigils to dialect-published
decorator families via the _dms_imports front-matter field;
tier-0-only decoders reject tier-1 documents at front-matter decode
(_dms_tier: 1 triggers rejection on a tier-0-only decoder by the
tier-marker rule below). The underscore (_) is not a reserved
decorator sigil here — it has its own category, reserved for core /
built-in decorators (e.g. heredoc modifiers _trim,
_fold_paragraphs).
Three other punctuation characters that might look like candidates for this set are explicitly not reserved, because they already carry tier-0 grammar:
/opens line comments (//) and block comments (/* */).-is a member of the bare-key character set (-key: 1is a valid kvpair) and prefixes negative numeric scalar roots (-5)._is the core-decorator prefix (see above).
The reservation cost in tier 0 is zero: none of the seventeen reserved-sigil characters are members of the bare-key character set (§Keys and scalars), and none can appear as the first non- whitespace character of any other valid tier-0 construct, so no pre-existing valid document is invalidated. Decoders that previously accepted such lines by oversight must reject them after this spec revision.
-
Reserved emoji characters: any extended grapheme cluster (per UAX #29, frozen at Unicode 15.1.0) that contains at least one codepoint from the Reserved Emoji Set is reserved in tier 0. The Reserved Emoji Set is the union of:
-
Extended_Pictographic=Yes(per UTS #51) — covers all pictographic emoji bases, including ZWJ-sequence components. - Regional Indicators, U+1F1E6..U+1F1FF — flag pairs.
- Emoji modifiers (skin-tone), U+1F3FB..U+1F3FF.
- Combining Enclosing Keycap, U+20E3 — keycap sequences.
All four sub-ranges are frozen at Unicode 15.1.0 alongside the bare-key and NFC tables. The set is closed under Unicode's monotonic-additive stability guarantees: future SPEC bumps can only add codepoints, never invalidate documents that decode cleanly under the current floor.
This selection covers every emoji renderable as a single visual
glyph in current systems — single-codepoint emoji (🚀),
ZWJ families (👨👩👧), skin-tone variants (👍🏽), flags
(🇺🇸), and keycaps (1️⃣).
ASCII overlap is naturally excluded. Characters with
Emoji=Yes in UTS #51 but not Extended_Pictographic=Yes —
digits 0-9, #, * — are not in the Reserved Emoji Set
and continue to carry their tier-0 grammar meaning (numeric
scalars, comment marker, decorator-sigil candidate). The
Extended_Pictographic property was specifically designed to
exclude these ASCII overlaps; that's why we use it as the
primary base.
Latin-1 trademark symbols are included. © (U+00A9), ®
(U+00AE), and ™ (U+2122) are classified as
Extended_Pictographic=Yes by Unicode and are therefore
reserved. To use them as plain-text prose at the start of a
line or as a bare-key character, write them inside quotes
(note: "© 2026 Acme") or a string scalar. This is a
deliberate consequence of taking Unicode's classification at
face value rather than carving out exceptions per-codepoint —
carve-outs would require ports to track their own
pictographic-vs-text taste, which would diverge over time.
Concretely, reservation applies as follows:
- Banned from bare keys. No bare-key character may belong
to the Reserved Emoji Set. Emoji-bearing keys must be quoted
(
"🚀": 1,"🇺🇸": "United States"). - Banned at line-start position. A body line whose first
non-whitespace extended grapheme cluster contains a
Reserved-Emoji codepoint is a parse error in tier 0, on the
same footing as a reserved decorator sigil. Tier 1 dialects
may bind such clusters to decorator families via
_dms_imports. - Banned as the first grapheme cluster of any unquoted value
position — flow scalar, array element, or inline scalar
value following
:. Emoji in value position must be inside a quoted string or heredoc. - Allowed verbatim inside strings and heredocs. Reservation
applies only to unquoted positions;
name: "🚀 launch", literal strings, and heredoc bodies are unaffected.
The four sub-ranges are sourced from UCD 15.1 (emoji-data.txt
for Extended_Pictographic and Emoji_Modifier; the Unicode
core database for Regional Indicators and U+20E3) and shipped
per port as a single frozen table, in lockstep with the
bare-key and NFC tables. A port must not delegate the check to
its host runtime's Unicode library — host tables track the
runtime's Unicode version and would silently diverge as new
emoji are assigned. Grapheme-cluster segmentation is likewise
performed against a UAX #29 algorithm frozen at 15.1.0 (the
algorithm itself is stable; only the property tables it
consults need pinning).
Decoder error messages must name the offending codepoint by
hex value and category, e.g. "U+1F680 (🚀, Extended_Pictographic)
reserved as emoji", since terminals and editors render some
Reserved-Emoji codepoints as monochrome text glyphs that authors
may not recognize as emoji.
Unicode normalization
To prevent visually-identical strings from comparing unequal — e.g.
é written as the precomposed scalar U+00E9 vs. as U+0065 U+0301
(e + combining acute) — every string the decoder produces is
normalized to Normalization Form C (NFC) as defined by Unicode
Standard Annex #15.
This applies uniformly to:
- bare keys and the inner content of quoted keys (
"..."and'...'); - basic strings (
"..."), after escape decoding; - literal strings (
'...'), even though no escapes are processed; - heredoc bodies of every form, after any escape decoding.
Normalization is applied to the source after UTF-8 decoding and
before tokenization, so even structural elements like bare keys
see normalized input — a bare key written as decomposed café
(e + U+0301) becomes precomposed café before the bare-key
category check, and is accepted. Strings produced by escape
sequences (\uXXXX, \UXXXXXXXX) are additionally NFC-normalized
at the point the string is constructed, since escape-decoded scalars
don't pass through the source-level NFC pass. NFC does not salvage
non-scalar escapes — a surrogate escape is still a parse error.
NFC is stable under the Unicode Stability Policy: for any character already assigned in Unicode 4.1 (2005) or later, its NFC form does not change in newer Unicode versions. New characters assigned in later Unicode releases get new NFC mappings, however — so a port that delegates NFC to its host runtime would normalize a Unicode 16 codepoint differently from a port frozen at 15.1.
To keep documents byte-identical across ports and across time, NFC
tables are frozen at Unicode 15.1.0 and shipped with each port,
exactly like the bare-key set. A port must not delegate NFC to its
host runtime's Unicode library (Python's unicodedata.normalize,
ICU's unorm2, etc.) — those track whichever Unicode version the
runtime was built against and would silently diverge once new
codepoints are assigned. The two table generations (NFC + bare-key
set) are kept in lockstep on a single SPEC-controlled Unicode floor;
a future SPEC bump moves both together.
Bump policy. A future SPEC version may move the Unicode floor
from 15.1 to a higher release. When that happens, ports ship only
the new tables — the prior floor's tables are not retained, and
ports do not carry the cumulative union of historical UCD snapshots.
This is safe because both properties involved are monotonic under
Unicode's own stability guarantees: XID_Continue only grows
(characters are never removed from the identifier sets), and NFC
mappings of already-assigned characters never change. Consequently
every document valid under floor N decodes byte-identically under
floor N+k — a bump is purely additive (new accepted characters,
new NFC entries for newly assigned codepoints) and never breaks
pre-bump documents.
The duplicate-key check operates on NFC-normalized keys, so writing
café twice in the same table — once precomposed, once decomposed —
is a parse error rather than two distinct keys.
Round-trip. A decoder-emitter pair preserves the NFC value of every string, not the original source bytes. Non-NFC input becomes NFC on re-emit; emitters do not (and should not) reconstruct the original encoding.
The indent rule
Nesting is expressed by indentation. One rule:
Inside a single parent, all direct children must be indented by the same number of spaces. The first child sets the width; every subsequent sibling must match it exactly.
Different parents can pick different widths. A block ends when a line is encountered at an indent strictly less than its children's width.
a:
b: 1 # a's children are 4-wide (first child set the width)
c: 2 # must be 4
d:
e: 1 # d's children are 2-wide; independent of a's choice
f: 2 # must be 2
g: 3 # back at the root level
Inconsistent sibling indent (e.g. b at 4 spaces then c at 3) is a decode
error with the column pointed at.
Front matter
A DMS document may begin with an optional front matter block delimited
by +++ lines. If present, the block must precede any other content
(blank lines, line comments, and block comments may appear before +++).
+++
app_name: "myservice"
doc_version: "1.2.3"
updated: 2026-04-23
+++
# the actual document body starts here
database:
host: "db.internal"
port: 5432
Rules:
- Open/close delimiters: each
+++must appear on its own line and start at column 1 — no leading whitespace. Trailing whitespace after the+++is permitted and ignored. Any other content on the line (a comment, a key, anything beyond whitespace) is a parse error. - Unterminated front matter is a parse error. If an opening
+++appears but no closing+++is found before end-of-file, the decoder must reject with a clear error pointing at the opener line. - Empty front matter is allowed. A
+++ \n +++block with no content between the delimiters decodes as a present-but-empty front-matter table (encoder shape:{ "_meta": {}, "_body": <body> }). This is distinct from "no front matter at all," which omits the_metawrapper entirely. - Position: the opening
+++must be the first significant line (blank lines and comments before it are fine). A+++line appearing after any non-comment content is a plain syntax error — it is not recognized as a front matter opener. - Contents: inside the block, content decodes as an ordinary DMS table (arbitrary nesting, all scalar types, flow/block forms, heredocs, comments — everything tier 0 supports).
- Reserved prefix: every key inside the front matter that starts
with
_(U+005F LOW LINE) is reserved for DMS's use. Users must not introduce their own keys with a leading_; doing so is a parse error. Unknown reserved keys (a leading-underscore key the decoder doesn't recognize) are also a parse error — this lets future DMS versions add new reserved keys and still give old decoders a clean error message. - User keys: any key not starting with
_belongs to the author. The decoder surfaces these as document metadata (see API shape below) and otherwise does not interpret them. - Front matter itself is tier 0. Every tier-0 decoder must be able to
recognize the
+++delimiters, decode the contents as a table, enforce the reserved-prefix rule, and act on the reserved keys defined below.
Currently-defined reserved keys
| Key | Type | Meaning |
|---|---|---|
_dms_tier |
non-negative int | Declares the minimum decoder tier required. Absent ⇒ tier 0 implied. _dms_tier: 0 is the explicit tier-0 form. _dms_tier: 1 opts into tier 1 (see TIER1.md) — a tier-0-only decoder rejects with the tier-1-pointing error described below; a tier-1-capable decoder accepts. _dms_tier: N for N ≥ 2 is a parse error in this revision (no tier ≥ 2 is currently defined). A value of any other type — string ("0"), float (0.0), bool, list, table, datetime — is a parse error: "_dms_tier must be a non-negative integer". |
Any other _-prefixed key inside the front matter is currently
reserved but undefined; a decoder encountering one must refuse
with "unknown reserved key: <name>". Reserved keys exist as a
forward-compatibility hook: future versions of DMS may add new
reserved keys, and old decoders will give a clean error message
rather than silently misinterpreting them.
Tier semantics
- A document with no front matter, or a front matter with
_dms_tierabsent or equal to0, is a tier 0 document. Every conforming decoder can read it. - A document with
_dms_tier: 1is a tier 1 document (see TIER1.md). Tier-1-capable decoders accept it; tier-0- only decoders must reject with the tier-1-pointing error described in §Reservations below. - A document with
_dms_tier: Nfor N ≥ 2 is currently a parse error. The_dms_tierkey remains a forward-compatibility hook for future tiers; no tier ≥ 2 is defined today.
Decoding modes — full and lite
Independent of the tier axis, every conforming DMS decoder exposes two decoding modes: a full mode (default) and a lite mode (opt-in). Both modes share the same grammar, the same error diagnostics, and the same data tree — they differ only in how much round-trip metadata the decoder keeps.
| Aspect | Full mode (default) | Lite mode (opt-in) |
|---|---|---|
| Data tree (tables, lists, scalars) | produced | produced |
Front matter (_meta) |
produced | produced |
| Comment AST (leading / inner / trailing / floating) | produced | not produced — comments are lexed and discarded |
original_forms (integer base, string form, heredoc form) |
produced | not produced |
Full-mode encode() (preserving round-trip) |
supported | not supported — needs comments + original_forms |
Lite-mode encode() (canonical-form emit) |
supported | supported |
The grammar is identical in both modes. Lite mode does not relax error checking, does not skip front-matter validation, does not loosen Unicode normalization, does not change which inputs are accepted. It is the same decoder with two output channels turned off.
What lite mode is for. Read-only consumers — application
configs, CI pipelines, deploy scripts, sysctl-style readers — that
decode, extract values, and never re-emit the document. The
comment-AST and original_forms machinery exists to support
encode(); if you don't call encode(), those structures are
dead weight. Lite mode lets read-only callers skip the bookkeeping
and recover wall-clock time (reference benchmarks show roughly
1.5–2× on flat-table workloads; varies by port).
What lite mode is not. Lite mode is not a "permissive" mode, not a non-conforming subset, and not an alternative conformance level. A document that decodes in full mode decodes in lite mode and vice versa; a document that errors in full mode errors at the same character in lite mode. The two modes produce the same data tree.
encode() itself has two modes — full and lite — orthogonal to
the decode-side modes. Same name pattern, different concern. The
decode-side mode controls how much round-trip metadata the decoder
captures. The emit-side mode controls how much of that metadata
encode() re-emits.
encode mode (input → output) |
Comments | original_forms |
Use case |
|---|---|---|---|
| Full (default) — preserving emit | re-emitted | re-emitted (hex/oct/bin/literal-string forms) | Round-trip a decoded file, hand-edited config writer |
| Lite — canonical emit | dropped | dropped (decimal ints, basic-quoted strings) | Generate DMS from in-memory data; bench/strip |
Lite-mode encode accepts any Document — full or lite. It
ignores comments and original_forms even when present, and emits
canonical form: decimal integers, basic-quoted strings, no comments,
emitter-default whitespace. The output is always valid DMS that
re-decodes to a data-equivalent Document.
Full-mode encode (the existing default) requires a full-mode-decoded
Document (or one constructed in code with the metadata fields
populated). A decoder that ships both encode modes MUST refuse
encode(lite_doc, mode=full) with a clean error ("full-mode emit
requires comments + original_forms; got a lite-mode Document").
encode(lite_doc, mode=lite) is always valid.
Round-trip stability (under §encode) is required only for
full-mode emit of a full-mode-decoded Document. Lite-mode emit
is canonical-form lossy by design — encode_lite(decode(src)) may
strip comments and re-render hex integers as decimal; that is the
intended behaviour, not a violation.
Conformance. Every conforming decoder MUST ship full-mode decode
AND full-mode encode. Lite-mode decode and lite-mode encode are
optional to ship; decoders that ship them MUST do so under the
contract above. A decoder that exposes only lite mode (either side)
is non-conforming.
Capability reporting. A decoder that ships lite mode advertises
it via a supports_lite_mode boolean on its capability surface.
Callers can probe before opting in.
Unordered tables — optional opt-in (orthogonal axis)
A third, independent axis exists alongside full/lite: the table
ordering guarantee. Tier 0 makes insertion-order preservation a
default invariant — every conforming decoder ships an ordered mode and
the conformance corpus is checked against it. Some consumers don't
care: kubectl-style read-only loaders, monitoring agents, batch
processors that consume DMS, project to a few keys, and never
re-emit. For those callers, an unordered mode is allowed as an
optional opt-in.
| Aspect | Ordered (default) | Unordered (opt-in) |
|---|---|---|
| Iteration order over a decoded table | insertion-order | arbitrary — decoder may use a hash-only backing |
| Conformance corpus expected output | byte-stable | best-effort; equality compares structurally, not order |
Full-mode encode() (round-trip) |
supported | not supported — round-trip needs stable order |
Lite-mode encode() (canonical) |
supported | supported (emits in iteration order, no stability promise) |
API shape. A decoder that ships unordered mode exposes it via a
parallel entry point or a flag — decode_document_unordered(src),
decode(src, ignore_order=true), etc. The exact name is
language-specific. The CLI convention used by the conformance and
bench harnesses is --ignore-order.
Capability reporting. A decoder that ships unordered mode
advertises it via a supports_ignore_order boolean on its capability
surface. Callers probe before opting in.
Combinations. Unordered is orthogonal to full/lite — all four
combinations are conforming if the decoder ships them: (ordered,
full), (ordered, lite), (unordered, full), (unordered, lite).
The most useful pairing for read-only callers is (unordered, lite):
fastest decode, no comment AST, hash-only table backing.
Reference implementation note (informational, not normative). The
DMS Rust reference ships --ignore-order on the CLI surface with
spec-correct semantics, but at the time of writing the runtime backing
is still IndexMap-based (no measurable decode-speed win). The flag
is plumbed end-to-end so other ports can implement the
HashMap-backed fast path without API churn. Ports that DO swap to a
hash-only backing should advertise supports_ignore_order = true.
API shape. Language-specific. The general pattern is a
construction-time option or a parallel entry point — e.g.
decode(source, mode="lite") versus decode(source), or
decode_lite(source) versus decode(source). The spec does not
mandate the exact API name; it mandates the contract above.
Examples
Full-mode decode (default), with comments preserved:
doc = dms.decode(source) # full mode by default
doc.comments[("db", "port")] # leading + trailing AttachedComments
out = dms.encode(doc) # round-trips the source
Lite-mode decode (opt-in), no comment AST:
doc = dms.decode(source, mode="lite") # comments lexed and discarded
doc.body["db"]["port"] # data is identical to full mode
doc.comments # empty / absent
dms.encode(doc) # ERROR: round-trip requires full mode
dms.encode(doc, mode="lite") # OK — canonical emit, no metadata needed
Lite-mode emit on a full-mode Document — strip comments + canonicalise:
doc = dms.decode(source) # full-mode decode, comments captured
canonical = dms.encode(doc, mode="lite")
# `canonical` has no comments, decimal integers (even if source used 0xFF),
# basic-quoted strings (even if source used '...' literal). re-decodes to a
# data-equivalent Document.
API shape
A decoder must expose the decoded document as both a body value
(what the rest of the spec already defines) and a front matter
table (the decoded +++ block, or an empty/absent value if the
document had no front matter). Exactly how is language-specific; for
this implementation's conformance encoder (JSON output), the shape is:
- No front matter present: encoder output is the body as tagged JSON, identical to pre-front-matter behavior.
- Front matter present: encoder output is a JSON object
{ "_meta": <front-matter tagged>, "_body": <body tagged> }. Both subtrees use the standard tagged-JSON encoding.
This means every existing conformance test — none of which declare front matter — keeps its expected output unchanged. New tests that use front matter produce the wrapped form.
Examples
Explicit tier 0 declaration:
+++
_dms_tier: 0
+++
host: "db.internal"
port: 5432
User metadata:
+++
title: "Production config"
author: "ada@example.com"
updated: 2026-04-23
+++
database:
host: "db.internal"
(Front matter surfaces as metadata.)
Tier ≥ 1 (parse error):
+++
_dms_tier: 1
+++
host: "db.internal"
(No tier ≥ 1 is currently defined; the decoder refuses.)
User tries a reserved key (parse error):
+++
_my_app_version: "1.0" # error: '_'-prefixed keys are reserved
+++
Front-matter-only decode
Every conforming decoder must expose a separate entry point that
decodes only the front-matter block and stops, skipping the body.
This exists for callers that need only the document's metadata —
config loaders checking _dms_tier, indexers harvesting user keys,
dispatchers choosing a downstream decoder — and would otherwise pay
the full decode cost for a few header lines.
Contract:
- Input: a DMS source (string or byte buffer).
- Output: the decoded front-matter table, or a language-specific
empty/absent value when the document has no front matter at all
(no opening
+++after trivia). Present-but-empty front matter (+++\n+++) returns an empty table — distinguishable by the caller from "no front matter". - Scope: the decoder scans leading trivia (blank lines, line and
block comments), the opening
+++, the front-matter contents, and the closing+++, then returns. Body bytes after the closer are not tokenized. - Validation: every front-matter rule from §Front matter still
applies — open/close on their own lines, unterminated front matter
is a parse error, the
_-prefix namespace is enforced,_dms_tieris type-checked, unknown reserved keys are rejected. Front-matter- only decode is not a permissive mode; it is the same grammar with an early stop. - Mode: front-matter-only decode runs in lite mode — no
comment AST, no
original_formsinside the front matter. (Front-matter preservation throughencodeis a full-decode concern.) - Errors: diagnostics inside the
+++ ... +++block are byte- identical to a full decode. Errors that only manifest in the body (duplicate body keys, unterminated body heredoc) are not surfaced by this API; callers needing whole-document validation must call the full decoder.
API shape. Language-specific. Reference ports use
decode_front_matter(source) / decodeFrontMatter(source) per host
idiom. The CLI convention used by the conformance and bench harnesses
is --front-matter-only.
Conformance. Required at tier 0. Absence of this entry point is non-conformance; there is no capability flag.
Forward compatibility
DMS evolves by reserving syntactic and lexical real estate today so future versions can extend it without breaking existing documents. Reservations fall into two groups: declared (the spec explicitly names them) and implicit (the tier-0 grammar rejects them today, leaving the slot free for a future tier to define).
Declared reservations
_dms_tier—_dms_tier: 1opts into tier 1 (defined in TIER1.md); tier-0-only decoders refuse cleanly with the tier-1-pointing error described below._dms_tier: Nfor N ≥ 2 remains reserved for future tiers (see Tier semantics).- Front-matter
_-prefix keys — the entire_-prefix namespace inside+++ ... +++is reserved for DMS. Unknown reserved keys are a parse error, so future versions can introduce directives (a merge policy, a schema reference, etc.) and old decoders surface a clean error rather than silently misreading. - Heredoc modifier names — unknown modifier identifiers are a decode error; new modifiers can land in later versions without ambiguity.
_trimwhereflags — unknown flag characters are silently ignored, so new flags are forward-compatible. (Inverse policy from modifier names — chosen because flags are a bag-of-chars, not identifiers.)- Lexical reservations — leading zeros on decimal integer literals,
octal escape sequences (
\012), and unknown backslash escapes are parse errors today, reserved against future definition.
Implicit reservations
The tier-0 grammar is strict: any token sequence not matched by an
explicit production is a parse error. The following positions are
currently rejected and are candidates for tier ≥ 1 extension. Tier-0
decoders must continue to reject them; a tier ≥ 1 document opts in
via _dms_tier.
- Post-inline-value, same line. The slot following an
inline_valueon a kvpair line, before the newline or trailing comment. Todayport: 5432 _example()is a tier-0 parse error. Reserved as the natural attachment point for future post-value annotations (e.g. modifier-style transforms applied to non-heredoc values, of the sameident(args)shape used by heredoc modifiers). - Post-root trailing content. Non-comment, non-blank tokens following a scalar-root value, or following the final value of a table/list root. Reserved as the attachment point for future whole-document annotations.
- Sigil tokens in value-positions. A reserved decorator sigil
(
! @ $ % ^ & * | ~) appearing where aninline_valuewould be expected — afterkey:, after+`, in flow-array or flow-table element positions, at scalar root — is a tier-0 parse error. Reserved as the future location for tier-≥1 decoration prefixes.
These reservations do not commit DMS to ever populating these slots — they document where future extensions could land without breaking existing tier-0 documents.
Decoders SHOULD emit a tier-1-pointing error when they encounter a
reserved decorator sigil in any reserved slot ("decorator sigil
'_dms_tier: 1 and declare the dialect
in _dms_imports — see TIER1.md"), rather than a generic
"unexpected token."
Document root
A DMS document's root value is polymorphic — it can be a table, a list, or a scalar. The root type is determined by the first significant line (significant = not blank, not a line comment, not a block comment):
| First significant line begins with… | Root is a … |
|---|---|
a key followed by : |
table |
+ (list item marker) |
list |
any other value token (string, number, """, …) |
scalar |
| nothing (document is empty or only comments) | empty table ({}) |
Once the root type is committed, every subsequent top-level (column 0) line must match it:
- Under a table root, every column-0 line must be a
kvpair. - Under a list root, every column-0 line must be a
+item. - Under a scalar root, there must be no further significant lines.
A top-level line that violates the committed root type is a parse error.
Examples
Table root (the common case for config):
title: "production"
database:
host: "db.internal"
List root:
+ name: "web1"
ipv4: "10.0.0.1"
+ name: "web2"
ipv4: "10.0.0.2"
Scalar root:
42
"""
A document whose entire value is a multi-line string.
"""
Keys and scalars
bare_key: 1 # bare key: letters, digits, underscore, dash
"quoted key": 2 # double-quoted: escapes processed
'quoted key': 3 # single-quoted: literal, every character as-is
résumé: 4 # Unicode letters are allowed in bare keys
42: 5 # numeric-looking bare keys are fine
"": 6 # empty string key — must be quoted
What counts as a bare key
A bare key is one or more characters, each drawn from:
- ASCII letters and digits (
A-Z,a-z,0-9) - ASCII underscore (
_) and dash (-) - Any character in the set
XID_Continue∖Default_Ignorable_Code_Point∖ Reserved-Emoji-Set (see §Lexical → Reserved emoji characters), as defined by the Unicode derived properties frozen at Unicode 15.1.0. Document encoding is UTF-8.
XID_Continue is the Unicode-standard "identifier continuation"
set — letters, digits, combining marks, and a curated handful of
joiners — and is what Python, Rust, and most modern languages use
for identifiers. Subtracting Default_Ignorable_Code_Point removes
invisibles such as zero-width joiners and variation selectors that
would otherwise let two visually-identical bare keys differ in
their byte content.
The accepted set is frozen, not host-derived. Each port ships
its own table generated once from the UCD 15.1 data files. A port
must not delegate this check to its host runtime's Unicode library
(Python's str.isidentifier(), ICU, etc.), because those track
whichever Unicode version the runtime was built against — meaning
the set of accepted bare-key characters would silently grow over
time as the host platform updates. Freezing the set at 15.1
guarantees that a document written today decodes identically a
decade from now, on every port, regardless of what new code points
Unicode introduces. A future SPEC version may bump the floor; until
then, ports re-emit their tables only on explicit SPEC instruction.
Keys that look like numbers (42), booleans (true), or other reserved
identifiers (inf, nan) are valid bare keys — the trailing :
disambiguates context. Every decoded key is a string, regardless of
whether it was written bare or quoted: 42: x produces the string key
"42".
A bare key may consist entirely of _ and/or - characters
(_:, -:, _-_: are all valid keys). The character-set rule is
positional, not compositional — there is no "must contain at least one
letter or digit" requirement.
Quoting
An empty string key must be quoted ("" or '') — a bare key requires
at least one character. Any key containing whitespace, :, #, {},
[], ", ', or . must also be quoted.
Separator whitespace
A : that terminates a key must be followed by a space (or end-of-line,
if the value is a child block). host:localhost is a parse error;
host: localhost is fine.
Duplicate keys
Two keys in the same table that, after decoding, produce the same string
are a parse error. This rule compares the final key strings — which
are NFC-normalized (see Unicode normalization)
— so "42" and 42 collide, "hello" and 'hello' collide, and a
key written as precomposed é collides with one written as e +
U+0301.
Key order
Key insertion order is preserved. A DMS decoder must expose each table as an insertion-ordered structure so that doc-to-doc diffs are stable and round-tripping a document emits keys in the same order they were written.
Block vs scalar values
A key can take its value in one of three shapes:
- Inline scalar — value fits on the same line:
port: 5432 name: "web1" - Child block — key ends with bare
:and the next non-blank line is indented further:database: host: "db.internal" port: 5432 - Heredoc — a triple-quote opener (
"""or''', with an optional label) follows the:; content starts on the next line and runs until the terminator. See Heredocs below.
A key with bare : and no indented block beneath it is a parse error. Use
[] or {} flow form for empty collections.
The three shapes are mutually exclusive: a key with an inline scalar (or a heredoc) cannot also have an indented child block beneath it. Source like
port: 5432
child: 1 # parse error: inline value already given for `port`
is rejected — the decoder commits to the inline value on the port line
and then sees the deeper-indented child: as illegal indent (no parent
block was opened). To get a child block, drop the inline value:
port:
child: 1
Lists
A list item is a line whose first non-whitespace character is +, followed
by a space and the item's content.
Mnemonic: read + as "push this onto the list." Each + line appends one
item to the enclosing list, the same way list.push(x) (JS), list.append(x)
(Python), or vec.push(x) (Rust) appends to an array — and the visual
column of the + plays the role of "which list" when lists are nested.
tags:
+ "web" # scalar item
+ "frontend"
+ "public"
servers:
+ name: "web1" # table item: first key sits on the + line
ipv4: "10.0.0.1"
disks:
+ mount: "/"
size_gb: 100
+ mount: "/var"
size_gb: 500
+ name: "web2"
ipv4: "10.0.0.2"
Rules:
- Sibling
+markers must be at the same column (they are direct children of the same parent; the indent rule applies). - A table item's first key sits after
+on the same line; sibling keys of that item must align with that first key's column, not with the+. - A list item may also be empty-on-same-line and open a nested block on the
next line:
```
matrix:
+
- 1
- 2 +
- 3
- 4 ```
Comments
Required of every conforming DMS decoder. This is not a "preserving variant" of full mode — comment-AST attachment and round-trip preservation are part of the format definition. A decoder that drops comments in full mode, or that decodes them but can't reproduce them via
encode(see §encode), does not conform. Lite mode (see §Decoding modes — full and lite) is an explicit, opt-in alternative that discards comments by design; that is a documented mode, not a preservation failure.
DMS preserves comments through decoding. Every comment in the
source — line comment (# or //), C-style block comment
(/* … */, nestable), or hash-block comment (###LABEL … LABEL,
### … ###) — is captured during decoding and attached to the
nearest neighbor in the value tree as a first-class AST node. This
is what makes a decode → modify → re-encode round-trip keep
comments at the right places in the output.
The contrast with most config formats is the load-bearing piece of DMS's design:
- JSON has no comments at all.
- TOML and YAML specs allow comments, but every mainstream
decoder (
tomli,tomlc99,toml-rs,@iarna/toml,BurntSushi/toml, PyYAML/CSafeLoader, libyaml,yaml.v3,js-yaml,YAML::XS) treats them as lexer trivia and discards them during decoding. The data tree exposed to your application has no record of where comments lived. Re-emitting that tree therefore can't reproduce the comments — they're gone the moment you decoded. - Some libraries preserve comments via a separate value type
(
ruamel.yamlfor Python YAML, thetoml-editcrate for Rust). These are language-specific add-ons that sit alongside the "normal" decoder; they're slower and incompatible with the ecosystem's main toolchain.
DMS bakes preservation into the format itself. Every reference
decoder (Rust, Go, C, Zig, Python, Node, Perl pure, Perl XS) returns
a Document whose comments are tracked at the same level as the
data, by spec.
The rest of this section defines the attachment rules and
round-trip contract. See §encode for the emitter side of the
round-trip.
Attachment rules
When the decoder encounters a comment, it attaches it to a value / kvpair / list-item / container node according to these rules.
Leading comment. A line whose only significant content is a
comment, immediately preceding a kvpair, list item, or block opener,
attaches to that following node as a leading comment — provided
there is no blank line between the comment and the node. Multiple
leading comments stack on the same node, in source order:
# server pool ← leading
# updated 2026-04-22 ← leading (also)
servers:
+ name: "web1"
Trailing comments. Comments on the same line as a value,
after that value, attach to the value as trailing comments.
Multiple trailing comments stack in source order — possible because a
/* ... */ block comment doesn't terminate the line. A # or //
line comment, if present, consumes the rest of the line and must
therefore come last:
port: 8080 # default ← trailing on `port`
retry: 3 /* aggressive */ /* see SLO */ ← two trailing block
token: "x" /* see vault */ # never log this ← block, then line
Inner comments. A /* ... */ block comment appearing between a
key's : and its value, or between a + and a list item's
content, attaches to that kvpair or list item as an inner comment.
Multiple inner comments stack in source order:
secret: /* see vault */ /* rotated 2026-04-01 */ "REDACTED"
+ /* see runbook */ name: "web1"
The rule is positional — inner attaches whenever a /* ... */ appears
between the opening sigil (: or +) and the value, regardless of
whether the value is an inline scalar, a heredoc opener, or the
newline that opens a child block:
db: /* connection cluster */
host: "db.internal"
port: 5432
Inner comments are /* ... */ only. Line comments (#, //)
consume to end-of-line and ### ... ### block comments require the
opener on its own line, so neither can appear mid-token.
Floating comment. A comment that does not attach to a
following sibling — either because a blank line separates it from
the next sibling, or because the block closes before any sibling
appears — attaches to the enclosing container (table or list) as
a floating comment, in source order:
servers:
+ name: "web1"
# the following block is currently disabled
# restore by uncommenting ← floating on `servers`
The blank-line rule is what disambiguates "section header" comments (meant to apply broadly) from "this-key" comments (meant to apply to the next key). Authors who want a comment to attach to the immediately following key just omit the blank line; authors who want a section header keep the blank line.
Block comments in leading / trailing / floating positions. The
block-comment forms (### ... ### / ### LABEL ... LABEL, and
/* ... */) follow the same attachment rules as line comments
when they occupy any of the leading / trailing / floating positions.
A block comment that occupies its own line-range and is followed by a
kvpair (no blank line) attaches as a leading comment; a block comment
on the same line as a value, after the value, attaches as trailing;
otherwise the floating logic applies. The inner position (above) is
the only position restricted to a single comment form (/* ... */).
For example, a trailing /* ... */:
x: 1 /* trailing C-block */
attaches as trailing on the kvpair for x and round-trips through
encode next to its value, exactly like a trailing # foo line
comment would.
Front matter comments. Comments inside the +++ ... +++ block
follow the same attachment rules, scoped to the front matter table.
Their leading / inner / trailing / floating attachments live alongside
the front matter user keys; they do not leak into the body.
Paths
The comment-attachment metadata and the round-trip original_forms
records both key into the document by path. A path is an ordered
sequence of segments, each one of:
- a table key — the decoded key string (always a string, see §Keys), which selects a value inside an enclosing table.
- a list index — a non-negative integer, which selects a value inside an enclosing list.
Conventions:
- The empty path
[]denotes the document root. - A path with first segment
"__fm__"(a sentinel string key) denotes a node inside the front-matter table; the rest of the path is then resolved against the FM table the same way as the body. - The encoder's tagged-JSON output uses the same path convention implicitly: a JSON object's keys are table-key segments, a JSON array's positions are list-index segments.
- Paths are not strings — they are typed sequences. Implementations
may serialize them however they like internally (Rust's
Vec<BreadcrumbSegment>, Python's tuple ofstr | int, JS's array ofstring | number, etc.), but the segment types are mandatory so a string key"1"and an index1never collide.
Examples:
| Source | Path of value "web1" |
|---|---|
host: "web1" |
["host"] |
db:host: "web1" |
["db", "host"] |
+ "web1" (list root) |
[0] |
servers:+ name: "web1" |
["servers", 0, "name"] |
+++app: "web1"+++x: 1 |
["__fm__", "app"] |
What's stored
Each comment is captured as { content, kind }:
content— the raw comment text including delimiters (# foo,// foo,/* foo */,### NOTE\n...\nNOTE).kind—"line"or"block".
The decoder does not assign stable IDs, breadcrumbs, or position metadata beyond the leading / inner / trailing / floating classification. Comments are identified solely by their attachment — there is no cross-document identity. (Earlier drafts of the spec experimented with content-hash IDs as part of a richer modifier system; that mechanism has been removed and is not part of DMS today.)
Every position is a list of comments stored in source order.
Leading and floating may contain any mix of line and block comments;
trailing accepts any mix but a # / // line comment must come last
(it consumes the rest of the line); inner accepts only /* ... */
block comments.
Round-trip semantics
- Decode → no modification → re-encode preserves every comment
at its attached node. (Whitespace within comment runs may be
normalized by the emitter — e.g. spacing around
#— but the content text is preserved.) - Decode → modify → re-encode preserves comments on still-present nodes: if the node a comment is attached to remains in the tree after modification, the comment travels with it. Newly inserted nodes carry no comments; deleted nodes drop their attached comments along with the node itself.
- The JSON conformance encoder described elsewhere in this spec does not emit comments — it reflects decoded values only. Comment preservation matters when re-emitting DMS output (see §encode).
Worked example
Source:
# the database section
db:
host: "localhost"
# raised from 80 after the LB change in 2024-Q4
port: 8080 # default for staging
secret: /* see vault */ /* rotated 2026-04-01 */ "REDACTED"
# restore by uncommenting
# debug: true
After decoding, Document.comments contains the entries below (paths
are breadcrumbs into the value tree). Every entry's value is a list
of {content, kind} records, in source order:
| Path | Position | Contents |
|---|---|---|
["db"] |
leading | # the database section |
["db", "port"] |
leading | # raised from 80 after the LB change in 2024-Q4 |
["db", "port"] |
trailing | # default for staging |
["db", "secret"] |
inner | /* see vault */, /* rotated 2026-04-01 */ |
["db"] |
floating | # restore by uncommenting, # debug: true |
Now mutate the decoded tree — say, change port to 5432:
doc.body["db"]["port"] = 5432
encode(doc) emits:
# the database section
db:
host: "localhost"
# raised from 80 after the LB change in 2024-Q4
port: 5432 # default for staging
secret: /* see vault */ /* rotated 2026-04-01 */ "REDACTED"
# restore by uncommenting
# debug: true
The data changed; the comments stayed at the same nodes.
encode — DMS-output emitter contract
Every conforming decoder must ship encode(document). It has two
modes — full (default) and lite — orthogonal to the
decode-side mode (see §Decoding modes — full and lite). The mode
controls how much round-trip metadata the emitter re-emits.
Full mode (default) re-emits valid DMS source from a decoded
Document produced in full-mode decode. The contract is
data + comments + literal-form preservation under round-trip:
decode(encode(decode(source)))produces aDocumentthat is data-equivalent todecode(source), has the same comments at the same attached paths, and uses the same literal forms for the values it can preserve (integer base, string form).
Lite mode emits the same data tree in canonical form:
comments are dropped, integers are emitted in decimal regardless of
source base, strings are emitted in basic-quoted form regardless of
source flavour, no original_forms consultation. Lite-mode emit
accepts both full-mode and lite-mode decoded Documents — the
metadata is simply ignored. Lite-mode emit is lossy by design for
comments / source forms; it preserves only the data tree.
The rest of this section specifies full-mode emit (where the preservation contract has teeth). Lite-mode emit follows the same data-tree rules but skips every "Required preservation" point below.
Byte-for-byte source preservation is not required (whitespace and indentation choices are emitter-determined; original column alignment is not preserved); semantic round-trip is.
Required preservations:
- Integer base. A decoder must record the source-form of every
integer literal (e.g.
0x1F40,0o755,0b1010_0110,1_000_000,+42,-7). On emit, the integer is re-rendered in its original base, preserving sign and underscore separators where present. Stored as a side-channeloriginal_formskeyed by the value's path in the tree (analogous to comment attachment); see API shape below. - String form. A decoder must record which of the four string
forms produced each decoded string: basic
"...", literal'...', heredoc-basic"""[LABEL]...LABEL(or"""..."""), heredoc-literal'''[LABEL]...LABEL(or'''...'''). Heredoc records additionally include the label text (or its absence) and any heredoc modifiers applied (_trim(...),_fold_paragraphs(), in source order). On emit, the string is re-rendered in the same form. - Comments. Every
AttachedCommentis re-emitted at its classified position (Leading / Inner / Trailing / Floating) on the still-present node at itspath. Within a position, comments emit in their stored source order. Inner comments are emitted between the kvpair's:and its value (or between a list item's+and its content), separated from each other and from the surrounding tokens by single spaces —key: /* a */ /* b */ value— to keep round 2 byte-stable. - Key insertion order. Already a tier-0 invariant; carries
through
encode.
Emitter-determined (not preserved):
- Float formatting: the shortest-decimal
ryushape is used uniformly.pi: 3.14round-trips as3.14;e: 2.7182818284may render as the same digit string or as2.7182818284etc., depending on the binary64 value. (Round-trip stability follows fromryubeing canonical for any f64.) - Indentation and whitespace. The emitter picks a consistent indentation (2 spaces is the recommended default; the contract doesn't pin a specific width) and consistent separator whitespace. Original column alignment is lost.
- Block vs flow form for containers. A list decoded as
[1, 2, 3]may emit as the same flow form OR as block-form+ 1 / + 2 / + 3; similarly for tables. Emitters MAY track and preserve the original form (recommended for tight round-trip), but the contract requires only data+comments+literal-form equivalence.
Front matter: an emitter omits the +++ ... +++ block entirely
when both meta is empty (or absent) AND no comments are attached
to the front matter. Otherwise the block is emitted with its keys
and any front-matter comments at their attached paths.
API shape (language-specific naming):
encode(document) -> str // full mode (default)
encode(document, mode=lite) -> str // lite mode (canonical-form emit)
encode_lite(document) -> str // alternative lite-mode entry point
Document {
meta, body, comments,
original_forms // sparse map: path -> {integer-lit | string-form-record}
}
The exact API name is language-specific — a parallel entry point
(encode_lite) and a mode parameter (encode(doc, mode="lite"))
are both conforming. The contract is what matters.
comments and original_forms are populated during full-mode
decoding and consulted only by full-mode encode. Lite-mode encode
ignores both fields. The conformance JSON encoder ignores them too.
Round-trip stability (full mode only):
encode(decode(encode(decode(source)))) must produce the same
string as encode(decode(source)) — i.e. the second round-trip is
byte-for-byte stable. This is the test condition for the
round-trip-comments fixture corpus. Lite-mode emit has no
round-trip stability requirement (it's canonical-form, lossy on
comments / source forms by design); it does have a data-stability
requirement: decode(encode_lite(doc)) must be data-equivalent to
doc.
Line comments
Two interchangeable forms, both running from their opener to the end of the line:
#— shell / YAML / TOML style.//— C / JS / Rust style.
Either form must be preceded by whitespace or start the line; key:5#x
and key:5//x are parse errors (the 5 runs into the sigil). Mixing
styles in one document is allowed — this is a spec-level convenience,
not a style rule. Linters may pick a canonical form.
No leading space is required after the sigil. #foo and # foo
are both valid line comments; //foo and // foo are both valid.
Style preferences (e.g., requiring a space) are a linter concern, not
a decoder one.
# full-line comment
// also a full-line comment
port: 5432 # trailing comment
port: 5432 // same
Block comments
Three forms, all equivalent in semantics (contents are discarded; no decoding happens inside):
| Form | Terminator |
|---|---|
/* ... */ |
next unmatched */ (nested /* */ allowed) |
### ... ### |
line whose trimmed content equals ### |
###LABEL ... LABEL |
line whose trimmed content equals LABEL |
/* ... */ (C-style). Nesting is supported: every /* opens a new
level and every */ closes the innermost. The decoder only returns to
code mode when the nesting count reaches zero. Can appear inline or
span multiple lines:
port: /* inline */ 8080
/* this /* nested */ is fine */
###LABEL and ### (heredoc-shaped). Opener and terminator must
each be on their own line (trimmed to match). Useful when content
contains unbalanced */ — the labeled form lets you pick any
terminator word. The label follows heredoc label rules
([A-Za-z_][A-Za-z0-9_]*) and sits directly after ### with no
whitespace.
###NOTE
The alerts below are owned by the SRE team.
Escalation policy: see runbook.md.
Any text, any syntax, no escaping — even raw */ survives.
NOTE
###
Short block comment, no label. Closed by another ### line.
###
/*
Same idea, C-shape. Nests with /* ... */.
*/
alerts:
primary: "pager"
A block comment may appear anywhere a blank line or a line comment would be valid.
Types
Strings
basic: "hello\tworld" # escapes processed
literal: 'C:\Users\ada' # no escapes, every character taken literally
Escape sequences (basic strings)
Inside "..." basic strings, the following backslash escapes are
recognized. Any other backslash sequence is a parse error.
| Escape | Decoded character | Unicode scalar |
|---|---|---|
\" |
quotation mark | U+0022 |
\\ |
reverse solidus (backslash) | U+005C |
\b |
backspace | U+0008 |
\f |
form feed | U+000C |
\n |
line feed (LF) | U+000A |
\r |
carriage return | U+000D |
\t |
character tabulation (tab) | U+0009 |
\uXXXX |
Unicode scalar U+XXXX (BMP scalars only) | U+0000..U+FFFF, excluding U+D800..U+DFFF |
\UXXXXXXXX |
Unicode scalar U+XXXXXXXX (full range) | U+0000..U+10FFFF, excluding U+D800..U+DFFF |
Rules and edge cases:
- Hex digits in
\uXXXXand\UXXXXXXXXare case-insensitive. \uXXXXmust consume exactly four hex digits; fewer is a parse error.\UXXXXXXXXmust consume exactly eight.- Surrogates are not scalars. Any escape in U+D800..U+DFFF is a
parse error (the decoder reports "unicode escape is not a scalar
value"). DMS does not recognize UTF-16 surrogate pairs — to encode
a character above the BMP via an escape, use
\UXXXXXXXXdirectly. So😀may be written literally as😀(UTF-8 source) or as\U0001F600, but not as the surrogate-pair escape😀. \UXXXXXXXXvalues must be ≤ U+10FFFF (the Unicode maximum); a larger value is a parse error.- No
\xXXbyte escape: DMS strings are Unicode scalar sequences, not byte sequences. To embed a non-ASCII character, write it literally (UTF-8 source) or use\uXXXX/\UXXXXXXXX. - No
\0null escape, and no raw NUL in source. U+0000 is not expressible — neither via an escape (no\0) nor as a raw byte anywhere in the source (the decoder rejects U+0000 in input with a parse error before lexing begins). Use binary-safe encodings (e.g. base64) outside DMS if you need raw bytes. - No
\'single-quote escape: single quotes don't need escaping in basic strings, and basic strings don't terminate on them. - No octal escapes (e.g.
\012) — for forward compatibility.
Literal strings ('...') process no escapes — every character
between the delimiters is taken as-is, then the resulting scalar
sequence is NFC-normalized like any other DMS string (see
Unicode normalization). Literal strings
cannot contain their own delimiter ('); use a heredoc for content
that mixes both quote kinds.
Strings never span lines. For multi-line text, use a heredoc.
Heredocs
Two quote flavors, optional label. Four forms total:
| Form | Escapes | Terminator (trimmed line content equals) |
|---|---|---|
"""LABEL |
yes | LABEL |
""" |
yes | """ |
'''LABEL |
no | LABEL |
''' |
no | ''' |
- A label is
[A-Za-z_][A-Za-z0-9_]*and sits directly after the opening triple quote with no whitespace between them. Labels are case-sensitive and have no case requirement —EOF,eof, andEnd_of_Fileare all valid. Uppercase-by-convention is a style choice for linters, not a decoder rule. - Content begins on the line after the opener and runs until the first line whose trimmed content equals the terminator above.
- Heredoc bodies do not participate in the outer indent rule; the decoder suspends indent checking until the terminator.
- Indent strip is always on. The column of the terminator's first non-whitespace character sets the strip depth; that many leading whitespace characters are removed from every non-blank content line.
- Body construction. Every content line — blank or non-blank —
contributes its (indent-stripped) text to the value. Lines are joined
by a single
\n; there is no implicit terminator after the final line. A body of N content lines therefore contains exactly N − 1 newlines from the join. - Line endings normalize to
\n. Source CRLF and bare LF both produce\nin the value, regardless of how the surrounding document is line-terminated. Heredoc bodies never emit\rfrom raw source bytes; to embed a literal CR, use\rin a basic heredoc. - Blank lines (lines containing only whitespace, including empty
lines) are exempt from the strip-depth check; their contributed text
is the empty string. Trailing newlines in the value come from blank
lines acquiring
\nseparators on either side via the join — not from a blank line emitting a\nof its own. - Non-blank content lines whose indent is less than the strip depth are a parse error.
- Trailing newlines: user-controlled. Each trailing blank line adds
exactly one
\nto the value (it contributes""and gains a separator from the join). To strip trailing newlines, use_trim("\n", ">")(see Modifiers). - To preserve leading whitespace verbatim, place the terminator at column 0 so strip depth is zero.
- Terminator column is independent of the surrounding indent. The terminator may appear at any column ≤ the smallest non-blank body indent, regardless of how deeply the heredoc's enclosing block is nested. The heredoc closes when its trimmed line content equals the terminator string; the column where that happens does not have to align with anything in the outer block. After the terminator line, the next non-blank line is interpreted under the surrounding block's indent rule, not the terminator's column.
config:
long_text: """EOF
line one
line two
EOF # ← terminator at column 0, but the parent
next_key: 1 # block at indent 2 continues uninterrupted
This pattern is the canonical way to embed left-aligned multi-line text from inside an indented context: place the terminator at the outermost column you want stripped (often column 0), and the body's indentation is preserved verbatim.
Examples (each row is a heredoc body after indent-strip):
| Content lines | Value |
|---|---|
["line 0"] |
"line 0" |
["line 0", ""] |
"line 0\n" |
["line 0", "line 1"] |
"line 0\nline 1" |
["line 0", "", "line 1"] |
"line 0\n\nline 1" |
["line 0", "line 1", "", "", ""] |
"line 0\nline 1\n\n\n" |
[] (heredoc with no content lines) |
"" |
Labels allow content to contain the triple-quote sequence. Without a
label, the first line whose trimmed content is the triple-quote itself
closes the heredoc — a simpler form for content that can't contain
""" or '''.
Line continuation (escape-on only)
In a """ heredoc, a \ that is the last non-whitespace character on
a line is a line continuation. The backslash, any trailing
whitespace on its line, the line terminator, and any leading whitespace
on the next non-blank line are all consumed — the two lines splice into
one. Applies before modifiers run.
prose: """EOF
The quick brown \
fox jumps over \
the lazy dog.
EOF
# → "The quick brown fox jumps over the lazy dog."
This gives per-line author control over where a newline is kept versus
chomped, which the global modifiers (_fold_paragraphs, _trim) can't
do at the character level.
Details and edge cases:
- Only
"""(escapes on) supports line continuation. Inside''',\is a literal backslash; no line splicing happens. \\at end of line is a literal backslash followed by a newline, not a continuation — the first\escapes the second. The rule fires only on an unescaped trailing\.- Trailing whitespace after the
\on the same line is allowed and consumed. - Continuation consumes blank lines too:
foo \followed by a blank line and thenbarproducesfoo bar. - A line that is only
\(followed by a newline) splices to the next line. \as the very last character before the terminator line (with no following content) is a parse error — there is nothing to splice to.
Modifiers
One or more modifiers may follow the opener (and label, if present), separated by whitespace. Every modifier is written in function-call form: an identifier followed by a parenthesized argument list. The parentheses are required even when the argument list is empty.
Modifiers work with both labeled and unlabeled heredocs. Whitespace rules (consistent with the rest of the language):
- A label, if present, attaches directly to the opener with no
whitespace between them:
"""EOF,'''END. - A modifier requires whitespace before it:
""" foo(),"""EOF foo().
Example:
sql: """EOF _trim("\n", ">")
SELECT id, name
FROM users
EOF
# → "SELECT id, name\nFROM users" (trailing newline stripped)
Disambiguation follows from those two rules combined with the
"modifiers always have parens" rule: a bare identifier touching the
opener is a label; an identifier with (...) preceded by whitespace is
a modifier. """foo() (no whitespace, has parens) is a parse error;
write """ foo() if you mean modifier-with-no-label, or """foo bar()
if you mean label-plus-modifier.
Modifiers run left-to-right, after indent-strip, before the value is returned. Each modifier operates on whatever the previous step produced.
The two standard modifiers
| Modifier | Effect |
|---|---|
_trim(chars, where, replacement = "") |
Find runs of characters matching chars, at positions selected by where, and replace each run with replacement. The Swiss-army content shaper — covers strip, chomp, and interior replace. |
_fold_paragraphs() |
Collapse runs of non-blank lines within a paragraph into space-joined single lines; blank-line paragraph breaks stay as a single \n. |
Unknown modifiers are a parse error. Argument types must match the
signatures; wrong types error at decode time (e.g. _trim(42, "*")).
_trim(chars, where, replacement = "")
chars(string) — a bag of characters to match. Matching is per-character;charsis not interpreted as a regex or a substring."\n"matches newlines only;" \t"matches either spaces or tabs;" \t\n\r"matches any standard whitespace. An emptychars("") is a no-op: nothing matches, so the body is returned unchanged.-
where(string) — a DSL of position flags. Unknown characters in the flags string are silently ignored, so future flags will be forward-compatible.Flag Meaning <Leading edge of the whole string (the first run of matching chars, if any). >Trailing edge of the whole string (the last run of matching chars, if any). \|Per-line edges — leading + trailing runs on every line, considered independently. *Every occurrence, anywhere — interior runs too. Subsumes <,>,\|when present.Flags combine (
"<|>"= leading + per-line + trailing).*in combination with other flags still means "everywhere"; the others become redundant. -
replacement(string, optional, default"") — what each matched run becomes.""means strip;", "means join-with-comma;"\n"means collapse-run-to-single-newline.
Run collapse. A consecutive run of matching characters becomes one
replacement, not one per char. So _trim("\n", "*", ", ") applied to
"a\n\nb" produces "a, b", not "a, , b".
Common recipes:
| Intent | Call |
|---|---|
| Strip trailing newlines | _trim("\n", ">") |
| Ensure exactly one trailing newline | _trim("\n", ">", "\n") |
| Ensure exactly three trailing newlines | _trim("\n", ">", "\n\n\n") |
| Join all lines with a comma-space | _trim("\n", "*", ", ") |
| Concatenate lines (no separator) | _trim("\n", "*") |
| Full whitespace-trim (both ends) | _trim(" \t\n\r", "<>") |
| Trim leading whitespace only | _trim(" \t\n\r", "<") |
| Strip per-line indentation remnants | _trim(" \t", "\|") |
| Tabs → spaces everywhere | _trim("\t", "*", " ") |
| Collapse all runs of whitespace to single space | _trim(" \t\n\r", "*", " ") |
The third argument is required only when you want to replace with something other than empty. All of the strip-style uses can omit it.
_fold_paragraphs()
Collapses non-blank-line runs into space-joined single lines, preserving
blank-line paragraph breaks as single \ns. Not expressible as trim
because it's a structural transform on paragraphs (two-level: lines
within paragraphs, paragraphs within a body), not character-level
replacement. Takes no arguments. Fine to combine with _trim(...).
Mapping to YAML block scalars — every YAML block scalar mode is expressible as a modifier combination:
| DMS | YAML |
|---|---|
| (default — no modifier needed) | \|+ |
_trim("\n", ">", "\n") |
\| |
_trim("\n", ">") |
\|- |
_fold_paragraphs() |
>+ |
_fold_paragraphs() _trim("\n", ">", "\n") |
> |
_fold_paragraphs() _trim("\n", ">") |
>- |
Modifiers stack, applied left-to-right:
csv: """EOF _trim("\n", "*", ", ") _trim(" \t", "<>")
alpha
beta
gamma
EOF
# → "alpha, beta, gamma"
summary: """EOF _fold_paragraphs() _trim("\n", ">", "\n")
First paragraph line one
first paragraph line two.
Second paragraph line one
second paragraph line two.
EOF
# → "First paragraph line one first paragraph line two.\nSecond paragraph line one second paragraph line two.\n"
Escape hatch for triple-quote content
If the body may contain a line whose trimmed content is the triple-quote opener, the unlabeled form will close early. Use the labeled form instead — labels exist precisely for this case:
doc: """END
my_string = """
"""
END
The two forms differ in their fallback when labels aren't used:
"""(escapes on) — you can write\"\"\"on a content line to smuggle in a literal"""without tripping the terminator. Ugly; a label is almost always cleaner.'''(literal, no escapes) — there is no way to include a line whose trimmed content is'''in an unlabeled body. Use a label.
sql: """END
SELECT id, name
FROM users
WHERE active = true
END
regex: '''
^\d{4}-\d{2}-\d{2}$
'''
note: """
First paragraph.
Second paragraph after a blank line.
"""
ascii_art: '''
/\_/\
( o.o )
> ^ <
'''
In the last example, the terminator sits at column 0, so strip depth is zero and every leading space in the art is preserved.
Integers
dec: 1_000_000 # underscores allowed between digits
hex: 0xDEAD_BEEF
oct: 0o755
bin: 0b1010_0110
neg: -42
64-bit signed; values outside [-2^63, 2^63 - 1] are a parse error.
Leading zeros on decimal literals are a parse error (reserved for future
use). Underscores must be between two digits (never at the start, end,
or adjacent to the base prefix or sign).
Floats
pi: 3.14159
avog: 6.022e23
small: 1.5e-10
inf_p: +inf
inf_n: -inf
nan: nan
IEEE 754 binary64. inf and nan are keywords, not identifiers.
Decimal floats require at least one digit on each side of the
decimal point. 1. and .5 are parse errors; write 1.0 and 0.5.
The exponent (e/E) is optional; if present, it's a signed decimal
integer.
Non-decimal floats use 0x / 0o / 0b prefixes and a binary
exponent marker p (mandatory — it's what distinguishes a non-decimal
float from a non-decimal integer). The value is mantissa × 2^exponent,
where the mantissa is the base-N number's value.
hex_f: 0x1.8p3 # 1.5 × 2^3 = 12.0
hex_int: 0xFp0 # 15.0 (no dot, but p makes it a float)
oct_f: 0o1.4p3 # (1 + 4/8) × 2^3 = 12.0
bin_f: 0b1.1p3 # 1.5 × 2^3 = 12.0
neg_e: 0x1p-3 # 0.125
The digit-on-both-sides rule applies to every form: 0x1.p3 and
0x.8p3 are parse errors. Underscore separators are allowed in the
mantissa but not in the exponent.
Round-trip. A decoder-emitter pair is required to preserve the
value of +inf, -inf, and nan across a decode-then-emit cycle,
though an emitter may normalize the spelling (e.g. always emit nan,
never NaN). Finite values round-trip to the shortest decimal literal
that produces the same binary64.
Booleans
enabled: true
debug: false
Date & time
RFC 3339 / ISO 8601 subset, four distinct types:
offset_dt: 1979-05-27T07:32:00-08:00 # offset datetime
local_dt: 1979-05-27T07:32:00 # local datetime
local_d: 1979-05-27 # local date
local_t: 07:32:00.999 # local time
The date/time separator is uppercase T only. RFC 3339 section 5.6
allows a lowercase t and a single space as alternates; DMS rejects
both to keep emitted output canonical. Lowercase t is a parse error;
a space between date and time is also a parse error (since a date
followed by whitespace is itself a complete local_d value).
Fractional seconds are optional and limited to nanosecond precision (up to 9 digits after the decimal point). More digits are a parse error rather than silently truncated, so the written precision always matches the stored value.
Arrays (flow form)
For inline or compact use. Block form uses + items (see Lists).
empty: [] # the empty list
ints: [1, 2, 3]
mixed: [1, "two", 3.0, true] # heterogeneous allowed
nested: [[1, 2], [3, 4]] # arrays of arrays
tables: [{x: 1}, {x: 2}, {x: 3}] # arrays of tables
multi: [
"first",
"second",
"third", # trailing comma ok
]
Flow arrays may span multiple lines. Between [ and the matching ],
whitespace (including newlines and any indentation) is insignificant —
the outer indent rule is suspended inside a flow form.
Flow tables
empty: {} # the empty table
point: { x: 1, y: 2 }
user: { name: "ada", email: "ada@example.com" }
quoted: { "with space": 1, plain: 2 } # quoted and bare keys mix
matrix: { rows: [[1, 2], [3, 4]], cols: 2 } # tables containing arrays
nested: { outer: { inner: { deep: true } } } # tables of tables
trailing: { a: 1, b: 2, } # trailing comma ok
multi: {
name: "ada",
email: "ada@example.com",
role: "admin",
}
Flow tables may span multiple lines. Between { and the matching },
whitespace (including newlines and any indentation) is insignificant —
the outer indent rule is suspended inside a flow form.
Keys in a flow table must be unique within that table — the same rule that applies to block tables. Repeating a key is a parse error.
Flow forms — canonical multi-line layout
Multi-line flow is permissive on the decode side: any whitespace between the brackets is accepted. On the encode side, every conforming port emits multi-line flow forms in one canonical layout, so two ports producing the same value tree always emit byte-identical output.
The close-bracket anchors the indent.
- The closing bracket (
]for arrays,}for tables) sits at the indent level of the line that opened the form. - Members are indented exactly one level deeper than the closing bracket.
- The opening bracket stays on the line that opened the form.
- A trailing comma after the last member is required in canonical form, for diff-friendliness and to match the decode-side permissiveness already documented.
# array, scalar root
xs: [
"first",
"second",
"third",
]
# table, scalar root
user: {
name: "ada",
email: "ada@example.com",
role: "admin",
}
# nested: closing bracket anchors at each level's opener
config: {
servers: [
{
name: "web1",
port: 8080,
},
{
name: "web2",
port: 8081,
},
],
}
When to break to multi-line. Encoders SHOULD emit multi-line form when:
- The single-line rendering would exceed the configured line-width threshold (port default: 80 chars; user-configurable), OR
- The form contains a heredoc or another multi-line member whose own rendering spans multiple lines.
Encoders MAY emit multi-line for other reasons (e.g., user-set
always_multiline_above: 3 for tables of size ≥ 3) — the rule
only specifies layout when multi-line is chosen, not the
break threshold.
Encoders that canonicalize non-empty flow values to block form
(tier-0's typical strategy) never emit multi-line flow for
non-empty cases and so do not exercise the layout rule. The rule
binds when an encoder genuinely emits a multi-line [...] /
{...} — most often for tier-1 decorator-call parens (which have
no block-form alternative; see TIER1.md).
What flow forms cannot contain
Flow forms are restricted to inline values. The following are decode
errors inside [ ... ] or { ... }:
- Heredocs (
"""/'''). Heredoc bodies start on the next line, which conflicts with flow's whitespace-insignificant rule. Use a single-line quoted string, or switch the container to block form. - Comments of any kind —
#,//, or block comments (/* ... */,### ... ###). Place the comment outside the brackets, or switch to block form where comments attach as leading / inner / trailing / floating. - Block-form children —
+list items or indented child blocks. A flow value position accepts only scalars and other flow forms.
Nested flow values
Flow arrays and flow tables compose freely — any value position accepts another flow array, flow table, or any scalar. There is no depth limit.
Grammar (sketch)
The grammar is indentation-sensitive; the token stream includes synthetic
INDENT and DEDENT tokens produced by the lexer according to the indent
rule above.
document = { trivia } ( table_root | list_root | scalar_root | empty ) ;
trivia = blank_line | line_comment | block_comment ;
table_root = kvpair { trivia | kvpair } ;
list_root = list_item { trivia | list_item } ;
scalar_root = ( inline_value | heredoc_ref ) { trivia } ;
empty = { trivia } ;
block_item = kvpair | list_item | line_comment | block_comment ;
kvpair = key ":" ( inline_value | heredoc_ref | child_block ) ;
list_item = "+" ( inline_value | heredoc_ref | kvpair { INDENT kvpair DEDENT } | child_block ) ;
child_block = NEWLINE INDENT block_item { block_item } DEDENT ;
key = bare_key | basic_string | literal_string ;
bare_key = ( xid_cont_safe | "_" | "-" )+ ;
(* xid_cont_safe = any character in *)
(* XID_Continue \ Default_Ignorable_Code_Point *)
(* per Unicode 15.1+ derived properties *)
inline_value = string | integer | float | boolean | datetime
| flow_array | flow_table ;
heredoc_ref = ( '"""' | "'''" ) [ label ] { WS modifier } ;
modifier = ident "(" [ inline_value { "," inline_value } ] ")" ;
ident = ( letter | "_" ) { letter | digit | "_" } ;
label = ident ;
flow_array = "[" [ inline_value { "," inline_value } [ "," ] ] "]" ;
flow_table = "{" [ flow_kv { "," flow_kv } [ "," ] ] "}" ;
flow_kv = key ":" inline_value ;
(* Inside flow_array / flow_table, all whitespace including *)
(* newlines is insignificant; the outer indent rule is *)
(* suspended between the opening and closing bracket/brace. *)
(* inline_value excludes heredoc_ref, and no comment *)
(* production appears inside flow forms — both are decode *)
(* errors. See "What flow forms cannot contain". *)
line_comment = ( "#" | "//" ) { any_char_except_newline } ;
block_comment = hash_block | c_block ;
hash_block = "###" [ label ] NEWLINE { any_line } terminator_line ;
c_block = "/*" { any_char | c_block } "*/" ; (* nested *)
(String/integer/float/datetime productions follow the rules described above.)
Example document
# Server config
title: "production"
updated: 2026-04-22T10:00:00-04:00
database:
host: "db.internal"
port: 5432
pool:
size: 10
dsn: """END
host=db.internal
port=5432
sslmode=require
END
servers:
+ name: "web1"
ipv4: "10.0.0.1"
disks:
+ mount: "/"
size_gb: 100
+ mount: "/var"
size_gb: 500
+ name: "web2"
ipv4: "10.0.0.2"
###NOTE
The feature flags below are owned by the growth team.
See team-growth/flags.md before changing.
NOTE
features:
enabled: ["auth", "billing", "search"]
limits: { rpm: 1000, burst: 50 }
Design non-goals
Decisions taken that intentionally don't appear in the spec.
- No
null/none. Missing values are expressed by key absence. - No string concatenation or line continuation for single-line strings. Heredocs are the only multi-line string mechanism.
- No unit suffixes (
30s,5MB). Consumers layer that on top. - No references / interpolation / schemas.
- No anchors / aliases / type tags.