Porting DMS to a new language

This document is the checklist for bringing a new language port to feature parity, including the two-tier (pure + FFI) pattern when the host language warrants it.

The reference implementation to read alongside this doc is dms-rs — the Rust port is the cleanest end-to-end view of the decoder pipeline. It is single-tier (compiled languages don't ship an FFI tier; see the table below), so for the two-tier shape specifically — pure-language decoder + native-speed FFI sibling sharing one repo — read dms-lua (dms-lua + dms-c rocks), which is where the two-tier pattern was first codified.

Canonical API surface

Every conforming port exposes the same three top-level entry points on the decode side and one on the encode side, named per the host language's idiomatic casing:

Concept	Function (snake_case)	Idiomatic variants
Decode a full document	`decode(source)`	`decodeDocument`, `Decode`, `Parse` (legacy alias only)
Decode only the front matter	`decode_front_matter(source)`	`decodeFrontMatter`, `DecodeFrontMatter`
Encode a `Document` back to DMS	`encode(document)`	`encodeDocument`, `Encode`, `ToDMS` (legacy alias only)

Lite-mode variants (decode_lite, encode_lite) are optional — ports that ship them must follow the contract in SPEC.md §Decoding modes — full and lite. The front-matter-only entry point is required at tier 0 (SPEC.md §Front-matter-only decode); there is no capability flag.

Migration from the `parse`/`to_dms` era

SPEC v0.14 renames the canonical entry points from parse/to_dms to decode/encode. Existing ports were on the prior naming and need a one-release migration:

Add the new names as the primary surface. decode, decode_front_matter, encode, decode_lite/encode_lite if the port shipped lite mode. The conformance harness and bench drivers all target the new names.
Keep the old names as deprecated thin aliases for one release. parse calls decode; to_dms calls encode. Mark them deprecated in the host language's idiomatic way (#[deprecated], @deprecated, JSDoc @deprecated, etc.) so downstream consumers see the warning at compile time / on import.
Remove the aliases in the release after. Two-release deprecation window total — long enough for downstream consumers to migrate, short enough that the alias surface doesn't accrete.
Update the bench drivers. parse_dms → decode_dms, parse_json_pure → decode_json_pure, etc. (See the Bench harness section below.) dms-tests will accept either name during the deprecation window, then drop the legacy form.
Update README and CHANGELOG to reference the new names.

The old aliases must call the new implementation, not the other way around — the canonical name is decode/encode, and the deprecated names exist purely as a compatibility shim.

Tier 1

This section is the guide for bringing a port from tier-0 conformance to full tier-1 conformance: decorator parsing, dialect registry, hoist pass, family resolution, param-signature validation, and canonical-form encoding. The reference implementation is dms-rs (219 tests, 6 examples, criterion benches, streaming module); everything below distils lessons learned from that work.

Tier 1 scope

A tier-1 port adds the following on top of the tier-0 baseline:

Front-matter acceptance — _dms_tier: 1 recognised; _dms_imports parsed (sigil / semver / allow-deny / cross-import collision validation).
Decorator-call lexer — sigil run + name + optional dotted namespace
balanced (...) groups.
Per-position sidecar attachment — Leading / Inner / Trailing / Floating, mirroring the four tier-0 comment positions.
Decoration inside flow forms — [...] / {...} elements each carry their own sidecar.
Decoration-only form — + |meta(charset:"x") / key: |required using family.empty_default.
Hoist pass — family.content_slot moves the named param into the value tree at the decorator's path.
Family resolution — alias rewrite, allow/deny filter, ambiguity reporting, namespace-qualified calls.
Param-signature validation — wildcard / wildcard_with_typed / strict / positional modes; required keys; defaults; type matching including ListOf / MapOf / named structs (with cycle detection); variadic last-slot collection.
Dialect registry — registry trait + InMemoryRegistry impl with five version-match strategies (exact / caret / tilde / gte / any), default caret, semver parse including pre-release ordering.
Encoder — emit decorations at all four positions; canonical-form heuristics (empty_default suppression for inner + leading, flow re-fold for short scalar lists into content_slot).
Streaming (optional) — per TIER1.md §1466, streaming is opt-out; batch-only is spec-compliant. See note below.

Recommended implementation ordering

Sixteen sub-steps in the order that worked for dms-rs. Each step carried 5–13 new tests on average; total test count grew from ~95 (tier-0 baseline) to 219.

Decorator-call lexer — pure structural scan, no AST integration.
_dms_imports parser + tier-1 FM acceptance — FM accepts _dms_tier: 1.
Decorator-call parser — use lex helpers + reuse tier-0 inline-value parser via wrap-and-decode.
Sidecar attachment: Leading + Floating — line-start positions via skip_trivia.
Sidecar attachment: Inner + Trailing — post-header / post-value positions.
Dialect registry types + trait + InMemoryRegistry stub — no resolution yet.
Hoist pass — content_slot lookup + body navigation by path.
Decoration-only form — placeholder + empty_default substitution after resolution.
Full resolution rules — alias / allow / deny / ambiguity.
Param-signature validation: wildcard / wildcard_with_typed / strict
- typed-key checks + required + defaults.
Positional + variadic + named-struct validation.
Decoration inside flow forms — path push/pop on flow_array / flow_table elements.
params_dec nested decoration — sub-parser with tier=1 enabled; harvest decorations_raw.
Canonical-form encoder heuristic — suppression + flow re-fold.
Dialect versioning strategies — semver parse + per-strategy match.
Tier-1 encoder — per-position emit splice into block emitters.

Hot paths and gotchas

Wrap-and-decode for param groups. dms-rs's parse_param_group wraps a slice as {...} or [...] and calls a sub-parser with tier=1 set. Mode (named vs positional) is detected by peeking the first significant token (skip whitespace + comments). This avoids reimplementing flow-table/array parsing from scratch.

Path-cloning is the cost of sidecar attachment. Every flush of pending leading decorations clones self.path. Tier-0 already pays this cost for comments; tier-1 doubles it. For documents over ~1 MB consider arena-backed paths or Rc<[Segment]>.

Sub-parser allocation in nested decoration. Each (...) param group constructs a fresh parser when tier-1 is active. dms-rs added a fast-path that scans the slice for sigil characters and falls back to tier-0 decode when none are present (~5% improvement on decorator-heavy documents).

Streaming by batch+visitor. dms-rs ships a post-process streaming visitor: parse fully, then emit a StreamEvent iterator. Spec §1466 explicitly permits batch-only ports. True coroutine streaming requires a parser refactor; the visitor approach gives consumers the API contract for typical use cases at much lower implementation cost.

Encoder canonicalization is lossy. The flow re-fold heuristic transforms block-form children into content_slot: [...] flow form when content is a scalar list. Round-trip equivalence is semantic, not byte-identical. Document this clearly in the port's encoder API docs.

SortedIndex pattern over HashMap. The tier-0 encoder uses sorted-Vec + binary_search for path → comment lookup. The tier-1 streaming module replaced its HashMap with the same pattern for ~9% speedup. Recommend SortedIndex for any path-keyed lookup hot path.

Family resolution is ambiguity-bearing. Under multi-import bind tables, the same fn_name across two families is a parse error requiring |<ns>.<fn>(...) qualification. The error message matters — list the candidate families explicitly.

Conformance corpus protocol (tier 1)

Tier-1 fixtures are an extension of the existing dms-tests corpus:

Valid fixtures live under dms-tests/valid_t1/<dialect-id>/ as paired <name>.dms + <name>.json files.
Invalid fixtures live under dms-tests/invalid_t1/<dialect-id>/ as <name>.dms only (error expected).
Wrapper schema is documented in dms-tests/TESTS_T1.md. Format: { tier, imports, body, decorators }. Path segments are encoded as single-key objects — {"key":"foo"} for map keys, {"index":N} for sequence positions.
Runner invocation: python3 run_conformance.py <encoder> --tier=1. Default --tier=0 preserves backwards compatibility.
Encoder protocol: read DMS from stdin; accept --tier=1 flag; write the tagged JSON wrapper to stdout; parse errors to stderr as <line>:<col>: <message>, exit 1.
Per-port matrix: each port declares which dialect IDs it supports. Pass count is reported as port × dialect → N/M.

Per-port API knobs

Required:

Entry point	Purpose
`decode_t1(src)`	Returns `DocumentT1` equivalent, no registry.
`decode_t1_with_registry(src, registry)`	Full hoist + family resolution + param validation.
`encode_t1(doc)`	Round-trip back to DMS source.
Dialect registry trait + in-memory impl	At minimum an empty / `InMemoryRegistry` impl.

Optional (port-level choice):

Feature	Notes
Streaming decoder	Visitor-based batch+stream is sufficient; coroutine streaming is not required.
`mutate_t1` helper API	Path-rewriting on body + sidecar in lockstep. Useful for transforms.
Built-in dialect specs	E.g. `dms+html` constants bundled into the port.
LSP / formatter integration	Port-level choice; no cross-port compatibility requirement.

Goal

Each language port should let users pick between two implementations that share the same public API and value shape:

Pure — written in the host language. Portable, no C toolchain required at install time.
Native (FFI) — a thin binding around the canonical C decoder (dms-c). Comparable in speed to the language's C-backed JSON parser.

The benchmark for the port reports both tiers like-for-like:

pure-DMS vs pure-language JSON parser and pure-language YAML parser
native-DMS vs C-backed JSON parser and C-backed YAML parser

This produces an honest "DMS costs ~Nx vs JSON, but is Mx faster than YAML" story that holds whether the user chose portability or speed.

When the FFI tier doesn't apply

For languages that compile to native code (Rust, Go, Zig, C#, Java, Crystal, C), the "pure" implementation already runs at C-comparable speed. Adding an FFI layer to dms-c only adds marshalling overhead, so those ports are single-tier:

Language tier	Languages	Tiers shipped
Compiled / native	C, Rust, Go, Zig, C#, Java, Crystal	one (pure-native)
Interpreted, C-stdlib JSON	Lua, Python, Node, Ruby, PHP, Perl	two (pure + FFI)

The bench format is identical in either case; the FFI row is just omitted for compiled languages.

Minimum deliverables

[ ] Pure-language decoder at the same SPEC tier (≥ tier 0) as the reference implementations
[ ] Native (FFI) decoder wrapping dms-c (only for interpreted languages — see table above)
[ ] Identical public API and value shape across both tiers
[ ] Frozen Unicode tables (NFC + bare-key set), generated from UCD 15.1, shipped in-tree (see Unicode tables)
[ ] 100% pass on the upstream conformance corpus (dms-tests)
[ ] Bench drivers: decode_dms, decode_dms_<ffi>, decode_json, decode_json_pure, decode_yaml, decode_yaml_pure
[ ] README with the standard two-tier perf table (see below)
[ ] Logo at the top of the README (per the cross-port convention)

FFI binding skeleton

The Lua C module is the smallest reference (~150 lines): dms-lua/dms-c/dms_c.c.

Three functions cover the shape:

push_table(L, &v->u.t)   // dms_table  → host language ordered map
push_list(L,  &v->u.l)   // dms_list   → host language array
push_value(L, v)         // dms_value  → switch on v->type

For datetime values, emit a wrapper carrying the source lexeme and a type tag — match the pure module exactly so encoders work unchanged across both tiers.

Vendoring

Vendor dms.c, dms.h, and vendor/utf8proc/* from the dms-c repo into the FFI package directory. Mirror the directory layout — dms.c includes "vendor/utf8proc/utf8proc.h" as a relative path:

<port>/<ffi-pkg>/
  binding.c                      # the language-specific glue
  dms.c                          # vendored from dms-c
  dms.h
  vendor/utf8proc/utf8proc.h
  vendor/utf8proc/utf8proc.c
  vendor/utf8proc/utf8proc_data.c
  vendor/utf8proc/LICENSE.md

Compile with -DUTF8PROC_STATIC and an include path that resolves "vendor/utf8proc/utf8proc.h" from dms.c's directory.

Package manifest patterns

Language	Pure manifest	FFI manifest
Lua	`dms-*.rockspec`	`dms-c-*.rockspec` (modules.sources includes glue + dms.c + utf8proc)
Python	`pyproject.toml`	second `setup.py` with `Extension(...)` (see `dms-py/dms-c/setup.py`)
Node	`package.json`	second package using `node-gyp` or `napi-rs`, sources in `binding.gyp`
Ruby	`dms.gemspec`	`dms-c.gemspec` with `extensions: ["ext/extconf.rb"]`
PHP	`composer.json`	PECL `package.xml` + `config.m4`
Perl	`Makefile.PL` (PP)	`Makefile.PL` (XS) — already shipped as `DMS-XS-Parser`

Both packages live in the same git repo (dms-<lang>/). The pure package has minimal deps; the FFI package's only language-side dep is the C toolchain.

Unicode tables (pure tier)

Two parts of the SPEC require Unicode property data:

NFC normalization — every decoded string is NFC-normalized (SPEC §Unicode normalization). Needs canonical decompositions, combining classes, composition pairs, and the QC_No quick-check list.
Bare-key acceptance set — bare keys outside ASCII are members of XID_Continue ∖ Default_Ignorable_Code_Point (SPEC §What counts as a bare key).

Both sets are frozen at Unicode 15.1.0 by SPEC and must be shipped in-tree by every pure-tier port. Do not delegate to the host runtime's Unicode library (Python's unicodedata, Go's golang.org/x/text/unicode/norm, ICU, JS's String.prototype.normalize, Ruby's String#unicode_normalize, etc.). Those track whichever Unicode version their runtime was built against, so the set of accepted bare-key characters and the NFC mapping of new codepoints would silently drift over time and across ports. Frozen tables make documents byte-identical across ports and across decades.

Reference generators

The Zig formatter dmsfmt under this repo is the reference: two Python generators read the UCD 15.1 source files and emit Zig source for the decoder to import.

dmsfmt/tools/
  gen_nfc_tables.py     # produces src/nfc_tables.zig
  gen_xid_table.py      # produces src/xid_table.zig
  ucd_15_1/
    UnicodeData.txt
    CompositionExclusions.txt
    DerivedCoreProperties.txt
    PropList.txt
    NormalizationTest.txt    # used by the conformance test, not the generator

To bring up a new pure port:

Copy dmsfmt/tools/ucd_15_1/ into the port repo (e.g. dms-<lang>/tools/ucd_15_1/). The UCD files are tiny (~5 MB total) and committing them keeps the port build hermetic.
Adapt gen_nfc_tables.py and gen_xid_table.py to emit the host language's syntax. The decoding logic stays — only the final "write Zig source" step changes: - Python → tuples / sets in a .py module - Go → []Range / map literals in a .go file - Rust → &[Range] / phf::Map in a .rs file - JS → typed-array literals in a .js module - Ruby → frozen-array literals in an .rb file - PHP → array literals in a .php module - Lua → table constructors in a .lua module
Commit the generated table source. Re-run the generator only when SPEC bumps the Unicode floor.
Add a conformance test that round-trips dmsfmt/tools/ucd_15_1/NormalizationTest.txt through the port's NFC routine (see dmsfmt/tests/nfc_conformance.zig for the pattern). 100% pass is required.

Why not vendor a host Unicode library

For some languages it's tempting to depend on a pinned third-party library that already happens to ship UCD 15.1 — utf8proc 2.9 (the one dms-c already vendors), Rust's unicode-normalization 0.1.x, etc. That works for the freeze in the short term, but it puts the port one bundler upgrade away from a silent Unicode bump. Generated in-tree tables make the freeze visible in the diff: any future Unicode change shows up as a SPEC-controlled commit touching the generator and the table sources together.

The C port is the one exception. It vendors utf8proc directly because (a) utf8proc's release cadence is glacial, (b) its in-tree data file is auditable as a static blob, and (c) it would be silly to write a third C codebase just to compile-time-bake the same tables. FFI-tier ports inherit the C port's utf8proc vendor and don't need their own generator.

Bench harness

Six drivers, three tiers. Each driver reads source from stdin, decodes once, prints ok\n. The runner subtracts an interpreter-startup probe and reports the best of N.

driver	decodes	library
`decode_dms`	DMS	pure-language port
`decode_dms_<ffi>`	DMS	FFI port (interpreted langs only)
`decode_json_pure`	JSON	pure-language JSON parser
`decode_json`	JSON	C-backed JSON parser (or stdlib if it is C)
`decode_yaml_pure`	YAML	pure-language YAML parser
`decode_yaml`	YAML	libyaml-backed parser

The Python harness in dms-lua/bench/bench_decoders.py is the reference orchestrator — copy and adapt for other languages. Single-tier (compiled) ports keep only the four *_pure-equivalent drivers (their canonical libraries are already pure-native).

README perf table

Standardized format. The 50,000-key flat-map fixture lives in dms-tests — generate once, reuse across ports.

## Performance

50,000-key flat document, best-of-5, startup-subtracted.

| tier        | DMS port  | time   | JSON peer  | time   | YAML peer  | time   | DMS / JSON | DMS / YAML |
|-------------|-----------|--------|------------|--------|------------|--------|------------|------------|
| pure        | dms       | <X> ms | <pure-json>| <X> ms | <pure-yaml>| <X> ms | <r>×       | <r>×       |
| native (C)  | dms_c     | <X> ms | <ffi-json> | <X> ms | <ffi-yaml> | <X> ms | <r>×       | <r>×       |

Single-tier ports drop the second row.

Per-language baseline picks

For the JSON/YAML peers, default to these. Pick something equivalent if your language has a different community standard.

Language	Pure JSON	FFI JSON	Pure YAML	FFI YAML
Lua	`dkjson`	`lua-cjson`	`lua-tinyyaml`	`lyaml` (libyaml)
Python	n/a — stdlib is C	`json` (stdlib)	`ruamel.yaml` PP	`PyYAML`+libyaml
Node	n/a — V8 is C++	`JSON.parse` (V8)	`yaml` (npm)	`yaml` + libyaml WASM/native
Ruby	`pure_json` gem	`json` (C ext)	`psych` PP mode	`psych`+libyaml
PHP	`seld/jsonlint`	`json_decode` (stdlib)	`symfony/yaml`	`yaml` PECL+libyaml
Perl	`JSON::PP`	`JSON::XS`	`YAML::PP`	`YAML::XS`
Rust	`serde_json`	(single-tier)	`serde_yaml`	(single-tier)
Go	`encoding/json`	(single-tier)	`gopkg.in/yaml.v3`	(single-tier)
C#	`System.Text.Json`	(single-tier)	`YamlDotNet`	(single-tier)
Java	Jackson / Gson	(single-tier)	SnakeYAML	(single-tier)
Crystal	stdlib `JSON`	(single-tier)	stdlib `YAML`	(single-tier)
Zig	`std.json`	(single-tier)	`zig-yaml`	(single-tier)

When a language has no commonly-used pure-X parser (Python's json module, Node's JSON.parse), report only the FFI tier and call out the asymmetry in the bench note. Don't fabricate a "pure" comparison against an obscure library nobody uses.

Status matrix

Port	Pure	FFI	Two-tier bench
`dms-lua`	✓	✓	✓
`dms-js`	✓	✓	✓
`dms-rb`	✓	✓	✓
`dms-php`	✓	✓	✓
`dms-py`	✓	✓	—
`dms-pl`	✓	✓	—
`dms-rs`	✓	n/a	—
`dms-c`	✓	n/a	—
`dms-zig`	✓	n/a	—
`dms-go`	✓	n/a	—
`dms-cs`	✓	n/a	—
`dms-java`	✓	n/a	—
`dms-cr`	✓	n/a	—

— = not yet ported. n/a = compiled language; FFI tier doesn't apply.

Concrete steps

For an interpreted-language port that already has a pure implementation:

Add a sibling subdir under the port repo (e.g. dms-c/ for Lua, ext/dms_c/ for Ruby) and vendor dms.c, dms.h, vendor/utf8proc/*.
Write the binding glue. Mirror the pure module's value shape exactly (table flag, key-order array, list flag, datetime wrapper).
Add the FFI package manifest. Sources must include the glue file, dms.c, and utf8proc.c. Compile flag -DUTF8PROC_STATIC.
Build locally and smoke-test decode(src) against a few fixtures.
Add decode_dms_<ffi>, decode_json_pure, decode_yaml_pure drivers.
Extend the bench harness to print the two-tier table.
Run the conformance suite against the FFI module — must hit the same pass count as the pure port.
Update the README perf section to the standardized format.
Commit and push.

For a compiled-language port, only steps 5–9 apply (drop the FFI row from the table).

Open question: package distribution

Pre-built binaries for OS × arch is the next gate after publishing. PyPI wheels, npm prebuilt-binaries, PECL binary releases, etc. each have their own conventions. Keep the source-build path working unconditionally; binary distribution can be opted into per-port.