Porting DMS to a new language

This document is the checklist for bringing a new language port to feature parity, including the two-tier (pure + FFI) pattern when the host language warrants it.

The reference implementation to read alongside this doc is dms-rs — the Rust port is the cleanest end-to-end view of the decoder pipeline. It is single-tier (compiled languages don't ship an FFI tier; see the table below), so for the two-tier shape specifically — pure-language decoder + native-speed FFI sibling sharing one repo — read dms-lua (dms-lua + dms-c rocks), which is where the two-tier pattern was first codified.

Canonical API surface

Every conforming port exposes the same three top-level entry points on the decode side and one on the encode side, named per the host language's idiomatic casing:

Concept Function (snake_case) Idiomatic variants
Decode a full document decode(source) decodeDocument, Decode, Parse (legacy alias only)
Decode only the front matter decode_front_matter(source) decodeFrontMatter, DecodeFrontMatter
Encode a Document back to DMS encode(document) encodeDocument, Encode, ToDMS (legacy alias only)

Lite-mode variants (decode_lite, encode_lite) are optional — ports that ship them must follow the contract in SPEC.md §Decoding modes — full and lite. The front-matter-only entry point is required at tier 0 (SPEC.md §Front-matter-only decode); there is no capability flag.

Migration from the parse/to_dms era

SPEC v0.14 renames the canonical entry points from parse/to_dms to decode/encode. Existing ports were on the prior naming and need a one-release migration:

  1. Add the new names as the primary surface. decode, decode_front_matter, encode, decode_lite/encode_lite if the port shipped lite mode. The conformance harness and bench drivers all target the new names.
  2. Keep the old names as deprecated thin aliases for one release. parse calls decode; to_dms calls encode. Mark them deprecated in the host language's idiomatic way (#[deprecated], @deprecated, JSDoc @deprecated, etc.) so downstream consumers see the warning at compile time / on import.
  3. Remove the aliases in the release after. Two-release deprecation window total — long enough for downstream consumers to migrate, short enough that the alias surface doesn't accrete.
  4. Update the bench drivers. parse_dmsdecode_dms, parse_json_puredecode_json_pure, etc. (See the Bench harness section below.) dms-tests will accept either name during the deprecation window, then drop the legacy form.
  5. Update README and CHANGELOG to reference the new names.

The old aliases must call the new implementation, not the other way around — the canonical name is decode/encode, and the deprecated names exist purely as a compatibility shim.

Tier 1

This section is the guide for bringing a port from tier-0 conformance to full tier-1 conformance: decorator parsing, dialect registry, hoist pass, family resolution, param-signature validation, and canonical-form encoding. The reference implementation is dms-rs (219 tests, 6 examples, criterion benches, streaming module); everything below distils lessons learned from that work.

Tier 1 scope

A tier-1 port adds the following on top of the tier-0 baseline:

Sixteen sub-steps in the order that worked for dms-rs. Each step carried 5–13 new tests on average; total test count grew from ~95 (tier-0 baseline) to 219.

  1. Decorator-call lexer — pure structural scan, no AST integration.
  2. _dms_imports parser + tier-1 FM acceptance — FM accepts _dms_tier: 1.
  3. Decorator-call parser — use lex helpers + reuse tier-0 inline-value parser via wrap-and-decode.
  4. Sidecar attachment: Leading + Floating — line-start positions via skip_trivia.
  5. Sidecar attachment: Inner + Trailing — post-header / post-value positions.
  6. Dialect registry types + trait + InMemoryRegistry stub — no resolution yet.
  7. Hoist passcontent_slot lookup + body navigation by path.
  8. Decoration-only form — placeholder + empty_default substitution after resolution.
  9. Full resolution rules — alias / allow / deny / ambiguity.
  10. Param-signature validation: wildcard / wildcard_with_typed / strict
    • typed-key checks + required + defaults.
  11. Positional + variadic + named-struct validation.
  12. Decoration inside flow forms — path push/pop on flow_array / flow_table elements.
  13. params_dec nested decoration — sub-parser with tier=1 enabled; harvest decorations_raw.
  14. Canonical-form encoder heuristic — suppression + flow re-fold.
  15. Dialect versioning strategies — semver parse + per-strategy match.
  16. Tier-1 encoder — per-position emit splice into block emitters.

Hot paths and gotchas

Wrap-and-decode for param groups. dms-rs's parse_param_group wraps a slice as {...} or [...] and calls a sub-parser with tier=1 set. Mode (named vs positional) is detected by peeking the first significant token (skip whitespace + comments). This avoids reimplementing flow-table/array parsing from scratch.

Path-cloning is the cost of sidecar attachment. Every flush of pending leading decorations clones self.path. Tier-0 already pays this cost for comments; tier-1 doubles it. For documents over ~1 MB consider arena-backed paths or Rc<[Segment]>.

Sub-parser allocation in nested decoration. Each (...) param group constructs a fresh parser when tier-1 is active. dms-rs added a fast-path that scans the slice for sigil characters and falls back to tier-0 decode when none are present (~5% improvement on decorator-heavy documents).

Streaming by batch+visitor. dms-rs ships a post-process streaming visitor: parse fully, then emit a StreamEvent iterator. Spec §1466 explicitly permits batch-only ports. True coroutine streaming requires a parser refactor; the visitor approach gives consumers the API contract for typical use cases at much lower implementation cost.

Encoder canonicalization is lossy. The flow re-fold heuristic transforms block-form children into content_slot: [...] flow form when content is a scalar list. Round-trip equivalence is semantic, not byte-identical. Document this clearly in the port's encoder API docs.

SortedIndex pattern over HashMap. The tier-0 encoder uses sorted-Vec + binary_search for path → comment lookup. The tier-1 streaming module replaced its HashMap with the same pattern for ~9% speedup. Recommend SortedIndex for any path-keyed lookup hot path.

Family resolution is ambiguity-bearing. Under multi-import bind tables, the same fn_name across two families is a parse error requiring |<ns>.<fn>(...) qualification. The error message matters — list the candidate families explicitly.

Conformance corpus protocol (tier 1)

Tier-1 fixtures are an extension of the existing dms-tests corpus:

Per-port API knobs

Required:

Entry point Purpose
decode_t1(src) Returns DocumentT1 equivalent, no registry.
decode_t1_with_registry(src, registry) Full hoist + family resolution + param validation.
encode_t1(doc) Round-trip back to DMS source.
Dialect registry trait + in-memory impl At minimum an empty / InMemoryRegistry impl.

Optional (port-level choice):

Feature Notes
Streaming decoder Visitor-based batch+stream is sufficient; coroutine streaming is not required.
mutate_t1 helper API Path-rewriting on body + sidecar in lockstep. Useful for transforms.
Built-in dialect specs E.g. dms+html constants bundled into the port.
LSP / formatter integration Port-level choice; no cross-port compatibility requirement.

Goal

Each language port should let users pick between two implementations that share the same public API and value shape:

The benchmark for the port reports both tiers like-for-like:

This produces an honest "DMS costs ~Nx vs JSON, but is Mx faster than YAML" story that holds whether the user chose portability or speed.

When the FFI tier doesn't apply

For languages that compile to native code (Rust, Go, Zig, C#, Java, Crystal, C), the "pure" implementation already runs at C-comparable speed. Adding an FFI layer to dms-c only adds marshalling overhead, so those ports are single-tier:

Language tier Languages Tiers shipped
Compiled / native C, Rust, Go, Zig, C#, Java, Crystal one (pure-native)
Interpreted, C-stdlib JSON Lua, Python, Node, Ruby, PHP, Perl two (pure + FFI)

The bench format is identical in either case; the FFI row is just omitted for compiled languages.

Minimum deliverables

FFI binding skeleton

The Lua C module is the smallest reference (~150 lines): dms-lua/dms-c/dms_c.c.

Three functions cover the shape:

push_table(L, &v->u.t)   // dms_table  → host language ordered map
push_list(L,  &v->u.l)   // dms_list   → host language array
push_value(L, v)         // dms_value  → switch on v->type

For datetime values, emit a wrapper carrying the source lexeme and a type tag — match the pure module exactly so encoders work unchanged across both tiers.

Vendoring

Vendor dms.c, dms.h, and vendor/utf8proc/* from the dms-c repo into the FFI package directory. Mirror the directory layout — dms.c includes "vendor/utf8proc/utf8proc.h" as a relative path:

<port>/<ffi-pkg>/
  binding.c                      # the language-specific glue
  dms.c                          # vendored from dms-c
  dms.h
  vendor/utf8proc/utf8proc.h
  vendor/utf8proc/utf8proc.c
  vendor/utf8proc/utf8proc_data.c
  vendor/utf8proc/LICENSE.md

Compile with -DUTF8PROC_STATIC and an include path that resolves "vendor/utf8proc/utf8proc.h" from dms.c's directory.

Package manifest patterns

Language Pure manifest FFI manifest
Lua dms-*.rockspec dms-c-*.rockspec (modules.sources includes glue + dms.c + utf8proc)
Python pyproject.toml second setup.py with Extension(...) (see dms-py/dms-c/setup.py)
Node package.json second package using node-gyp or napi-rs, sources in binding.gyp
Ruby dms.gemspec dms-c.gemspec with extensions: ["ext/extconf.rb"]
PHP composer.json PECL package.xml + config.m4
Perl Makefile.PL (PP) Makefile.PL (XS) — already shipped as DMS-XS-Parser

Both packages live in the same git repo (dms-<lang>/). The pure package has minimal deps; the FFI package's only language-side dep is the C toolchain.

Unicode tables (pure tier)

Two parts of the SPEC require Unicode property data:

Both sets are frozen at Unicode 15.1.0 by SPEC and must be shipped in-tree by every pure-tier port. Do not delegate to the host runtime's Unicode library (Python's unicodedata, Go's golang.org/x/text/unicode/norm, ICU, JS's String.prototype.normalize, Ruby's String#unicode_normalize, etc.). Those track whichever Unicode version their runtime was built against, so the set of accepted bare-key characters and the NFC mapping of new codepoints would silently drift over time and across ports. Frozen tables make documents byte-identical across ports and across decades.

Reference generators

The Zig formatter dmsfmt under this repo is the reference: two Python generators read the UCD 15.1 source files and emit Zig source for the decoder to import.

dmsfmt/tools/
  gen_nfc_tables.py     # produces src/nfc_tables.zig
  gen_xid_table.py      # produces src/xid_table.zig
  ucd_15_1/
    UnicodeData.txt
    CompositionExclusions.txt
    DerivedCoreProperties.txt
    PropList.txt
    NormalizationTest.txt    # used by the conformance test, not the generator

To bring up a new pure port:

  1. Copy dmsfmt/tools/ucd_15_1/ into the port repo (e.g. dms-<lang>/tools/ucd_15_1/). The UCD files are tiny (~5 MB total) and committing them keeps the port build hermetic.
  2. Adapt gen_nfc_tables.py and gen_xid_table.py to emit the host language's syntax. The decoding logic stays — only the final "write Zig source" step changes: - Python → tuples / sets in a .py module - Go → []Range / map literals in a .go file - Rust → &[Range] / phf::Map in a .rs file - JS → typed-array literals in a .js module - Ruby → frozen-array literals in an .rb file - PHP → array literals in a .php module - Lua → table constructors in a .lua module
  3. Commit the generated table source. Re-run the generator only when SPEC bumps the Unicode floor.
  4. Add a conformance test that round-trips dmsfmt/tools/ucd_15_1/NormalizationTest.txt through the port's NFC routine (see dmsfmt/tests/nfc_conformance.zig for the pattern). 100% pass is required.

Why not vendor a host Unicode library

For some languages it's tempting to depend on a pinned third-party library that already happens to ship UCD 15.1 — utf8proc 2.9 (the one dms-c already vendors), Rust's unicode-normalization 0.1.x, etc. That works for the freeze in the short term, but it puts the port one bundler upgrade away from a silent Unicode bump. Generated in-tree tables make the freeze visible in the diff: any future Unicode change shows up as a SPEC-controlled commit touching the generator and the table sources together.

The C port is the one exception. It vendors utf8proc directly because (a) utf8proc's release cadence is glacial, (b) its in-tree data file is auditable as a static blob, and (c) it would be silly to write a third C codebase just to compile-time-bake the same tables. FFI-tier ports inherit the C port's utf8proc vendor and don't need their own generator.

Bench harness

Six drivers, three tiers. Each driver reads source from stdin, decodes once, prints ok\n. The runner subtracts an interpreter-startup probe and reports the best of N.

driver decodes library
decode_dms DMS pure-language port
decode_dms_<ffi> DMS FFI port (interpreted langs only)
decode_json_pure JSON pure-language JSON parser
decode_json JSON C-backed JSON parser (or stdlib if it is C)
decode_yaml_pure YAML pure-language YAML parser
decode_yaml YAML libyaml-backed parser

The Python harness in dms-lua/bench/bench_decoders.py is the reference orchestrator — copy and adapt for other languages. Single-tier (compiled) ports keep only the four *_pure-equivalent drivers (their canonical libraries are already pure-native).

README perf table

Standardized format. The 50,000-key flat-map fixture lives in dms-tests — generate once, reuse across ports.

## Performance

50,000-key flat document, best-of-5, startup-subtracted.

| tier        | DMS port  | time   | JSON peer  | time   | YAML peer  | time   | DMS / JSON | DMS / YAML |
|-------------|-----------|--------|------------|--------|------------|--------|------------|------------|
| pure        | dms       | <X> ms | <pure-json>| <X> ms | <pure-yaml>| <X> ms | <r>×       | <r>×       |
| native (C)  | dms_c     | <X> ms | <ffi-json> | <X> ms | <ffi-yaml> | <X> ms | <r>×       | <r>×       |

Single-tier ports drop the second row.

Per-language baseline picks

For the JSON/YAML peers, default to these. Pick something equivalent if your language has a different community standard.

Language Pure JSON FFI JSON Pure YAML FFI YAML
Lua dkjson lua-cjson lua-tinyyaml lyaml (libyaml)
Python n/a — stdlib is C json (stdlib) ruamel.yaml PP PyYAML+libyaml
Node n/a — V8 is C++ JSON.parse (V8) yaml (npm) yaml + libyaml WASM/native
Ruby pure_json gem json (C ext) psych PP mode psych+libyaml
PHP seld/jsonlint json_decode (stdlib) symfony/yaml yaml PECL+libyaml
Perl JSON::PP JSON::XS YAML::PP YAML::XS
Rust serde_json (single-tier) serde_yaml (single-tier)
Go encoding/json (single-tier) gopkg.in/yaml.v3 (single-tier)
C# System.Text.Json (single-tier) YamlDotNet (single-tier)
Java Jackson / Gson (single-tier) SnakeYAML (single-tier)
Crystal stdlib JSON (single-tier) stdlib YAML (single-tier)
Zig std.json (single-tier) zig-yaml (single-tier)

When a language has no commonly-used pure-X parser (Python's json module, Node's JSON.parse), report only the FFI tier and call out the asymmetry in the bench note. Don't fabricate a "pure" comparison against an obscure library nobody uses.

Status matrix

Port Pure FFI Two-tier bench
dms-lua
dms-js
dms-rb
dms-php
dms-py
dms-pl
dms-rs n/a
dms-c n/a
dms-zig n/a
dms-go n/a
dms-cs n/a
dms-java n/a
dms-cr n/a

= not yet ported. n/a = compiled language; FFI tier doesn't apply.

Concrete steps

For an interpreted-language port that already has a pure implementation:

  1. Add a sibling subdir under the port repo (e.g. dms-c/ for Lua, ext/dms_c/ for Ruby) and vendor dms.c, dms.h, vendor/utf8proc/*.
  2. Write the binding glue. Mirror the pure module's value shape exactly (table flag, key-order array, list flag, datetime wrapper).
  3. Add the FFI package manifest. Sources must include the glue file, dms.c, and utf8proc.c. Compile flag -DUTF8PROC_STATIC.
  4. Build locally and smoke-test decode(src) against a few fixtures.
  5. Add decode_dms_<ffi>, decode_json_pure, decode_yaml_pure drivers.
  6. Extend the bench harness to print the two-tier table.
  7. Run the conformance suite against the FFI module — must hit the same pass count as the pure port.
  8. Update the README perf section to the standardized format.
  9. Commit and push.

For a compiled-language port, only steps 5–9 apply (drop the FFI row from the table).

Open question: package distribution

Pre-built binaries for OS × arch is the next gate after publishing. PyPI wheels, npm prebuilt-binaries, PECL binary releases, etc. each have their own conventions. Keep the source-build path working unconditionally; binary distribution can be opted into per-port.