Porting DMS to a new language
This document is the checklist for bringing a new language port to feature parity, including the two-tier (pure + FFI) pattern when the host language warrants it.
The reference implementation to read alongside this doc is
dms-rs — the Rust port is
the cleanest end-to-end view of the decoder pipeline. It is
single-tier (compiled languages don't ship an FFI tier; see the table
below), so for the two-tier shape specifically — pure-language
decoder + native-speed FFI sibling sharing one repo — read
dms-lua (dms-lua +
dms-c rocks), which is where the two-tier pattern was first
codified.
Canonical API surface
Every conforming port exposes the same three top-level entry points on the decode side and one on the encode side, named per the host language's idiomatic casing:
| Concept | Function (snake_case) | Idiomatic variants |
|---|---|---|
| Decode a full document | decode(source) |
decodeDocument, Decode, Parse (legacy alias only) |
| Decode only the front matter | decode_front_matter(source) |
decodeFrontMatter, DecodeFrontMatter |
Encode a Document back to DMS |
encode(document) |
encodeDocument, Encode, ToDMS (legacy alias only) |
Lite-mode variants (decode_lite, encode_lite) are optional —
ports that ship them must follow the contract in SPEC.md §Decoding
modes — full and lite. The front-matter-only entry point is
required at tier 0 (SPEC.md §Front-matter-only decode); there is
no capability flag.
Migration from the parse/to_dms era
SPEC v0.14 renames the canonical entry points from parse/to_dms
to decode/encode. Existing ports were on the prior naming and
need a one-release migration:
- Add the new names as the primary surface.
decode,decode_front_matter,encode,decode_lite/encode_liteif the port shipped lite mode. The conformance harness and bench drivers all target the new names. - Keep the old names as deprecated thin aliases for one release.
parsecallsdecode;to_dmscallsencode. Mark them deprecated in the host language's idiomatic way (#[deprecated],@deprecated, JSDoc@deprecated, etc.) so downstream consumers see the warning at compile time / on import. - Remove the aliases in the release after. Two-release deprecation window total — long enough for downstream consumers to migrate, short enough that the alias surface doesn't accrete.
- Update the bench drivers.
parse_dms→decode_dms,parse_json_pure→decode_json_pure, etc. (See the Bench harness section below.)dms-testswill accept either name during the deprecation window, then drop the legacy form. - Update README and CHANGELOG to reference the new names.
The old aliases must call the new implementation, not the other way
around — the canonical name is decode/encode, and the deprecated
names exist purely as a compatibility shim.
Tier 1
This section is the guide for bringing a port from tier-0 conformance to
full tier-1 conformance: decorator parsing, dialect registry, hoist
pass, family resolution, param-signature validation, and canonical-form
encoding. The reference implementation is dms-rs (219 tests, 6
examples, criterion benches, streaming module); everything below
distils lessons learned from that work.
Tier 1 scope
A tier-1 port adds the following on top of the tier-0 baseline:
- Front-matter acceptance —
_dms_tier: 1recognised;_dms_importsparsed (sigil / semver / allow-deny / cross-import collision validation). - Decorator-call lexer — sigil run + name + optional dotted namespace
- balanced
(...)groups. - Per-position sidecar attachment — Leading / Inner / Trailing / Floating, mirroring the four tier-0 comment positions.
- Decoration inside flow forms —
[...]/{...}elements each carry their own sidecar. - Decoration-only form —
+ |meta(charset:"x")/key: |requiredusingfamily.empty_default. - Hoist pass —
family.content_slotmoves the named param into the value tree at the decorator's path. - Family resolution — alias rewrite, allow/deny filter, ambiguity reporting, namespace-qualified calls.
- Param-signature validation — wildcard / wildcard_with_typed / strict
/ positional modes; required keys; defaults; type matching including
ListOf/MapOf/ named structs (with cycle detection); variadic last-slot collection. - Dialect registry — registry trait +
InMemoryRegistryimpl with five version-match strategies (exact / caret / tilde / gte / any), default caret, semver parse including pre-release ordering. - Encoder — emit decorations at all four positions; canonical-form
heuristics (empty_default suppression for inner + leading, flow re-fold
for short scalar lists into
content_slot). - Streaming (optional) — per TIER1.md §1466, streaming is opt-out; batch-only is spec-compliant. See note below.
Recommended implementation ordering
Sixteen sub-steps in the order that worked for dms-rs. Each step
carried 5–13 new tests on average; total test count grew from ~95
(tier-0 baseline) to 219.
- Decorator-call lexer — pure structural scan, no AST integration.
_dms_importsparser + tier-1 FM acceptance — FM accepts_dms_tier: 1.- Decorator-call parser — use lex helpers + reuse tier-0 inline-value parser via wrap-and-decode.
- Sidecar attachment: Leading + Floating — line-start positions via
skip_trivia. - Sidecar attachment: Inner + Trailing — post-header / post-value positions.
- Dialect registry types + trait +
InMemoryRegistrystub — no resolution yet. - Hoist pass —
content_slotlookup + body navigation by path. - Decoration-only form — placeholder +
empty_defaultsubstitution after resolution. - Full resolution rules — alias / allow / deny / ambiguity.
- Param-signature validation: wildcard / wildcard_with_typed / strict
- typed-key checks + required + defaults.
- Positional + variadic + named-struct validation.
- Decoration inside flow forms — path push/pop on
flow_array/flow_tableelements. params_decnested decoration — sub-parser withtier=1enabled; harvestdecorations_raw.- Canonical-form encoder heuristic — suppression + flow re-fold.
- Dialect versioning strategies — semver parse + per-strategy match.
- Tier-1 encoder — per-position emit splice into block emitters.
Hot paths and gotchas
Wrap-and-decode for param groups. dms-rs's parse_param_group
wraps a slice as {...} or [...] and calls a sub-parser with tier=1
set. Mode (named vs positional) is detected by peeking the first
significant token (skip whitespace + comments). This avoids
reimplementing flow-table/array parsing from scratch.
Path-cloning is the cost of sidecar attachment. Every flush of
pending leading decorations clones self.path. Tier-0 already pays this
cost for comments; tier-1 doubles it. For documents over ~1 MB consider
arena-backed paths or Rc<[Segment]>.
Sub-parser allocation in nested decoration. Each (...) param group
constructs a fresh parser when tier-1 is active. dms-rs added a
fast-path that scans the slice for sigil characters and falls back to
tier-0 decode when none are present (~5% improvement on decorator-heavy
documents).
Streaming by batch+visitor. dms-rs ships a post-process streaming
visitor: parse fully, then emit a StreamEvent iterator. Spec §1466
explicitly permits batch-only ports. True coroutine streaming requires a
parser refactor; the visitor approach gives consumers the API contract for
typical use cases at much lower implementation cost.
Encoder canonicalization is lossy. The flow re-fold heuristic
transforms block-form children into content_slot: [...] flow form when
content is a scalar list. Round-trip equivalence is semantic, not
byte-identical. Document this clearly in the port's encoder API docs.
SortedIndex pattern over HashMap. The tier-0 encoder uses sorted-Vec
+ binary_search for path → comment lookup. The tier-1 streaming module
replaced its HashMap with the same pattern for ~9% speedup. Recommend
SortedIndex for any path-keyed lookup hot path.
Family resolution is ambiguity-bearing. Under multi-import bind
tables, the same fn_name across two families is a parse error requiring
|<ns>.<fn>(...) qualification. The error message matters — list the
candidate families explicitly.
Conformance corpus protocol (tier 1)
Tier-1 fixtures are an extension of the existing dms-tests corpus:
- Valid fixtures live under
dms-tests/valid_t1/<dialect-id>/as paired<name>.dms+<name>.jsonfiles. - Invalid fixtures live under
dms-tests/invalid_t1/<dialect-id>/as<name>.dmsonly (error expected). - Wrapper schema is documented in
dms-tests/TESTS_T1.md. Format:{ tier, imports, body, decorators }. Path segments are encoded as single-key objects —{"key":"foo"}for map keys,{"index":N}for sequence positions. - Runner invocation:
python3 run_conformance.py <encoder> --tier=1. Default--tier=0preserves backwards compatibility. - Encoder protocol: read DMS from stdin; accept
--tier=1flag; write the tagged JSON wrapper to stdout; parse errors to stderr as<line>:<col>: <message>, exit 1. - Per-port matrix: each port declares which dialect IDs it supports.
Pass count is reported as
port × dialect → N/M.
Per-port API knobs
Required:
| Entry point | Purpose |
|---|---|
decode_t1(src) |
Returns DocumentT1 equivalent, no registry. |
decode_t1_with_registry(src, registry) |
Full hoist + family resolution + param validation. |
encode_t1(doc) |
Round-trip back to DMS source. |
| Dialect registry trait + in-memory impl | At minimum an empty / InMemoryRegistry impl. |
Optional (port-level choice):
| Feature | Notes |
|---|---|
| Streaming decoder | Visitor-based batch+stream is sufficient; coroutine streaming is not required. |
mutate_t1 helper API |
Path-rewriting on body + sidecar in lockstep. Useful for transforms. |
| Built-in dialect specs | E.g. dms+html constants bundled into the port. |
| LSP / formatter integration | Port-level choice; no cross-port compatibility requirement. |
Goal
Each language port should let users pick between two implementations that share the same public API and value shape:
- Pure — written in the host language. Portable, no C toolchain required at install time.
- Native (FFI) — a thin binding around the canonical C decoder
(
dms-c). Comparable in speed to the language's C-backed JSON parser.
The benchmark for the port reports both tiers like-for-like:
- pure-DMS vs pure-language JSON parser and pure-language YAML parser
- native-DMS vs C-backed JSON parser and C-backed YAML parser
This produces an honest "DMS costs ~Nx vs JSON, but is Mx faster than YAML" story that holds whether the user chose portability or speed.
When the FFI tier doesn't apply
For languages that compile to native code (Rust, Go, Zig, C#, Java,
Crystal, C), the "pure" implementation already runs at C-comparable
speed. Adding an FFI layer to dms-c only adds marshalling overhead,
so those ports are single-tier:
| Language tier | Languages | Tiers shipped |
|---|---|---|
| Compiled / native | C, Rust, Go, Zig, C#, Java, Crystal | one (pure-native) |
| Interpreted, C-stdlib JSON | Lua, Python, Node, Ruby, PHP, Perl | two (pure + FFI) |
The bench format is identical in either case; the FFI row is just omitted for compiled languages.
Minimum deliverables
- [ ] Pure-language decoder at the same SPEC tier (≥ tier 0) as the reference implementations
- [ ] Native (FFI) decoder wrapping
dms-c(only for interpreted languages — see table above) - [ ] Identical public API and value shape across both tiers
- [ ] Frozen Unicode tables (NFC + bare-key set), generated from UCD 15.1, shipped in-tree (see Unicode tables)
- [ ] 100% pass on the upstream conformance corpus (dms-tests)
- [ ] Bench drivers:
decode_dms,decode_dms_<ffi>,decode_json,decode_json_pure,decode_yaml,decode_yaml_pure - [ ] README with the standard two-tier perf table (see below)
- [ ] Logo at the top of the README (per the cross-port convention)
FFI binding skeleton
The Lua C module is the smallest reference (~150 lines):
dms-lua/dms-c/dms_c.c.
Three functions cover the shape:
push_table(L, &v->u.t) // dms_table → host language ordered map
push_list(L, &v->u.l) // dms_list → host language array
push_value(L, v) // dms_value → switch on v->type
For datetime values, emit a wrapper carrying the source lexeme and a type tag — match the pure module exactly so encoders work unchanged across both tiers.
Vendoring
Vendor dms.c, dms.h, and vendor/utf8proc/* from the
dms-c repo into the FFI
package directory. Mirror the directory layout — dms.c includes
"vendor/utf8proc/utf8proc.h" as a relative path:
<port>/<ffi-pkg>/
binding.c # the language-specific glue
dms.c # vendored from dms-c
dms.h
vendor/utf8proc/utf8proc.h
vendor/utf8proc/utf8proc.c
vendor/utf8proc/utf8proc_data.c
vendor/utf8proc/LICENSE.md
Compile with -DUTF8PROC_STATIC and an include path that resolves
"vendor/utf8proc/utf8proc.h" from dms.c's directory.
Package manifest patterns
| Language | Pure manifest | FFI manifest |
|---|---|---|
| Lua | dms-*.rockspec |
dms-c-*.rockspec (modules.sources includes glue + dms.c + utf8proc) |
| Python | pyproject.toml |
second setup.py with Extension(...) (see dms-py/dms-c/setup.py) |
| Node | package.json |
second package using node-gyp or napi-rs, sources in binding.gyp |
| Ruby | dms.gemspec |
dms-c.gemspec with extensions: ["ext/extconf.rb"] |
| PHP | composer.json |
PECL package.xml + config.m4 |
| Perl | Makefile.PL (PP) |
Makefile.PL (XS) — already shipped as DMS-XS-Parser |
Both packages live in the same git repo (dms-<lang>/). The pure
package has minimal deps; the FFI package's only language-side dep is
the C toolchain.
Unicode tables (pure tier)
Two parts of the SPEC require Unicode property data:
- NFC normalization — every decoded string is NFC-normalized (SPEC §Unicode normalization). Needs canonical decompositions, combining classes, composition pairs, and the QC_No quick-check list.
- Bare-key acceptance set — bare keys outside ASCII are members
of
XID_Continue ∖ Default_Ignorable_Code_Point(SPEC §What counts as a bare key).
Both sets are frozen at Unicode 15.1.0 by SPEC and must be
shipped in-tree by every pure-tier port. Do not delegate to the
host runtime's Unicode library (Python's unicodedata, Go's
golang.org/x/text/unicode/norm, ICU, JS's String.prototype.normalize,
Ruby's String#unicode_normalize, etc.). Those track whichever
Unicode version their runtime was built against, so the set of
accepted bare-key characters and the NFC mapping of new codepoints
would silently drift over time and across ports. Frozen tables make
documents byte-identical across ports and across decades.
Reference generators
The Zig formatter dmsfmt under this repo is the
reference: two Python generators read the UCD 15.1 source files and
emit Zig source for the decoder to import.
dmsfmt/tools/
gen_nfc_tables.py # produces src/nfc_tables.zig
gen_xid_table.py # produces src/xid_table.zig
ucd_15_1/
UnicodeData.txt
CompositionExclusions.txt
DerivedCoreProperties.txt
PropList.txt
NormalizationTest.txt # used by the conformance test, not the generator
To bring up a new pure port:
- Copy
dmsfmt/tools/ucd_15_1/into the port repo (e.g.dms-<lang>/tools/ucd_15_1/). The UCD files are tiny (~5 MB total) and committing them keeps the port build hermetic. - Adapt
gen_nfc_tables.pyandgen_xid_table.pyto emit the host language's syntax. The decoding logic stays — only the final "write Zig source" step changes: - Python → tuples / sets in a.pymodule - Go →[]Range/ map literals in a.gofile - Rust →&[Range]/phf::Mapin a.rsfile - JS → typed-array literals in a.jsmodule - Ruby → frozen-array literals in an.rbfile - PHP → array literals in a.phpmodule - Lua → table constructors in a.luamodule - Commit the generated table source. Re-run the generator only when SPEC bumps the Unicode floor.
- Add a conformance test that round-trips
dmsfmt/tools/ucd_15_1/NormalizationTest.txtthrough the port's NFC routine (seedmsfmt/tests/nfc_conformance.zigfor the pattern). 100% pass is required.
Why not vendor a host Unicode library
For some languages it's tempting to depend on a pinned third-party
library that already happens to ship UCD 15.1 — utf8proc 2.9 (the
one dms-c already vendors), Rust's unicode-normalization 0.1.x,
etc. That works for the freeze in the short term, but it puts the
port one bundler upgrade away from a silent Unicode bump. Generated
in-tree tables make the freeze visible in the diff: any future
Unicode change shows up as a SPEC-controlled commit touching the
generator and the table sources together.
The C port is the one exception. It vendors utf8proc directly
because (a) utf8proc's release cadence is glacial, (b) its
in-tree data file is auditable as a static blob, and (c) it would
be silly to write a third C codebase just to compile-time-bake the
same tables. FFI-tier ports inherit the C port's utf8proc vendor
and don't need their own generator.
Bench harness
Six drivers, three tiers. Each driver reads source from stdin, decodes
once, prints ok\n. The runner subtracts an interpreter-startup probe
and reports the best of N.
| driver | decodes | library |
|---|---|---|
decode_dms |
DMS | pure-language port |
decode_dms_<ffi> |
DMS | FFI port (interpreted langs only) |
decode_json_pure |
JSON | pure-language JSON parser |
decode_json |
JSON | C-backed JSON parser (or stdlib if it is C) |
decode_yaml_pure |
YAML | pure-language YAML parser |
decode_yaml |
YAML | libyaml-backed parser |
The Python harness in
dms-lua/bench/bench_decoders.py
is the reference orchestrator — copy and adapt for other languages.
Single-tier (compiled) ports keep only the four *_pure-equivalent
drivers (their canonical libraries are already pure-native).
README perf table
Standardized format. The 50,000-key flat-map fixture lives in
dms-tests — generate
once, reuse across ports.
## Performance
50,000-key flat document, best-of-5, startup-subtracted.
| tier | DMS port | time | JSON peer | time | YAML peer | time | DMS / JSON | DMS / YAML |
|-------------|-----------|--------|------------|--------|------------|--------|------------|------------|
| pure | dms | <X> ms | <pure-json>| <X> ms | <pure-yaml>| <X> ms | <r>× | <r>× |
| native (C) | dms_c | <X> ms | <ffi-json> | <X> ms | <ffi-yaml> | <X> ms | <r>× | <r>× |
Single-tier ports drop the second row.
Per-language baseline picks
For the JSON/YAML peers, default to these. Pick something equivalent if your language has a different community standard.
| Language | Pure JSON | FFI JSON | Pure YAML | FFI YAML |
|---|---|---|---|---|
| Lua | dkjson |
lua-cjson |
lua-tinyyaml |
lyaml (libyaml) |
| Python | n/a — stdlib is C | json (stdlib) |
ruamel.yaml PP |
PyYAML+libyaml |
| Node | n/a — V8 is C++ | JSON.parse (V8) |
yaml (npm) |
yaml + libyaml WASM/native |
| Ruby | pure_json gem |
json (C ext) |
psych PP mode |
psych+libyaml |
| PHP | seld/jsonlint |
json_decode (stdlib) |
symfony/yaml |
yaml PECL+libyaml |
| Perl | JSON::PP |
JSON::XS |
YAML::PP |
YAML::XS |
| Rust | serde_json |
(single-tier) | serde_yaml |
(single-tier) |
| Go | encoding/json |
(single-tier) | gopkg.in/yaml.v3 |
(single-tier) |
| C# | System.Text.Json |
(single-tier) | YamlDotNet |
(single-tier) |
| Java | Jackson / Gson | (single-tier) | SnakeYAML | (single-tier) |
| Crystal | stdlib JSON |
(single-tier) | stdlib YAML |
(single-tier) |
| Zig | std.json |
(single-tier) | zig-yaml |
(single-tier) |
When a language has no commonly-used pure-X parser (Python's json
module, Node's JSON.parse), report only the FFI tier and call out the
asymmetry in the bench note. Don't fabricate a "pure" comparison
against an obscure library nobody uses.
Status matrix
| Port | Pure | FFI | Two-tier bench |
|---|---|---|---|
dms-lua |
✓ | ✓ | ✓ |
dms-js |
✓ | ✓ | ✓ |
dms-rb |
✓ | ✓ | ✓ |
dms-php |
✓ | ✓ | ✓ |
dms-py |
✓ | ✓ | — |
dms-pl |
✓ | ✓ | — |
dms-rs |
✓ | n/a | — |
dms-c |
✓ | n/a | — |
dms-zig |
✓ | n/a | — |
dms-go |
✓ | n/a | — |
dms-cs |
✓ | n/a | — |
dms-java |
✓ | n/a | — |
dms-cr |
✓ | n/a | — |
— = not yet ported. n/a = compiled language; FFI tier doesn't apply.
Concrete steps
For an interpreted-language port that already has a pure implementation:
- Add a sibling subdir under the port repo (e.g.
dms-c/for Lua,ext/dms_c/for Ruby) and vendordms.c,dms.h,vendor/utf8proc/*. - Write the binding glue. Mirror the pure module's value shape exactly (table flag, key-order array, list flag, datetime wrapper).
- Add the FFI package manifest. Sources must include the glue file,
dms.c, andutf8proc.c. Compile flag-DUTF8PROC_STATIC. - Build locally and smoke-test
decode(src)against a few fixtures. - Add
decode_dms_<ffi>,decode_json_pure,decode_yaml_puredrivers. - Extend the bench harness to print the two-tier table.
- Run the conformance suite against the FFI module — must hit the same pass count as the pure port.
- Update the README perf section to the standardized format.
- Commit and push.
For a compiled-language port, only steps 5–9 apply (drop the FFI row from the table).
Open question: package distribution
Pre-built binaries for OS × arch is the next gate after publishing. PyPI wheels, npm prebuilt-binaries, PECL binary releases, etc. each have their own conventions. Keep the source-build path working unconditionally; binary distribution can be opted into per-port.