Network File System Version 4 T. Haynes Internet-Draft Hammerspace Intended status: Standards Track 28 April 2026 Expires: 30 October 2026 Proxy-Driven Server for Flexible Files Version 2 draft-haynes-nfsv4-flexfiles-v2-proxy-server-00 Abstract Parallel NFS (pNFS) with the Flexible Files Version 2 layout type supports client-side erasure coding and per-chunk repair between clients and data servers. This document extends that architecture with a proxy server (PS) role: a registered peer of the metadata server that polls the metadata server for work assignments and carries them out -- moving a file from one layout to another, reconstructing a whole file from surviving shards, or translating between codecs for clients that cannot participate in the file's native encoding (including NFSv3 clients). All PS-MDS coordination is fore-channel: the metadata server returns work assignments inline in the response to a PS-initiated PROXY_PROGRESS poll, and the PS reports completion via a fore-channel PROXY_DONE. No callback operations are required for the PS protocol. Note to Readers Discussion of this draft takes place on the NFSv4 working group mailing list (nfsv4@ietf.org), which is archived at https://mailarchive.ietf.org/arch/search/?email_list=nfsv4. Source code and issues list for this draft can be found at https://github.com/ietf-wg-nfsv4/flexfiles-v2-data-mover. Working Group information can be found at https://github.com/ietf-wg- nfsv4. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Haynes Expires 30 October 2026 [Page 1] Internet-Draft FFv2 Proxy Server April 2026 Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 30 October 2026. Copyright Notice Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 5 2.1. Relation to the Main Draft . . . . . . . . . . . . . . . 5 3. Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1. In Scope . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2. Out of Scope . . . . . . . . . . . . . . . . . . . . . . 7 4. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.1. Administrative Ingest . . . . . . . . . . . . . . . . . . 9 4.2. Policy-Driven Layout Transition . . . . . . . . . . . . . 9 4.3. DS Maintenance / Evacuation . . . . . . . . . . . . . . . 9 4.4. Whole-File Repair . . . . . . . . . . . . . . . . . . . . 10 4.5. TLS Coverage Transition . . . . . . . . . . . . . . . . . 10 4.6. Filehandle / Storage-Backend Transition . . . . . . . . . 10 4.7. Codec Translation for Codec-Ignorant Clients . . . . . . 11 4.7.1. Mechanism . . . . . . . . . . . . . . . . . . . . . . 11 4.7.2. Why the same PROXY_REGISTRATION machinery . . . . . . 12 5. Design Model . . . . . . . . . . . . . . . . . . . . . . . . 12 5.1. Roles . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.2. Session Between MDS and PS . . . . . . . . . . . . . . . 13 5.3. Flow Summary . . . . . . . . . . . . . . . . . . . . . . 14 5.4. Message Sequence: Policy-Driven Move . . . . . . . . . . 14 5.5. Message Sequence: Whole-File Repair . . . . . . . . . . . 16 5.6. Message Sequence: MDS-Initiated Cancellation . . . . . . 16 6. New NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . 17 6.1. proxy_stateid4: A New Stateid Type . . . . . . . . . . . 18 Haynes Expires 30 October 2026 [Page 2] Internet-Draft FFv2 Proxy Server April 2026 6.1.1. Value Space . . . . . . . . . . . . . . . . . . . . . 19 6.1.2. MDS Minting . . . . . . . . . . . . . . . . . . . . . 19 6.1.3. Lifetime . . . . . . . . . . . . . . . . . . . . . . 19 6.1.4. Renewal Semantics . . . . . . . . . . . . . . . . . . 20 6.1.5. Authorization . . . . . . . . . . . . . . . . . . . . 20 6.2. Operation 93: PROXY_REGISTRATION - Register as Data Mover . . . . . . . . . . . . . . . . . . . . . . . . . . 20 6.2.1. ARGUMENTS . . . . . . . . . . . . . . . . . . . . . . 20 6.2.2. RESULTS . . . . . . . . . . . . . . . . . . . . . . . 21 6.2.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 21 6.3. Operation 94: PROXY_PROGRESS - Heartbeat and Receive Work Assignments . . . . . . . . . . . . . . . . . . . . . . . 23 6.3.1. ARGUMENTS . . . . . . . . . . . . . . . . . . . . . . 23 6.3.2. RESULTS . . . . . . . . . . . . . . . . . . . . . . . 23 6.3.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 24 7. New Fore-Channel Operations: PROXY_DONE and PROXY_CANCEL . . 26 7.1. Operation 99: PROXY_DONE - Commit or Roll Back a Proxy Operation . . . . . . . . . . . . . . . . . . . . . . . . 26 7.1.1. ARGUMENTS . . . . . . . . . . . . . . . . . . . . . . 26 7.1.2. RESULTS . . . . . . . . . . . . . . . . . . . . . . . 27 7.1.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 27 7.2. Operation 100: PROXY_CANCEL - Abort a Proxy Operation . . 29 7.2.1. ARGUMENTS . . . . . . . . . . . . . . . . . . . . . . 29 7.2.2. RESULTS . . . . . . . . . . . . . . . . . . . . . . . 29 7.2.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 29 8. Multi-PS Assignment Fan-out . . . . . . . . . . . . . . . . . 30 9. Affinity Matching . . . . . . . . . . . . . . . . . . . . . . 31 10. Layout Shape During a Proxy Operation . . . . . . . . . . . . 32 10.1. Single-Layout Model . . . . . . . . . . . . . . . . . . 33 10.2. Two-Layout State on the MDS Side . . . . . . . . . . . . 33 10.2.1. Pinned definitions . . . . . . . . . . . . . . . . . 34 10.2.2. Atomic commit on PROXY_DONE . . . . . . . . . . . . 34 10.2.3. Late writes during the swap window . . . . . . . . . 34 10.2.4. No new claim type . . . . . . . . . . . . . . . . . 35 10.2.5. Drain interaction . . . . . . . . . . . . . . . . . 35 10.2.6. Per-instance migration deltas (informative) . . . . 36 11. Client Behavior . . . . . . . . . . . . . . . . . . . . . . . 37 11.1. When the Layout Is Recalled . . . . . . . . . . . . . . 37 11.2. In-Flight I/O When the PS Changes . . . . . . . . . . . 37 12. State Machine . . . . . . . . . . . . . . . . . . . . . . . . 37 12.1. Transitions . . . . . . . . . . . . . . . . . . . . . . 39 13. PS Failure and Recovery . . . . . . . . . . . . . . . . . . . 40 13.1. PS Crash During PROXY_ACTIVE . . . . . . . . . . . . . . 40 13.2. Cascading PS Failure . . . . . . . . . . . . . . . . . . 41 13.3. Source DS Crash During PROXY_ACTIVE . . . . . . . . . . 41 13.4. Destination DS Crash During PROXY_ACTIVE . . . . . . . . 41 14. MDS Crash Recovery . . . . . . . . . . . . . . . . . . . . . 41 14.1. PS Recovery Sequence . . . . . . . . . . . . . . . . . . 42 Haynes Expires 30 October 2026 [Page 3] Internet-Draft FFv2 Proxy Server April 2026 14.2. PS Identity Continuity . . . . . . . . . . . . . . . . . 43 14.3. Lost Migration Records . . . . . . . . . . . . . . . . . 43 15. Backward Compatibility . . . . . . . . . . . . . . . . . . . 44 15.1. Clients . . . . . . . . . . . . . . . . . . . . . . . . 44 15.2. Data Servers . . . . . . . . . . . . . . . . . . . . . . 44 15.3. NFSv3 Source DSes . . . . . . . . . . . . . . . . . . . 44 16. Security Considerations . . . . . . . . . . . . . . . . . . . 45 16.1. Credential Forwarding and the Privilege Boundary . . . . 46 16.2. Namespace Traversal Privilege . . . . . . . . . . . . . 49 16.3. PS-Side Policy Enforcement (informative) . . . . . . . . 50 17. Implementations . . . . . . . . . . . . . . . . . . . . . . . 50 17.1. reffs . . . . . . . . . . . . . . . . . . . . . . . . . 50 17.2. Demonstration . . . . . . . . . . . . . . . . . . . . . 51 18. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 52 19. Interaction with the Main Draft . . . . . . . . . . . . . . . 53 19.1. chunk_guard4 . . . . . . . . . . . . . . . . . . . . . . 53 19.2. CHUNK_LOCK . . . . . . . . . . . . . . . . . . . . . . . 53 19.3. CB_CHUNK_REPAIR . . . . . . . . . . . . . . . . . . . . 53 19.4. TRUST_STATEID / REVOKE_STATEID . . . . . . . . . . . . . 54 20. Open Questions . . . . . . . . . . . . . . . . . . . . . . . 54 21. Deferred . . . . . . . . . . . . . . . . . . . . . . . . . . 57 22. References . . . . . . . . . . . . . . . . . . . . . . . . . 57 22.1. Normative References . . . . . . . . . . . . . . . . . . 58 22.2. Informative References . . . . . . . . . . . . . . . . . 58 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 59 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 59 1. Introduction The Flexible Files Version 2 layout type ([I-D.haynes-nfsv4-flexfiles-v2]) introduces client-side erasure coding for pNFS and a per-chunk repair protocol (CB_CHUNK_REPAIR) that lets the metadata server direct an active client to reconstruct individual damaged chunks. That mechanism is sufficient for repairs whose scope is a handful of chunks in a file that has at least one live client. Three classes of work are outside the per-chunk repair model. The first is *whole-file repair*: the case in which enough data servers have failed that per-chunk reconstruction would require visiting every chunk, or in which no live client is available to drive the repair at all. The second is *layout transitions*: a file must move from one layout geometry to another for policy reasons (migrating to a new coding type, or re-mirroring), for maintenance reasons (evacuating a data server ahead of decommission), or for environmental reasons (moving between transport-security profiles or between filehandle backends). The third is *codec translation*: a client that cannot participate in the file's native codec -- Haynes Expires 30 October 2026 [Page 4] Internet-Draft FFv2 Proxy Server April 2026 including every NFSv3 client, and any legacy or minimal NFSv4 client that does not implement the file's encoding type -- still needs to read and write the file. This document specifies a *proxy server (PS)* role to address those three cases with a single mechanism: the PS opens a session to the metadata server and registers its capabilities via PROXY_REGISTRATION; the PS then polls the MDS using PROXY_PROGRESS for work assignments (move, repair), which the MDS returns inline in the poll response; the PS carries out each assignment and signals completion via PROXY_DONE (or abort via PROXY_CANCEL). All of the PS-MDS coordination is fore-channel; the PS does not require a back- channel callback program. Clients discover that a file is being proxied through the normal pNFS layout path -- a layout that names the PS with a new data-server flag (FFV2_DS_FLAGS_PROXY) -- and route their I/O through it while it is active. This design codifies a mechanism that is common, but implementation- specific, in existing Flex Files v1 deployments. Today the lack of a standardized version is the single biggest interop gap between pNFS and parallel-filesystem competitors that already expose migration primitives. 2. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. 2.1. Relation to the Main Draft [I-D.haynes-nfsv4-flexfiles-v2] defines CB_CHUNK_REPAIR and the per- chunk repair model. This document is the companion whole-file and per-client mechanism. Per-chunk and whole-file operations are mutually exclusive for a given file at a given time; coexistence rules are in Section 19. 3. Scope This section draws the boundary between the wire-level mechanism defined here and the much larger space of useful behaviours a proxy implementation might support. Drawing the boundary tightly keeps the protocol small enough to specify and implement in a single revision; everything beyond the boundary is either future work or implementation latitude, and both categories are called out below so later readers know which is which. Haynes Expires 30 October 2026 [Page 5] Internet-Draft FFv2 Proxy Server April 2026 3.1. In Scope This document defines a new protocol role, a new session, and a small set of operations that flow on that session. Around that core it specifies the layout conventions a client observes while a proxy operation is active, the rules that match a client to a co-resident proxy, the credential-forwarding rules a translating proxy must follow, and the recovery semantics for the three actor failures that matter during an operation (PS, MDS, DS). The new role is the *proxy server (PS)*, distinct from the MDS and DS roles defined in [I-D.haynes-nfsv4-flexfiles-v2]. A PS registers with an MDS and, on receipt of a directive from that MDS, performs a move, a repair, or an ongoing codec translation on behalf of a client. The PS is opaque to most clients and is visible to the MDS through a dedicated NFSv4.1+ session that the PS itself opens. All PS-MDS coordination is fore-channel: the PS issues ops to the MDS, and the MDS returns work assignments inline in its responses. No callback channel is required for the PS protocol. The fore-channel surface is deliberately small. PROXY_REGISTRATION (Section 6.2) lets the PS declare the codec set it supports, its co- residency affinity token, and its lease. PROXY_PROGRESS (Section 6.3) is the PS's poll-and-report op: the PS sends it as a heartbeat (with optional progress reports on in-flight migrations) and the MDS replies with zero or more new work assignments inline. PROXY_DONE (Section 7.1) commits or rolls back a migration when the PS finishes it; PROXY_CANCEL (Section 7.2) lets the PS abort early. All four ops are fore-channel PS-to-MDS. Around the operation set, the document specifies the layout conventions a client sees during a proxy operation and how a client discovers the PS in the first place. Co-residency attestation -- the "affinity matching" machinery that makes host-local proxy shortcuts safe -- travels as an affinity token carried in PROXY_REGISTRATION and is described in Section 9. Codec translation for codec-ignorant clients, including NFSv3 clients, is in scope, and is the one case that stretches the move/ repair vocabulary. The same proxy machinery that handles move and repair also provides the persistent per-client translation that lets a client incapable of participating in a file's native codec still read and write the file. Unlike move and repair, which are transient transitions on the file, codec translation is an ongoing routing arrangement that persists as long as the codec-ignorant client is active. Haynes Expires 30 October 2026 [Page 6] Internet-Draft FFv2 Proxy Server April 2026 Credential-forwarding rules for a proxy that translates on behalf of a client are defined in the Security Considerations section. The proxy is a translator, not an authority: authorization decisions MUST remain with the MDS, using the client's forwarded credentials. Getting this boundary wrong turns the PS into a privilege-elevation vehicle, so the rules are stated normatively and enforced at the MDS rather than policed on the wire. Finally, the document defines recovery semantics for the three actor failures that matter during a proxy operation -- PS failure, MDS failure, and DS failure -- each with its own fencing and re- registration rules so that a mid-operation crash does not leave a file in an unrecoverable state. 3.2. Out of Scope The items below are deliberately deferred. They split into three groups: features whose absence was an explicit design decision (delta journaling, partial-range moves), orchestration that belongs to a layer above a single proxy (multi-proxy pipelines, automated load balancing), and proxy-internal behaviour that does not surface on the wire and therefore needs no standardisation. Nothing on this list is precluded by the current design; each is a reasonable future extension. *Journaling and partial moves.* Move assignments in this revision are always whole-file. The PS performs a CSM-style write to all mirrors (source D, destination G, and any other mirrors in the file's mirror set) while reading source bytes from any mirror in the source set; the two-layout state on the MDS keeps client traffic on L1 throughout, with an atomic swap to L2 at PROXY_DONE time (Section 10.2). Delta-journaling mechanisms -- capturing writes against an otherwise-offline source, replaying them on completion, or maintaining reference integrity across detached clones -- are a future extension, as is a partial-range move that would move a byte range while the rest of the file stays on the source. The whole-file two-layout commit covers every motivating scenario the design currently has. *Orchestration beyond a single proxy.* Multi-proxy pipelines (staged moves for very large files) and automated load balancing or predictive selection across registered proxies are out of scope. An MDS in this revision selects a single PS per operation; load distribution across many proxies, when it matters, is expected to be handled by the MDS's selection policy and does not surface as new wire protocol. Haynes Expires 30 October 2026 [Page 7] Internet-Draft FFv2 Proxy Server April 2026 *Server-side copy as an alternative path.* Integration with server- side copy ([RFC7862] Section 4) as an alternative to PS-driven moves for single-file moves within one namespace is adjacent work. The two mechanisms are complementary (server-side copy is a client-directed intra-server operation; the PS-driven move is an MDS-directed inter- server operation), and their intersection -- for example, using server-side copy under the hood of a PS move assignment -- is better specified in its own extension rather than bolted into this document. *Proxy-internal features that do not surface on the wire.* A proxy MAY implement content-integrity and error-correction layers, encryption and compression pass-through, log-structured write staging, sector- alignment normalisation, and POSIX-loopback shortcuts when proxy and client are co-resident. These are useful motivating scenarios for the move/repair vocabulary but do not require new protocol surface beyond what the PROXY_PROGRESS / PROXY_DONE / PROXY_CANCEL fore-channel set already provides, and so they are left to implementation rather than standardised here. 4. Use Cases Seven motivating scenarios converge on the same mechanism. Six of them are transient transitions on the file itself: the file is changing state (being ingested, re-coded, evacuated, reconstructed, migrated between transport-security profiles, or migrated between filehandle backends), and the transition is what the PS drives to completion. The seventh is qualitatively different: the file is not changing, but specific clients -- those that cannot participate in the file's native codec -- are routed through a PS persistently, for as long as they are active. The distinction matters for how the MDS schedules work. Transient transitions have a terminal state; the MDS expects each one to complete (via terminal PROXY_PROGRESS) and then to retire the associated layout. The persistent routing case has no terminal state for the file as a whole; the PS stays in the layouts of codec- ignorant clients as long as those clients are open. In every case, a registered PS becomes the source of truth for a file's data during the operation, and clients are redirected to route I/O through that PS rather than directly to the original layout's data servers. The individual scenarios are described below. Haynes Expires 30 October 2026 [Page 8] Internet-Draft FFv2 Proxy Server April 2026 4.1. Administrative Ingest An administrator rsyncs a file from an external source into the cluster as a single-copy file. Server policy requires the file to be mirrored or erasure coded. The MDS queues a MOVE assignment for the file; the next PROXY_PROGRESS poll from a registered proxy whose codec set covers the destination layout returns the assignment in its response. The proxy populates the destination from the source, while any client that opens the file during the move sees a layout that routes I/O through the proxy. The source "layout" may not even be a Flex Files layout; it could be a non-pNFS NFS mount that the proxy reads as an NFSv4.2 client. Throughout the move the proxy presents the file to pNFS clients as if the move had not started, while populating the destination in the background. 4.2. Policy-Driven Layout Transition A server-objective or policy change ("files older than 30 days must be erasure coded", "high-access-rate files must have additional mirrors") requires transforming a file's layout without user visibility. The transformation is purely a layout change; the file contents are unchanged except at the shard level. The MOVE assignment carries the new layout's geometry and coding type via the destination dstore identifier in proxy_assignment4; the proxy reshapes the file's shards to match. Because the transformation type (encode / decode / transcode) is entirely specified by the (source, destination) dstore pair plus the file's recorded layout type, the assignment does not need a separate transformation-class field. 4.3. DS Maintenance / Evacuation A data server is scheduled for maintenance (hardware replacement, software upgrade, decommission). All files whose layouts reference that DS must be evacuated to replacement DSes before it is taken offline. The MDS queues a MOVE assignment per file (source = the outgoing DS, target = a replacement); registered proxies pick the assignments up via PROXY_PROGRESS polls. Evacuation can be large- scale (thousands or millions of files); running per-client per-chunk repair over every file would be prohibitively expensive, but a single registered proxy can drive many concurrent migrations subject to the per-PS in-flight cap (Section 8). Haynes Expires 30 October 2026 [Page 9] Internet-Draft FFv2 Proxy Server April 2026 4.4. Whole-File Repair Multiple DSes have failed such that per-chunk repair cannot reconstruct the file in place. The MDS constructs a new layout backed by replacement DSes and queues a REPAIR assignment. The next registered-PS PROXY_PROGRESS poll receives the assignment; the proxy drives reconstruction from whatever surviving shards remain. If fewer than k shards survive across the mirror set, the proxy reports terminal status via PROXY_DONE with status=NFS4ERR_PAYLOAD_LOST, matching the per-chunk repair semantics in the Repair Client Selection section of [I-D.haynes-nfsv4-flexfiles-v2]. 4.5. TLS Coverage Transition A file whose layout currently points at non-TLS-capable DSes needs to be migrated to TLS-capable DSes, or vice versa (an inventory change, a policy change mandating transport security, onboarding a new storage class whose DSes are TLS-only). A PROXY_OP_MOVE assignment applies: the destination DSes have the required transport security profile, the source DSes are retired. A client that arrives mid- transition is routed through the proxy and does not directly see the heterogeneous DS set. The proxy establishes its own RPC connections to source and destination DSes, potentially with different transport security profiles (non-TLS to source, mutual TLS [RFC9289] to destination, or any other combination). The proxy's per-DS security is independent of the client's security to the proxy. 4.6. Filehandle / Storage-Backend Transition A DS changes the filehandles it issues for a file; this happens when the DS's underlying storage is migrated (e.g., from one backend object store to another) and the old filehandles become unresolvable on the new backend. Without a proxy, every client holding a layout has to be individually recalled and re-issued. With a proxy, the MDS points all clients at the proxy (keeping their existing stateids and FHs intact), the proxy reconciles old-to-new FHs internally, and clients are recalled only at the end. This same mechanism covers several related situations: an NFSv3-to- NFSv4.2 DS protocol upgrade where the DS FHs change as a side effect of migrating from [RFC1813] to [RFC8881] semantics; a DS-side format change that invalidates existing FHs (for example, a transition from a local POSIX store to an object store); and backend-opaque FH migration where the DS's FH structure is internally versioned and old clients hold stale versions. Haynes Expires 30 October 2026 [Page 10] Internet-Draft FFv2 Proxy Server April 2026 4.7. Codec Translation for Codec-Ignorant Clients The coding-type registry defined in the IANA Considerations of [I-D.haynes-nfsv4-flexfiles-v2] is expected to grow. Not every client is required to implement every registered codec; a minimal client, a legacy client, or an NFSv3 client typically cannot participate in erasure-coded files at all. Per the codec-negotiation rules in [I-D.haynes-nfsv4-flexfiles-v2], such a client either retries with a different supported_types hint, falls back to MDS- terminated I/O, or (this case) is routed through a proxy that translates on its behalf. Unlike the move / repair / evacuation / transition use cases above, codec translation is *persistent per client*. The file itself is not changing state. What changes is the layout the MDS hands to a codec- ignorant client: that client gets a layout with FFV2_DS_FLAGS_PROXY set and a coding_type the client does support (typically FFV2_CODING_MIRRORED, or for NFSv3 clients just a flat NFSv3 data surface). The proxy encodes and decodes on the fly against the real DSes; the client sees a flat file. The same file may be accessed directly by codec-aware clients (with a normal layout naming the real DSes) and through the proxy by codec- ignorant clients (with a proxy layout) simultaneously. The MDS issues a different layout per request; FFV2_DS_FLAGS_PROXY is set only in the layouts that need translation. 4.7.1. Mechanism A translating proxy runs two sides that meet internally. On its *client-facing* side it speaks the protocol the codec-ignorant client can speak: for an NFSv3 [RFC1813] client that is an NFSv3 server that re-exports the MDS's namespace; for a legacy NFSv4.2 client that understands only some codecs, it is an NFSv4.2 data-server surface presenting FFV2_CODING_MIRRORED (or an equivalent codec the client supports). On its *MDS-facing* side it is an NFSv4.2 client to the MDS plus whatever DS protocol the MDS's real DSes speak. The proxy translates each client-facing op into the corresponding MDS or DS op, applies the codec transformation between the two, and returns results. For an NFSv3 client, a read flows: * Client: NFSv3 READ against the proxy. * Proxy: if it does not hold a layout for the file, issues LAYOUTGET on the MDS with the client's forwarded credentials (see Security Considerations). Haynes Expires 30 October 2026 [Page 11] Internet-Draft FFv2 Proxy Server April 2026 * Proxy: issues CHUNK_READ (or v3 READ if the DS is NFSv3) against the real DSes, decodes the shards back to plaintext. * Proxy: returns the plaintext bytes in the NFSv3 READ reply. A write flows: * Client: NFSv3 WRITE with stable_how and a byte range. * Proxy: encodes the bytes per the file's codec, issues CHUNK_WRITE / CHUNK_FINALIZE / CHUNK_COMMIT against the real DSes. * Proxy: returns NFSv3 WRITE ok with the stable_how it was able to honor (which may be downgraded based on the back-end DS's own stable_how behavior). 4.7.2. Why the same PROXY_REGISTRATION machinery The registered-PS mechanism gives the MDS the information it needs for translation-proxy selection: prr_codecs enumerates the codecs the PS can translate between, the MDS <-> PS session carries the fore- channel control-plane traffic, and the lease bounds the relationship. No new op is required for the translation case -- the existing PROXY_REGISTRATION covers it. PROXY_OP_MOVE and PROXY_OP_REPAIR assignments are not used for pure translation (the file is not moving or being repaired); the PS simply serves the codec-ignorant client's I/O requests against the unchanged source layout. 5. Design Model 5.1. Roles This document introduces a third role alongside the pNFS metadata server (MDS) and data server (DS): Proxy server (PS): A persistent, registered peer of the MDS that carries out whole-file operations on the MDS's behalf -- moving file content between layouts, reconstructing files whose source layout has been damaged, and translating codecs on behalf of clients that cannot participate in the file's native encoding. A PS is a distinct role from a DS; a given server MAY implement both, and typically does, but the protocol does not require that. The MDS sees the PS through a dedicated session whose direction is defined in Section 5.2. The existing roles are unchanged: Metadata server (MDS): As defined in Haynes Expires 30 October 2026 [Page 12] Internet-Draft FFv2 Proxy Server April 2026 [I-D.haynes-nfsv4-flexfiles-v2]: the coordinator for each file, and the authority that issues layouts, manages stateids, and selects repair participants. Data server (DS): As defined in [I-D.haynes-nfsv4-flexfiles-v2]: serves the CHUNK data path to pNFS clients. Only one of the three pairs carries new ops in this document. The MDS <-> PS pair gains the new PS-to-MDS regular ops for registration, progress, and terminal reporting (Section 6). No new callback ops are introduced; the MDS pulls work assignments to the PS in the PROXY_PROGRESS reply on the fore-channel, and the PS reports completion or cancellation back via PROXY_DONE / PROXY_CANCEL on the same session, so no callback program is required for this protocol. The Client <-> PS pair gains no new ops: clients reach a PS through the normal pNFS data path, seeing it as a DS with FFV2_DS_FLAGS_PROXY set in the layout (Section 10). The MDS <-> DS pair is also unchanged; the tight-coupling control session in [I-D.haynes-nfsv4-flexfiles-v2] carries over as defined there. 5.2. Session Between MDS and PS The PS opens an NFSv4.1+ session to the MDS as a normal client. All PS-MDS coordination flows on the fore-channel of that session: PROXY_REGISTRATION establishes the relationship, PROXY_PROGRESS heartbeats and pulls work assignments, PROXY_DONE / PROXY_CANCEL report terminal state. No callback program is required for the PS protocol -- the session's back-channel is unused by this draft (the PS may still establish one for unrelated NFSv4.1 callbacks if it wishes, but no PS-protocol op rides on it). The session direction is intentionally opposite to the MDS -> DS tight-coupling control session in [I-D.haynes-nfsv4-flexfiles-v2]: that session is opened by the MDS to carry MDS-originated stateid management to a DS. The MDS <-> PS session is opened by the PS because registration is a PS-initiated act -- the PS is saying "here I am, with these capabilities." Without a PS-to-MDS direction the capability-advertisement would have to be inferred from session-setup flags alone, which is inadequate for the range of capabilities a PS can usefully advertise (codec set, affinity token, and -- as the DEVICEID_REGISTRATION open question anticipates -- fault-zone coordinates and other deployment attributes). A consequence of this direction choice is that a server that implements both the DS and PS roles toward the same MDS runs two sessions between the same pair of hosts: the MDS opens the DS tight- coupling session toward the box, and the box's PS opens the PS session toward the MDS. That is two EXCHANGE_ID exchanges, two Haynes Expires 30 October 2026 [Page 13] Internet-Draft FFv2 Proxy Server April 2026 CREATE_SESSION exchanges, and two TCP connections. In deployments that use RPCSEC_GSS ([RFC7861]) or RPC-over-TLS ([RFC9289]) on the PS session -- which the credential-forwarding rules in Section 16 recommend for any PS that translates on behalf of clients -- reserved-port trust is not in use and the doubled connection has no security cost. In a strict AUTH_SYS-only deployment the second outbound reserved port is a real but typically negligible cost, because a storage box's outbound NFS traffic is usually limited to one connection per MDS it is registered with. 5.3. Flow Summary The PS opens a session to the MDS and issues PROXY_REGISTRATION, declaring its supported codecs and its affinity token; the MDS records the registration and returns a registration id with a granted lease. The PS then polls the MDS via PROXY_PROGRESS at lease/2 cadence (or as the MDS's ppr_lease_remaining_sec hint directs). When the MDS decides to move or repair a file, it selects a registered PS whose capabilities match the operation and queues an assignment for that PS; the next PROXY_PROGRESS reply delivers the assignment in its ppr_assignments<> array. The PS picks the work up by issuing OPEN + LAYOUTGET on the assignment's pa_file_fh, drives the byte-shoveling phase to completion, and reports terminal status by issuing LAYOUTRETURN + PROXY_DONE in a single fore-channel compound on the same session. The MDS may at any time retract an assignment that the PS has not yet acknowledged via OPEN+LAYOUTGET by including a PROXY_OP_CANCEL_PRIOR assignment for the same (pa_file_fh, pa_target_dstore_id) pair in a subsequent PROXY_PROGRESS reply; the PS may cancel work it has already started via the fore-channel PROXY_CANCEL (Section 7.2). Clients interact with the PS through the normal layout path. During a proxy operation the MDS hands out layouts that name the PS as a DS entry with FFV2_DS_FLAGS_PROXY set; clients route CHUNK I/O to that entry. Clients that arrive mid-operation see the proxy layout from the start and need no additional signalling; clients that held an older (non- proxy) layout are recalled via CB_LAYOUTRECALL and reacquire. 5.4. Message Sequence: Policy-Driven Move The simplest flow -- a quiesced whole-file move for a policy transition. Shown as a wire-level message sequence between the three protocol actors; clients are elided because in the quiesced case they are recalled before the PS work starts. Haynes Expires 30 October 2026 [Page 14] Internet-Draft FFv2 Proxy Server April 2026 PS MDS | | | ---- CREATE_SESSION ----------> | (PS opens session to MDS) | <--- session est. ------------- | | | | ---- PROXY_REGISTRATION ------> | (advertise codecs, affinity) | <--- reg_id, granted_lease ---- | | | | ---- PROXY_PROGRESS ----------> | (heartbeat poll) | <--- NFS4_OK, ppr_assignments | (zero or more entries; one | includes MOVE assignment | delivers the MOVE work) | | | ---- OPEN(file_FH) -----------> | (PS picks up the work) | ---- LAYOUTGET (L3 composite) > | | | | [PS drives move: reads source | | DSes, encodes per destination | | codec, writes destination | | DSes via L3 fan-out] | | | | ---- PROXY_PROGRESS ----------> | (heartbeat; lease renewal, | <--- NFS4_OK, ppr_lease_rem.... | no new assignment needed) | | | ... | | | | ---- SEQUENCE | | PUTFH(file_FH) | | LAYOUTRETURN(L3_stid) | | PROXY_DONE(L3_stid, OK) -> | (terminal: commit L1 -> L2) | <--- NFS4_OK ------------------ | | | | | --- CB_LAYOUTRECALL ---> | | (to clients holding L1) | | | | <-- LAYOUTRETURN ------ | | (from each client) | | | | (MDS retires source DSes; | | next LAYOUTGET on this | | file returns L2) v v Figure 1: Message sequence for a policy-driven move Haynes Expires 30 October 2026 [Page 15] Internet-Draft FFv2 Proxy Server April 2026 5.5. Message Sequence: Whole-File Repair Same shape as a move, but the assignment in PROXY_PROGRESS carries pa_kind = PROXY_OP_REPAIR and the source layout is degraded. Terminal outcomes: * *NFS4_OK* in pd_status: the PS reconstructed the file; the MDS proceeds as in Figure 1. * *NFS4ERR_PAYLOAD_LOST* in pd_status: fewer than k shards survived across the mirror set; the MDS marks the affected byte ranges lost and rolls back to L1. No CB_LAYOUTRECALL is issued because there is no valid destination layout to issue. 5.6. Message Sequence: MDS-Initiated Cancellation The MDS may decide to retract an assignment. Two cases: 1. *Assignment not yet acknowledged by the PS.* The MDS includes a PROXY_OP_CANCEL_PRIOR assignment in the next PROXY_PROGRESS reply, naming the same (file_FH, target_dstore_id) pair as the prior MOVE / REPAIR assignment. The PS, which has not yet OPEN'd the file, simply drops the prior assignment from its in-flight queue. 2. *Assignment acknowledged and in flight.* The MDS internally aborts the migration and discards the in-flight record; the PS's eventual PROXY_DONE returns NFS4ERR_BAD_STATEID (the L3 layout stateid no longer resolves to a record), and the PS abandons the work and releases its OPEN. The MDS may also let the PS's registration lease expire as a coarser cancellation. The PS-initiated cancellation case uses the fore-channel PROXY_CANCEL (Section 7.2). Haynes Expires 30 October 2026 [Page 16] Internet-Draft FFv2 Proxy Server April 2026 PS MDS | | | [MOVE assignment delivered in | | prior PROXY_PROGRESS reply, | | not yet acknowledged] | | | | | <-- (cancel decision) | | | ---- PROXY_PROGRESS ----------> | (next heartbeat poll) | <--- NFS4_OK, ppr_assignments | (CANCEL_PRIOR for the | includes CANCEL_PRIOR | same file_FH/target) | | | [PS drops the prior assignment | | from its in-flight queue; | | no PROXY_DONE / PROXY_CANCEL | | is issued for this work] | v v Figure 2: Message sequence for MDS-initiated cancellation (assignment not yet acknowledged) 6. New NFSv4.2 Operations This document defines two new NFSv4.2 operations that a proxy server (PS) issues to the metadata server (MDS) on the fore-channel of the PS -> MDS session defined in Section 5.2. PROXY_REGISTRATION (93) is issued once at session setup and on renewal. PROXY_PROGRESS (94) is issued by the PS as a heartbeat-with-poll: the PS reports periodic and terminal progress for in-flight migrations and optionally requests new work; the MDS replies inline with zero or more new work assignments. PROXY_DONE (99) commits or rolls back an individual migration when the PS finishes it; PROXY_CANCEL (100) lets the PS abort early. None of these operations is sent by pNFS clients. NFSv4.2 callback operation numbers 17-20 (the prior version of this draft used them for CB_PROXY_MOVE, CB_PROXY_REPAIR, CB_PROXY_STATUS, and CB_PROXY_CANCEL respectively) are *reserved* by this document and MUST NOT be reused. See the "Major revision (2026-04-26)" front- matter section and the IANA Considerations section (Section 18) for the rationale and the wire-level reservation record. /// /* New operations for the Data Mover (PS -> MDS) */ /// /// OP_PROXY_REGISTRATION = 93; /// OP_PROXY_PROGRESS = 94; /// OP_PROXY_DONE = 99; /// OP_PROXY_CANCEL = 100; Haynes Expires 30 October 2026 [Page 17] Internet-Draft FFv2 Proxy Server April 2026 Figure 3: Data Mover operation numbers Opcodes 93 and 94 continue the MDS-to-DS control-plane range that [I-D.haynes-nfsv4-flexfiles-v2] opens at 88 (TRUST_STATEID through BULK_REVOKE_STATEID at 88-90). Opcodes 99 and 100 were chosen to leave 95-98 reserved by this document (the prior revision had assigned them to the now-retired CB_PROXY_* set; see Section 18). If other in-flight NFSv4.2 extensions collide on these values during IANA coordination, the final assignment will be reconciled by the consuming RFC editor. The following amendment blocks extend the nfs_argop4 and nfs_resop4 dispatch unions from [RFC7863] with the new ops. A consumer that combines this document's extracted XDR with the [RFC7863] XDR applies the amendments at the unions' extension point. /// /* nfs_argop4 amendment block */ /// /// case OP_PROXY_REGISTRATION: /// PROXY_REGISTRATION4args opproxyregistration; /// case OP_PROXY_PROGRESS: /// PROXY_PROGRESS4args opproxyprogress; /// case OP_PROXY_DONE: /// PROXY_DONE4args opproxydone; /// case OP_PROXY_CANCEL: /// PROXY_CANCEL4args opproxycancel; Figure 4: nfs_argop4 amendment block /// /* nfs_resop4 amendment block */ /// /// case OP_PROXY_REGISTRATION: /// PROXY_REGISTRATION4res opproxyregistration; /// case OP_PROXY_PROGRESS: /// PROXY_PROGRESS4res opproxyprogress; /// case OP_PROXY_DONE: /// PROXY_DONE4res opproxydone; /// case OP_PROXY_CANCEL: /// PROXY_CANCEL4res opproxycancel; Figure 5: nfs_resop4 amendment block 6.1. proxy_stateid4: A New Stateid Type This document introduces proxy_stateid4, a new server-issued stateid type used as the canonical handle for an in-flight proxy migration. The wire shape reuses the standard NFSv4 stateid4 from [RFC8881] S3.3.12; no new XDR type is added: Haynes Expires 30 October 2026 [Page 18] Internet-Draft FFv2 Proxy Server April 2026 /// typedef stateid4 proxy_stateid4; Figure 6: proxy_stateid4 wire shape 6.1.1. Value Space The proxy_stateid value space is *disjoint* from the open, lock, layout, and delegation stateid value spaces defined in [RFC8881]. Disjointness is enforced by _context_, not by an in-band tag: only PROXY_PROGRESS, PROXY_DONE, and PROXY_CANCEL arguments carry a proxy_stateid4. An implementation MUST NOT use an open, lock, layout, or delegation stateid lookup table to resolve a proxy_stateid. Conversely, a leaked proxy_stateid presented to a non-PROXY operation (e.g., READ, WRITE, SETATTR, CLOSE) MUST be rejected with NFS4ERR_BAD_STATEID by the per-op stateid validator: the per-op tables are disjoint by construction, and a value allocated as a proxy_stateid will not match any entry in any other table. 6.1.2. MDS Minting The MDS mints a fresh proxy_stateid each time it accepts a work assignment for delivery to a PS, and includes it in the proxy_assignment4 carried in the next PROXY_PROGRESS reply. (The current XDR for proxy_assignment4 does not yet carry a proxy_stateid field; the field is added in the same revision that defines this section. See Section 6.3.) The MDS guarantees that no two proxy_stateids in the same (server_state, boot_seq) are equal. An implementation MAY embed the MDS boot_seq in the high-order bytes of other[12] to enable cheap NFS4ERR_STALE_STATEID detection across reboots; this is informative implementation guidance, not a wire requirement. One known implementation uses the layout { uint16_t boot_seq | uint16_t reserved | uint64_t opaque } where the opaque tail is getrandom(2) output. The reserved field is zero in this revision; implementations MUST emit zero and MUST NOT reject non-zero on receipt (left as a forward-compat slot for widening boot_seq). 6.1.3. Lifetime A proxy_stateid is valid from the instant the MDS mints it until either: * The PS issues PROXY_DONE(proxy_stateid, ...) or PROXY_CANCEL(proxy_stateid) and the MDS acknowledges it. On acknowledgment the proxy_stateid is retired; subsequent references return NFS4ERR_BAD_STATEID. Haynes Expires 30 October 2026 [Page 19] Internet-Draft FFv2 Proxy Server April 2026 * The PS's registration lease expires (Section 6.2), at which point all proxy_stateids minted for that PS are abandoned. Subsequent references return NFS4ERR_BAD_STATEID (or NFS4ERR_STALE_CLIENTID if the registration itself has been purged). * The MDS reboots. Subsequent references to a proxy_stateid minted in a prior boot return NFS4ERR_STALE_STATEID. 6.1.4. Renewal Semantics PROXY_PROGRESS may carry a proxy_stateid in its arguments to renew an in-flight assignment (a future revision of the PROXY_PROGRESS args extends ppa_flags for this purpose). The seqid field of proxy_stateid4 follows the standard NFSv4 stateid seqid semantics in [RFC8881] S8.2.4: * The MDS bumps seqid on each issuance, including renewal acknowledgments. * The PS sends the most recent seqid it has received. * Out-of-order seqids are rejected with NFS4ERR_OLD_STATEID. 6.1.5. Authorization Possession of a proxy_stateid is *not* sufficient to drive PROXY_DONE or PROXY_CANCEL on the corresponding migration. The MDS additionally validates that the calling session's registered-PS identity matches the migration record's recorded owner (see the "Authorization" subsection of Section 7.1 for the full normative rule). Without this check, a PS that learned another PS's proxy_stateid (through a packet capture, a leaked log, or any other channel) could drive its PROXY_DONE / PROXY_CANCEL on a migration it does not own. 6.2. Operation 93: PROXY_REGISTRATION - Register as Data Mover 6.2.1. ARGUMENTS /// struct PROXY_REGISTRATION4args { /// ffv2_coding_type4 prr_codecs<>; /// opaque prr_affinity<>; /// uint32_t prr_lease; /// uint32_t prr_flags; /// }; Figure 7: XDR for PROXY_REGISTRATION4args Haynes Expires 30 October 2026 [Page 20] Internet-Draft FFv2 Proxy Server April 2026 6.2.2. RESULTS /// struct PROXY_REGISTRATION4resok { /// uint64_t prr_registration_id; /// uint32_t prr_granted_lease; /// }; /// /// union PROXY_REGISTRATION4res switch (nfsstat4 prr_status) { /// case NFS4_OK: /// PROXY_REGISTRATION4resok prr_resok4; /// default: /// void; /// }; Figure 8: XDR for PROXY_REGISTRATION4res 6.2.3. DESCRIPTION A proxy server (PS) calls PROXY_REGISTRATION on the fore-channel of its session to the MDS (Section 5.2) to declare its capabilities. The MDS records the registration and MAY select that PS for subsequent MOVE / REPAIR work assignments delivered inline in the response to PROXY_PROGRESS. The prr_codecs field lists the ffv2_coding_type4 values the PS supports. The PS MUST be able to encode, decode, and transcode between any pair of values in this list. Because the transformation class of a PROXY_OP_MOVE assignment is inherent in the (source, destination) layout pair, this codec-set membership is all the capability information the MDS needs to match. An empty list results in NFS4ERR_INVAL in this revision. The prr_affinity field is an opaque token the PS supplies for co- residency attestation with a client. The MDS does not interpret it. At layout-grant time the MDS compares prr_affinity against the requesting client's co_ownerid (Section 18.35 of [RFC8881]); a match indicates that the client and the PS share a host, and the MDS MAY use it as input to PS selection. See Section 9 for the matching algorithm. An empty prr_affinity means the PS does not claim co- residency with any client. The prr_lease field is the lease duration the PS requests in seconds. The MDS MAY grant a shorter one, returned in prr_granted_lease. The PS MUST renew before the granted lease expires; on expiry the MDS drops the registration and any in-flight migration record owned by this PS is abandoned (committed to L1 per Section 8). Haynes Expires 30 October 2026 [Page 21] Internet-Draft FFv2 Proxy Server April 2026 The prr_flags field is reserved for future use. In this revision the PS MUST set prr_flags to 0, and an MDS that receives a PROXY_REGISTRATION with any bit of prr_flags set MUST reject it with NFS4ERR_INVAL. The "reject, don't ignore" rule follows the NFSv4 extension model in [RFC8178]. Section 8 of [RFC8178] specifies that when a flag bit is used that is not known in the specified minor version, NFS4ERR_INVAL is returned; Section 4.4.3 of [RFC8178] then explains that this same error is how a requester determines whether the responder understands the bit. Silently ignoring an unknown bit would break that discovery contract: a PS that sets a future capability bit against an MDS that pre-dates the bit could not tell whether the MDS honored the capability or simply dropped it. A future revision of this specification (or a successor document that updates it) MAY define new bit values in prr_flags, following the extension rules of Section 4.2 of [RFC8178]. A PS that understands a newly defined bit MAY set it when registering with an MDS that supports it; on NFS4ERR_INVAL the PS MAY retry with the bit cleared, treating the response as the [RFC8178] Section 4.4.3 signal that the MDS does not recognize the bit. On success, the MDS returns a prr_registration_id that identifies this registration. The PS uses it to renew the registration before the granted lease expires (by re-issuing PROXY_REGISTRATION with the same prr_registration_id) and to identify itself across reconnects (see the squat-guard text below). Registration conveys capabilities only; the PS's network endpoint is conveyed through the same deviceinfo channel as any other DS's address. When the MDS selects a PS for an operation, the layout issued to clients includes a ffv2_data_server4 entry pointing at the PS's existing deviceinfo. PROXY_REGISTRATION is issued on the fore-channel of the MDS <-> PS session. That session is opened by the PS, not by the MDS; it is distinct from the MDS -> DS tight-coupling control session defined by [I-D.haynes-nfsv4-flexfiles-v2] even when the same host acts as both DS and PS. The PS MUST present EXCHGID4_FLAG_USE_NON_PNFS on the session so that the MDS can distinguish it from a regular pNFS client. An MDS that receives PROXY_REGISTRATION on a session whose owning client did not present EXCHGID4_FLAG_USE_NON_PNFS MUST reject it with NFS4ERR_PERM. Before recording the registration, the MDS MUST validate the PS's transport-security identity against a deployment allowlist. Acceptable identities are the PS's RPCSEC_GSS machine principal or Haynes Expires 30 October 2026 [Page 22] Internet-Draft FFv2 Proxy Server April 2026 the PS's RPC-over-TLS client-certificate identity, matching the MDS <-> PS transport-security rule in Section 16.1. The MDS MUST reject a PROXY_REGISTRATION from any identity not on the allowlist with NFS4ERR_PERM. AUTH_SYS, even over a transport that is otherwise authenticated, is never a valid authenticator for this operation; the MDS MUST reject it. Because one PS proxies one MDS, a successful rogue registration displaces the legit PS and returns NFS4ERR_STALE to every client holding cached filehandles against the previous PS. To guard against registration squatting, the MDS MUST refuse a new PROXY_REGISTRATION from an allowlisted identity while an existing registration from that same identity still holds a valid lease; the MDS returns NFS4ERR_DELAY and SHOULD log the conflict. A renewal -- distinguished by the PS re-presenting the same prr_registration_id it received on the prior registration -- is not squatting and the MDS MUST accept it (refreshing the granted lease). Registration revocation before lease expiry is not a dedicated operation in this revision. An MDS that needs to revoke a PS before its lease expires MUST cease delivering work assignments to that PS in PROXY_PROGRESS replies; MUST return NFS4ERR_STALE_CLIENTID on subsequent PROXY_PROGRESS or PROXY_REGISTRATION-renewal from the revoked PS; and MUST handle the PS's in-flight migration records as if the lease had expired (see the lease-expiry paragraph above): the records are abandoned and the affected layouts revert to the pre- migration state. The revoked PS, on its next PROXY_PROGRESS, sees NFS4ERR_STALE_CLIENTID and may either re-register (if the deployment policy allows) or shut down. A future revision MAY define a dedicated PROXY_REVOKE operation if operational experience shows lease revocation through silence is insufficient. 6.3. Operation 94: PROXY_PROGRESS - Heartbeat and Receive Work Assignments 6.3.1. ARGUMENTS /// struct PROXY_PROGRESS4args { /// uint32_t ppa_flags; /// }; Figure 9: XDR for PROXY_PROGRESS4args 6.3.2. RESULTS Haynes Expires 30 October 2026 [Page 23] Internet-Draft FFv2 Proxy Server April 2026 /// enum proxy_op_kind4 { /// PROXY_OP_MOVE = 0, /// PROXY_OP_REPAIR = 1, /// PROXY_OP_CANCEL_PRIOR = 2 /// }; /// /// struct proxy_assignment4 { /// proxy_op_kind4 pa_kind; /// proxy_stateid4 pa_stateid; /// nfs_fh4 pa_file_fh; /// uint64_t pa_source_dstore_id; /// uint64_t pa_target_dstore_id; /// opaque pa_descriptor<>; /// }; /// /// struct PROXY_PROGRESS4resok { /// uint32_t ppr_lease_remaining_sec; /// proxy_assignment4 ppr_assignments<>; /// }; /// /// union PROXY_PROGRESS4res switch (nfsstat4 ppr_status) { /// case NFS4_OK: /// PROXY_PROGRESS4resok ppr_resok; /// default: /// void; /// }; Figure 10: XDR for PROXY_PROGRESS4res 6.3.3. DESCRIPTION A registered proxy server calls PROXY_PROGRESS on the fore-channel of its session to the MDS for two purposes: 1. *Heartbeat*: extend the PS's registration lease. The MDS responds with ppr_lease_remaining_sec so the PS can size its next poll interval. 2. *Receive work assignments*: pick up zero or more units of work the MDS has queued for this PS. Each assignment is a proxy_assignment4 describing one migration or repair the MDS wants this PS to drive. Per [RFC8178] S4.4.3, ppa_flags is a reserved-for-future-use flag word; the MDS MUST reject any non-zero bit with NFS4ERR_INVAL. The slot allows future revisions to add PS-side appetite signaling (e.g., "do not give me more assignments right now") without an XDR break. Haynes Expires 30 October 2026 [Page 24] Internet-Draft FFv2 Proxy Server April 2026 The MDS returns work assignments inline in ppr_assignments<>. A PS that does not want new work simply ignores the assignments past its in-flight cap; the MDS does not retract assignments once delivered. Each assignment names a single file (pa_file_fh), the source and target dstores the migration moves data between (pa_source_dstore_id / pa_target_dstore_id), and a kind-specific opaque descriptor (pa_descriptor<>) for future extensions (for example, a precomputed source-layout descriptor so the PS can dial source DSes without a second LAYOUTGET). The pa_stateid field carries the proxy_stateid4 (Section 6.1) the MDS has minted for this migration; the PS uses it as the handle in the eventual PROXY_DONE / PROXY_CANCEL. The pa_kind discriminates the work type: * PROXY_OP_MOVE: drain or migrate the file's data between the named dstores. pa_stateid is the proxy_stateid the PS will reference in PROXY_DONE / PROXY_CANCEL. * PROXY_OP_REPAIR: reconstruct a missing or corrupt mirror on pa_target_dstore_id from the surviving mirrors. pa_stateid is the proxy_stateid the PS will reference in PROXY_DONE / PROXY_CANCEL. * PROXY_OP_CANCEL_PRIOR: the MDS rescinds an assignment it delivered in a prior PROXY_PROGRESS reply, before the PS acknowledged it via OPEN+LAYOUTGET. pa_stateid is the proxy_stateid of the assignment being rescinded; the PS MUST drop any in-progress work tagged with this proxy_stateid and MUST NOT issue PROXY_DONE / PROXY_CANCEL for it (the MDS has already cleaned up the in-flight migration record on its side and retired the proxy_stateid). For each MOVE / REPAIR assignment, the PS picks the work up by issuing a normal NFSv4 OPEN+LAYOUTGET against pa_file_fh (the L3 composite layout), shovels bytes per the kind-specific protocol, and reports terminal status via PROXY_DONE(pa_stateid, ...) (Section 7.1) or PROXY_CANCEL(pa_stateid) (Section 7.2). pa_file_fh is an nfs_fh4 minted by the MDS and presented to the PS for use against the same MDS. Per [RFC8881] Section 4.2.3, NFSv4 filehandles are server-private opaque tokens; the receiving server treats the byte string as opaque, validates it only by attempting the lookup, and returns NFS4ERR_STALE or NFS4ERR_BADHANDLE if the bytes do not resolve. The PS MUST NOT inspect, mutate, or shape-check pa_file_fh; it forwards the filehandle verbatim in PUTFH on the same MDS that issued it, and the existing PUTFH semantics apply unchanged. The ppr_lease_remaining_sec field is the MDS's acknowledgment of this PROXY_PROGRESS as a registration lease renewal. It is the number of seconds remaining until the PS's registration would expire absent Haynes Expires 30 October 2026 [Page 25] Internet-Draft FFv2 Proxy Server April 2026 further PROXY_PROGRESS. A well-behaved PS treats it as a lower bound on its next poll deadline; the MDS MAY return a smaller value than the standard NFSv4 lease period to drive a busy PS to poll more often or to encourage a quiet one to back off. Polling cadence: lease/2 in steady state. Adaptive backoff to lease and then 2*lease after K consecutive empty replies; reset on any non- empty reply. The MDS may override the cadence via ppr_lease_remaining_sec. The MDS-initiated cancellation case (the MDS abandons an in-flight assignment before the PS has driven it to terminal state) is signaled via the PROXY_OP_CANCEL_PRIOR assignment kind described above. There is no separate cancel callback; the PS-initiated cancel is handled by the fore-channel PROXY_CANCEL (Section 7.2). 7. New Fore-Channel Operations: PROXY_DONE and PROXY_CANCEL The PS-to-MDS protocol uses two new fore-channel operations in addition to the extended PROXY_PROGRESS: * *PROXY_DONE (op 99)*: PS reports terminal success or failure on a specific in-flight migration. The MDS uses the ppd_status to atomically commit (success: swap the inode's active layout from L1 to L2) or roll back (failure: keep L1, drop L2/G). * *PROXY_CANCEL (op 100)*: PS aborts a work item it was assigned but cannot complete (e.g., source DS becomes unreachable, PS resource exhaustion). The MDS treats this as PROXY_DONE with a fail- equivalent status: rolls back to L1, drops L2/G, frees the assignment for re-assignment by a later PROXY_PROGRESS poll. Both operations identify the affected migration by *layout stateid*. The PS acquired this stateid earlier when it issued LAYOUTGET against the migration layout (L3) for this file; the MDS keys its persisted in-flight migration record on the (clientid, file_FH, layout_stid) triple. No new stateid type is required. 7.1. Operation 99: PROXY_DONE - Commit or Roll Back a Proxy Operation 7.1.1. ARGUMENTS /// struct PROXY_DONE4args { /// proxy_stateid4 pd_stateid; /// nfsstat4 pd_status; /// }; Figure 11: PROXY_DONE arguments Haynes Expires 30 October 2026 [Page 26] Internet-Draft FFv2 Proxy Server April 2026 7.1.2. RESULTS /// struct PROXY_DONE4res { /// nfsstat4 pdr_status; /// }; Figure 12: PROXY_DONE results 7.1.3. DESCRIPTION PROXY_DONE signals the terminal outcome of a migration the PS was assigned via PROXY_PROGRESS. pd_stateid is the proxy_stateid the MDS minted when it delivered the corresponding proxy_assignment4 (Section 6.1). pd_status == NFS4_OK directs the MDS to commit the migration (swap the file's active layout from the pre-migration shape L1 to the post-migration shape L2); any other value directs the MDS to roll back (keep L1, discard L2 and the PS-only composite L3). The PS compounds PROXY_DONE after the byte-shoveling phase completes (or fails): SEQUENCE PUTFH(file_FH) LAYOUTRETURN(L3_stateid) PROXY_DONE(pd_stateid, status) LAYOUTRETURN runs FIRST per [RFC8881] S18.51, releasing the PS's reference to the L3 layout cleanly via the standard mechanism. PROXY_DONE then operates on the persisted in-flight migration record keyed by the proxy_stateid; the record is the single source of truth for migration state, so PROXY_DONE remains valid even though L3 has just been returned. The PS MAY issue PROXY_DONE in a subsequent compound, but the single-compound shape is RECOMMENDED to keep the recovery window short. 7.1.3.1. Authorization The MDS MUST validate, in this priority order, returning the first failure encountered: 1. The calling session belongs to a registered PS (i.e., the session's owning client has nc_is_registered_ps == true). Otherwise: NFS4ERR_PERM. 2. pd_stateid.other was minted in the current (server_state, boot_seq) tuple. A proxy_stateid minted in a prior boot returns NFS4ERR_STALE_STATEID. Haynes Expires 30 October 2026 [Page 27] Internet-Draft FFv2 Proxy Server April 2026 3. A migration record exists in this boot whose recorded proxy_stateid.other matches pd_stateid.other. Otherwise: NFS4ERR_BAD_STATEID. 4. The migration record's recorded *registered-PS identity* matches the calling session's registered-PS identity. The identity captured at PROXY_REGISTRATION time -- the prr_registration_id if non-empty, or the matched GSS principal / mTLS fingerprint otherwise -- is the authorization principal, *not* the per- EXCHANGE_ID clientid4. This makes PROXY_DONE / PROXY_CANCEL tolerant of PS reconnect: a PS that drops its session and reconnects with a fresh EXCHANGE_ID but the same prr_registration_id retains authority over its in-flight migrations. Mismatch returns NFS4ERR_PERM. 5. The current filehandle (set by the preceding PUTFH) matches the migration record's recorded file_FH. Otherwise: NFS4ERR_BAD_STATEID. 6. pd_stateid.seqid matches the most recently issued seqid for this proxy_stateid (per [RFC8881] S8.2.4 stateid sequence semantics). Otherwise: NFS4ERR_OLD_STATEID. If all validations succeed, the MDS atomically: * For pd_status == NFS4_OK: applies the migration's recorded per- instance deltas to the inode's active layout (i_layout_segments), removing DRAINING slots, promoting INCOMING slots to STABLE, drops the L3 PS-only composite, issues CB_LAYOUTRECALL on the prior layout to external clients still holding cached L1 references, defers final removal of decommissioned mirrors until all L1 holders return their layouts. See "Layout Shape During a Proxy Operation" (Section 10) for the per-instance delta machinery (informative). * For pd_status != NFS4_OK: discards the migration's recorded deltas without touching i_layout_segments. No CB_LAYOUTRECALL is needed (external clients never saw the post-image). The PS owns cleanup of any half-written data it placed on INCOMING DSes. In both cases the migration record is unhashed and freed; the proxy_stateid is retired. Atomicity is critical: external client traffic must transition cleanly across this op; either the per-instance deltas commit fully or they do not commit at all. Haynes Expires 30 October 2026 [Page 28] Internet-Draft FFv2 Proxy Server April 2026 7.2. Operation 100: PROXY_CANCEL - Abort a Proxy Operation 7.2.1. ARGUMENTS /// struct PROXY_CANCEL4args { /// proxy_stateid4 pc_stateid; /// }; Figure 13: PROXY_CANCEL arguments 7.2.2. RESULTS /// struct PROXY_CANCEL4res { /// nfsstat4 pcr_status; /// }; Figure 14: PROXY_CANCEL results 7.2.3. DESCRIPTION PROXY_CANCEL discards an assigned-but-unfinished migration. The PS uses it when it knows it cannot complete the assignment (the PS is being shut down gracefully, the source DS is unreachable, the destination DS rejected the writes, etc.) and wants to release the work item back to the MDS without computing a specific failure status. pc_stateid is the proxy_stateid the MDS minted when it delivered the corresponding proxy_assignment4. Compound shape: SEQUENCE PUTFH(file_FH) LAYOUTRETURN(L3_stateid) PROXY_CANCEL(pc_stateid) LAYOUTRETURN runs first (standard [RFC8881] S18.51 release of the L3 layout); PROXY_CANCEL then operates on the persisted in-flight migration record only. 7.2.3.1. Authorization The same priority-ordered validation as PROXY_DONE (Section 7.1) applies, with pc_stateid substituted for pd_stateid. In particular, the migration record's recorded registered-PS identity MUST match the caller's, or the MDS returns NFS4ERR_PERM; a PS cannot cancel another PS's migration. Haynes Expires 30 October 2026 [Page 29] Internet-Draft FFv2 Proxy Server April 2026 7.2.3.2. Side effects If validation succeeds, the MDS discards the migration's recorded deltas, unhashes and frees the migration record, retires the proxy_stateid, and (informatively) updates its operator-facing telemetry to record the cancellation. No CB_LAYOUTRECALL is needed. Side effects on i_layout_segments mirror PROXY_DONE with pd_status != NFS4_OK. The distinction between PROXY_DONE(FAIL) and PROXY_CANCEL is purely intent / accounting: PROXY_DONE(FAIL) records that the PS attempted the migration and ran into a recoverable error; PROXY_CANCEL records that the PS abandoned the assignment without attempting it (or while attempting, decided not to report a specific failure cause). An MDS implementation MAY surface the distinction in operator telemetry but MUST NOT make any behavioral distinction on the wire. 8. Multi-PS Assignment Fan-out [[REVISED 2026-04-26 -- new normative section.]] When multiple PSes are registered against the same MDS, the MDS coordinates assignment fan-out per the following rules: 1. *At most one in-flight migration per (file_FH, target_dstore) pair at any time.* This is a HARD invariant. The MDS keys its in-flight migrations registry on this pair and MUST NOT assign a migration whose key matches a still-in-flight entry. 2. *PS selection: round-robin within the registered set.* The MDS maintains an ordered list of currently-registered PSes (insertion order). For each new assignment, walk the list starting after the last PS assigned; pick the first PS whose in-flight migration count is below a per-PS cap (default 8, configurable). If no PS qualifies, the assignment stays in the queue; the next PS poll re-tries the walk. 3. *No PS affinity.* The MDS does not implement "this file always goes to this PS"; round-robin is the default. An implementation MAY add affinity (e.g., by client-IP locality) without breaking conformance. 4. *Cross-PS reassignment on lease expiry.* If a PS holding in- flight migrations loses its session (lease expiry, squat from a competing PS), the MDS: * Internally treats the abandoned migrations as PROXY_DONE(FAIL): commits L1, drops L2/G half-fills. Haynes Expires 30 October 2026 [Page 30] Internet-Draft FFv2 Proxy Server April 2026 * Re-queues each abandoned migration as a fresh assignment. * On reassignment, the per-(file_FH, target_dstore) invariant catches any duplicate (the new PS picks a fresh target if the old target was contaminated; the autopilot decides). 5. *PS Identity continuity across reconnects.* A PS that reconnects with the same client_owner4 (recovers the same clientid via EXCHANGE_ID) retains ownership of its in-flight migrations. No new assignment necessary; the PS reclaims its layouts via the recovery path in Section 14. The per-PS in-flight cap exists to prevent one PS monopolising the queue while others sit idle. It also bounds the PROXY_PROGRESS reply size for the assignment-fanout amplification case (e.g., MDS has thousands of files to migrate; without a cap, the first PS poll receives all of them). Implementations SHOULD include a TSan-style invariant test that confirms, under concurrent PROXY_PROGRESS polls from N PSes, the MDS never assigns two records with the same (file_FH, target_dstore) key. 9. Affinity Matching A PS MAY populate prr_affinity at registration time with a token that lets the MDS recognize co-residency with a client. The canonical use case: a client host is running a local PS, potentially in a container on the same physical machine as the client. Network-level checks alone are not reliable -- the PS may be on a container IP, a separate netns, or behind a NAT -- so an explicit token is required. The matching algorithm has three moving parts. First, clients opt in by embedding an affinity token in the co_ownerid they present at EXCHANGE_ID; clients that do not wish to participate send a normal co_ownerid with no embedded token. Second, when the MDS processes a LAYOUTGET on a file that has an active proxy MOVE or REPAIR migration, the MDS iterates over registered PSes and compares each PS's prr_affinity against the requesting client's co_ownerid; a match (equality, substring, or hash equivalence -- the match predicate is implementation-defined but MUST be deterministic for a given MDS instance) indicates co-residency. Third, when a match exists the MDS MAY use it as input to PS selection, preferring, when multiple PSes are eligible, the one that matches the most clients' affinity tokens. Haynes Expires 30 October 2026 [Page 31] Internet-Draft FFv2 Proxy Server April 2026 The layout returned names the selected PS first in ffs_data_servers. For mirrored coding types this makes the local PS the client's default read target (matching the FFv1 affinity pattern). For erasure-coded types it makes the local PS the preferred encode endpoint. Affinity is advisory. The MDS MUST NOT grant any authority based solely on affinity; the normal authentication model still applies, and a client that claims an affinity token it has no right to gains at most a sub-optimal layout, not unauthorized access. Note: there is no proposal to stamp the affinity token into the filehandle. The MDS has already performed the match before the layout reaches the client, and the layout's deviceinfo carries all the information a client-side shortcut needs. 10. Layout Shape During a Proxy Operation The layout the MDS hands out to clients while a proxy operation is active is the mechanism's sole client-facing surface. Everything else in this document -- the session, the ops, the credential- forwarding rules -- is between the MDS and the PS. The layout shape is therefore what a client implementer needs to read to know how its code interacts with a proxied file. When a proxy MOVE or REPAIR migration is active for a file, the layout the MDS hands out to clients contains a *proxy DS entry* at the head of ffs_data_servers (or otherwise flagged for routing; see the flag below). This entry names the selected PS. Source and destination DS entries MAY also appear in the layout at the MDS's discretion, but in the simplest case the PS is the only visible DS, and the source and destination mirror sets are internal to the PS. The PROXY flag is a new bit on ffv2_ds_flags4: /// const FFV2_DS_FLAGS_PROXY = 0x00000040; Figure 15: FFV2_DS_FLAGS_PROXY As per [I-D.haynes-nfsv4-flexfiles-v2]'s Flag-Word Allocation rule, IANA does not maintain a registry for ffv2_ds_flags4; the bit is governed by this document. When FFV2_DS_FLAGS_PROXY is set on any data server entry in a layout, clients MUST direct all CHUNK I/O for this file to that DS rather than to any other data server in the layout. The PS internally dispatches reads and writes to the source and destination DSes. Haynes Expires 30 October 2026 [Page 32] Internet-Draft FFv2 Proxy Server April 2026 10.1. Single-Layout Model This design uses a single layout with a PROXY-flagged entry rather than two linked layouts. Three considerations drive that choice. pNFS clients already handle single layouts cleanly, so no new layout- linkage mechanism needs to be invented or implemented on the client. The client's view of the file -- "the file's DS is the PS" -- is the truth during the operation, so exposing the source and destination DSes directly would invite confusion about which entry to address rather than clarify. And late-arriving clients see the proxy layout from the start, without any separate setup path to join an operation already in progress. The alternative -- a source layout plus a destination layout linked by a redirector record -- was considered and rejected on those three grounds. 10.2. Two-Layout State on the MDS Side [[REVISED 2026-04-26 -- new normative subsection. This specifies the MDS's internal layout state during a migration. External clients see only the active layout (L1 before the PROXY_DONE swap, L2 after); the PS sees a composite (L3) for the duration of the migration.]] For each file F whose mirror on a draining dstore D is being migrated, the MDS persists three logical layout records: * *L1* -- the *active* layout for external clients. Mirror set includes D unchanged. All external traffic (existing cached layouts, fresh LAYOUTGETs) sees L1. * *L2* -- the *candidate* post-migration layout. Mirror set has D replaced by the target G. Not visible to external clients during the migration window. * *L3* -- the *composite* layout served only to the registered PS that owns the migration. Two mirror entries: - M1 (read source): the current L1 mirror set. PS reads source bytes from any mirror in M1. - M2 (write target): L1's mirror set PLUS G. PS writes via CSM to every mirror in M2. The presence of D in M2 (alongside G) is intentional and load- bearing: PS-issued CSM writes land on D (keeping D in sync with concurrent external client writes, which also CSM to D via L1) and on G (filling the destination as the PS catches up). Both D and G converge to the same byte image. Haynes Expires 30 October 2026 [Page 33] Internet-Draft FFv2 Proxy Server April 2026 10.2.1. Pinned definitions * L1.mirrors = the file's pre-migration mirror set, includes D * L2.mirrors = (L1.mirrors \ {D}) union {G} * L3.M1 = L1.mirrors (PS's read-source set) * L3.M2 = L1.mirrors union {G} (PS's CSM write-target set) 10.2.2. Atomic commit on PROXY_DONE When the PS issues PROXY_DONE(layout_stid, status=NFS4_OK), the MDS atomically (in one persisted transaction): 1. Flips the active layout to L2 (D dropped, G promoted) 2. Drops L1 from the inode's layout records 3. Drops L3 (PS no longer needs the migration view) 4. Issues CB_LAYOUTRECALL to external clients holding cached L1 5. Defers REMOVE_MIRROR(D) until all L1 holders return When PROXY_DONE indicates failure (or PROXY_CANCEL is issued): 1. Keeps L1 active, unchanged 2. Drops L2; the half-filled G instance is internally unlinked 3. Drops L3 4. No CB_LAYOUTRECALL needed (external clients never saw L2) 10.2.3. Late writes during the swap window The deferred REMOVE_MIRROR(D) (step 5 of the success path) matters because clients with cached L1 layouts may have writes in flight that land on D after the swap. Without the deferral, late writes would arrive at a destroyed D instance and the client's CSM would see NFS3ERR_STALE. Late writes that arrive at D between the swap and the deferred REMOVE are harmless: the bytes land on D's instance which is about to be unlinked; the bytes are dropped with the unlink. The client sees the write succeed (CSM to D succeeded) and the data is present on the surviving mirrors {E, G}. No data loss or client-visible error. Haynes Expires 30 October 2026 [Page 34] Internet-Draft FFv2 Proxy Server April 2026 Late writes that arrive after D's instance has been unlinked (rare: client held L1 cache past CB_LAYOUTRECALL) return NFS3ERR_STALE on the CSM to D. The client retries CSM with its cached layout; on next LAYOUTGET it gets L2 and writes only to {E, G} thereafter. This is a recoverable transient error, not data loss. 10.2.4. No new claim type The PS uses normal OPEN(CLAIM_NULL) to open the file. The MDS recognises the registered-PS session (nc_is_registered_ps == true) and serves the L3 composite layout instead of a normal L1 RW grant. The L3 layout stateid is a normal NFSv4 layout stateid; the PS uses it for CHUNK / WRITE / READ I/O against the source and target DSes in the standard way. It is *distinct* from the per-migration handle, which is proxy_stateid4 (Section 6.1); the MDS keys its persisted in- flight migration record on the proxy_stateid, not on the layout stateid. Separating the two handles -- one for I/O on a specific layout, one for the migration as a whole -- keeps the migration record's lifetime independent of any single LAYOUTGET / LAYOUTRETURN cycle the PS may perform during the byte-shoveling phase. 10.2.5. Drain interaction The DRAINING state on dstore D is observable to external clients only through the absence of new instance allocations on D (via the runway- pop filter). Before the in-flight migration record becomes visible to the LAYOUTGET path, the MDS issues CB_LAYOUTRECALL on every layout outstanding for the file whose composition includes D. Once those layouts have been returned (or administratively revoked when a client's CB back-channel fails to ack within the recall window), the migration record is published and subsequent LAYOUTGETs return the post-image view (L2 from external clients' perspective; L3 from the PS's). This omit-and-replace ordering guarantees that no client write hits D after the migration has started. The alternative -- keep-and-shadow, in which the layout view continues to include D and the PS shadows client writes from D to G as they happen -- requires the PS to expose itself as a flex-files data server (an INTERPOSED instance taking the place of D in the visible layout, with the PS funneling writes to both D and G). This shape is defined in the per-instance delta model below (informative) but is not exercised by the wire ops in this revision. Haynes Expires 30 October 2026 [Page 35] Internet-Draft FFv2 Proxy Server April 2026 10.2.6. Per-instance migration deltas (informative) The L1/L2/L3 framing above describes one valid implementation approach -- whole-layout swap -- that captures the simplest case (single mirror replacement under a Client Side Mirroring codec). An MDS implementation that supports more general migrations (e.g., a single shard add to an erasure-coded file, or a partial mirror-set rotation under FFv2 RS) MAY record migration state as *per-instance deltas* on the file's existing layout records, rather than as a complete L2/L3 pair. In this informative model, each migration record carries an array of per-instance deltas, each delta describing a transformation on one position within one segment of i_layout_segments. Four instance states are useful: * STABLE -- unchanged; client writes go here directly. * DRAINING -- a slot being decommissioned; under omit-and-replace, the LAYOUTGET view-build path omits this slot and replaces it with the matching INCOMING. * INCOMING -- a new slot the PS is filling; under omit-and-replace, the LAYOUTGET view-build path emits this slot in place of the matching DRAINING. * INTERPOSED -- a slot whose visible endpoint is the PS, with the PS internally fanning writes out to one or more target DSes. Used by keep-and-shadow (forward-compat; not produced by the wire ops in this revision). The current published layout (i_layout_segments) is built *through* the deltas: when LAYOUTGET runs while a migration is active, the layout-build path consults the migration record and emits the during- migration view by applying the deltas to the base segments. i_layout_segments itself is never mutated until PROXY_DONE(NFS4_OK) collapses the deltas into the base records permanently. This per-instance model and the L1/L2/L3 swap model agree on the wire-visible behavior in the simplest case (single mirror replacement, omit-and-replace). The wire ops in this draft do not require either implementation; an MDS chooses whichever matches its layout-record machinery. Haynes Expires 30 October 2026 [Page 36] Internet-Draft FFv2 Proxy Server April 2026 The wire ops in this draft do not constrain the choice; the per- instance delta model is one known implementation strategy that has been used to track the four record-builder invariants and a lease- aware reaper for the migration record / proxy_stateid tables across the lifecycle described above. 11. Client Behavior A client that observes a layout with FFV2_DS_FLAGS_PROXY routes all CHUNK I/O to the PROXY-flagged DS entry and does not issue I/O directly to any non-PROXY DS in that layout. Non-PROXY DSes MAY appear in the layout for informational reasons but MUST NOT be addressed by the client. The client uses its existing layout stateid against the PROXY-flagged DS entry; the PS accepts CHUNK ops under that stateid because the MDS has registered the stateid via TRUST_STATEID on the PS, per the tight-coupling semantics in [I-D.haynes-nfsv4-flexfiles-v2]. The client handles PS-side errors (NFS4ERR_DELAY, connection loss, NFS4ERR_BAD_STATEID) exactly as it would any other DS error: report LAYOUTERROR to the MDS and expect either a new layout or a PS reassignment in return. 11.1. When the Layout Is Recalled If the MDS recalls the layout mid-operation (the PS failed and is being replaced, or the operation completed and normal DS layouts are being reissued), the client LAYOUTRETURNs as usual and reacquires via LAYOUTGET. The new layout may name a different PS, a different mirror set, or no PROXY-flagged entry if the operation has completed. 11.2. In-Flight I/O When the PS Changes In-flight I/O to the old PS when the MDS recalls the layout MAY complete at the old PS; results remain valid under the old PS's authority. New I/O issued after LAYOUTRETURN MUST go through the replacement PS (or, if the new layout has no PROXY-flagged entry, directly to the DSes named there). 12. State Machine A file's participation in a proxy operation passes through five states: READY (no operation in flight), ASSIGNED (the MDS has queued an assignment for a PS but the PS has not acknowledged it via OPEN+LAYOUTGET), PROXY_ACTIVE (the PS is driving a move or repair), COMMITTING (the PS has issued PROXY_DONE(OK) and the MDS is recalling the old layout from external clients), and DONE (clients are on the Haynes Expires 30 October 2026 [Page 37] Internet-Draft FFv2 Proxy Server April 2026 post-move layout, source DSes retired). The state is MDS-local: clients never observe these state names directly, but a client's behaviour is shaped by which layout the MDS is currently handing out. A given file spends most of its lifetime in READY; a proxy operation is a relatively short excursion through the other four states, after which the file returns to READY with a new layout in place (or, on cancellation or failure, with the old layout preserved). The diagram below shows the happy-path progression; the table that follows enumerates every state transition including the unhappy ones (cancellation, PS failure without replacement). (admin, policy, repair, or maintenance trigger) | v +------------+ | READY | | source | | layout | | only | +-----+------+ | | MDS selects registered PS; | queues a proxy_assignment4 | for delivery in next | PROXY_PROGRESS reply v +--------------+ | ASSIGNED | | MDS has the | | in-flight | | record; PS | | has not yet | | OPEN'd file | +-----+--------+ | | PS picks up the assignment: | OPEN(file_FH) + LAYOUTGET | (L3 composite layout) v +--------------+ | PROXY_ACTIVE | | clients see | | layout with | | PS DS at | | head of | | ffs_data_ | | servers; | Haynes Expires 30 October 2026 [Page 38] Internet-Draft FFv2 Proxy Server April 2026 | PS drives | | source->dest | +-----+--------+ | | PS issues SEQUENCE | PUTFH LAYOUTRETURN | PROXY_DONE(stid, OK) v +------------+ | COMMITTING | | MDS issues | | CB_LAYOUT- | | RECALL for | | old layout | +-----+------+ | | all clients have | LAYOUTRETURNed v +------------+ | DONE | | new layout | | live; | | source | | DSes | | retired | +------------+ 12.1. Transitions +============+============+=====================+===================+ |From |To |Trigger |Actions | +============+============+=====================+===================+ |READY |ASSIGNED |MDS decides to move |MDS queues a | | | |or repair |proxy_assignment4 | | | | |(kind=MOVE or | | | | |REPAIR) for | | | | |delivery in the | | | | |next PROXY_PROGRESS| | | | |reply to the | | | | |selected PS; | | | | |creates the in- | | | | |flight migration | | | | |record | +------------+------------+---------------------+-------------------+ |ASSIGNED |PROXY_ACTIVE|PS picks up the |PS issues OPEN + | | | |assignment |LAYOUTGET against | | | | |pa_file_fh; MDS | Haynes Expires 30 October 2026 [Page 39] Internet-Draft FFv2 Proxy Server April 2026 | | | |starts handing out | | | | |the L3 composite | | | | |layout with | | | | |FFV2_DS_FLAGS_PROXY| | | | |set on the PS entry| +------------+------------+---------------------+-------------------+ |PROXY_ACTIVE|COMMITTING |PS issues PROXY_DONE |MDS begins | | | |with |CB_LAYOUTRECALL | | | |pd_status=NFS4_OK |fan-out to clients | | | | |still on the old | | | | |layout | +------------+------------+---------------------+-------------------+ |COMMITTING |DONE |All clients have |MDS issues post- | | | |LAYOUTRETURNed |move layouts (L2); | | | | |source DSes retired| +------------+------------+---------------------+-------------------+ |ASSIGNED |READY |MDS-initiated |MDS drops the in- | | | |cancellation: MDS |flight record; PS | | | |includes a |drops the | | | |PROXY_OP_CANCEL_PRIOR|assignment from its| | | |assignment in the |in-flight queue | | | |next PROXY_PROGRESS | | | | |reply | | +------------+------------+---------------------+-------------------+ |PROXY_ACTIVE|READY |PS failed and no |MDS reverts layouts| | | |replacement |to pre-move source | | | |available; or PS- |set (L1) | | | |initiated | | | | |cancellation via | | | | |PROXY_CANCEL; or | | | | |PROXY_DONE with | | | | |pd_status != NFS4_OK | | +------------+------------+---------------------+-------------------+ Table 1 13. PS Failure and Recovery 13.1. PS Crash During PROXY_ACTIVE When a PS crashes mid-operation, client I/O routed through it receives NFS4ERR_DELAY (if the PS is reachable but unhealthy) or connection errors (if unreachable), and the affected clients report LAYOUTERROR to the MDS. The MDS MAY select a replacement PS from the registered pool and queue a fresh proxy_assignment4 (kind MOVE or REPAIR) for that PS in its next PROXY_PROGRESS reply, with the source layout updated to reflect current reality -- destination DSes that the failed PS populated are now part of the source set -- and the Haynes Expires 30 October 2026 [Page 40] Internet-Draft FFv2 Proxy Server April 2026 destination layout unchanged; the replacement PS resumes from wherever the failed PS left off. The MDS then issues CB_LAYOUTRECALL on the old layout and the replacement PS's layout becomes live for new LAYOUTGETs. If the MDS cannot find a replacement within a policy timeout, it MUST cancel the operation: revert to the pre-move source layout, do not issue a destination layout, and mark the destination DSes for cleanup or retry. 13.2. Cascading PS Failure Repeated PS failures on the same operation SHOULD trigger escalation to deployment management rather than recursive retry. Recurring failures likely indicate an environmental issue the PS cannot work around. 13.3. Source DS Crash During PROXY_ACTIVE Reduces the PS's read parallelism but does not block forward progress as long as the erasure code can still reconstruct from surviving source DSes. If the source degrades past reconstructibility, the operation transitions to whole-file repair semantics automatically: partial reconstruction succeeds; ranges that cannot be reconstructed terminate the operation with NFS4ERR_PAYLOAD_LOST. 13.4. Destination DS Crash During PROXY_ACTIVE Treated as a normal DS failure on the destination side. The PS acts like a client to the destination DSes: LAYOUTERROR to the MDS, which MAY substitute a spare or mark the destination FFV2_DS_FLAGS_REPAIR. The PS continues pushing to the remaining destinations. Clients are unaffected. 14. MDS Crash Recovery Clients and the PS detect MDS session loss and enter RECLAIM per [RFC8881] S8.4 / S10.2.1. The MDS persists three independent things across restart: 1. *Client identity* for every NFSv4 client, including the PS. This is the standard NFSv4 client-state persistence: the PS's client_owner4 and assigned clientid4 survive MDS reboot. 2. *The nc_is_registered_ps attribute* on the PS's persisted client record. After reboot the MDS knows the PS was previously registered without re-issuing PROXY_REGISTRATION. Haynes Expires 30 October 2026 [Page 41] Internet-Draft FFv2 Proxy Server April 2026 3. *Per-file in-flight migration records*: {file_FH, source_dstore, target_dstore, owning_PS_clientid, started_at}. Lives alongside (1) and (2). Keyed on clientid (stable across PS-process restarts that preserve client_owner4), NOT on the layout stateid (which is volatile per-boot). 14.1. PS Recovery Sequence After detecting session loss, the PS: 1. *Re-establishes the session* with EXCHANGE_ID using its prior client_owner4 (recovers the same clientid4), followed by CREATE_SESSION. 2. *Issues RECLAIM_COMPLETE* to enter the reclaim phase. PROXY_REGISTRATION is NOT re-issued -- the persisted nc_is_registered_ps flag is sufficient. 3. *Per-file recovery* for each in-flight migration the PS was working on uses the standard NFSv4 reclaim path with one PS-side persistence requirement: * *PS implementations MUST persist each in-flight migration's layout stateid in PS-local stable storage* when the MDS grants the L3 layout. Lives in PS-side state (e.g., a small sidecar file or DB table keyed by file FH). This is the only PS-side persistence the data mover requires beyond what a normal NFSv4 client persists; without it the PS cannot reclaim its layouts after a PS-process restart and the migration is abandoned. * OPEN_RECLAIM(CLAIM_PREVIOUS, file_FH) per [RFC8881] S9.11.1. The MDS validates that the prior clientid4 had an OPEN on this file (which it did -- the OPEN was created when the PS picked up the assignment from a PROXY_PROGRESS reply). MDS re-grants the OPEN. * LAYOUTGET(reclaim=true) per [RFC8881] S18.43.3, supplying the previously-persisted layout stateid as the reclaim key. The MDS validates that: - The session's client has nc_is_registered_ps == true - A persisted in-flight migration record exists for this (clientid, file_FH, layout_stateid) triple Haynes Expires 30 October 2026 [Page 42] Internet-Draft FFv2 Proxy Server April 2026 - The reclaim falls within the server grace window and re- grants the L3 composite layout with a fresh stateid (the stateid the PS supplied is consumed; a new one is issued for the resumed migration session, which the PS MUST persist again per the rule above). This is the standard [RFC8881] layout reclaim path; no new claim type, no side-channel grant signal. The data-mover-specific contribution is the nc_is_registered_ps session attribute and the persisted in-flight migration record on the MDS side. 4. *Continue the migration* from wherever it left off. Bytes already on G are preserved (the DS holds them); the PS reads the remaining source bytes from D, writes them to G. 5. *Eventually issues PROXY_DONE* with success or failure; the L1->L2 commit (or rollback) completes. 14.2. PS Identity Continuity A PS implementation SHOULD persist its client_owner4 to survive PS- process restart so that post-restart EXCHANGE_ID recovers the same clientid and the persisted in-flight migration records remain valid. If the PS's client_owner4 rotates (e.g., because PS process state was lost), the new EXCHANGE_ID gets a fresh clientid4. The PS's prr_registration_id (if matching the prior registration) identifies it as the "same operator-meaningful PS instance" via the squat-guard, but the in-flight migration records keyed on the OLD clientid cannot be claimed by the NEW clientid. Those moves get re-assigned fresh by the autopilot via the next PROXY_PROGRESS poll. 14.3. Lost Migration Records If the persisted migration record cannot be matched (e.g., MDS state corruption), the per-file reclaim returns NFS4ERR_NO_GRACE or NFS4ERR_RECLAIM_BAD. The PS reports the move as failed via PROXY_DONE(layout_stid, FAIL); the autopilot retries on a fresh PROXY_PROGRESS poll with a new target dstore (or the same target if it is still the best choice). Haynes Expires 30 October 2026 [Page 43] Internet-Draft FFv2 Proxy Server April 2026 15. Backward Compatibility This section addresses client/DS compatibility with the proxy-server protocol. For implementations of an earlier revision of this draft that pre-built against the prior CB_PROXY_* shape, see "Major revision (2026-04-26)" at the front of the draft: pre-built CB_PROXY_* implementations SHOULD treat CB op numbers 17-20 as defunct (return NFS4ERR_NOTSUPP if received), MUST migrate driving logic to the PROXY_PROGRESS poll path, and MUST replace any CB_PROXY_CANCEL receive logic with the new fore-channel PROXY_CANCEL send logic. 15.1. Clients Client behavior is a normal layout path with a new flag. Clients that do not recognize FFV2_DS_FLAGS_PROXY will treat the PROXY- flagged DS entry as any other DS and route I/O to it normally; that is in fact the correct behavior. No client-side version negotiation is needed. Clients that require strict per-DS identity checking (e.g., "the layout's DS must match a pre-allowlisted fingerprint") should extend their allowlist to include registered proxy servers. This is a deployment concern, not a protocol one. 15.2. Data Servers A host that does not implement the proxy server role simply does not call PROXY_REGISTRATION and is never selected for a MOVE or REPAIR assignment. A deployment with no registered PS falls back to per- chunk CB_CHUNK_REPAIR for single-shard repair, to admin-coordinated offline procedures for policy transitions and DS evacuation, and to blocking DS maintenance -- the DS cannot drain through a PS, so it must remain reachable to clients throughout its service life. Deployments SHOULD ensure at least one registered PS exists per failure domain to avoid a single point of failure on move operations. 15.3. NFSv3 Source DSes When the source mirror is an NFSv3 DS, the PS reads from it using NFSv3 semantics and writes to the NFSv4.2 destination using CHUNK semantics. This is the same pattern [I-D.haynes-nfsv4-flexfiles-v2] uses for InBand I/O. Haynes Expires 30 October 2026 [Page 44] Internet-Draft FFv2 Proxy Server April 2026 16. Security Considerations The security surface added by this document sits in two places: the session the PS establishes with the MDS, and the data path clients take through the PS during a proxy operation. The session is narrower than the data path -- only the MDS talks to the PS over it, and the MDS has long been a trusted coordinator in the pNFS model -- but it carries operations that affect every client whose layouts reach a PS. The data path is broader, because it exposes the PS to every client whose layout has the FFV2_DS_FLAGS_PROXY flag; a compromised PS on that path has the same observational and modification reach as a compromised DS, and in the translating-PS case a larger reach because of the elevated identity the PS typically runs with. The numbered items below name the specific threats the design either addresses or explicitly leaves out of scope. The rule on credential forwarding, because it is the most consequential and the most easily implemented incorrectly, is expanded in Section 16.1. 1. *PS authority.* A PS in PROXY_ACTIVE sees all client I/O for the proxied file. A compromised PS can observe or modify file data. Deployments MUST treat PS-capable hosts as at least as trusted as the DSes they proxy for. PROXY_REGISTRATION SHOULD be gated by a deployment-level allowlist; arbitrary hosts that present the op without prior provisioning SHOULD be rejected. 2. *Transport security across the operation.* The PS's connections to source and destination DSes are independent of the client's connection to the PS. A PS MAY read from an AUTH_SYS source and write to a TLS destination (or any other combination). The PS is responsible for enforcing the effective security policy (e.g., do not downgrade encrypted data to a plaintext DS). 3. *Principal binding during a proxy operation.* For PS-to-DS traffic (the PS reading source DSes and writing destination DSes to carry out a MOVE or REPAIR assignment), the PS presents a principal to those DSes that they will accept; this is the PS's own service identity unless constrained delegation or equivalent is arranged. Forwarding the client's identity to the peer DSes for PS-driven data movement is NOT required and is typically NOT practical (the client is not in the conversation at that point). See the Credential Forwarding and Privilege Boundary section below for the case of client-initiated file I/O through a translating PS, where the credential-forwarding rule is different and stricter. Haynes Expires 30 October 2026 [Page 45] Internet-Draft FFv2 Proxy Server April 2026 4. *PS impersonation.* A malicious MDS could register a hostile entity as a PS. The existing MDS trust model already grants the MDS this capability via CB_LAYOUTRECALL and the ability to issue any layout it chooses; PROXY_REGISTRATION does not weaken it. Clients that require stronger PS identity verification SHOULD validate the PS's transport-security credentials against a deployment allowlist. 5. *Affinity token.* prr_affinity is a co-residency attestation, not an authentication mechanism. Matching an affinity token between a PS and a client grants the client no new access rights; it is used only as input to PS-selection preferences. A client cannot elevate privilege by spoofing an affinity token. 6. *Registration lease expiry.* If a PS's lease expires mid- operation, the MDS MUST abandon the operation: discard the in- flight migration record, revert the affected layouts to the pre- operation state, and arrange cleanup of any half-populated destination DSes. The MDS MUST NOT continue to route client I/O to a PS whose registration has lapsed. 16.1. Credential Forwarding and the Privilege Boundary A translating PS (see Section 4.7) has structurally elevated privilege by design. To perform its management tasks -- moves, repairs, evacuations, cross-tenant re-exports -- the deployment grants the PS's service identity broad access: typically not-root- squashed, often read/write to every file in the namespace, and session authority to every DS. That privilege is intentional. A codec-ignorant client that reaches the PS, however, arrives with its own RPC credentials that the PS does not itself need in order to function. An NFSv3 client's uid/gid, an AUTH_SYS-squashed identity, an RPCSEC_GSS principal -- none of these are the PS's own. If the PS ignores the client's credentials and issues MDS or DS operations under its own service identity when translating client I/O, every client that reaches the PS silently inherits the PS's privilege. This is a protocol-level privilege-escalation vector, and this document calls it out rather than hiding it. The normative requirements below apply whenever a PS is translating client-initiated file I/O (as distinct from PS-driven move / repair work, which runs under the PS's own authority on directives from the MDS). They form a cohesive set: credential pass-through (rule 1) is the core requirement; no-squash-inversion (rule 2) closes the most common way rule 1 can be implemented incorrectly; authorization- remains-with-MDS (rule 3) names the responsibility on the MDS side of the same contract; service-identity-is-for-the-control-plane (rule 4) Haynes Expires 30 October 2026 [Page 46] Internet-Draft FFv2 Proxy Server April 2026 draws the line between the op paths where the PS uses its own credentials and the op paths where it does not; and the failure-mode rule (rule 5) specifies the correct refusal behaviour rather than letting a silent fall-through become the escape hatch. 1. *Credential pass-through.* The PS MUST present the client's credentials (RPC auth flavor and principal) on every MDS or DS operation it issues as a consequence of a client-initiated request. Specifically, a client READ that the PS expands into LAYOUTGET + CHUNK_READ MUST carry the client's credentials on both the LAYOUTGET against the MDS and the CHUNK_READ against the DSes. The PS MUST NOT substitute its own service identity for client-initiated operations. 2. *No squash inversion.* If the client arrives with a root-squashed identity (for example, uid 0 mapped to nobody by the NFSv3 export configuration on the client-facing side of the PS), the PS MUST preserve the squashed identity when forwarding. The PS MUST NOT translate a client's squashed credentials back into unsquashed root, even though the PS's own identity is typically unsquashed. 3. *Authorization remains with the MDS.* When a client- initiated operation reaches the MDS over a PS <-> MDS session, the MDS MUST use the RPC credentials carried on that compound for authorization and MUST NOT substitute the PS's session-level identity. Equivalently: the MDS performs access-control checks against the forwarded client credentials, not against the PS's service identity, for any client-initiated file operation. The PS is a translator, not an authority. This is what prevents PS deployment from becoming a blanket ACL override. 4. *PS service identity is for the control plane only.* The PS MAY, and typically MUST, use its own service identity for: * The MDS <-> PS session (the session the PS opens to the MDS, on which PROXY_REGISTRATION, PROXY_PROGRESS, PROXY_DONE, and PROXY_CANCEL all flow on the fore-channel; the session's back- channel is not used by this draft). * Peer-DS session setup for PS-driven data movement (reading source DSes, writing destination DSes under a MOVE assignment the MDS has delivered via PROXY_PROGRESS). * PS housekeeping. The PS's service identity MUST NOT be used for client-initiated file data operations. Haynes Expires 30 October 2026 [Page 47] Internet-Draft FFv2 Proxy Server April 2026 5. *Failure mode on missing credentials.* If the PS cannot forward a client's credentials for some reason (e.g., the client presented AUTH_NONE, or the client-facing side used a security flavor the PS cannot propagate), the PS MUST reject the client operation with the equivalent of NFS4ERR_ACCESS (or NFS3ERR_ACCES for NFSv3 clients). The PS MUST NOT fall back to serving the operation under its own identity. Deployment-level requirements: * PROXY_REGISTRATION MUST be allowlisted. An unknown host presenting PROXY_REGISTRATION MUST be rejected. This is the only wire-level defense against a hostile entity registering as a PS and then receiving client-forwarded credentials. * The MDS <-> PS session MUST use RPCSEC_GSS [RFC7861] or RPC-over- TLS [RFC9289] with mutual authentication. AUTH_SYS on the MDS <-> PS session is forbidden. * Deployments SHOULD audit both the PS's credential- forwarding behavior (the PS logs what it forwards) and the MDS's authorization checks (the MDS logs what principal authorized each operation). Divergence between the two indicates a credential- forwarding bug or compromise. What the protocol cannot defend against: * A compromised PS has direct access to whatever credentials pass through it. Credential confidentiality collapses the moment the PS is under adversary control. Mitigation is operational: restrict which hosts can register as a PS, audit PROXY_REGISTRATION events, rotate deployment-level keys. * A deployment that configures a PS to run as root while the client is root-squashed has already violated rule 2 above; no wire mechanism detects a PS deliberately mis-implementing credential forwarding. Deployments SHOULD verify their PS implementation's credential- forwarding behavior through conformance testing before production use. Future work (noted as an Open Question below): RPCSEC_GSSv3 structured privilege assertion per [RFC7861] Section 2.5.2 is the natural strong-authentication mechanism for PS-forwarded credentials. This revision does not require GSSv3 because the broader NFSv4 deployment base does not yet support it; deployments that can use GSSv3 SHOULD prefer it over AUTH_SYS passthrough for the credential- forwarding channel. Haynes Expires 30 October 2026 [Page 48] Internet-Draft FFv2 Proxy Server April 2026 16.2. Namespace Traversal Privilege A PS that translates client I/O has to know how the MDS's namespace is shaped: which paths are exported, what filehandle each path resolves to, how the exports mount within one another. The PS acquires this information by traversing the MDS's namespace -- LOOKUP, LOOKUPP, PUTFH, PUTROOTFH, GETFH on the PS <-> MDS session. This traversal cannot always run under forwarded client credentials: at the point the PS needs to discover a new export (a client has not yet asked for it, or the PS has just restarted and has no FH cache) there is no client whose credentials the PS could forward. Deployments have two choices for how the PS acquires namespace shape: 1. *Grant a narrow traversal privilege.* The MDS MAY treat a registered PS's service identity as authorized for LOOKUP, LOOKUPP, PUTFH, PUTROOTFH, GETFH, and SEQUENCE on the PS <-> MDS session without applying the MDS's export-rule filtering that would normally gate those names. This is strictly a structural privilege: it permits the PS to see that paths exist and to obtain their filehandles, but grants no data access. All operations that carry or require data authorization (OPEN, READ, WRITE, LAYOUTGET, GETATTR of privileged attributes, etc.) MUST still run under the rules of Section 16.1: forwarded client credentials for client-initiated operations, and PS service identity only for control-plane operations. A deployment that grants this privilege discloses the MDS's namespace shape to the PS's service identity -- specifically, names that the PS's source address would not be able to see through the MDS's normal export filtering. Deployments SHOULD audit traversal compounds on registered-PS sessions so the disclosure is reviewable; the MDS SHOULD log each LOOKUP / GETFH that benefits from the bypass. 2. *Do not grant the privilege.* The PS is required to translate every client-originated LOOKUP into a separate LOOKUP against the MDS under the forwarding client's credentials, caching only what the client's credentials authorized the MDS to return. This eliminates the namespace-shape disclosure but costs an extra MDS round-trip per client LOOKUP-miss and leaves the PS unable to pre-discover exports. This document does not normatively prefer one approach over the other. Implementations SHOULD document which they use; deployment guidance for the common combined DS+PS case is that option (1) is expected and the narrower privilege is acceptable given the PS is already a trusted control-plane peer of the MDS. Haynes Expires 30 October 2026 [Page 49] Internet-Draft FFv2 Proxy Server April 2026 The traversal privilege is distinct from and narrower than root_squash bypass. A forwarded-uid-0 client operation (OPEN, READ, etc.) under option (1) is still subject to normal root_squash handling on the PS's source-address rule at the MDS; the traversal privilege applies only to the six ops enumerated above. 16.3. PS-Side Policy Enforcement (informative) A PS implementation MAY perform per-client export-rule enforcement locally, rejecting operations the MDS would also reject before forwarding them. This is a performance optimization: it keeps bad requests off the PS <-> MDS wire and lets the PS return NFS4ERR_WRONGSEC / NFS4ERR_ACCESS without paying a round-trip. Local enforcement is not a security boundary. Rule 3 of Section 16.1 names the MDS as the authority for every client-initiated file operation. A PS that performs local enforcement is checking its own cached copy of the MDS's per-client rules; if the copy is stale, wrong, or absent, the PS MUST forward the operation and let the MDS decide. A PS implementation that declines to perform local enforcement is conformant with this specification. Deployments that want local enforcement need a mechanism for the PS to acquire the MDS's per-export client-rule list. This document does not standardise such a mechanism; implementation- specific options include a control-plane probe-protocol extension, out-of-band admin distribution, or a future revision of this specification. Any such mechanism MUST itself be authenticated against the PS allowlist (the rules are sensitive deployment policy) and MUST support a refresh path so PSes see rule changes within a bounded time. 17. Implementations This section documents the publicly available implementation the editors are aware of at the time of writing, in line with [RFC7942]. 17.1. reffs reffs is an open-source NFSv4.2 server, MDS, and erasure-coding client. The reffs source ships an MDS, a Proxy Server, and a multi- codec client harness used as the working implementation for this draft. reffs is licensed AGPL-3.0-or-later. The PS surface implemented in reffs covers, at the time of writing: * The proxy listener model (one process serving its native NFS port and a per-[[proxy_mds]] PS port from independent SB namespaces, see Section 5.2). Haynes Expires 30 October 2026 [Page 50] Internet-Draft FFv2 Proxy Server April 2026 * PROXY_REGISTRATION over RPCSEC_GSS-class auth, presently exercised via mutually-authenticated RPC-over-TLS ([RFC9289]) with a client- cert SHA-256 fingerprint allowlist. AUTH_SYS over plain TCP is rejected with NFS4ERR_PERM per Section 16. * PROXY_PROGRESS lease renewal and the empty-assignment idle path used by every steady-state PS poll. * Forwarding of LOOKUP, OPEN, READ, WRITE, GETATTR, CLOSE, LAYOUTGET, GETDEVICEINFO, LAYOUTRETURN, LAYOUTERROR through the PS to the upstream MDS using the end-client's credentials. Forward-channel ops not yet exercised end-to-end in the public implementation include PROXY_DONE / PROXY_CANCEL, which are issued only after a PROXY_PROGRESS reply that delivers a proxy_assignment4. The MDS-driven assignment model (move, repair) is wire-implemented but the only assignment kind exercised by the published demo is the implicit no-assignment heartbeat that every PROXY_PROGRESS produces. 17.2. Demonstration A reproducible demonstration of cross-PS proxying, exercising the layout-passthrough data path through PS-A and PS-B against a shared MDS + 6 DSes, lives in the reffs source under deploy/sanity/. The demo does not exercise migration, repair, or any proxy_assignment4; its purpose is to show that a client's codec-encoded write through one PS is recoverable byte-for-byte through a peer PS that shares the same MDS. The matrix: +===============+========+==========================+========+ | Path | Layout | Codec | Result | +===============+========+==========================+========+ | /ffv1-csm | FF v1 | plain mirror | PASS | +---------------+--------+--------------------------+--------+ | /ffv1-stripes | FF v1 | stripe k=6, m=0 | PASS | +---------------+--------+--------------------------+--------+ | /ffv2-csm | FF v2 | plain mirror, CHUNK | PASS | +---------------+--------+--------------------------+--------+ | /ffv2-rs | FF v2 | RS(4,2), CHUNK | PASS | +---------------+--------+--------------------------+--------+ | /ffv2-mj | FF v2 | Mojette systematic (4,2) | PASS | +---------------+--------+--------------------------+--------+ Table 2 Haynes Expires 30 October 2026 [Page 51] Internet-Draft FFv2 Proxy Server April 2026 For each row the client opens /codec_