Managing Netplan with Django in Production (Revisited)
Table of Contents
- Situation
- Problem
- What Changed
- Architecture
- Safety Rules
- Django Workflow
- Host Watcher
- UI Feedback
- Terminal Fallback
- Flow
- Takeaways
- Final Notes
Situation
Changing a server IP address remotely is the kind of task that punishes overconfidence. One wrong route, one malformed YAML file, or one DHCP lease with an unexpected default route can make the management UI disappear.
In field deployments we do not always have SSH access. Operators may need to:
- Move the appliance between networks
- Switch the default Ethernet interface
- Change an interface from static addressing to DHCP
- Leave a secondary interface with no address at all
- Recover from stale netplan or cloud-init files left by the OS image
So the project manages netplan through a guarded workflow instead of letting the
web request run netplan apply directly.
The important idea is still simple:
Django stages the desired config. The host watcher applies it. The UI waits until the real host interface state matches what was requested.
Problem
Manually editing /etc/netplan/*.yaml is risky in production because:
- A YAML mistake can break networking immediately.
- A default route on the wrong interface can kill the active session.
- DHCP can inject routes unless it is explicitly controlled.
- Old netplan files can conflict with the file you intended to use.
- Cloud-init can reintroduce network settings after boot.
- A web request is the wrong place to run long, privileged host operations.
The project needed a UI-driven process with:
- A strict stage -> confirm -> apply flow
- Validation before any live netplan file is touched
- A clear default-interface rule
- Per-interface YAML files
- Backups before promotion
- A host-side service that survives errors
- A status endpoint that checks actual interface state
- Logs that operators can understand
What Changed
The old post described the core idea, but the implementation has grown.
The current code now has:
- A separate form for selecting the default interface
- Three addressing modes:
static,dhcp, andno_ip - A hidden DHCP flag in the web UI so disabled inputs do not confuse the server
- Server-side validation for interface name, gateway placement, static default interface, CIDR format, and gateway subnet membership
- A watcher-generated
interface_config.jsonfile used as the source of truth for UI status - A polling endpoint that validates the final state from JSON, not only from log text
- A summarized network log with English and Persian user-facing messages
- A manual/terminal
dialogpath that uses the same pending-file and watcher contract - A watcher startup routine that removes conflicting netplan entries, disables cloud-init networking, validates netplan syntax, and refreshes interface JSON
The rest of this post describes the current behavior without exposing real deployment values. Code blocks are pseudocode, not exact project code.
Architecture
The web container has controlled access to host networking files:
host /sys/class/net -> container /host_sys/class/net read-only
host /etc/netplan -> container /host_netplan read-write
app state directory -> container /code read-write
Django uses those mounts to list interfaces and stage netplan YAML. The host watcher runs outside the request path and uses the real host locations:
/etc/netplan
/sys/class/net
<app-state>/ip_reset
<app-state>/ip_refresh
<app-state>/netplan_result.log
<app-state>/interface_config.json
The split is intentional:
- Django owns validation, staging, confirmation, session state, and UI feedback.
- The host watcher owns
netplan generate,netplan apply, cleanup, and JSON state generation. - The wait page polls Django, and Django reads host-generated state.
The systemd service is deliberately small:
[Service]
ExecStart=/bin/bash <app-dir>/host_network_info.sh
Restart=always
RestartSec=2
StandardOutput=append:<app-dir>/netplan_result.log
StandardError=append:<app-dir>/netplan_result.log
The service is allowed to restart, but the script itself is also written to keep running through most failures and log what happened.
Safety Rules
The default interface is stored in a project config file under a system setting. Operators can change it from the settings page, but every network change reads the current value again before building netplan.
Pseudocode:
def get_default_interface():
return read_config_value(section="system", key="default_interface")
Only Ethernet-style interfaces are accepted:
def allowed_interfaces():
return [
name
for name in list_directory("/host_sys/class/net")
if name.startswith(("eth", "ens", "enp"))
]
The server validates the selected interface even though the UI already provides a dropdown:
if selected_interface not in allowed_interfaces():
reject("Unknown interface")
The critical route rules are:
- Only the default interface may have a gateway.
- The default interface must use a static IP.
- DHCP is allowed only on non-default interfaces.
- A gateway is only valid with a static IP.
- If a gateway is provided, it must belong to the same subnet as the static IP.
- Non-default DHCP must not install default routes.
Pseudocode:
if interface != default_interface and gateway:
reject("Gateway is only allowed on the default interface")
if interface == default_interface and (use_dhcp or not ip_cidr):
reject("Default interface requires a static IP")
if gateway and not ip_cidr:
reject("Gateway requires a static IP")
if ip_cidr and gateway and gateway not in network(ip_cidr):
reject("Gateway must be inside the IP subnet")
These rules are the main lockout protection. They keep the default route predictable and prevent secondary DHCP interfaces from taking over traffic.
Django Workflow
1. Read Current Interface State
The UI displays the watcher-generated interface snapshot:
[
{
"interface": "<iface>",
"ip": "<current-ip-cidr-or-empty>",
"gateway": "<default-gateway-or-empty>"
}
]
Django does not shell out on every page load to discover state. It reads the JSON file generated by the watcher.
There is also a small endpoint for one interface:
def interface_status(request):
iface = request.query["interface"]
state = read_json("<app-state>/interface_config.json")
return state_for(iface)
2. Control the Form
The frontend helps operators avoid invalid combinations:
- No interface selected: IP, gateway, and DHCP controls are disabled.
- Default interface selected: DHCP is disabled and gateway is editable.
- Non-default interface selected: DHCP is allowed and gateway is read-only.
- DHCP checked: IP input is disabled and the server receives a hidden
use_dhcp=truefield.
The hidden field matters because disabled inputs are not submitted by browsers. The server still revalidates everything, but the hidden DHCP flag makes the operator’s intent explicit.
3. Build Netplan In Memory
Django builds a per-interface netplan document in memory first:
config = {
"network": {
"version": 2,
"renderer": "networkd",
"ethernets": {
interface: {
"optional": True
}
}
}
}
Static mode:
config["network"]["ethernets"][interface].update({
"dhcp4": False,
"addresses": ["<ip>/<cidr>"]
})
DHCP mode for a non-default interface:
config["network"]["ethernets"][interface].update({
"dhcp4": True,
"addresses": [],
"dhcp4-overrides": {
"use-routes": False,
"route-metric": 500
}
})
No-IP mode for a non-default interface:
config["network"]["ethernets"][interface].update({
"dhcp4": False,
"addresses": []
})
Default route only when the selected interface is the default interface and a gateway was provided:
if interface == default_interface and gateway:
config["network"]["ethernets"][interface]["routes"] = [
{"to": "0.0.0.0/0", "via": "<gateway-ip>"}
]
Notice that the examples use placeholders. The real system writes the actual operator-provided values after validation.
4. Stage Only
Saving the form does not activate the change. It writes only a pending file:
/host_netplan/<interface>.yaml.pending
Pseudocode:
remove_old_pending_file(interface)
write_yaml("<netplan-mount>/<interface>.yaml.pending", config)
chmod_owner_root_only("<netplan-mount>/<interface>.yaml.pending")
redirect_to_confirm_page(interface, mode, expected_ip)
At this point:
- The live YAML is untouched.
netplan applyhas not run.- Routing has not changed.
- The operator still gets a confirmation screen.
5. Confirm Is the Commit Point
The confirmation view is the only place where pending config becomes live.
On confirm:
- Ensure the pending file exists.
- Copy the current live file to a
.bakfile if it exists. - Promote pending to live with a same-filesystem replace.
- Atomically write the
ip_resetsignal file. - Store expected state in the session for polling.
- Redirect to the wait page, or to the new default-interface IP if the default interface IP actually changed.
Pseudocode:
if not exists(pending):
reject("No pending config")
if exists(live):
copy(live, backup)
replace(pending, live)
atomic_write("<app-state>/ip_reset", "request metadata")
session["ip_change_iface"] = interface
session["ip_change_mode"] = mode
session["ip_change_expected_ip"] = expected_ip_or_empty
session["ip_change_started_at"] = now()
The signal file is written atomically:
def atomic_write(path, content):
tmp = path + ".tmp"
write_and_fsync(tmp, content)
replace(tmp, path)
This gives the watcher a clean file modification event and avoids partially written signals.
Host Watcher
The watcher is the part that touches the real host networking stack.
On startup it:
- Removes conflicting netplan entries.
- Keeps only per-interface YAML files with expected Ethernet-style names.
- Disables cloud-init network configuration if needed.
- Runs
netplan generateto validate syntax. - Waits briefly for interfaces to settle.
- Writes
interface_config.json.
Pseudocode:
cleanup_netplan_dir
disable_cloud_init_network
netplan_generate_for_validation
wait_for_interfaces
generate_interface_json
remember_signal_file_timestamps
The cleanup is intentionally strict. It keeps files shaped like:
<interface>.yaml
and removes stale or conflicting entries. That prevents old cloud-init or installer-generated files from defining the same interface behind our back.
The main loop watches two signal files:
while true; do
if refresh_signal_changed; then
regenerate_interface_json
write_refresh_log
fi
if reset_signal_changed; then
detect_duplicate_interface_definitions
run_netplan_apply_without_crashing_watcher
wait_for_interfaces
regenerate_interface_json
write_apply_log
fi
sleep 1
done
After an apply, the watcher logs:
- Whether netplan apply completed with success or warning
- Duplicate interface definitions, if detected
- Which interfaces changed IP
- Any relevant netplan output, with noisy Open vSwitch warnings filtered out
It then rewrites interface_config.json from the actual host state:
for iface in ethernet_interfaces; do
ip = current_ipv4_for(iface)
gateway = default_gateway_for(iface)
append_json({ "interface": iface, "ip": ip, "gateway": gateway })
done
That JSON file is the bridge between the host and Django.
UI Feedback
The wait page does not assume that a log line means the network is correct. Instead, Django checks both time and actual state.
When the confirm view starts an apply, it records:
session = {
"interface": "<iface>",
"mode": "static | dhcp | no_ip",
"expected_ip": "<ip-without-cidr-or-empty>",
"started_at": now()
}
The polling endpoint reads only logs and JSON newer than started_at, so old
success messages do not accidentally satisfy a new request.
Readiness is mode-specific:
if mode == "static":
ready = current_ip_without_cidr(interface) == expected_ip
elif mode == "dhcp":
ready = interface_has_some_ipv4(interface)
elif mode == "no_ip":
ready = interface_has_no_ipv4(interface)
else:
ready = interface_has_some_ipv4(interface)
Then the endpoint returns:
updatedwhen the host state matches the requestpendingwhile the watcher or JSON refresh is still catching uperrorwhen apply status fails or the wait times out
On error or timeout, Django copies the .bak file back over the live YAML when
a backup exists, then sends the operator back to the network settings page.
The raw watcher log is summarized before display. Instead of showing every shell line, the UI maps known messages to short operator-facing text, for example:
Preparing network settings.
Validating network settings.
Network changes were applied successfully.
Address updated on interface <iface>.
The same summarizer has Persian labels as well.
Terminal Fallback
There is also a host-side dialog tool for operators who are working locally or
through a console session.
It follows the same contract:
- Read the default interface from the same config file.
- List Ethernet-style interfaces.
- Warn before changing the default interface.
- Stage
<interface>.yaml.pending. - Back up the live YAML before promotion.
- Promote pending files to live files.
- Signal the same watcher with
ip_reset. - Wait for the watcher log and show the result.
The terminal path is not a separate networking system. It is another front end for the same stage -> promote -> signal -> apply design.
Flow
- Operator opens Network Settings.
- Django reads the default interface and watcher-generated interface JSON.
- Operator optionally changes the default interface.
- Operator selects an Ethernet interface.
- UI enables only valid controls for that interface.
- Operator chooses static, DHCP, or no-IP behavior.
- Django validates the submitted values.
- Django writes
<interface>.yaml.pending. - Operator reviews the confirmation page.
- Confirm backs up live YAML and promotes pending YAML.
- Confirm atomically writes
ip_reset. - The host watcher runs
netplan apply. - The watcher regenerates
interface_config.json. - The wait page polls Django.
- Django compares requested state to actual host state.
- The operator is returned to settings, or the browser moves to the new IP if the default-interface address changed.
Downtime is usually just the time netplan needs to apply and the browser needs to reconnect.
Takeaways
- Keep web validation separate from privileged host application.
- Treat netplan changes as a staged transaction.
- Make confirmation the only commit point.
- Store the default interface in one place and enforce it everywhere.
- Keep the default interface static-only.
- Never allow gateways on non-default interfaces.
- Disable DHCP route injection on non-default DHCP interfaces.
- Verify success from actual interface state, not only from logs.
- Generate a small JSON snapshot for the UI.
- Keep backups before promotion.
- Clean stale netplan and cloud-init conflicts.
- Keep the watcher alive even when individual operations fail.
Final Notes
The design works because it accepts that network changes are dangerous.
Instead of pretending a web form can safely edit networking directly, the project turns the change into a controlled handoff:
validate -> stage -> confirm -> promote -> signal -> apply -> verify
That gives operators a predictable workflow without requiring SSH, while still keeping the risky part on the host side where netplan actually belongs.
The most important part is not the YAML generation. It is the discipline around when the YAML becomes live, who applies it, and how the UI decides the change really worked.