Skip to content

Optimize boot time and consolidate confd into a single daemon #1412

Draft
troglobit wants to merge 41 commits intomainfrom
initviz
Draft

Optimize boot time and consolidate confd into a single daemon #1412
troglobit wants to merge 41 commits intomainfrom
initviz

Conversation

@troglobit
Copy link
Contributor

@troglobit troglobit commented Feb 22, 2026

Description

This PR replaces sysrepo-plugind + bootstrap + load scripts with a single confd that handles config generation, datastore init, config loading, and plugin management. The daemon uses a libev event loop with SR_SUBSCR_NO_THREAD instead of ~30 per-subscription sysrepo threads. Care has been taken to handle crashes more gracefully, both at runtime and fail-over to failure-config at bootstrap.

The new initviz package was used for boot visualization.

Initial work on this branch speed up sysctl-sync-ip-conf and mnt scripts, as well as move hostname.d setup to build-time instead of runtime. To optimize the bootstrap process and free up CPU time on single-core systems, the start of many services have been postponed to either runlevel 2, or after confd bootstrap, like dbus, dnsmasq, and statd. To reduce overhead, we also drop unnecessary logger processes for services that already log to syslog. Even BusyBox has been inspected, we now enable NOEXEC/NOFORK applets.

Other interesting changes in this PR is to drop WiFi and GPS from minimal defconfigs. Move WiFi/firmware selects from RPi board Config.in to the full defconfigs so features are explicit rather than implicit via BSP.

Checklist

Tick relevant boxes, this PR is-a or has-a:

  • Bugfix
    • Regression tests
    • ChangeLog updates (for next release)
  • Feature
    • YANG model change => revision updated?
    • Regression tests added?
    • ChangeLog updates (for next release)
    • Documentation added?
  • Test changes
    • Checked in changed Readme.adoc (make test-spec)
    • Added new test to group Readme.adoc and yaml file
  • Code style update (formatting, renaming)
  • Refactoring (please detail in commit messages)
  • Build related changes
  • Documentation content changes
    • ChangeLog updated (for major changes)
  • Other (please describe):

@troglobit troglobit added the ci:main Build default defconfig, not minimal label Feb 22, 2026
@troglobit troglobit marked this pull request as ready for review February 22, 2026 11:24

This comment was marked as outdated.

@troglobit troglobit force-pushed the initviz branch 2 times, most recently from b8ae1be to 98772d7 Compare March 2, 2026 10:18
@troglobit troglobit force-pushed the initviz branch 2 times, most recently from fbe6691 to daab75e Compare March 17, 2026 13:49
@troglobit troglobit marked this pull request as draft March 17, 2026 13:49
 - The biggest changes are syncing with latest BusyBox (busybox-update-config)
 - Disable optimize for size
 - Enable feature "SH_NOFORK" which allows /bin/sh to call applet_main()
   directly without having to fork+exec busybox

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
The v2025.01 release supports the Microchip SamA7G5* eval kit(s), which
means we can enjoy the same patch level of U-Boot as other Infix boards

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
For details, see:
 - linux4sam/u-boot-at91@23ac019
 - linux4sam/linux-at91@5b35500

U-Boot patches imported and refreshed in local KernelKi fork of U-Boot,
see https://github.com/kernelkit/u-boot/tree/v2025.01-kkit

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Backport fixes from upstream post v4.16 release.  Mainly to fix
mdns-alias crash+restart counter issue when avahi-daemon has to
be restarted.  Finit did not properly clear the dependency that
mdns-alias had on avahi-daemon, causing it to crash and have its
restart counter incremented.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
 - Markdown syntax
 - Grammar fixes
 - Use lowdown's admonition syntax
 - Update examples

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
On single-core Cortex-A7, the YANG bootstrap and config loading is the
dominant boot bottleneck.  The current sequence spawns three serial
phases (bootstrap, sysrepo-plugind, load), each performing independent
sr_connect()/sr_disconnect() cycles, and every sysrepoctl/sysrepocfg
invocation is a fork+exec that rebuilds SHM from scratch.

Replace all three with a single confd binary that does one sr_connect()
and performs all datastore operations in-process:

 - Wipe stale /dev/shm/sr_* for a clean slate
 - sr_install_factory_config() from the generated JSON
 - Smart migration: compare config version via libjansson, only
   fork+exec the migrate script when versions actually differ
 - Load startup-config (or test-config) via lyd_parse_data() +
   sr_replace_config(), mirroring what sysrepocfg -I does internally
 - On failure: revert to factory-default, load failure-config, set
   login banners (Fail Secure mode)
 - On first boot: copy factory-default to running, export to file
 - dlopen plugins and enter event loop

The bootstrap shell script is split: config generation (gen-hostname,
gen-interfaces, etc.) stays in the new gen-config script, while all
sysrepo operations move into the C daemon.  The finit boot sequence
collapses from 5 stanzas to 2 (gen-config -> confd).

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Infix technically does not need to start dbus and dnsmasq before confd
has loaded the system startup-config.  So we move it in time to save
some CPU cycles for confd & C:o.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Also, we don't need to start a logger process for statd, it behaves
nicely and uses syslog.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Note, rousette does not support SIGHUP, so let's mark it as such.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Allow confd to start even earlier at boot and call 'gen-config' as a
background task, pending on it to complete before we load the sysrepo
factory datastore.

Also, add Finit style progress to console so users can see the phases
of the bootstrap process.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Follow-up to 90f619b which first introduced the /etc/hostname.d concept.

This commit moves the setup of /etc/hostname.d to post-build.sh dropping
the initial call to the hostname activation script a bit, since it is
called anyway after bootstrap has finished.  The scrip is also given a
bit of a refrehs, reducing overhead and needless log messages.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Relocate probe of wifi radios from gen-hardware to 00-probe.  This saves
one python invocation and some precious CPU cycles at boot.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
netopeer2 and rousette were starting as soon as confd had a PID, while
confd was still loading the startup config via sr_replace_config().  This
held a write lock on the running datastore.  A test calling test/reset
shortly after the NETCONF port opened would have its own sr_replace_config()
block on that lock for up to 60 s, at which point sysrepo timed out the
whole RPC with "SHM event 'rpc' processing timed out".

Fix by adding the usr/bootstrap condition to both management servers so
they don't accept connections until confd signals that bootstrap is complete.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Neither the Raspberry Pi 2B or the Microchip SAMA7G54-EK board have WiFi
hardware by default so drop WiFi, and GPS, support to allow for smaller
builds with only the bare essentials in kernel and system.

Minimal builds in general don't need WiFi or GPS either, so let's disable
them from all.  This may become a central theme going forward, keeping
the minimal builds ... minimal.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
 - Add x509-public-key-format identity to crypto-types
 - Add certificate node to web services container
 - Use certificate from ietf-keystore as web cert

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Use SR_SUBSCR_NO_THREAD for all subscriptions and integrate sysrepo
event pipes into a libev event loop.  This eliminates approximately 30
per-subscription threads, reducing overhead on embedded ARM hardware.

A temporary poll-based "event pump" thread handles callback dispatch
during bootstrap (where sr_replace_config blocks waiting for callbacks),
then exits.  After bootstrap, the single-threaded libev loop takes over
for steady-state event processing.

Note, the confd-test-mode plugin still requires use of threads so we do
not create deadlocks when calling sr_replace_config().

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Only install the keys on CHANGE event, fixes this annoying issue:

Nov 5 01:32:10 ix confd[2011]: Installing HTTPS gencert certificate "self-signed"
Nov 5 01:32:10 ix confd[2011]: Installing SSH host key "genkey".
Nov 5 01:32:11 ix confd[2011]: Installing HTTPS gencert certificate "self-signed
Nov 5 01:32:11 ix confd[2011]: Installing SSH host key "genkey".

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Replace logging + logging.handlers with a lightweight syslog wrapper,
and argparse with manual argv parsing.  On a sama7g54, this cuts yanger
startup from ~770ms to ~470ms by eliminating ~300ms of stdlib imports.

Also batch external command invocations:

 - ietf_routing: two sysctl calls instead of two per interface
 - ietf_hardware: one ls per hwmon device instead of six
 - bridge: fetch mctl querier data once instead of once per VLAN

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
 - Use same log frameworks as reset of confd
 - Use existing primitives from libite + libsrx
 - Drop remaining pthreads
 - Coding style fixes

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
When the mdns service is stop-started (e.g. after a config change),
statd's avahi client fires AVAHI_CLIENT_FAILURE momentarily.  With
AVAHI_CLIENT_NO_FAIL the client reconnects automatically, but the
immediate ERROR log is misleading:

  ERROR: avahi: client failure: Daemon connection failed

New behavior:
- On AVAHI_CLIENT_FAILURE: start a 2 s deferred timer (no immediate log)
- Timer fires up to 3 times (~6 s total); on the 3rd attempt, check if
  mDNS is enabled in the running config via a temporary sysrepo session
- Log ERROR only if the daemon is still down AND mDNS is enabled
- On AVAHI_CLIENT_S_RUNNING: cancel the timer, reset the counter, and
  log NOTE "mDNS daemon reconnected" if a failure had been seen

This silences the error entirely when the operator has disabled mDNS
(expected), and defers it by ~6 s for a brief restart (self-heals
before the timer fires).

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
The operator sees Infix through YANG models and should not need to know
which library implements a given feature.  Rename the public-facing
parts of the avahi module to use the mdns vocabulary:

- Log strings: "avahi: ..." → "mdns: ..."
- Public API:  avahi_ctx_init/exit → mdns_ctx_init/exit
- Main type:   struct avahi_ctx → struct mdns_ctx
- statd field: statd.avahi → statd.mdns

Internal types (struct avahi_neighbor, avahi_service, …) and file names
(avahi.c, avahi.h) are kept as-is — developers debugging at the C level
benefit from knowing the underlying implementation.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Save a few CPU cycles by skipping a new dagger generation when no
interfaces have been modified/added/deleted.

Uses d->next_fp as the sentinel: NULL means no claim was made for this
transaction.  dagger_evolve() and dagger_abandon() now NULL it after
fclose, so subsequent unclaimed transactions also get the clean early
return.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Previously svc_enadis() would call 'initctl enable + touch' for every
config change event, even when the service's enabled state was unchanged.
This caused rousette to be unnecessarily restarted on every test_reset(),
racing with the active RESTCONF connection on slow hardware.

Replace svc_enadis() and svc_change() with svc_enable() which only
manages nginx symlinks and calls 'initctl enable/disable' -- never touch.
Each handler now checks the diff for the specific leaf that changed:

- If /enabled appears in diff: call svc_enable() to start or stop it
- If other config leaves changed with service already enabled: touch only

This ensures rousette, ttyd, netbrowse, avahi, sshd, and lldpd are only
restarted when their configuration actually requires it.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Add a finit_enable/disable/reload() family in core.c that directly
manipulates Finit's service state without fork+exec overhead:

  finit_enable(svc)  -- create symlink in /etc/finit.d/enabled/
  finit_disable(svc) -- remove symlink from /etc/finit.d/enabled/
  finit_delete(svc)  -- remove both symlink and service entirely
  finit_reload(svc)  -- utimensat() on .conf to schedule reload

Printf-style variants (finit_enablef/disablef/reloadf) handle
template instance names such as container@foo and hostapd@wlan0.

All systemf("initctl ... enable/disable/touch ...") call sites across
containers, dhcp-server, firewall, hardware, ntp, routing, services,
syslog, and system are converted to the new API.

As a related cleanup in services.c, drop the remaining srx_enabled()
calls in favour of reading the already-fetched config tree directly
via lydx_is_enabled(lydx_get_xpathf(config, ...)), eliminating the
last sysrepo round-trips from that module.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Replace remaining systemf() calls that invoke simple file-system
operations with direct C API equivalents, eliminating unnecessary
fork/exec overhead:

 - mkdir -p     → mkpath() from libite
 - ln -sf       → erase() + symlink() from libite/POSIX
 - rm -rf       → rmrf() from libsrx helpers
 - rm -f dir/*  → rmrf() + mkpath() to clear and recreate the dir

Files updated: dagger.c, containers.c, firewall.c, services.c, system.c

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
After a successful bootstrap confd writes a sentinel to /run/confd.boot
If Finit restarts confd — whether after a crash or a clean exit — the
sentinel is found and the destructive bootstrap phases are skipped:

 - gen-config fork (factory/failure configs already exist)
 - wipe_sysrepo_shm() (other daemons, e.g. statd, are live)
 - sr_install_factory_config() (datastores are already initialised)
 - sr_replace_config(NULL, NULL) (running datastore is consistent)
 - bootstrap_config() / load startup-config (not needed; sysrepo has
   the right state; plugins resync via SR_EV_ENABLED on re-subscribe)

On restart confd reconnects to sysrepo, re-initialises plugins (which
re-subscribe and receive SR_EV_ENABLED to resync with the live running
datastore), then enters the steady-state event loop.

The sentinel lives on tmpfs so a real reboot always produces a clean
slate.  Crash-loop protection is delegated to Finit's max-restarts (10).

As a side-effect this also enables a future "run-once" mode for resource
constrained systems: confd can exit after bootstrap and the sentinel
ensures any later restart just re-attaches without re-bootstrapping.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
test_reset() triggers a config reload which causes services such as
rousette to restart.  Wait for the transport to become reachable again
before returning from attach(), preventing subsequent API calls from
racing with a still-restarting backend.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
A service restart triggered by finit during sysrepo callbacks can drop
an in-flight HTTP connection, causing copy("candidate", "running") to
fail with RemoteDisconnected.  Add retry logic consistent with the
existing PATCH retry pattern in put_config_dict().

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Replace fragile SSH+jq approach for reading the chassis MAC with a
NETCONF query to ietf-hardware:hardware/component[name='mainboard'],
reading phys-address directly from the YANG model.

Also add until() polling to the chassis MAC and chassis+offset MAC
verification steps, consistent with the reset-to-default steps.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
has_fix() only checks that fix-mode is set (2d/3d), but altitude and
other fields may not yet be populated in the operational datastore when
gpsd is still processing its first NMEA cycles after boot.  Calling
verify_position() immediately after has_fix() passes can therefore race
and fail with:

  KeyError: 'altitude'

This manifests reliably on the second GPS receiver (gps1) after reboot,
because it is initialized slightly later than gps0 and hits the window
where fix-mode is set but altitude has not yet appeared.

  not ok 11 - Verify gps1 position is near the coordinates
  # Traceback (most recent call last):
  #   File "test/case/hardware/gps_simple/test.py", line 29, in verify_position
  #     alt = float(state["altitude"])
  #                 ~~~~~^^^^^^^^^^^^
  # KeyError: 'altitude'

Add has_position() to infamy/gps.py, which gates on fix-mode AND all
position fields (latitude, longitude, altitude, satellites-used) being
present.  Replace the has_fix() polls in both the pre- and post-reboot
verify steps with has_position().

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Also, update testing-overview.svg to support dark mode view.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci:main Build default defconfig, not minimal

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants