10  Deployment

11 Deployment

This page is the concise operator-facing entry point for SysNDD deployment.

11.1 Quick Start

git clone https://github.com/berntpopp/sysndd.git
cd sysndd
cp .env.example .env
# edit .env
docker compose up -d

Legacy archive-downloader deployment scripts are not part of the supported deployment path; do not use unverified downloaded shell to provision runtime configuration.

11.2 Key Runtime Settings

api/config.yml

The production API image does not include api/config.yml. Provide runtime configuration through the Compose read-only mount, an operator secret, or an equivalent deployment-specific config injection mechanism. Never re-add COPY config.yml config.yml to api/Dockerfile; local credentials can otherwise be baked into image layers.

MIRAI_WORKERS

Controls background worker count for long-running jobs.

  • small server: 1
  • medium server: 2
  • large server: 4

Rule of thumb:

Peak memory ~= 500 MB base + workers x 2 GB

DB_POOL_SIZE

Controls the database connection pool.

Recommended baseline:

  • MIRAI_WORKERS=1 -> DB_POOL_SIZE=3-5
  • MIRAI_WORKERS=2 -> DB_POOL_SIZE=5-7
  • MIRAI_WORKERS=4 -> DB_POOL_SIZE=10-12

CACHE_VERSION

Increment CACHE_VERSION when cached function behavior or cached result shape changes and you need cache invalidation on next startup.

External genomic proxy caches live under /app/cache/external/{static,stable,dynamic} by default and can be relocated with EXTERNAL_PROXY_CACHE_DIR. The proxy layer caches successful and true not-found responses, but transient upstream errors (error = TRUE, mapped to 503) are evicted immediately so a timeout does not poison the cache for the full 7/14/30-day source TTL.

External provider request budgets default to short fail-fast values: EXTERNAL_PROXY_TIMEOUT_SECONDS=6, EXTERNAL_PROXY_MAX_SECONDS=10, EXTERNAL_PROXY_MAX_TRIES=2, and EXTERNAL_PROXY_AGGREGATE_MAX_SECONDS=12. Override per source with names such as EXTERNAL_PROXY_MGI_TIMEOUT_SECONDS, EXTERNAL_PROXY_MGI_MAX_SECONDS, and EXTERNAL_PROXY_MGI_MAX_TRIES. The same per-source pattern covers the two-step and batch providers: EXTERNAL_PROXY_UNIPROT_* (its features fetch now uses the budget instead of a 30–120s window), EXTERNAL_PROXY_GENEREVIEWS_* (NCBI E-utilities), and EXTERNAL_PROXY_GNOMAD_BATCH_* (worker-only batch path, higher defaults of 20s timeout / 30s window / 3 tries). The aggregate external gene route remains serial and returns partial = TRUE with skipped_sources when the aggregate budget is exhausted.

Beyond the per-source and aggregate budgets, a per-request external-time ceiling caps the total time any single request may spend in external calls: EXTERNAL_PROXY_REQUEST_MAX_SECONDS (default 15s). Once a request crosses it, subsequent external fetches short-circuit to a degraded 503 (request_budget_exceeded = TRUE) without contacting the upstream, so even a request that touches several providers cannot occupy a worker indefinitely. This is independent of the 12s aggregate budget, which only governs the multi-source /api/external/gene/<symbol> route. Per-request timing is logged by the postroute hook as [request-timing] method=<m> path=<p> status=<http> duration_ms=<n> external_ms=<n> slow=<bool> (to the API log file); external_ms is the wall time that request spent in external providers (0 for cheap routes), and slow=true flags requests over API_SLOW_REQUEST_MS (default 2000). Use external_ms to confirm whether a slow request was slow because of an upstream provider.

Each external provider emits a structured timing line on stderr of the form [external-proxy] source=<provider> event=complete status=<http> elapsed_ms=<n> cache=<hit|miss> (transient failures additionally log event=error_not_cached). gnomad, ensembl, uniprot, and alphafold log this at the memoise chokepoint, while mgi and rgd log it from their inline timing wrapper. Use elapsed_ms to spot upstream slowdowns, cache=hit/cache=miss to confirm the disk caches are serving traffic, and status to track 404/503 rates per source. The log is cheap (one cache-key probe plus two clock reads) and adds no latency on the hot path. This per-request fast-fail plus observability bounds how long any single external request can occupy an API worker; true cross-request isolation between heavy and light routes (separate worker pools / queue) is tracked in issue #154. Until then, run the API with more than one replica and non-sticky routing so a worker held by a slow request does not stall cheap routes such as /api/health/, auth, and stats. The curator GeneReviews coverage feature (/api/genereviews, Curator+) resolves GeneReviews availability through NCBI E-utilities and caches it in the same success-only external cache (30-day static TTL). The API container therefore needs outbound egress to eutils.ncbi.nlm.nih.gov when curators run the live availability pass or attach a GeneReviews reference. NCBI credentials are optional: set NCBI_API_KEY and NCBI_EUTILS_EMAIL to raise NCBI rate limits; anonymous low-volume use works without them. The cheap (already-linked) coverage view and CSV export make no external calls.

Public analysis snapshots

The log-cleanup Compose service prunes old rows from the operational request log table (logging) on a daily schedule so the table does not grow unbounded. It reuses the API image (so it shares the renv dependencies, RMariaDB, and the existing connection-pool/config helpers) and connects over the internal backend network only — it needs the database but no outbound egress. The service runs a small no-root scheduler loop that invokes api/scripts/delete_old_logs.R once per day; the script delegates to the unit-tested helpers in api/functions/log-cleanup.R.

Configuration (environment variables, with defaults):

  • LOG_RETENTION_DAYS=30 — delete logging rows whose timestamp is older than this many days. Validated to a positive integer before it reaches SQL.
  • LOG_CLEANUP_AT=03:00 — daily run time, HH:MM in container (UTC) time.
  • LOG_CLEANUP_DRY_RUN=false — when truthy (1/true/yes/on), count and log the candidate rows but delete nothing. Use this to verify scope before enabling deletion.

Only the high-volume logging table is pruned. Low-volume audit tables (for example llm_generation_log, async_job_events) are intentionally left alone; async_job_events already cascades from async_jobs. The script exits non-zero on failure and the scheduler logs and continues to the next cycle rather than crash-looping.

PubtatorNDD nightly refresh (pubtatornidd-cron)

The pubtatornidd-cron Compose service keeps the PubtatorNDD analysis current automatically. It is a dumb scheduler: once per night it enqueues a single durable pubtatornidd_nightly async job (via api/scripts/pubtatornidd_nightly_enqueue.R) and exits the run. The existing worker service — which already has the PubTator/PubMed egress — claims and runs the actual refresh (orchestrator in api/functions/pubtatornidd-nightly.R), so all retries, single-flight locking, and history live there. Like log-cleanup it reuses the API image and connects over the internal backend network only (it needs the database to enqueue, not egress).

Each run, the worker-side orchestrator: single-flights via a non-blocking MySQL advisory lock (GET_LOCK('pubtatornidd_nightly', 0)) so overlapping runs skip cleanly; resolves the standing query (job payload → PUBTATORNDD_NIGHTLY_QUERY → most-recent cached query); incrementally fetches new publications (soft page-watermark, ≤3 req/s); refreshes the per-gene enrichment snapshot; and refreshes the precomputed gene-summary table when present. The structured run summary is persisted in the job result_json for observability; a failed refresh step marks the job failed.

Configuration (environment variables, with defaults):

  • PUBTATORNDD_NIGHTLY_AT=02:30 — daily enqueue time, HH:MM in container (UTC) time.
  • PUBTATORNDD_NIGHTLY_QUERY= — optional PubTator query override for the standing corpus. When empty, the worker refreshes the most-recently-cached query in pubtator_query_cache.
  • PUBTATORNDD_NIGHTLY_MAX_PAGES= — optional page cap for the incremental fetch (defaults to 50 inside the worker).

The worker resets the per-request external-time accumulator at the start of every job, and the enrichment batch additionally resets it per external call, so the per-request external ceiling (EXTERNAL_PROXY_REQUEST_MAX_SECONDS) — intended for public request paths — does not short-circuit this legitimately external-heavy nightly batch.

Database version (DB_VERSION / DB_COMMIT)

The human-facing database version (issue #22) is tracked in the single-row db_version table (migration 028_add_db_version.sql), separate from the migration runner’s schema_version apply ledger and from about_content.version. The migration seeds a baseline semantic version, and the API exposes it in the database block of the public GET /api/version response (semantic version, last db/-folder git commit, optional description/updated_at, and an available flag). The App surfaces it on the About page. The endpoint degrades gracefully: if the DB or table is unreachable it reports version/commit as "unknown" and available: false instead of failing.

To stamp the deployed values at release time, set DB_VERSION (semantic major.minor.patch) and/or DB_COMMIT (last db/-folder git short hash) in the API container environment. The running container has no git checkout, so capture them on a host that has the repo:

# Prints DB_VERSION=<semver> and DB_COMMIT=<short-hash> for the current checkout.
./db/scripts/update-db-version.sh            # version from the seeded migration
./db/scripts/update-db-version.sh 1.1.0      # pin a specific semantic version
./db/scripts/update-db-version.sh 1.1.0 >> .env   # inject, then redeploy

docker-compose.yml passes DB_VERSION and DB_COMMIT through to the api service. On startup, after migrations, db_version_sync_from_env() updates the db_version row (id = 1) when either variable is set; it is a non-fatal no-op otherwise. Bump the seeded version (in a new NNN_*.sql migration) when the DB schema or core seed data changes meaningfully.

Public analysis snapshots

Public analysis endpoints and MCP analysis tools read public-ready rows from analysis_snapshot_manifest and normalized snapshot payload tables. They do not compute STRING networks, phenotype clusters, correlations, fCoSE layouts, external provider calls, or Gemini summaries on request-path miss.

After curated public data changes, submit analysis_snapshot_refresh durable jobs for the supported presets and let the worker build and activate snapshots. Activation is scoped to one public-ready row per (analysis_type, parameter_hash), so refreshing one preset does not replace another preset. Refresh jobs must use approved-public inputs only.

A fresh deploy bootstraps the snapshots automatically (#420): after migrations, start_sysndd_api.R runs analysis_snapshot_bootstrap_on_startup(), which enqueues a refresh job for any supported preset that has no active public-ready snapshot. It is idempotent (a restart with snapshots already present enqueues nothing), dedup-safe, never crashes boot, and is gated by ANALYSIS_SNAPSHOT_BOOTSTRAP_ON_STARTUP (default true; set to false to disable). The worker must be running to consume the jobs.

To reduce first-start contention on a small host (#447), the startup bootstrap staggers heavy builds: the heavy functional_clusters build is enqueued with a scheduled_at offset (ANALYSIS_SNAPSHOT_BOOTSTRAP_STAGGER_SECONDS, default 120; set 0 to disable) so it is not claim-eligible at the same instant as the cheap presets, and the PubtatorNDD startup bootstrap is offset separately (PUBTATORNIDD_BOOTSTRAP_STAGGER_SECONDS, default 240) so it does not co-launch with the snapshot bootstrap. Only the automatic startup path staggers — the admin force refresh and the operator script submit immediately, so a manual rebuild is never delayed. These knobs only affect scheduling; they require no DB schema change and no extra worker.

There are three ways to (re)build snapshots, all sharing one submit function:

  • Automatic — the startup bootstrap above.

  • Admin HTTP (no SSH/docker needed)POST /api/admin/analysis/snapshots/refresh (Administrator token) submits the jobs and returns the job ids; pass {"force": true} to rebuild even when a current snapshot exists, or {"analysis_type": "gene_network_edges"} to target one preset. GET /api/admin/analysis/snapshots/status reports per-preset state (missing / available / stale / source_version_mismatch) with timestamps and row counts so an operator can watch a rebuild progress. Example:

    curl -X POST https://<host>/api/admin/analysis/snapshots/refresh \
      -H "Authorization: Bearer <admin-token>" -H "Content-Type: application/json" -d '{}'
    curl https://<host>/api/admin/analysis/snapshots/status -H "Authorization: Bearer <admin-token>"
  • Operator script (SSH fallback)make refresh-analysis-snapshots (or docker exec sysndd-api-1 Rscript /app/scripts/refresh-analysis-snapshots.R) forces a rebuild of all presets.

While a snapshot is still building, the public GeneNetworks and PhenotypeClusters pages show a friendly “analysis is being prepared” panel (with a retry) instead of a raw error.

Snapshot status meanings:

  • unsupported_parameter: the requested parameters are not in the fixed public preset matrix; change the request or predefine and refresh a new preset in code.
  • snapshot_missing: the preset is supported, but no public-ready snapshot is active yet; run the refresh job.
  • snapshot_stale: an active snapshot exists but is past stale_after; public REST and MCP analysis reads report it unavailable until the preset is refreshed.
  • source_version_mismatch: the stored source data version no longer matches the cheap current source version; public REST and MCP analysis reads report it unavailable until the preset is refreshed.

Available snapshot responses carry a meta.snapshot provenance block sourced from the public-ready manifest row: snapshot_id, analysis_type, parameter_hash, schema_version, data_class, generated_at, stale_after, source_data_version, input_hash, payload_hash, and record_counts. input_hash binds the snapshot to its supported parameter set plus the public source-data version; payload_hash binds it to the materialized result; record_counts reports the stored payload row counts (it excludes generated network metadata). These fields let operators and downstream clients audit lineage and completeness without a second query.

Gemini model configuration

The effective Gemini model resolves in this order: GEMINI_MODEL, api/config.yml key gemini_model, then the SysNDD default gemini-3.5-flash. The admin LLM configuration endpoint reports the source, default model, validity, and any warning so operators can see when an environment override is active.

Invalid or shut-down models are rejected before Gemini is called. If Google releases a model before the built-in catalog is updated, set GEMINI_ALLOWED_MODELS_EXTRA to a comma-separated allowlist of the new IDs; unknown allowlisted models are accepted but surfaced with an operator warning. The allowlist does not re-enable cataloged shut-down models.

GeneNetworks layout artifacts

GeneNetworks display layouts are precomputed derived-analysis artifacts, not request-path work. The API and worker images contain Node 24 plus the minimal api/layout/ dependencies needed to run the headless Cytoscape/fCoSE helper. The worker should run the durable network_layout_prewarm job after data/cache refreshes that can change the displayed gene network.

The public /api/analysis/network_edges request path only reads matching artifacts from /app/cache/network_layouts; it must not run fCoSE synchronously. If an artifact is absent, invalid, or stale, the API marks the display layout as unavailable and the browser falls back to its existing fCoSE layout. Cache invalidation is controlled by the content-aware layout key, which includes the displayed node/edge set, query parameters, layout options, Cytoscape/fCoSE versions, and the current CACHE_VERSION.

MCP sidecar settings

The optional mcp service runs api/start_sysndd_mcp.R as a separate read-only process. It is not part of Plumber and does not run migrations or workers.

  • MCP_DB_POOL_SIZE defaults to 2 and controls the MCP-only DB pool.
  • MCP_PORT defaults to 8787.
  • MCP_OUTPUT_MODE defaults to json_text, matching the transport spike result for mcptools 0.2.1.9000.
  • MCP_CACHE_DIR defaults to /app/cache; the MCP container mounts the shared API cache read-only and uses it only for cache-hit checks and cache reads.
  • MCP_URL is used by make test-mcp-smoke and the lightweight MCP container liveness probe. Inside the container the default is http://127.0.0.1:8787.

The sidecar initializes with concise SysNDD-specific client instructions that describe the gene -> entity -> publication workflow, entity model, deferred-tool loading guidance, cheap-path payload controls, resource semantics, and read-only constraints. get_sysndd_capabilities provides the longer in-band guide for workflows, limits, payload modes, citation rules, resources, prompt opt-in status, errors, and v1 exclusions. Tool descriptions include short example calls and boolean defaults. Hidden deprecated aliases are not supported; clients should use the advertised schemas.

Payload controls are exposed as response_mode, abstract_mode, synopsis_mode, include flags, expand, and dedupe_publications. Use response_mode = "minimal" for structure-first retrieval; it defaults to no synopsis and no abstracts. Other modes default to citation metadata rather than prose. Tool results report meta.elapsed_ms. Entity phenotypes are grouped as modifier-keyed HPO ID arrays, and batch/expanded payloads keep schema_version only at the outer envelope. get_gene_context defaults include_comparisons to false, reports meta.entity_total / meta.entity_has_more / meta.next_entity_offset, and supports expand = "entities" for an opt-in one-call gene + entity detail response. get_genes_context supports 1-10 genes with per-gene errors and optional cross-gene publication deduplication. Detailed entity expansion is capped at 20 IDs per call and reports meta.entity_detail_truncated_by_batch_cap when the requested entity limit exceeded that cap. get_entities_context defaults dedupe_publications to true so shared publication objects are returned once at the top level with per-entity publication_refs. Publication tools expose recommended_citation, publication_date_sysndd_record, publication_date_confidence, optional abstract fields, and separate sysndd_curation_date values for linked entities. abstract_mode = "metadata" reports abstract_available and omits excerpt fields. Historical rows remain unverified until the one-off PubMed backfill is applied. Use get_genes_context for 1-10 genes, get_entities_context for 1-20 entity IDs, and get_publications_context for 1-20 PMIDs instead of issuing many single-record calls.

MCP_SCHEMA_VERSION 1.2 analysis tools are limited to the analysis catalog, gene research context, NDDScore context, curation comparison context, phenotype analysis context, and gene network context. They label every analysis payload as curated_sysndd_evidence, curated_derived_analysis, ml_prediction, llm_generated_summary, external_reference_identifier, or operational_metadata. NDDScore remains an ML prediction layer, separate from curated SysNDD evidence and not an evidence tier. LLM summary data is cache-only: current validated summaries generated by the admin workflow may be read, but MCP must not trigger Gemini/LLM generation or expose prompts/queries. MCP analysis tools must not call live external providers; stored external IDs may be returned only as external_reference_identifier.

The MCP sidecar must not write to the database, call write routes, execute raw SQL/R, expose admin/user/log/job data, expose draft reviews or re-review workflows, call live external providers, or trigger Gemini/LLM generation.

Snapshot-backed MCP analysis sections depend on public-ready API/worker-derived snapshots. The sidecar may inspect manifest status but must not call bootstrap_init_cache_version(), clear cache files, compute STRING networks, generate phenotype correlations, run phenotype clustering, or generate LLM summaries. If dry_run reports snapshot_missing, snapshot_stale, or source_version_mismatch, refresh the corresponding analysis_snapshot_refresh preset before expecting MCP to return current records.

Large analysis calls default to response_mode = "compact" and max_response_chars = "auto". Responses include budget metadata and may include dropped_summary, recovery, per-section status, dry_run, or response_mode = "diagnostics" output so clients can narrow broad requests. The recommended low-token path is get_sysndd_analysis_catalog, then get_gene_research_context(dry_run = TRUE, response_mode = "compact"), then focused analysis tools for the sections the client actually needs.

The sidecar patches mcptools so tools/list advertises read-only annotations and output schemas, and resources/list / resources/read serve distinct static sysndd://schema/overview and sysndd://schema/tool-guide resources. MCP prompts are disabled by default because Claude Code exposes them as user-invoked slash commands rather than automatically discovered LLM workflows; set MCP_ENABLE_PROMPTS=true only when the deployment intentionally wants prompts/list / prompts/get to expose the four SysNDD workflow prompts. Recoverable validation failures return stable tool-result JSON envelopes with isError = true; malformed or unknown user inputs should not surface as JSON-RPC -32603 internal errors. The container healthcheck uses only initialize and tools/list; make test-mcp-smoke remains the heavier end-to-end probe.

The production Compose file keeps MCP internal-only by default: no host port and no Traefik labels are configured. If an operator exposes it as /mcp, the proxy must protect it with a static bearer-token middleware, equivalent private network control, or a future OAuth flow, and should strip /mcp before forwarding to the mcptools HTTP root endpoint. A safe deployment can route MCP protocol requests (POST /mcp and GET /mcp with Accept: text/event-stream) to the protected sidecar while letting normal browser GET /mcp requests reach the public informational Vue page. Without such a protected proxy route, the public app serves /mcp as a short informational page for humans and MCP client setup guidance; that page must not be treated as the transport endpoint.

For Traefik, the intended shape is:

# Operator overlay example; do not expose this without authentication.
mcp:
  networks:
    - backend
    - proxy
  labels:
    - "traefik.enable=true"
    - "traefik.docker.network=sysndd_proxy"
    - "traefik.http.routers.mcp-post.rule=Host(`sysndd.dbmr.unibe.ch`) && Path(`/mcp`) && Method(`POST`)"
    - "traefik.http.routers.mcp-post.entrypoints=web"
    - "traefik.http.routers.mcp-post.priority=200"
    - "traefik.http.routers.mcp-post.middlewares=mcp-strip,mcp-auth"
    - "traefik.http.routers.mcp-sse.rule=Host(`sysndd.dbmr.unibe.ch`) && Path(`/mcp`) && HeadersRegexp(`Accept`, `.*text/event-stream.*`)"
    - "traefik.http.routers.mcp-sse.entrypoints=web"
    - "traefik.http.routers.mcp-sse.priority=200"
    - "traefik.http.routers.mcp-sse.middlewares=mcp-strip,mcp-auth"
    - "traefik.http.middlewares.mcp-strip.stripprefix.prefixes=/mcp"
    - "traefik.http.services.mcp.loadbalancer.server.port=8787"

Add mcp-auth through the deployment’s chosen authentication middleware before enabling the route.

11.3 Operations Notes

  • Migrations run at API startup and should fail hard if broken. Migration 025_create_core_views.sql codifies the core read views (ndd_entity_view, users_view, search_non_alt_loci_view, search_disease_ontology_set) so a brand-new MySQL volume boots without manually running db/C_Rcommands_set-table-connections.R. The views use SQL SECURITY INVOKER; on an existing DB where they were created by the legacy script, CREATE OR REPLACE swaps them in place (the app DB user already has the required SELECT grants).
  • Public clustering submission has a queue-depth cap. ASYNC_PUBLIC_JOB_CAP (default 8) bounds simultaneously queued/running jobs on the default queue; over the cap the public submit routes return 503 + Retry-After: 60 (CAPACITY_EXCEEDED). Raise it in the deployed .env if the worker fleet can sustain more concurrent STRING-db clustering jobs.
  • The public LLM cluster-summary endpoints (/api/analysis/functional_cluster_summary, /api/analysis/phenotype_cluster_summary) are cache-hit-only for anonymous/Viewer callers and return 404 on a cache miss; on-demand Gemini generation requires a Curator+ token. Pre-warm summaries via admin generation rather than expecting the public path to generate them.
  • Access-token lifetime is configured by token_expiry (seconds) in each api/config.yml block (default 3600); it drives both the JWT exp and the expires_in returned by POST /api/auth/authenticate. The legacy refresh key now only controls the password-reset link TTL. Set token_expiry explicitly in the production config block.
  • Async job polling now reads durable MySQL-backed state, so sticky sessions are optional for correctness.
  • make cache-clear removes nested .rds cache files under /app/cache, including external proxy caches.
  • Run the worker service alongside the API service; mirai daemons live in the worker service and jobs are executed by the worker entrypoint, not the web process.
  • The worker service healthcheck should verify the worker process is alive, not probe an HTTP endpoint from the worker container.
  • The worker service needs both internal database access and outbound provider access. In Compose it should stay on backend for MySQL/API internals and on the egress-capable proxy network for Gemini, PubMed, PubTator, and other external calls. Do not attach it only to the internal backend network.
  • Keep the MCP service on internal/private access unless a protected route is deliberately configured. MCP tools and prompts are read-only and must not call Gemini/LLM generation, live external providers, raw SQL/R execution, write routes, admin/user/log/job routes, draft reviews, or re-review data. Analysis tools must remain cache-only for LLM summaries and bounded by compact defaults plus max_response_chars.
  • Refresh public analysis snapshots after curated data changes or analysis algorithm changes. Submit analysis_snapshot_refresh jobs for each supported preset, watch /api/jobs/<job_id>/status, and run make test-mcp-smoke against the MCP sidecar after activation.
  • Run NDDScore updates from the administrator /ManageNDDScore page: Check Zenodo, Download & validate, then Import & activate latest release. The worker needs outbound egress to Zenodo. The previous active release keeps serving until the new release validates and activates successfully. All imported releases are retained for history; there is no automatic pruning. On failure, inspect the release import_status and last_error_message in the admin view or database.
  • Configure the default NDDScore Zenodo source in the production .env file. NDDSCORE_ZENODO_RECORD_ID defaults to 20258027, and NDDSCORE_ZENODO_API_BASE_URL defaults to https://zenodo.org/api/records. The API and worker containers both receive these variables; if they are missing, api/config.yml provides the same defaults.
  • publication.publication_date_source records how each Publication_date was derived (pubmed, pubmed_partial, medline_date, unknown). New ingestions set it automatically. After deploying the migration, run the one-time operator backfill with PubMed egress: Rscript db/updates/backfill_publication_dates.R --dry-run --limit=25 for a small rehearsal, Rscript db/updates/backfill_publication_dates.R --dry-run to preview the full run, then Rscript db/updates/backfill_publication_dates.R --apply to update historical rows. The script is dry-run by default, uses an advisory lock, limits PubMed fallback requests with NCBI_REQUEST_DELAY_SECONDS, skips unresolved IDs, and commits writes in batches controlled by BACKFILL_UPDATE_BATCH_SIZE.
  • Use HTTPS in production. See TLS Certificate Renewal below for the yearly certificate workflow and the dry-run-safe CSR helper.

11.4 TLS Certificate Renewal

Design rationale and the full decision record live in .planning/decisions/2026-06-11-tls-certificate-renewal-automation.md (issue #25). This section is the operator runbook.

How TLS is served today

The application Compose stack (docker-compose.yml) runs Traefik on the web (:80) entrypoint only; it has no :443 entrypoint and no ACME resolver, so HTTPS for the public host sysndd.dbmr.unibe.ch is terminated by an upstream institutional reverse proxy. A legacy standalone nginx TLS config also exists (app/docker/nginx/prod.conf, terminating :443 from a cert.pem/key.pem mount at /etc/nginx/certificates/); it is retained but not wired into the current stack. Confirm which terminator is active in your deployment before installing a new certificate.

Two paths

Option A — ACME / Let’s Encrypt (preferred if a public CA is acceptable). If the institution allows a public CA for this host, add a :443 entrypoint plus an ACME resolver to Traefik (or to the upstream proxy). Traefik then issues and auto-renews certificates with no CSR, no email, and no restart — this eliminates the manual yearly process entirely. Confirm public-CA acceptability and inbound :80/:443 reachability for the ACME challenge first.

Option B — scripted CSR for an institutional CA (current default). Keep the institutional/internal CA but remove the manual openssl step. Use the helper to produce a reproducible key + CSR, submit it to the authority, install the returned certificate, and reload the terminator.

CSR helper (scripts/cert/generate-csr.sh)

The helper is dry-run by default and refuses to write key material inside the repository tree. Configure it via scripts/cert/cert-renewal.conf (copied from cert-renewal.conf.example; the real config and any local key/CSR output are gitignored) or CERT_* environment variables.

# 1. Configure (one-time): copy the example and edit subject/SAN/output dir.
cp scripts/cert/cert-renewal.conf.example scripts/cert/cert-renewal.conf
$EDITOR scripts/cert/cert-renewal.conf      # CERT_OUT_DIR must be OUTSIDE the repo

# 2. Inspect resolved config and the exact openssl command (writes nothing).
scripts/cert/generate-csr.sh --print-config
scripts/cert/generate-csr.sh                # DRY-RUN: prints the openssl command

# 3. Generate the real key + CSR (the only live operation in this helper).
scripts/cert/generate-csr.sh --apply --out-dir /etc/sysndd/certs

The private key is written under umask 077 + chmod 600; the CSR is safe to share with the signing authority.

Remaining operator steps (TODO hooks — CA-/deployment-specific)

The helper intentionally does not submit, install, or reload — those depend on your CA and active terminator. It prints guidance for each:

  1. Submit the CSR to the authority (portal upload, CA API, or email).

  2. Install the returned certificate (+ intermediate chain). Validate it matches the key — these two MD5s must be identical:

    openssl x509 -noout -modulus -in cert.pem | openssl md5
    openssl rsa  -noout -modulus -in key.pem  | openssl md5

    Keep the previous cert.pem/key.pem as .bak for rollback, then place the new pair at the active terminator’s mount.

  3. Reload the terminator without dropping connections:

    • nginx: docker compose exec <proxy> nginx -s reload
    • Traefik file-provider: dynamic cert files hot-reload automatically on change. Verify afterward:
    echo | openssl s_client -connect sysndd.dbmr.unibe.ch:443 \
      -servername sysndd.dbmr.unibe.ch 2>/dev/null | openssl x509 -noout -dates

Yearly schedule

Run the generator on a yearly cadence from host cron or a systemd timer (never inside the nginx app container) and notify an operator that a fresh CSR is ready to submit. Allow round-trip slack (e.g. ~6 weeks before expiry):

# 06:00 on Oct 1 each year — generate the renewal CSR and log the result.
0 6 1 10 * /opt/sysndd/scripts/cert/generate-csr.sh --apply \
  --out-dir /etc/sysndd/certs >> /var/log/sysndd-cert-renew.log 2>&1

For ACME (Option A) no schedule is needed; Traefik renews automatically.

Rollback

The generated key/CSR are inert until a signed certificate is installed, so generation carries no production risk. If a freshly installed certificate breaks TLS, restore the .bak cert.pem/key.pem and reload again — fast and connection-preserving.

11.5 SEO Prerender Operations

The default production path is build-time prerendering into the frontend image. Set Docker build arg SEO_GENERATE=true to generate crawlable public route HTML after the Vite build. If SEO_API_BASE_URL is set, the generator reads /api/seo/*; otherwise it uses deterministic fixtures.

The production frontend image builds with VUE_MODE=production by default so Vite reads app/.env.production. Do not build the production image with VUE_MODE=docker; that mode is reserved for the local development container.

docker build \
  -f app/Dockerfile \
  --build-arg VUE_MODE=production \
  --build-arg SEO_GENERATE=true \
  --build-arg SEO_API_BASE_URL=https://sysndd.dbmr.unibe.ch/api \
  app

Verify the generated output locally with:

make verify-seo-app

Runtime refresh is optional and intentionally outside API startup and nginx. The profiled sidecar keeps nginx single-purpose:

docker compose --profile ops run --rm seo-prerender

If a deployment mounts app/dist through an explicit shared or bind-mounted artifact volume, restart the app only after successful generation:

docker compose restart app

Do not add cron or Node to the nginx app container. For periodic refreshes after data releases, run the profiled sidecar from host cron or rebuild the app image with SEO_GENERATE=true.

11.6 Security headers

The frontend nginx config (app/docker/nginx/security-headers.conf) emits the following on every SPA response:

  • Strict-Transport-Security: max-age=63072000; includeSubDomains; preload — one-way policy decision; see .planning/decisions/2026-04-25-csp-hsts-policy.md for the operator-facing caveats (sub-domain inventory and preload-list submission).
  • Content-Security-Policy'unsafe-inline' for script-src is replaced by build-time sha256-... hashes (see “Vite upgrade maintenance” below); 'unsafe-eval' is retained intentionally because vendor JS (NGL Web Workers, Vue runtime template compiler, markdown-it) needs it; 'unsafe-inline' for style-src is retained intentionally because Bootstrap-Vue-Next, NGL, and d3 emit unhashable inline style="" attributes. Full rationale in the ADR.
  • X-Content-Type-Options: nosniff
  • Referrer-Policy: strict-origin-when-cross-origin
  • X-Frame-Options: SAMEORIGIN and CSP frame-ancestors 'self' together restrict framing.
  • Permissions-Policy denies geolocation, camera, microphone, and other powerful APIs we do not use.

The Playwright spec app/tests/e2e/security-headers.spec.ts is the regression net for the directive shape; any future loosening must red-line that spec before merge.

Vite upgrade maintenance

After bumping Vite, NGL, markdown-it, or any other vendor that may add or remove inline <script> content, regenerate the CSP script-src hashes:

cd app
npm run build
node scripts/audit-csp-violations.mjs --build dist
# Update the 'sha256-...' list in app/docker/nginx/security-headers.conf

CI’s Playwright security-headers.spec.ts and the audit script catch missed updates.

For full deployment details, runtime tuning context, and troubleshooting history, see the repository and infrastructure configuration alongside the compose files.