# Server-Sent Events (SSE) for real-time updates

Galaxy can push history changes, interactive-tool entry-point changes, and
in-app notifications to connected browsers via [Server-Sent Events][sse-mdn]
instead of polling. This replaces the legacy 3-second history poll and
10-second entry-point poll with a single long-lived HTTP connection per
browser tab, dramatically reducing API load on busy servers and giving users
sub-second update latency.

This document describes the architecture, the configuration knobs, the
metrics admins should watch, and how to configure NGINX so the long-lived
event connection is not buffered or prematurely terminated.

[sse-mdn]: https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events

## How it works

A high-level view of the moving parts on the server:

```
   ┌────────────────┐  LISTEN/NOTIFY     ┌──────────────────────┐
   │  Postgres      │ ─────────────────► │ HistoryAuditMonitor  │
   │  history_audit │  (or audit poll    │ (one elected process) │
   │                │   on SQLite)       └──────────┬───────────┘
   └────────────────┘                               │ Kombu
                                                    ▼
                                     ┌─────────────────────────────┐
                                     │  galaxy_queue_worker        │
                                     │  control task fan-out:      │
                                     │   • history_update          │
                                     │   • entry_point_update      │
                                     │   • notification_update     │
                                     │   • broadcast_update        │
                                     └──────────────┬──────────────┘
                                                    │ in-process
                                                    ▼
                              ┌─────────────────────────────────────┐
                              │  Each gunicorn worker:              │
                              │   SSEConnectionManager → asyncio    │
                              │   queue per connected browser tab   │
                              └──────────────┬──────────────────────┘
                                             │ HTTP chunked
                                             ▼
                              ┌─────────────────────────────────────┐
                              │  Browser EventSource                │
                              │  (single stream, multiplexed events)│
                              └─────────────────────────────────────┘
```

Concretely:

1. **One stream per browser, many event types.** The browser opens a single
   `EventSource` against `/api/events/stream`. The same connection carries
   `history_update`, `entry_point_update`, `notification_update`,
   `broadcast_update`, and `notification_status` events.
2. **Per-process registries.** Each Gunicorn worker keeps an
   `SSEConnectionManager` that holds an `asyncio.Queue` per connected tab,
   indexed by user id (and Galaxy session id, so anonymous users still
   receive their own history's updates).
3. **Producers.**
   - History updates come from a `HistoryAuditMonitor` that watches
     `history_audit` via PostgreSQL `LISTEN/NOTIFY` (instant) or by polling
     the audit table on SQLite. Only one process in the cluster is the
     producer, picked by `DatabaseHeartbeat` leader election. If a
     standalone `galaxy-sse-monitor` process is running it always wins;
     otherwise one webapp picks it up.
   - Entry-point updates are dispatched directly from the code paths that
     mutate interactive-tool entry points — there is no separate watcher.
   - Notifications dispatch SSE events from `NotificationManager` whenever
     a notification or broadcast is created.
4. **Cross-process fan-out.** Producers don't know which worker holds the
   recipient's connection, so all events go through a Kombu control task
   broadcast on the internal AMQP bus. Every worker receives every event
   and locally drops the ones for users it doesn't currently hold a
   connection for.
5. **Reconnect catch-up.** When a browser reconnects after a network blip,
   it sends `Last-Event-ID`. The server replays an aggregated
   `notification_status` covering everything since that timestamp. History
   updates come with an `update_time` cursor in the payload, so the client
   can request the delta itself.

### Standalone monitor (recommended for production)

The `galaxy-sse-monitor` console script (installed by the `galaxy-app`
package) runs the `HistoryAuditMonitor` outside the webapp processes. This
is the recommended layout for production because:

- The webapp processes never compete with each other for the audit-monitor
  role on cold starts.
- Restarting the webapp tier doesn't briefly stall history updates while
  another worker is elected.
- The monitor needs only DB + AMQP access, so it can be sized
  independently and runs with a much smaller resident-set than a webapp.

A typical Gravity supervisor entry looks like the existing
`galaxy-celery-worker` block — point at the same `galaxy.yml` and run
`galaxy-sse-monitor` on its own. With the daemon present, the
`HistoryAuditMonitor` registered on the webapp side stays idle (the
heartbeat election picks the daemon) but is still wired up so it can take
over if the daemon goes away.

If `enable_sse_updates` is `false`, `galaxy-sse-monitor` will start, log a
warning, and idle — it does no work and produces no events.

## Configuration

There is a single admin-facing flag for SSE-driven updates:

```yaml
galaxy:
  enable_sse_updates: true
```

This controls **all three** SSE-driven paths (history, entry-point,
notifications). When `false`:

- `HistoryAuditMonitor` is not registered, so the cluster does no
  `LISTEN/NOTIFY` or audit-table polling for history changes.
- The browser falls back to its existing 3-second history poll and
  10-second entry-point poll.
- Notifications fall back to the existing 30-second polling against
  `/api/notifications/status`.

`enable_notification_system` is independent: it gates whether the
notification system is available at all (notification creation, delivery,
preferences, broadcasts). With `enable_notification_system: true`:

- `enable_sse_updates: true` → notifications arrive via SSE.
- `enable_sse_updates: false` → notifications are polled.

With `enable_notification_system: false` the entire notification system is
off — there is nothing to push or poll.

The polling-fallback knob for the history audit monitor stays available:

```yaml
galaxy:
  history_audit_monitor_poll_interval: 2  # seconds, SQLite / no-LISTEN only
```

This only matters when running on SQLite or in setups where PostgreSQL
`LISTEN/NOTIFY` is unavailable.

## What to monitor

When statsd is configured (via `statsd_host` and friends), the SSE
plumbing emits the following metrics. Capture these on the same dashboard
you use for Gunicorn worker health:

| Metric                                     | Type    | Source                | Meaning                                                      |
| ------------------------------------------ | ------- | --------------------- | ------------------------------------------------------------ |
| `galaxy.sse.connections.dropped`           | counter | `SSEConnectionManager` | A per-connection asyncio queue filled up; an event was lost. |
| `galaxy.sse.dispatch.count` (tag: `task`) | counter | `SSEEventDispatcher`  | Control-task fan-outs by event kind.                         |
| `galaxy.sse.dispatch.latency_ms` (tag: `task`) | timing | `SSEEventDispatcher`  | Wall time spent enqueueing the control task.                 |
| `galaxy.sse.dispatch.skipped_no_qw`        | counter | `SSEEventDispatcher`  | Producer tried to dispatch with no queue worker bound — events would have been dropped. |

You can also expose a "currently connected SSE clients" gauge if you wire
one up: each `SSEConnectionManager` instance publishes
`total_broadcast_connections` (all connections, including anonymous) and
`total_per_user_connections` (connections bound to a specific user). These
are per-worker numbers; sum across workers for a cluster total.

Alerting recommendations:

- **`galaxy.sse.connections.dropped` > 0 sustained** indicates a slow or
  stuck client whose queue filled up. Occasional drops on a network blip
  are normal; a steady rate is a bug or a misconfigured proxy holding
  events back too long.
- **`galaxy.sse.dispatch.skipped_no_qw` > 0** means events are being lost
  because the producer process can't reach the AMQP bus. Check the
  `amqp_internal_connection` config and the AMQP broker health.
- **`galaxy.sse.dispatch.latency_ms` p95 climbing** points at an AMQP
  bottleneck (broker load, network) — events will land late.

In addition, watch the standard worker metrics. SSE connections are
long-lived (tens of minutes is common), but Galaxy runs Gunicorn with
`uvicorn.workers.UvicornWorker` (configured by Gravity by default), so
each worker process is async and can hold thousands of idle SSE
connections without blocking other requests. The practical concern on
busy servers is therefore memory and file-descriptor headroom, not
worker exhaustion: budget a few KB per connection plus one fd per
connection per worker, and raise `ulimit -n` accordingly.

## Configuring NGINX

The SSE endpoint is served at `/api/events/stream`. It is a normal
HTTP/1.1 chunked response, so it works through NGINX without any special
modules — but you must turn off response buffering and raise the
read/send timeouts, otherwise events will arrive in batched bursts (or
not at all until the connection times out).

Galaxy already sets `X-Accel-Buffering: no` on the response, which
disables NGINX's response buffering for that one endpoint without
affecting buffering on the rest of Galaxy. That alone is enough on most
setups. The block below adds the read/send timeouts and HTTP/1.1
upgrade-friendly headers explicitly so the connection survives long
idle periods between events:

```nginx
        # Long-lived Server-Sent Events stream.
        # Galaxy sends ``X-Accel-Buffering: no`` on the response, which
        # disables nginx response buffering just for this endpoint.
        # The keepalive comment fires every 30s so the read timeout
        # only needs to be a comfortable margin above that.
        location /api/events/stream {
            proxy_pass http://unix:/srv/galaxy/var/gunicorn.sock;
            proxy_set_header Host $http_host;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            proxy_http_version 1.1;
            proxy_set_header Connection "";

            # Disable buffering and gzip explicitly. ``X-Accel-Buffering``
            # already does this, but pinning it here also covers setups
            # where a sub-filter strips upstream headers.
            proxy_buffering off;
            proxy_cache off;
            gzip off;

            # Keepalives fire every 30s; allow generous slack on top.
            proxy_read_timeout 1h;
            proxy_send_timeout 1h;
        }
```

Place this `location` block **above** the catch-all `location /` block in
your existing Galaxy `server {}` (see the [NGINX proxy guide](nginx.md)).
NGINX matches longest prefix first, so the order doesn't matter for
correctness, but keeping all the override blocks together at the top of
the server block makes the special-case handling easy to find.

If you serve Galaxy at a URL prefix (`/galaxy`), prefix the location too:

```nginx
        location /galaxy/api/events/stream {
            proxy_pass http://unix:/srv/galaxy/var/gunicorn.sock:/galaxy;
            # ...same body as above
        }
```

### Other proxies

If you front Galaxy with something other than NGINX, the same rules
apply: disable response buffering for `/api/events/stream`, allow the
connection to stay open for at least the `keepalive` interval (30 s by
default) plus a healthy margin, and pass the request through HTTP/1.1
without forcing `Connection: close`.

- **Apache `mod_proxy_http`**: add `ProxyPass` with
  `flushpackets=on flushwait=5` and bump `ProxyTimeout` to at least a
  few minutes. Avoid `mod_deflate` on this endpoint.
- **HAProxy**: the connection is a plain HTTP/1.1 chunked response and
  needs no special handling beyond `timeout server` and
  `timeout tunnel` raised above 30 s.
- **Cloudflare and other CDNs**: many CDNs buffer HTTP/1.1 chunked
  responses by default. Either bypass the CDN for `/api/events/stream`
  or follow the CDN's documented pattern for SSE streaming.

## Verifying the deployment

After enabling `enable_sse_updates`, three quick checks confirm the
stream is healthy end-to-end:

1. From the browser DevTools Network tab, open Galaxy and look for a
   `GET /api/events/stream` request that stays in the **pending** state
   with a constantly incrementing transferred-bytes count. The response
   `Content-Type` is `text/event-stream`.
2. Trigger a history change (run a tool, rename a dataset). The Network
   tab should show a `history_update` event in the EventStream view of
   that connection within a second or two, and the history panel
   refreshes without a polling round-trip.
3. On the server side, `galaxy.sse.dispatch.count` should be ticking up
   for each event kind your users exercise. If you wired up the
   connection gauges, they should reflect roughly one connection per
   open browser tab.

If the connection opens but no events arrive, the most common causes
are: a proxy buffering responses (revisit the NGINX section), or a
producer that can't reach AMQP (see `galaxy.sse.dispatch.skipped_no_qw`).