Comment on page
Deployment
Learn how to deploy and start the Router in Production.
The router is a Go application provided as a self-contained Docker container. You can initiate the router by executing a single Docker command:
docker run \
--name cosmo-router \
-e FEDERATED_GRAPH_NAME=<federated_graph_name> \
-e GRAPH_API_TOKEN=<router_api_token> \
-e LISTEN_ADDR=0.0.0.0:3002 \
--platform=linux/amd64 \
-p 3002:3002 \
ghcr.io/wundergraph/cosmo/router:latest
--platform=linux/amd64
is required because we don't provide ARM images yet.By default, the health check endpoint is exposed at the path
/health
. The health check reports if the router is alive. If you are on Kubernetes you can make use of our liveness and readiness checks. For convenience, we export both on
/health/live
and /health/ready
:containers:
- name: router
livenessProbe:
httpGet:
path: "/health/live"
port: 3002
readinessProbe:
httpGet:
path: "/health/ready"
port: 3002
The liveness route returns
200
when the router is up. The readiness handler returns 200
only when the router is ready to serve requests.Before a Pod is terminated, a SIGTERM signal is sent to the Pod. Afterward, the Pod has 30 seconds to terminate itself (see terminationGracePeriodSeconds). Ensure that the router-specific timeouts are set below this threshold or increase the termination period appropriately.
For example, the following configuration was tested and could serve as a good starting point:
Router
GRACE_PERIOD_SECONDS=20
SHUTDOWN_DELAY_SECONDS=30
Pod
terminationGracePeriodSeconds=60
With that configuration, your router has 30 seconds to respond to all active connections before forcefully shutting down the server. Kubernetes will not terminate the container until the termination period of 60 seconds is over. Because the readiness endpoint no longer responds
200
and the container is inTERMINATION
state any new traffic from being redirected to the instance is prevented.GRACE_PERIOD_SECONDS
becomes interesting when the router updates its schema. This is also the time the server has to gracefully shut down old clients before the switch is enforced.To make the application reachable within the container network, you need to expose the router on
0.0.0.0
e.g LISTEN_ADDR=0.0.0.0:3002.
This is also required when trying to access Prometheus metrics from outside the docker network. Use PROMETHEUS_LISTEN_ADDR="0.0.0.0:8088"
to make the endpoint accessible.Structured logging, also known as JSON logging, is enabled by default and can be controlled using the
JSON_LOG
environment variable. For development, we recommend setting this off because the logs are more friendly rendered.The router exposes useful metrics about the process and memory. All metrics can be scraped from the Prometheus metrics endpoint. For more information please visit the dedicated Metrics section.
In scenarios with low traffic, a sampling rate of 1 (100%) is acceptable. However, for high-volume situations, we strongly recommend using a lower sampling rate. For instance, in a project experiencing 60 requests per second, a sampling rate of 0.1 (10% 6 req/s) is sufficient to generate valuable insights. We are also considering the possibility of introducing certain limitations within Cosmo Cloud to prevent overloads in the future.
Ensure that your subgraphs use parent-based sampling to inherit the sampling rate. For more information see OTEL instrumentation on Subgraphs.
Please ensure ENGINE_ENABLE_REQUEST_TRACING (ART) is set to
false
to not expose information about your subgraphs. You will see a warning when ART is enabled.The specific resource recommendations depend on your constraints, but we have observed that low to medium-traffic projects benefit from the following setup:
- 3 instances for high availability.
- Each instance is equipped with 2 CPUs and 1GB of RAM.
Last modified 14d ago