Metrics

Artemis metrics

Artemis and components Artemis depends on, RabbitMQ and PostgreSQL, provide metrics in a well-known format suitable for ingestion by Prometheus. Artemis core metrics are served by Artemis API server under dedicated path, /metrics.

Artemis

  • artemis_identity_info (gauge): Artemis identity info. Labels provide information about identity aspects.
  • artemis_package_info (gauge): Artemis packaging info. Labels provide information about package versions.

Messages and tasks

  • overall_message_count (counter): Overall total number of messages processed by queue and actor.
  • overall_errored_message_count (counter): Overall total number of errored messages by queue and actor.
  • overall_retried_message_count (counter): Overall total number of retried messages by queue and actor.
  • overall_rejected_message_count (counter): Overall total number of rejected messages by queue and actor.
  • current_message_count (gauge): Current number of messages being processed by queue and actor.
  • current_delayed_message_count (gauge): Current number of messages being delayed by queue and actor.
  • message_duration_milliseconds (histogram): The time spent processing messages by queue and actor.

API server HTTP traffic

  • http_requests_total (counter): Request count by method, path and status line.
  • http_requests_inprogress_count (gauge): Requests in progress by method and path.

API server DB metrics

  • db_pool_size (gauge): Maximal number of connections available in the pool.
  • db_pool_checked_in (gauge): Current number of connections checked in.
  • db_pool_checked_out (gauge): Current number of connections out.
  • db_pool_overflow (gauge): Current overflow of connections.

Routing metrics

  • overall_policy_calls_count (counter): Overall total number of policy call by policy name.
  • overall_policy_cancellations_count (counter): Overall total number of policy canceling a guest request by policy name.
  • overall_policy_rulings_count (counter): Overall total number of policy rulings by policy name, pool name and whether the pool was allowed.

Provisioning metrics

  • current_guest_request_count (gauge): Current number of guest requests being provisioned by pool and state.
  • current_guest_request_count_total (gauge): Current total number of guest requests being provisioned.
  • overall_provisioning_count (counter): Overall total number of all requested guest requests.
  • overall_successfull_provisioning_count (counter): Overall total number of successfully provisioned guest requests by pool.
  • overall_failover_count (counter): Overall total number of failovers to another pool by source and destination pool.
  • overall_successfull_failover_count (counter): Overall total number of successful failovers to another pool by source and destination pool.
  • guest_request_age (gauge): Guest request ages by pool, state and age threshold. Thresholds are 5 minutes long in the first hour, then 60 minutes until 48 hours. inf bucket collects all guests that are older than 48 hours.
  • provisioning_duration_seconds (histogram): The time spent provisioning a machine.
  • guest_request_state_transitions (counter): Overall total number of guest request state transitions, per pool, current state and new state.

Pool resource metrics

Following metrics are generated for every known pool, and represent limits and current usage of pool resources. All metrics have a label called dimension set to either limit or usage to represent which side of the equation it provides.

  • pool_resources_instances (gauge): Current limits and usage of pool instances.
  • pool_resources_cores (gauge) Current limits and usage of pool cores.
  • pool_resources_memory_bytes (gauge): Current limits and usage of pool memory.
  • pool_resources_diskspace_bytes (gauge): Current limits and usage of pool diskspace.
  • pool_resources_snapshot (gauge): Current limits and usage of pool snapshot.
  • pool_resources_network_addresses (gauge) Current limits and usage of pool networks.
  • pool_resources_updated_timestamp (gauge): Last time the pool metrics were updated, as Unix timestamp.
  • pool_costs (counter): Overall total cost of resources used by a pool, per pool and resource type.
  • pool_image_info_count (gauge): Number of cached image info entries.
  • pool_image_info_updated_timestamp (gauge): Last time pool image info has been updated.
  • pool_flavor_info_count (gauge): Number of cached flavor info entries.
  • pool_flavor_info_updated_timestamp (gauge): Last time pool flavor info has been updated.

Pool operations

  • pool_enabled (gauge): Current enabled/disabled pool state by pool.
  • cli_calls (counter): Overall total number of CLI commands executed, per pool and command name.
  • cli_calls_exit_codes (counter): Overall total number of CLI commands exit codes, per pool, command name and exit code.
  • cli_call_duration_seconds (histogram): The time spent executing CLI commands, by pool and command name.
  • pool_errors (counter): Overall total number of pool errors, per pool and error.

Other components

TODO: List other components and where to find their metrics, plus links to their documentation - we don’t want to document their metrics here. But this will need to somehow standardise where their metrics can be found, e.g. by listing routes and checking k8s the routes are available.

PostgreSQL

RabbitMQ

Redis