Metrics¶

Metrics are used by Clusterman to record state about clusters that can be used later for autoscaling or simulation.

Clusterman uses a metrics interface API to ensure that all metric values are stored in a consistent format that can be used both for autoscaling and simulation workloads. At present, all metric data is stored in DynamoDB, and accessed using the ClustermanMetricsBotoClient. In the future, the interface layer allows us to transparently change backends if necessary.

Interacting with the Metrics Client¶

Metric Types¶

Metrics in Clusterman can be classified into one of three different types. Each metric type is stored in a separate namespace. Within each namespace, metric values are uniquely identified by their key and timestamp.

clusterman_metrics.APP_METRICS¶: metrics collected from client applications (e.g., number of application runs)

clusterman_metrics.METADATA¶: metrics collected about the cluster (e.g., current spot prices, instance types present)

clusterman_metrics.SYSTEM_METRICS¶: metrics collected about the cluster state (e.g., CPU, memory allocation)

Application metrics are designed to be read and written by the application owners to provide input into their autoscaling signals. System metrics and metadata can be read by application owners, but are written by batch jobs inside the Clusterman code base. Metadata metrics cannot be read by application owners and are only used for monitoring and simulation purposes.

Metric Keys¶

Metric keys have two components, a metric name and a set of dimensions. The metric key format is:

metric_name|dimension1=value1,dimension2=value2

This allows for metrics to be easily converted into SignalFX datapoints, where the metric name is used as the timeseries name, and the dimensions are converted to SignalFX dimensions. The generate_key_with_dimensions() helper function will return the full metric key in its proper format. Use it to get the correct key when reading or writing metrics.

Reading Metrics¶

The metrics client provides a function called ClustermanMetricsBotoClient.get_metric_values() which can be used to query the metrics datastore.

Note

In general, signal authors should not need to read metrics through the metrics client, because the BaseSignal takes care of reading metrics for the signal.

Writing Metrics¶

The metrics client provides a function called ClustermanMetricsBotoClient.get_writer(); this function returns an “enhanced generator” or coroutine (not an asyncio coroutine) which can be used to write metrics data into the datastore. The generator pattern is used to allow writing to be batched together and reduce throughput capacity into DynamoDB. See the API documentation for how to use this generator.

Example and Reference¶

DynamoDB Example Tables¶

The following tables show examples of how our data is stored in DynamoDB:

Application Metrics
metric name	timestamp	value
app_A,my_runs	1502405756	2
app_B,my_runs	1502405810	201
app_B,metric2	1502405811	1.3

System Metrics
metric name	timestamp	value
cpus_allocated\|cluster=norcal-prod,pool=appA_pool	1502405756	22
mem_allocated\|cluster=norcal-prod,pool=appB_pool	1502405810	20

Metadata
metric name	timestamp	value	<c3.xlarge, us-west-2a>	<c3.xlarge, us-west-2c>
spot_prices\|aws_availability_zone=us-west-2a,aws_instance_type=c3.xlarge	1502405756	1.30
spot_prices\|aws_availability_zone=us-west-2c,aws_instance_type=c3.xlarge	1502405756	5.27
fulfilled_capacity\|cluster=norcal-prod,pool=seagull	1502409314		4	20

Metric Name Reference¶

The following is a list of metric names and dimensions that Clusterman collects:

System Metrics¶

cpus_allocated|cluster=<cluster name>,pool=<pool>
mem_allocated|cluster=<cluster name>,pool=<pool>
disk_allocated|cluster=<cluster name>,pool=<pool>

Metadata Metrics¶

cpus_total|cluster=<cluster name>,pool=<pool>
disk_total|cluster=<cluster name>,pool=<pool>
fulfilled_capacity|cluster=<cluster name>,pool=<pool> (separate column per InstanceMarket)
mem_total|cluster=<cluster name>,pool=<pool>
spot_prices|aws_availability_zone=<availability zone>,aws_instance_type=<AWS instance type>
target_capacity|cluster=<cluster name>,pool=<pool>