clusterman¶
Clusterman autoscales Mesos clusters based on the values of user-defined signals of resource utilization. It also provides tools to manually manage those clusters, and simulate how changes to autoscaling logic will impact the cost and performance.
Overview¶
Clusterman scales the pools in each Mesos cluster independently. In other words, it treats each group of instances with the same reserved pool in a Mesos cluster as a single unit. For each pool in a cluster, Clusterman determines the total target capacity by evaluating signals. These signals are user-defined functions of metrics collected through Clusterman.
Assumptions and Definitions (aka the “Clusterman Contract”)¶
A cluster is a Mesos cluster, that is, a distributed system managed by Apache Mesos
A pool is a group of machines belonging to a cluster; a resource group is a logical grouping of machines in a pool corresponding to a cloud provider’s API (for example, an Amazon Spot Fleet Request might be a resource group in a pool). A cluster may have many pools and each pool may have many resource groups.
Resource groups must have a way of assigning weights to machines running in that resource group. These weights are used by Clusterman to determine how many resources should be added to a particular pool.
Note
It is recommended (but not required) that the weight has a consistent meaning across all the types of machines that can appear in the resource group; the definition of weight in a resource group can be arbitrarily chosen by the operator, as long as it is consistent. For example, a resource group may define 1 unit of weight to equal 50 vCPUs. Moreover, each resource group in the pool should use the same definition of weight.
An application is a Mesos framework running on a pool; note that applications can span resource groups but they cannot span pools; thus each pool has a dedicated “purpose” or “set of applications” that is managed by Clusterman.
Every pool has at most one Clusterman autoscaler, which is responsible for managing the size of that pool
How Clusterman Works¶
Pool Manager¶
Clusterman manages a group of agents in a pool through a pool manager; the pool manager consists of one or more resource group units, which represent groups of machines that can be modified together, such as via an AWS spot fleet request.
Signals¶
For each pool, Clusterman determines the target capacity by evaluating signals. Signals reports the estimated resource requirements (e.g. CPUs, memory) for an application running on that pool. Clusterman compares this estimate to the current number of resources available and changes the target capacity for the pool accordingly.
These signals are functions of metrics and may be defined per application, by the owners of that application (see How to Write a Custom Signal). Each application may define exactly one signal; if there is no custom signal defined for a application, there is also a default signal defined by Clusterman that will be used.
Metrics¶
Signals are functions of metrics, values collected by Clusterman over time. Clusterman uses a metrics API layer to ensure that all metric values are stored in a consistent format that can be used both for autoscaling and simulation workloads. At present, all metrics data is stored in DynamoDB.
Application owners may use the metrics library to record application-specific metrics. The clusterman service also collects a number of metrics that may be used by anyone for autoscaling signals or simulation.
Simulator¶
In addition to the live autoscaler, Clusterman comes with a simulator that allows operators to test changes to their code or signals, experiment with different parameter values on live data, or compute operating costs, all without impacting production clusters. The simulator uses the same metrics and signals as the live autoscaler, except that it does not interact with live resource groups but instead operates in a simulated environment.
Metrics¶
Metrics are used by Clusterman to record state about clusters that can be used later for autoscaling or simulation.
Clusterman uses a metrics interface API to ensure that all metric values are stored in a consistent format that can be
used both for autoscaling and simulation workloads. At present, all metric data is stored in DynamoDB, and accessed
using the ClustermanMetricsBotoClient
. In the future, the interface layer allows us to transparently change
backends if necessary.
Interacting with the Metrics Client¶
Metric Types¶
Metrics in Clusterman can be classified into one of three different types. Each metric type is stored in a separate namespace. Within each namespace, metric values are uniquely identified by their key and timestamp.
-
clusterman_metrics.
APP_METRICS
¶ metrics collected from client applications (e.g., number of application runs)
-
clusterman_metrics.
METADATA
¶ metrics collected about the cluster (e.g., current spot prices, instance types present)
-
clusterman_metrics.
SYSTEM_METRICS
¶ metrics collected about the cluster state (e.g., CPU, memory allocation)
Application metrics are designed to be read and written by the application owners to provide input into their autoscaling signals. System metrics and metadata can be read by application owners, but are written by batch jobs inside the Clusterman code base. Metadata metrics cannot be read by application owners and are only used for monitoring and simulation purposes.
Metric Keys¶
Metric keys have two components, a metric name and a set of dimensions. The metric key format is:
metric_name|dimension1=value1,dimension2=value2
This allows for metrics to be easily converted into SignalFX datapoints, where the metric name is used as the timeseries
name, and the dimensions are converted to SignalFX dimensions. The generate_key_with_dimensions()
helper
function will return the full metric key in its proper format. Use it to get the correct key when reading or writing
metrics.
Reading Metrics¶
The metrics client provides a function called ClustermanMetricsBotoClient.get_metric_values()
which can be
used to query the metrics datastore.
Note
In general, signal authors should not need to read metrics through the metrics client, because the
BaseSignal
takes care of reading metrics for the signal.
Writing Metrics¶
The metrics client provides a function called ClustermanMetricsBotoClient.get_writer()
; this function returns
an “enhanced generator” or coroutine (not an asyncio coroutine) which can be used to write metrics data into the
datastore. The generator pattern is used to allow writing to be batched together and reduce throughput capacity into
DynamoDB. See the API documentation for how to use this generator.
Example and Reference¶
DynamoDB Example Tables¶
The following tables show examples of how our data is stored in DynamoDB:
Application Metrics |
||
---|---|---|
metric name |
timestamp |
value |
app_A,my_runs |
1502405756 |
2 |
app_B,my_runs |
1502405810 |
201 |
app_B,metric2 |
1502405811 |
1.3 |
System Metrics |
||
---|---|---|
metric name |
timestamp |
value |
cpus_allocated|cluster=norcal-prod,pool=appA_pool |
1502405756 |
22 |
mem_allocated|cluster=norcal-prod,pool=appB_pool |
1502405810 |
20 |
Metadata |
||||
---|---|---|---|---|
metric name |
timestamp |
value |
<c3.xlarge, us-west-2a> |
<c3.xlarge, us-west-2c> |
spot_prices|aws_availability_zone=us-west-2a,aws_instance_type=c3.xlarge |
1502405756 |
1.30 |
||
spot_prices|aws_availability_zone=us-west-2c,aws_instance_type=c3.xlarge |
1502405756 |
5.27 |
||
fulfilled_capacity|cluster=norcal-prod,pool=seagull |
1502409314 |
4 |
20 |
Metric Name Reference¶
The following is a list of metric names and dimensions that Clusterman collects:
System Metrics¶
cpus_allocated|cluster=<cluster name>,pool=<pool>
mem_allocated|cluster=<cluster name>,pool=<pool>
disk_allocated|cluster=<cluster name>,pool=<pool>
Metadata Metrics¶
cpus_total|cluster=<cluster name>,pool=<pool>
disk_total|cluster=<cluster name>,pool=<pool>
fulfilled_capacity|cluster=<cluster name>,pool=<pool>
(separate column per InstanceMarket)mem_total|cluster=<cluster name>,pool=<pool>
spot_prices|aws_availability_zone=<availability zone>,aws_instance_type=<AWS instance type>
target_capacity|cluster=<cluster name>,pool=<pool>
Signals¶
Each Clusterman autoscaler instance manages the capacity for a single pool in a Mesos cluster. Clusterman determines the target capacity by evaluating signals. Signals are a function of metrics and represent the estimated resources (e.g. CPUs, memory) required by an application running on that pool. Clusterman compares this estimate to the current number of resources available and changes the target capacity for the pool accordingly (see Scaling Logic).
Signal Evaluation¶
During each autoscaling run, Clusterman evaluates each signal defined for the pool; any metrics requested by the signal are automatically read from the metrics datastore by the autoscaler, and passed in to the signal, along with any additional parameters that the signal needs in order to run. The signal then returns a resource request, indicating how many resources the application wants for the current period. Clusterman combines the resource requests from all the signals to determine how many resources to add or remove from the pool; these resource requests are subject to final capacity limits on the cluster to ensure that the pool does not ever contain too many or too few resources (which might cost extra money or impact availability).
A signal’s resource request is defined as follows:
{ "Resources": { "cpus": requested_cpus, "mem": requested_memory_in_MB, "disk": requested_disk_in_MB } }
If an application does not define its own signal, or if Clusterman is unable to load or evaluate the application’s
signal for any reason, Clusterman will fall back to using a default signal, defined in Clusterman’s own service
configuration file. See the configuration file and the clusterman
namespace within clusterman_signals
package
for the latest definitions. In general, the default signal uses recent values of allocated CPUs, memory, and disk to
estimate the resources required in the future.
How to Write a Custom Signal¶
Code for custom signals should be defined in the clusterman_signals
package. Once a signal is defined there, the
Pool Configuration section below describes how Clusterman can be configured to use it for a pool.
Signal code¶
In clusterman_signals
, there is a separate directory for each application (called the signal namespace). If
there is not already a namespace for your signal already, create a directory within clusterman_signals
and create an
__init__.py
file within that directory.
Within that directory, application owners may choose how to organize signal classes within files. The only requirement
is that the signal class must be able to be imported directly from that subpackage, i.e. from clusterman_signals.poolA
import MyCustomSignal
. Typically, in the __init__.py
, you would import the class and then add it to __all__
:
from clusterman_signals.poolA.custom_signal import MyCustomSignal
...
__all__ = [MyCustomSignal, ...]
Define a new class that implements clusterman_signals.base_signal.BaseSignal
(the class name should be
unique). In this class, you only need to overwrite the value()
method. value()
should use metric
values to return a clusterman_signals.base_signal.SignalResources
tuple, where the units
of the tuple should match the Mesos units: shares for CPUs, MB for memory and disk.
When you configure your custom signal, you specify the metric names that your signal
requires and how far back the data for each metric should be queried. The autoscaler handles the querying of metrics for
you, and passes these into the value()
method, along with the current UNIX timestamp. The format of metrics
argument is a dictionary of metric timeseries data, keyed by the timeseries name and where where each metric timeseries
is a list of (unix_timestamp_seconds, value)
pairs, sorted from oldest to most recent.
The signal also has available any configuration parameters that you specified in the parameters
dict, and the
cluster and pool that the signal is operating on are available in the cluster
and pool
attributes
on the signal.
Note
For application metrics, the clusterman metrics client will automatically prepend the application name to the metric key to avoid conflicts between metrics for different applications. However, Clusterman strips this prefix from the metric name before sending it to the signal, so you do not need to handle this in your signal code.
Note
For system metrics, the metrics client will add the cluster and pool as dimensions to the metric name to
prevent conflicts between different clusters and pools. These dimensions are also stripped from the metric name
before being sent to the client, since they are accessible via the cluster
and pool
attributes
in the signal.
Example¶
A custom signal class that averages cpus_allocated
values:
from clusterman_signals.base_signal import BaseSignal
from clusterman_signals.base_signal import SignalResources
class AverageCPUAllocation(BaseSignal):
def value(self):
cpu_values = [val for timestamp, val in self.metrics_cache['cpus_allocated']
average = sum(cpu_values) / len(cpu_values)
return SignalResources(cpus=average)
And configuration for a pool, so that the autoscaler will evaluate that signal every 10 minutes, over data from the last 20 minutes:
autoscaling_signal:
name: AverageCPUAllocation
branch_or_tag: v1.0.0
period_minutes: 10
required_metrics:
- name: cpus_allocated
type: system_metrics
minute_range: 20
Under the hood (supervisord)¶
In order to ensure that the autoscaler can work with multiple clients that specify different versions of the
clusterman_signals
repo, we do not import clusterman_signals
into the autoscaler. Instead, Clusterman launches
each signal in a separate process and communicates with them over abstract Unix domain sockets. The orchestration of the signal subprocesses and the autoscaler
is performed by supervisord, a client/server system that controls the operation of all the
independent subprocesses. In turn, supervisord is controlled by an autoscaler bootstrap batch daemon. The way this
works is outlined in detail below:
When a new signal version is written, tagged, and pushed to master, Jenkins builds a virtual environment for that signal, creates a tarball of the virtualenv, and uploads it to S3.
When the autoscaler bootstrap batch starts, it reads the
CMAN_CLUSTER
andCMAN_POOL
environment variables to determine what cluster and pool it should be operating on.The autoscaler bootstrap script reads the version of the signal that should be used for this specific cluster and pool from the configuration. It sets all of the environment variables needed for
supervisord
to run. Once the bootstrap initialization is complete, it startssupervisord
.Since there may be multiple applications running on the pool, and each application can pin a different version of the signal code, we may need to download multiple different versions of the signal code. The first thing
supervisord
does when it starts, therefore, is to download all needed versions of the signal from S3 as specified in theCMAN_VERSIONS_TO_FETCH
environment variable.supervisord
uses a so-called homogeneous process group to fetch the signals. That is, it runs one copy of the signal-fetcher script for each version of the signal code that needs to be downloaded; it reports completion only when all of the processes in the group have completed successfully. TheCMAN_NUM_VERSIONS
environment variable controls the size of this process group, and each fetcher script takes%(process_num)
as an argument to determine its task.The autoscaler bootstrap waits for that step to complete, and then triggers supervisord to start the signal process(es) running via the
CMAN_SIGNAL_NAMESPACES
,CMAN_SIGNAL_NAMES
, andCMAN_SIGNAL_APPS
environment variables. As above,supervisord
runs the signals in homogeneous process groups.Each signal listens for incoming connections on an abstract Unix domain socket named
\0{signal_namespace}-{signal_name}-{app}-socket
, wheresignal_namespace
is the subdirectory ofclusterman_signals
containing the signal specified bysignal_name
, andapp
is the application running the signal.Note
the name of the default signal is
__default__
If the signal process dies for any reason,
supervisord
will automatically restart it, and the autoscaler will attempt to reconnect on the next iteration.The autoscaler bootstrap waits for that step to complete and then starts the autoscaler batch daemon, which connects to all running signals and then proceeds to autoscale the pool.
The autoscaler bootstrap periodically polls files in srv-configs and AWS keys, and will restart the entire process if any of these files change.
Running the Signal Process¶
To initialize the signal, run.py
is called in the clusterman_signals
repo; this script takes two command-line
arguments: the pool of the signal to load, and the name of the signal to load. The socket name is constructed from
these two parameters, which (should) guarantee that different pools communicate over different processes. The script
then connects to the specified Unix socket and waits for the autoscaler to initialize the signal. The JSON object for
signal initialization looks like the following:
{
"cluster": what cluster this signal is operating on,
"pool": what pool this signal is operating on for the specified cluster,
"parameters": the values for any parameters from configuration that the signal should reference
}
Once the signal is properly initialized, the run.py
script waits for input from the autoscaler indefinitely. Since
metrics data could be arbitrarily large, the communication protocol for this data looks like the following:
First the autoscaler must send the length of the encoded metrics data object as an unsigned integer
The signal run loop must ACK the length by sending
0x1
back to the autoscalerThe autoscaler then must send the actual metrics data, broken up into chunks if necessary
When the signal run loop has received all data, it must ACK the data by sending
0x1
back to the autoscaler, unless the run loop detects some error in the communication; in this case, it must send0x2
to the autoscalerIf the autoscaler receives
0x2
from the signal, it will throw an exception; otherwise, it will wait for a response from the signal
The metrics input data takes the form of the following JSON blob:
{
"metrics": {
"metric-name-1": [[timestamp, value1], [timestamp, value2], ...],
"metric-name-2": [[timestamp, value1], [timestamp, value2], ...],
...
}
}
In other words, the autoscaler passes in all of the required_metrics
values for the signal, which have been
collected over the last period_minutes
window for each metric. The signal then will give the following response to
the autoscaler:
{ "Resources": { "cpus": requested_cpus, "mem": requested_memory_in_MB, "disk": requested_disk_in_MB } }
The value in this response is the result from running the signal with the specified data.
supervisord Environment Variables¶
CMAN_CLUSTER
: the name of the cluster to autoscaleCMAN_POOL
: the name of the pool to autoscaleCMAN_ARGS
: any additional arguments to pass to the autoscaler batch jobCMAN_VERSIONS_TO_FETCH
: a space-separated list of signal versions to fetch from S3CMAN_SIGNAL_VERSIONS
: a space-separated list of versions to use for each signalCMAN_SIGNAL_NAMESPACES
: a space-separated list of namespaces to use for each signalCMAN_SIGNAL_NAMES
: a space-separated list of signal namesCMAN_SIGNAL_APPS
: a space-separated list of applications scaledCMAN_NUM_VERSIONS
: the number of signal versions to fetch from S3CMAN_NUM_SIGNALS
: the number of signals to runCMAN_SIGNALS_BUCKET
: the location of the signal artifact bucket in S3
Autoscaler Batch¶
The autoscaler controls the autoscaling function of Clusterman. It runs for each cluster and pool managed by Clusterman. Within each cluster, it evaluates the signals for each configured application. The difference between the signalled resources and the current number of resources available for the pool determines how the cluster will be scaled.
Note
Currently, Clusterman can only handle a single application per cluster.
Scaling Logic¶
Clusterman tries to maintain a certain level of resource utilization, called the setpoint. It uses the value of signals as the measure of utilization. If current utilization differs from the setpoint, Clusterman calculates a new desired target capacity for the cluster that will keep utilization at the setpoint.
Clusterman calculates the percentage difference between this desired target capacity and the current target capacity. If this value is greater than the target capacity margin, then it will add or remove resources to bring the cluster to the desired target capacity. This prevents Clusterman from scaling too frequently in response to small changes.
The setpoint and target capacity margin are configured under autoscaling
in Service Configuration.
There are also some absolute limits on scaling, e.g. the maximum units that can be added or removed at a time.
These are configured under scaling_limits
in Pool Configuration.
For example, suppose the setpoint is 0.8 and the target capacity margin is 0.1. If the total number of CPUs is 100, and the signalled number of CPUs is 96, the current level of utilization is \(96/100=0.96\), which is different from the setpoint. Then suppose currently both target capacity and actual capacity are 100.0, the desired target capacity is calculated as \(100.0 * 0.96 / 0.8 = 120.0\), the target capacity percentage change is \((120.0 - 100.0)/100.0 = 0.2\), exceeding target capacity margin. So clusterman would scale the pool to a target capacity of 120.0.
Draining and Termination Logic¶
Clusterman uses a set of complex heuristics to identify hosts to terminate when scaling the cluster down. In
particular, it looks to see if hosts have joined the Mesos cluster, if they are running any tasks, and if any of the
running tasks are “critical” workloads that should not be terminated. It combines this information with information
about the resource groups in the pool, and it will work to prioritize agents to terminate based on this information,
while attempting to keep capacity balanced across each of the resource groups. See the MesosPoolManager
class for more information.
Moreover, if the draining: true
flag is set in the pool’s configuration file, Clusterman will attempt to drain tasks
off the host before terminating. This means that it will attempt to gracefully remove running tasks from the agent and
re-launch them elsewhere before terminating the host. This system is controlled by submitting hostnames to an Amazon
SQS queue; a worker watches this queue and, for each hostname, places the host in maintenance mode (which prevents new tasks from being scheduled on the
host, and then removes any running tasks on the host. Finally, this host is submitted to a termination SQS queue, where
another worker handles the final shutdown of the host.
Configuration¶
There are two levels of configuration for Clusterman. The first configures the Clusterman application or service itself, for operators of the service. The second provides per-pool configuration, for client applications to customize scaling behavior.
Service Configuration¶
The following is an example configuration file for the core Clusterman service and application:
aws:
access_key_file: /etc/boto_cfg/clusterman.json
region: us-west-1
autoscale_signal: # configures the default signal for Clusterman
name: MostRecentCPU
# What version of the signal to use (a branch or tag in the clusterman_signals Git repo)
branch_or_tag: v1.0.2
# How frequently the signal will be evaluated.
period_minutes: 10
required_metrics:
- name: cpus_allocated
type: system_metrics
# The metric will be queried for the most recent data in this range.
minute_range: 10
autoscaling:
# signal namespace for the default signal
default_signal_pool: clusterman
# Percentage utilization that Clusterman will try to maintain.
setpoint: 0.7
# Clusterman will only scale if percentage change of current and new target capacities is
# beyond this margin.
target_capacity_margin: 0.1
# Clusterman will not scale down if, since the last run, the capacity has decreased by more than
# this threshold. Note that this includes capacity removed by Clusterman in the last run and
# capacity lost for any other region (e.g. instance failures).
prevent_scale_down_after_capacity_loss: true
instance_loss_threshold: 2
# How long to wait for an agent to "drain" before terminating it
drain_termination_timeout_seconds:
sfr: 100
batches:
cluster_metrics:
# How frequently the batch should run to collect metrics.
run_interval_seconds: 60
spot_prices:
# Max one price change for each (instance type, AZ) in this interval.
dedupe_interval_seconds: 60
# How frequently the batch should run to collect metrics.
run_interval_seconds: 60
node_migration:
# Maximum number of worker prcesses the batch can spawn
# (every worker can handle a single migration for a pool)
max_worker_processes: 6
# How frequently the batch should check for migration triggers.
run_interval_seconds: 60
# Number of failed attempts tollerated for event workers.
# Job is marked as failed once this number of attempts is surpassed,
# and timespan from event creation is higher than this many times
# the estimated migration time for the pool.
failed_attemps_margin: 5
clusters:
cluster-name:
aws_region: us-west-2
mesos_api_url: <Mesos cluster FQDN>
kubeconfig_path: /path/to/kubeconfig.conf
cluster_config_directory: /nail/srv/configs/clusterman-pools/
module_config:
- namespace: clog
config:
log_stream_name: clusterman
file: /nail/srv/configs/clog.yaml
initialize: yelp_servlib.clog_util.initialize
- namespace: clusterman_metrics
file: /nail/srv/configs/clusterman_metrics.yaml
- namespace: yelp_batch
config:
watchers:
- aws_key_rotation: /etc/boto_cfg/clusterman.json
- clusterman_yaml: /nail/srv/configs/clusterman.yaml
The aws
section provides the location of access credentials for the AWS API, as well as the region in which
Clusterman should operate.
The autoscale_signal
section defines the default signal for autoscaling. This signal will be used for a pool, if
that pool does not define its own autoscale_signal
section in its pool configuration.
The autoscaling
section defines settings for the autoscaling behavior of Clusterman.
The batches
section configures specific Clusterman batches, such as the autoscaler and metrics collection batches.
The clusters
section provides the location of the clusters which Clusterman knows about.
The module_config
section loads additional configuration values for Clusterman modules, such as
clusterman_metrics
.
Pool Configuration¶
To configure a pool, a directory with the cluster’s name should be created in the cluster_config_directory
defined in the service configuration. Within that directory, there should be a file named <pool>.yaml
.
The following is an example configuration file for a particular Clusterman pool:
draining:
draining_time_threshold_seconds: 1200
force_terminate: true
redraining_delay_seconds: 60
resource_groups:
- sfr:
tag: 'my-custom-resource-group-tag'
scaling_limits:
min_capacity: 1
max_capacity: 800
max_weight_to_add: 100
max_weight_to_remove: 100
max_tasks_to_kill: 100
min_node_scalein_uptime_seconds: 300
autoscale_signal:
name: CustomSignal
namespace: my_application_signal
# What version of the signal to use (a tag in the clusterman_signals Git repo)
branch_or_tag: v3.7
# How frequently the signal will be evaluated.
period_minutes: 10
required_metrics:
- name: cpus_allocated
type: system_metrics
# The metric will be queried for the most recent data in this range.
minute_range: 10
# custom parameters to be passed into the signal (optional)
parameters:
- paramA: 'typeA'
- paramB: 10
node_migration:
trigger:
max_uptime: 90d
event: true
strategy:
rate: 5
prescaling: '2%'
precedence: highest_uptime
bootstrap_wait: 5m
bootstrap_timeout: 15m
disable_autoscaling: false
expected_duration: 2h
The resource-groups
section provides information for loading resource groups in the pool manager.
The scaling_limits
section provides global pool-level limits on scaling that the autoscaler and
other Clusterman commands should respect. The field min_node_scalein_uptime_seconds
is an optional
setting allowing to indicate a timespan in which freshly bootstrapped nodes are deprioritized in the
selection for termination.
The autoscale_signal
section defines the autoscaling signal used by this pool. This section is optional. If it is
not present, then the autoscale_signal
from the service configuration will be used.
For required metrics, there can be any number of sections, each defining one desired metric. The metric type must be one of Metric Types.
The node_migration
section contains settings controlling how Clusterman should be recycling nodes
inside the pool. Enabling this configuration is useful for keeping the average uptime of your pool low and/or
be able to perform adhoc migrations of the nodes according to some conditional parameter.
See Pool Configuration for all details.
Reloading¶
The Clusterman batches will automatically reload on changes to the clusterman service config file and the AWS
credentials file. This is specified in the namespace: yelp_batch
section of the main configuration file. The
autoscaler batch and the metrics collector batch also will automatically reload for changes to any pools that are
configured to run on the specified cluster.
Warning
Any changes to these configuration files will cause the signal to be reloaded by the autoscaling batch. Test your config values before pushing. If the config values break the custom signal, then the pool will start using the default signal.
Resource Groups¶
Resource groups are wrappers around cloud provider APIs to enable scaling up and down groups of machines. A resource
group implments the ResourceGroup
interface, which provides the set of required methods for
Clusterman to interact with the resource group. Currently, Clusterman supports the following types of resource groups:
AutoScalingResourceGroup
: AWS autoscaling groupsSpotFleetResourceGroup
: AWS spot fleet requests
Cluster Management¶
Clusterman comes with a number of command-line tools to help with cluster management.
Discovery¶
The clusterman list-clusters
and clusterman list-pools
commands can aid in determining what clusters and pools
Clusterman knows about:
Traceback (most recent call last):
File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/latest/clusterman/run.py", line 16, in <module>
from clusterman.args import parse_args
File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/latest/clusterman/args.py", line 18, in <module>
import colorlog
ModuleNotFoundError: No module named 'colorlog'
Traceback (most recent call last):
File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/latest/clusterman/run.py", line 16, in <module>
from clusterman.args import parse_args
File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/latest/clusterman/args.py", line 18, in <module>
import colorlog
ModuleNotFoundError: No module named 'colorlog'
Management¶
The clusterman manage
command can be used to directly change the state of the cluster:
Traceback (most recent call last):
File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/latest/clusterman/run.py", line 16, in <module>
from clusterman.args import parse_args
File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/latest/clusterman/args.py", line 18, in <module>
import colorlog
ModuleNotFoundError: No module named 'colorlog'
The --target-capacity
option allows users to directly change the size of the Mesos cluster specified by the
--cluster
and --pool
arguments.
Note that there can be up to a few minutes of “lag time” between when the manage command is issued and when
changes are reflected in the cluster. This is due to potential delays introduced into the pipeline while AWS finds and
procures new instances for the cluster. Therefore, it is not recommended to run clusterman manage
repeatedly in
short succession, or immediately after the autoscaler batch has run.
Note
Future versions of Clusterman may include a rate-limiter for the manage command
Note
By providing the existing target capacity value as the argument to --target-capacity
, you can force
Clusterman to attempt to prune any fulfilled capacity
that is above the
desired target capacity
.
Status¶
The clusterman status
command provides information on the current state of the cluster:
Traceback (most recent call last):
File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/latest/clusterman/run.py", line 16, in <module>
from clusterman.args import parse_args
File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/latest/clusterman/args.py", line 18, in <module>
import colorlog
ModuleNotFoundError: No module named 'colorlog'
As noted above, the state of the cluster may take a few minutes to equilibrate after a clusterman manage
command or
the autoscaler has run, so the output from clusterman status
may not accurately reflect the desired status.
Simulation¶
Running the Simulator¶
Traceback (most recent call last):
File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/latest/clusterman/run.py", line 16, in <module>
from clusterman.args import parse_args
File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/latest/clusterman/args.py", line 18, in <module>
import colorlog
ModuleNotFoundError: No module named 'colorlog'
Experimental Input Data¶
The simulator can accept experimental input data for one or more metric timeseries using the --metrics-data-file
argument to clusterman simulate
. The simulator expects this file to be stored as a compressed (gzipped) JSON file;
the JSON schema is as follows:
{
'metric_name_1': [
[<date-time-string>, value],
[<date-time-string>, value],
...
],
'metric_name_2': [
[<date-time-string>, value],
[<date-time-string>, value],
...
},
...
}
Optional Multi-valued Timeseries Data¶
Some timeseries data needs to have multiple y-values per timestamp. The metrics data file can optionally accept timeseries in a dictionary with the dictionary keys corresponding to the names of the individual timeseries. For example:
{
'metric_a': [
[
<date-time-string>,
{
'key1': value,
'key2': value
}
],
[
<date-time-string>,
{
'key3': value
}
],
[
<date-time-string>,
{
'key1': value,
'key2': value,
'key3': value
}
]
]
}
Additional Tools¶
There are two ways to get data for simulation, if you would like to use values that are different than the actual values recorded in the metrics store. generate-data allows you to generate values as a function of pre-existing metrics or from a random distribution, and SignalFX scraper allows you to query values from SignalFX.
generate-data¶
The clusterman generate-data
command is a helper function for the clusterman simulator to generate “fake” data,
either as some function of pre-existing metric data or as drawn from a specified random distribution. The command takes
as input an experimental design YAML file, and produces as output a compressed JSON file that can be directly used in a
simulation.
Note
If the output file already exists, new generated metrics will be appended to it; existing metrics in the output file that share the same name as generated metrics will be overwritten, pending user confirmation
Experimental Design File Specification¶
An experimental design file contains details for how to generate experimental metric data for use in a simulation. The specification for the experimental design is as follows:
metric_type:
metric_name:
start_time: <date-time string>
end_time: <date-time string>
frequency: <frequency specification>
values: <values specification>
dict_keys: (optional) <list of dictionary keys>
The
metric_type
should be one of the Metric Types. There should be one section containing all the applicable metric names for each type.Each
metric_name
is arbitrary; it should correspond to a metric value thatclusterman simulate
will use when performing its simulation. Multiple metrics can be specified for a given experimental design by repeating the above block in the YAML file for each desired metric; note that if multiple metrics should follow the same data generation specification, YAML anchors and references can be used.The
<date-time string>
fields can be in a wide variety of different formats, both relative and exact. In most cases dates and times should be specifed in ISO-8601 format; for example,2017-08-03T18:08:44+00:00
. However, in some cases it may be useful to specify relative times; these can be in human-readable format, for exampleone month ago
or-12h
.The
<frequency specification>
can take one of three formats:Historical data: To generate values from historical values, specify
historical
here and follow the specification for historical values below.Random data: if values will be randomly generated, then the frequency can be in one of two formats:
Regular intervals: by providing an
<date-time string>
for the frequency specification, metric values will be generated periodically; for example, a frequency of1m
will generate a new data point every minute.Random intervals: to generate new metric event arrival times randomly, specify a
<random generator>
block for the frequency, as shown below:distribution: dist-function params: dist_param_a: param-value dist_param_b: param-value
The
dist-function
should be the name of a function in the Python random module. Theparams
are the keyword arguments for the chosen function. All parameter values relating to time should be defined in seconds; for example, ifgauss
is chosen for the distribution function, the units for the mean and standard deviation should be seconds.
Note
A common choice for the dist-function is expovariate, which creates an exponentially-distributed interarrival time, a.k.a, a Poisson process. This is a good baseline model for the arrival times of real-world data.
Similarly, the
<values specification>
can take one of two formats:Function of historical data: historical values can be linearly transformed by \(ax+b\). Specify the following block:
aws_region: <AWS region to read historical data from> params: a: <value> b: <value>
Random values: for this mode, specify a
<random generator>
block as shown above for frequency.
The
dict_keys
field takes a list of strings which are used to generate a single timeseries with (potentially) multiple data points per time value. For example, given the followingdict_keys
configuration:metric_a: dict_keys: - key1 - key2 - key3
the resulting generated data for
metric_a
might look something like the example in Optional Multi-valued Timeseries Data format.
Output Format¶
The generate-data
command produces a compressed JSON containing the generated metric data. The format for this file
is identical to the simulator’s Experimental Input Data format.
Sample Usage¶
drmorr ~ > clusterman generate-data --input design.yaml --ouput metrics.json.gz
Random Seed: 12345678
drmorr ~ > clusterman simulate --metrics-data-file metrics.json.gz \
> --start-time "2017-08-01T08:00:00+00:00" --end-time "2017-08-01T08:10:00+00:00"
=== Event 0 -- 2017-08-01T08:00:00+00:00 [Simulation begins]
=== Event 2 -- 2017-08-01T08:00:00+00:00 [SpotPriceChangeEvent]
=== Event 28 -- 2017-08-01T08:00:00+00:00 [SpotPriceChangeEvent]
=== Event 21 -- 2017-08-01T08:00:00+00:00 [SpotPriceChangeEvent]
=== Event 22 -- 2017-08-01T08:02:50+00:00 [SpotPriceChangeEvent]
=== Event 3 -- 2017-08-01T08:05:14+00:00 [SpotPriceChangeEvent]
=== Event 23 -- 2017-08-01T08:06:04+00:00 [SpotPriceChangeEvent]
=== Event 0 -- 2017-08-01T08:00:00+00:00 [Simulation ends]
Sample Experimental Design File¶
metadata:
spot_prices|aws_availability_zone=us-west-2a,aws_instance_type=c3.8xlarge: &spot_prices
# If no timezone is specified, generator will use YST
start_time: "2017-12-01T08:00:00Z"
end_time: "2017-12-01T09:00:00Z"
frequency:
distribution: expovariate
params:
lambd: 0.0033333 # Assume prices change on average every five minutes
values:
distribution: uniform
params:
a: 0
b: 1
spot_prices|aws_availability_zone=us-west-2b,aws_instance_type=c3.8xlarge: *spot_prices
spot_prices|aws_availability_zone=us-west-2c,aws_instance_type=c3.8xlarge: *spot_prices
capacity|cluster=norcal-prod,role=seagull:
start_time: "2017-12-01T08:00:00Z"
end_time: "2017-12-01T09:00:00Z"
dict_keys:
- c3.8xlarge,us-west-2a
- c3.8xlarge,us-west-2b
- c3.8xlarge,us-west-2c
frequency:
distribution: expovariate
params:
lambd: 0.001666 # Assume capacity change on average every ten minutes
values:
distribution: randint
params:
a: 10
b: 50
app_metrics:
seagull_runs:
start_time: "2017-12-01T08:00:00Z"
end_time: "2017-12-01T09:00:00Z"
frequency:
distribution: expovariate
params:
lambd: 0.0041666 # 15 seagull runs per hour
values: 1
system_metrics:
cpu_allocation|cluster=everywhere-testopia,role=jolt:
start_time: "2017-12-01T08:00:00Z"
end_time: "2017-12-01T09:00:00Z"
frequency: historical
values:
aws_region: "us-west-2"
params: # calculate value by a*x + b
a: 1.5
b: 10
The above design file, and a sample output file are located in docs/examples/design.yaml
and
docs/examples/metrics.json.gz
, respectively.
SignalFX scraper¶
A tool for downloading data points from SignalFX and saving them in the compressed JSON format that the Clusterman simulator can use. This is an alternative to generating data if the data you’re interested in is in SignalFX, but it’s not yet in Clusterman metrics.
Note
Only data from the last month is available from SignalFX.
The tool will interactively ask you the metric type to save each metric as.
Traceback (most recent call last):
File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/latest/clusterman/tools/signalfx_scraper.py", line 17, in <module>
from clusterman_metrics.util.constants import METRIC_TYPES
ModuleNotFoundError: No module named 'clusterman_metrics'
Sample usage:
python -m clusterman.tools.signalfx_scraper --start-time 2017-12-03 --end-time 2017-12-10 \
--src-metric-names 'seagull.fleet_miser.cluster_capacity_units' --dest-file capacity \
--api-token <secret> --filter rollup:max region:uswest2-testopia cluster_name:releng
Node Migration¶
Node Migration is a functionality which allows Clusterman to recycle nodes of a pool according to various criteria, in order to reduce the amount of manual work necessary when performing infrastructure migrations.
NOTE: this is only compatible with Kubernetes clusters.
Node Migration Batch¶
The Node Migration batch is the entrypoint of the migration logic. It takes care of fetching migration trigger events, spawning the worker processes actually performing the node recycling procedures, and monitoring their health.
Batch specific configuration values are described as part of the main service configuration in Service Configuration.
The batch code can be invoked from the clusterman.batch.node_migration
Python module.
Pool Configuration¶
The behaviour of the migration logic for a pool is controlled by the node_migration
section of the pool configuration.
The allowed values for the migration settings are as follows:
trigger
:max_uptime
: if set, monitor nodes’ uptime to ensure it stays lower than the provided value; human readable time string (e.g. 30d).event
: if set totrue
, accept async migration trigger for this pool; details about event triggers are described below in Migration Event Trigger.
strategy
:rate
: rate at which nodes are selected for termination; percentage or absolute value (required).prescaling
: if set, pool size (in nodes) is increased by this amount before performing node recycling; percentage or absolute value (0 by default). This directly sets a capacity value for the pool if autoscaling is disabled, or applies a temporary capacity offset otherwise.precedence
: precedence with which nodes are selected for termination:highest_uptime
: select older nodes first (default);lowest_task_count
: select node with fewer running tasks first;az_name_alphabetical
: group nodes by availability zone, and select group in alphabetical order;
bootstrap_wait
: indicative time necessary for a node to be ready to run workloads after boot; human readable time string (3 minutes by default).bootstrap_timeout
: maximum wait for nodes to be ready after boot; human readable time string (10 minutes by default).allowed_failed_drains
: allow for up to this many nodes to fail draining and be requeued before aborting (3 by default)
disable_autoscaling
: turn off autoscaler while recycling instances (false by default).ignore_pod_health
: avoid loading and checking pod information to determine pool health (false by default).health_check_interval
: how much to wait between checks when monitoring pool health (2 minutes by default).orphan_capacity_tollerance
: acceptable ratio of orphan capacity over target capacity to still consider the pool healthy (float, 0 by default, max 0.2).max_uptime_worker_skips
: maximum number of times the uptime monitoring worker can skip churning nodes due to unsatisfied pool capacity (6 by default, set to 0 to always allow skipping).expected_duration
: estimated duration for migration of the whole pool; human readable time string (1 day by default).
See Pool Configuration for how an example configuration block would look like.
Migration Event Trigger¶
Migration trigger events are submitted as Kubernetes custom resources of type nodemigration
.
They can be easily generated and submitted by using the clusterman migrate
CLI sub-command and it related options.
In case jobs for a pool need to be stopped, it is possible to use the clusterman migrate-stop
utility.
The manifest for the custom resource defintion is as follows:
---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: nodemigrations.clusterman.yelp.com
spec:
scope: Cluster
group: clusterman.yelp.com
names:
plural: nodemigrations
singular: nodemigration
kind: NodeMigration
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
required:
- spec
properties:
spec:
type: object
required:
- cluster
- pool
- condition
properties:
cluster:
type: string
pool:
type: string
label_selectors:
type: array
items:
type: string
condition:
type: object
properties:
trait:
type: string
enum: [kernel, lsbrelease, instance_type, uptime]
target:
type: string
operator:
type: string
enum: [gt, ge, eq, ne, lt, le, in, notin]
In more readable terms, an example resource manifest would look like:
---
apiVersion: "clusterman.yelp.com/v1"
kind: NodeMigration
metadata:
name: my-test-migration-220912
labels:
clusterman.yelp.com/migration_status: pending
spec:
cluster: kubestage
pool: default
condition:
trait: uptime
operator: lt
target: 90d
The fields in each migration event allow to control which nodes are affected by the event and what is the desired final condition for them. More specifically:
cluster
: name of the cluster to be targeted.pool
: name of the pool to be targeted.label_selectors
: list of additional Kubernetes label selectors to filter affected nodes.condition
: the desired final state for the node, i.e. all nodes must be have kernel version higher than X.trait
: metadata to be compared; currently supportskernel
,lsbrelease
,instance_type
, oruptime
.operator
: comparison operator; supportsgt
,ge
,eq
,ne
,lt
,le
,in
,notin
.target
: right side of the comparison expression, e.g. a kernel version or an instance type; may be a single string or a comma separated list when usingin
/notin
operators.
Drainer¶
Drainer is the component to drain pods off the node before terminating. It may drain and terminate nodes for three reasons:
spot_interruption
node_migration
scaling_down
NOTE: all settings are only compatible with Kubernetes clusters.
Drainer Batch¶
The Drainer batch is the entrypoint of the draining logic.
The batch code can be invoked from the clusterman.batch.drainer
Python module.
Pool Configuration¶
The behaviour of the drainer logic for a pool is controlled by the draining
section of the pool configuration.
The allowed values for the drainer settings are as follows:
draining_time_threshold_seconds
: maximum time to complete draining process (1800 by default)redraining_delay_seconds
: how much to wait between draining tries in case of draining failure (15 by default).force_terminate
: forcibly terminate the node after reaching draining_time_threshold_seconds (false by default).
See Pool Configuration for how an example configuration block would look like.