clusterman

Clusterman autoscales Mesos clusters based on the values of user-defined signals of resource utilization. It also provides tools to manually manage those clusters, and simulate how changes to autoscaling logic will impact the cost and performance.

Overview

Clusterman scales the pools in each Mesos cluster independently. In other words, it treats each group of instances with the same reserved pool in a Mesos cluster as a single unit. For each pool in a cluster, Clusterman determines the total target capacity by evaluating signals. These signals are user-defined functions of metrics collected through Clusterman.

Assumptions and Definitions (aka the “Clusterman Contract”)

  1. A cluster is a Mesos cluster, that is, a distributed system managed by Apache Mesos

  2. A pool is a group of machines belonging to a cluster; a resource group is a logical grouping of machines in a pool corresponding to a cloud provider’s API (for example, an Amazon Spot Fleet Request might be a resource group in a pool). A cluster may have many pools and each pool may have many resource groups.

  3. Resource groups must have a way of assigning weights to machines running in that resource group. These weights are used by Clusterman to determine how many resources should be added to a particular pool.

    Note

    It is recommended (but not required) that the weight has a consistent meaning across all the types of machines that can appear in the resource group; the definition of weight in a resource group can be arbitrarily chosen by the operator, as long as it is consistent. For example, a resource group may define 1 unit of weight to equal 50 vCPUs. Moreover, each resource group in the pool should use the same definition of weight.

  4. An application is a Mesos framework running on a pool; note that applications can span resource groups but they cannot span pools; thus each pool has a dedicated “purpose” or “set of applications” that is managed by Clusterman.

  5. Every pool has at most one Clusterman autoscaler, which is responsible for managing the size of that pool

How Clusterman Works

Pool Manager

Clusterman manages a group of agents in a pool through a pool manager; the pool manager consists of one or more resource group units, which represent groups of machines that can be modified together, such as via an AWS spot fleet request.

Signals

For each pool, Clusterman determines the target capacity by evaluating signals. Signals reports the estimated resource requirements (e.g. CPUs, memory) for an application running on that pool. Clusterman compares this estimate to the current number of resources available and changes the target capacity for the pool accordingly.

These signals are functions of metrics and may be defined per application, by the owners of that application (see How to Write a Custom Signal). Each application may define exactly one signal; if there is no custom signal defined for a application, there is also a default signal defined by Clusterman that will be used.

Metrics

Signals are functions of metrics, values collected by Clusterman over time. Clusterman uses a metrics API layer to ensure that all metric values are stored in a consistent format that can be used both for autoscaling and simulation workloads. At present, all metrics data is stored in DynamoDB.

Application owners may use the metrics library to record application-specific metrics. The clusterman service also collects a number of metrics that may be used by anyone for autoscaling signals or simulation.

Simulator

In addition to the live autoscaler, Clusterman comes with a simulator that allows operators to test changes to their code or signals, experiment with different parameter values on live data, or compute operating costs, all without impacting production clusters. The simulator uses the same metrics and signals as the live autoscaler, except that it does not interact with live resource groups but instead operates in a simulated environment.

Metrics

Metrics are used by Clusterman to record state about clusters that can be used later for autoscaling or simulation.

Clusterman uses a metrics interface API to ensure that all metric values are stored in a consistent format that can be used both for autoscaling and simulation workloads. At present, all metric data is stored in DynamoDB, and accessed using the ClustermanMetricsBotoClient. In the future, the interface layer allows us to transparently change backends if necessary.

Interacting with the Metrics Client

Metric Types

Metrics in Clusterman can be classified into one of three different types. Each metric type is stored in a separate namespace. Within each namespace, metric values are uniquely identified by their key and timestamp.

clusterman_metrics.APP_METRICS

metrics collected from client applications (e.g., number of application runs)

clusterman_metrics.METADATA

metrics collected about the cluster (e.g., current spot prices, instance types present)

clusterman_metrics.SYSTEM_METRICS

metrics collected about the cluster state (e.g., CPU, memory allocation)

Application metrics are designed to be read and written by the application owners to provide input into their autoscaling signals. System metrics and metadata can be read by application owners, but are written by batch jobs inside the Clusterman code base. Metadata metrics cannot be read by application owners and are only used for monitoring and simulation purposes.

Metric Keys

Metric keys have two components, a metric name and a set of dimensions. The metric key format is:

metric_name|dimension1=value1,dimension2=value2

This allows for metrics to be easily converted into SignalFX datapoints, where the metric name is used as the timeseries name, and the dimensions are converted to SignalFX dimensions. The generate_key_with_dimensions() helper function will return the full metric key in its proper format. Use it to get the correct key when reading or writing metrics.

Reading Metrics

The metrics client provides a function called ClustermanMetricsBotoClient.get_metric_values() which can be used to query the metrics datastore.

Note

In general, signal authors should not need to read metrics through the metrics client, because the BaseSignal takes care of reading metrics for the signal.

Writing Metrics

The metrics client provides a function called ClustermanMetricsBotoClient.get_writer(); this function returns an “enhanced generator” or coroutine (not an asyncio coroutine) which can be used to write metrics data into the datastore. The generator pattern is used to allow writing to be batched together and reduce throughput capacity into DynamoDB. See the API documentation for how to use this generator.

Example and Reference

DynamoDB Example Tables

The following tables show examples of how our data is stored in DynamoDB:

Application Metrics

metric name

timestamp

value

app_A,my_runs

1502405756

2

app_B,my_runs

1502405810

201

app_B,metric2

1502405811

1.3

System Metrics

metric name

timestamp

value

cpus_allocated|cluster=norcal-prod,pool=appA_pool

1502405756

22

mem_allocated|cluster=norcal-prod,pool=appB_pool

1502405810

20

Metadata

metric name

timestamp

value

<c3.xlarge, us-west-2a>

<c3.xlarge, us-west-2c>

spot_prices|aws_availability_zone=us-west-2a,aws_instance_type=c3.xlarge

1502405756

1.30

spot_prices|aws_availability_zone=us-west-2c,aws_instance_type=c3.xlarge

1502405756

5.27

fulfilled_capacity|cluster=norcal-prod,pool=seagull

1502409314

4

20

Metric Name Reference

The following is a list of metric names and dimensions that Clusterman collects:

System Metrics
  • cpus_allocated|cluster=<cluster name>,pool=<pool>

  • mem_allocated|cluster=<cluster name>,pool=<pool>

  • disk_allocated|cluster=<cluster name>,pool=<pool>

Metadata Metrics
  • cpus_total|cluster=<cluster name>,pool=<pool>

  • disk_total|cluster=<cluster name>,pool=<pool>

  • fulfilled_capacity|cluster=<cluster name>,pool=<pool> (separate column per InstanceMarket)

  • mem_total|cluster=<cluster name>,pool=<pool>

  • spot_prices|aws_availability_zone=<availability zone>,aws_instance_type=<AWS instance type>

  • target_capacity|cluster=<cluster name>,pool=<pool>

Signals

Each Clusterman autoscaler instance manages the capacity for a single pool in a Mesos cluster. Clusterman determines the target capacity by evaluating signals. Signals are a function of metrics and represent the estimated resources (e.g. CPUs, memory) required by an application running on that pool. Clusterman compares this estimate to the current number of resources available and changes the target capacity for the pool accordingly (see Scaling Logic).

Signal Evaluation

During each autoscaling run, Clusterman evaluates each signal defined for the pool; any metrics requested by the signal are automatically read from the metrics datastore by the autoscaler, and passed in to the signal, along with any additional parameters that the signal needs in order to run. The signal then returns a resource request, indicating how many resources the application wants for the current period. Clusterman combines the resource requests from all the signals to determine how many resources to add or remove from the pool; these resource requests are subject to final capacity limits on the cluster to ensure that the pool does not ever contain too many or too few resources (which might cost extra money or impact availability).

A signal’s resource request is defined as follows:

{
  "Resources": {
    "cpus": requested_cpus,
    "mem": requested_memory_in_MB,
    "disk": requested_disk_in_MB
  }
}

If an application does not define its own signal, or if Clusterman is unable to load or evaluate the application’s signal for any reason, Clusterman will fall back to using a default signal, defined in Clusterman’s own service configuration file. See the configuration file and the clusterman namespace within clusterman_signals package for the latest definitions. In general, the default signal uses recent values of allocated CPUs, memory, and disk to estimate the resources required in the future.

How to Write a Custom Signal

Code for custom signals should be defined in the clusterman_signals package. Once a signal is defined there, the Pool Configuration section below describes how Clusterman can be configured to use it for a pool.

Signal code

In clusterman_signals, there is a separate directory for each application (called the signal namespace). If there is not already a namespace for your signal already, create a directory within clusterman_signals and create an __init__.py file within that directory.

Within that directory, application owners may choose how to organize signal classes within files. The only requirement is that the signal class must be able to be imported directly from that subpackage, i.e. from clusterman_signals.poolA import MyCustomSignal. Typically, in the __init__.py, you would import the class and then add it to __all__:

from clusterman_signals.poolA.custom_signal import MyCustomSignal
...

__all__ = [MyCustomSignal, ...]

Define a new class that implements clusterman_signals.base_signal.BaseSignal (the class name should be unique). In this class, you only need to overwrite the value() method. value() should use metric values to return a clusterman_signals.base_signal.SignalResources tuple, where the units of the tuple should match the Mesos units: shares for CPUs, MB for memory and disk.

When you configure your custom signal, you specify the metric names that your signal requires and how far back the data for each metric should be queried. The autoscaler handles the querying of metrics for you, and passes these into the value() method, along with the current UNIX timestamp. The format of metrics argument is a dictionary of metric timeseries data, keyed by the timeseries name and where where each metric timeseries is a list of (unix_timestamp_seconds, value) pairs, sorted from oldest to most recent.

The signal also has available any configuration parameters that you specified in the parameters dict, and the cluster and pool that the signal is operating on are available in the cluster and pool attributes on the signal.

Note

For application metrics, the clusterman metrics client will automatically prepend the application name to the metric key to avoid conflicts between metrics for different applications. However, Clusterman strips this prefix from the metric name before sending it to the signal, so you do not need to handle this in your signal code.

Note

For system metrics, the metrics client will add the cluster and pool as dimensions to the metric name to prevent conflicts between different clusters and pools. These dimensions are also stripped from the metric name before being sent to the client, since they are accessible via the cluster and pool attributes in the signal.

Example

A custom signal class that averages cpus_allocated values:

from clusterman_signals.base_signal import BaseSignal
from clusterman_signals.base_signal import SignalResources

class AverageCPUAllocation(BaseSignal):

    def value(self):
       cpu_values = [val for timestamp, val in self.metrics_cache['cpus_allocated']
       average = sum(cpu_values) / len(cpu_values)
       return SignalResources(cpus=average)

And configuration for a pool, so that the autoscaler will evaluate that signal every 10 minutes, over data from the last 20 minutes:

autoscaling_signal:
    name: AverageCPUAllocation
    branch_or_tag: v1.0.0
    period_minutes: 10
    required_metrics:
        - name: cpus_allocated
          type: system_metrics
          minute_range: 20

Under the hood (supervisord)

In order to ensure that the autoscaler can work with multiple clients that specify different versions of the clusterman_signals repo, we do not import clusterman_signals into the autoscaler. Instead, Clusterman launches each signal in a separate process and communicates with them over abstract Unix domain sockets. The orchestration of the signal subprocesses and the autoscaler is performed by supervisord, a client/server system that controls the operation of all the independent subprocesses. In turn, supervisord is controlled by an autoscaler bootstrap batch daemon. The way this works is outlined in detail below:

  1. When a new signal version is written, tagged, and pushed to master, Jenkins builds a virtual environment for that signal, creates a tarball of the virtualenv, and uploads it to S3.

  2. When the autoscaler bootstrap batch starts, it reads the CMAN_CLUSTER and CMAN_POOL environment variables to determine what cluster and pool it should be operating on.

  3. The autoscaler bootstrap script reads the version of the signal that should be used for this specific cluster and pool from the configuration. It sets all of the environment variables needed for supervisord to run. Once the bootstrap initialization is complete, it starts supervisord.

  4. Since there may be multiple applications running on the pool, and each application can pin a different version of the signal code, we may need to download multiple different versions of the signal code. The first thing supervisord does when it starts, therefore, is to download all needed versions of the signal from S3 as specified in the CMAN_VERSIONS_TO_FETCH environment variable.

    supervisord uses a so-called homogeneous process group to fetch the signals. That is, it runs one copy of the signal-fetcher script for each version of the signal code that needs to be downloaded; it reports completion only when all of the processes in the group have completed successfully. The CMAN_NUM_VERSIONS environment variable controls the size of this process group, and each fetcher script takes %(process_num) as an argument to determine its task.

  5. The autoscaler bootstrap waits for that step to complete, and then triggers supervisord to start the signal process(es) running via the CMAN_SIGNAL_NAMESPACES, CMAN_SIGNAL_NAMES, and CMAN_SIGNAL_APPS environment variables. As above, supervisord runs the signals in homogeneous process groups.

    Each signal listens for incoming connections on an abstract Unix domain socket named \0{signal_namespace}-{signal_name}-{app}-socket, where signal_namespace is the subdirectory of clusterman_signals containing the signal specified by signal_name, and app is the application running the signal.

    Note

    the name of the default signal is __default__

    If the signal process dies for any reason, supervisord will automatically restart it, and the autoscaler will attempt to reconnect on the next iteration.

  6. The autoscaler bootstrap waits for that step to complete and then starts the autoscaler batch daemon, which connects to all running signals and then proceeds to autoscale the pool.

  7. The autoscaler bootstrap periodically polls files in srv-configs and AWS keys, and will restart the entire process if any of these files change.

Running the Signal Process

To initialize the signal, run.py is called in the clusterman_signals repo; this script takes two command-line arguments: the pool of the signal to load, and the name of the signal to load. The socket name is constructed from these two parameters, which (should) guarantee that different pools communicate over different processes. The script then connects to the specified Unix socket and waits for the autoscaler to initialize the signal. The JSON object for signal initialization looks like the following:

{
    "cluster": what cluster this signal is operating on,
    "pool": what pool this signal is operating on for the specified cluster,
    "parameters": the values for any parameters from configuration that the signal should reference
}

Once the signal is properly initialized, the run.py script waits for input from the autoscaler indefinitely. Since metrics data could be arbitrarily large, the communication protocol for this data looks like the following:

  1. First the autoscaler must send the length of the encoded metrics data object as an unsigned integer

  2. The signal run loop must ACK the length by sending 0x1 back to the autoscaler

  3. The autoscaler then must send the actual metrics data, broken up into chunks if necessary

  4. When the signal run loop has received all data, it must ACK the data by sending 0x1 back to the autoscaler, unless the run loop detects some error in the communication; in this case, it must send 0x2 to the autoscaler

  5. If the autoscaler receives 0x2 from the signal, it will throw an exception; otherwise, it will wait for a response from the signal

The metrics input data takes the form of the following JSON blob:

{
    "metrics": {
        "metric-name-1": [[timestamp, value1], [timestamp, value2], ...],
        "metric-name-2": [[timestamp, value1], [timestamp, value2], ...],
        ...
    }
}

In other words, the autoscaler passes in all of the required_metrics values for the signal, which have been collected over the last period_minutes window for each metric. The signal then will give the following response to the autoscaler:

{
  "Resources": {
    "cpus": requested_cpus,
    "mem": requested_memory_in_MB,
    "disk": requested_disk_in_MB
  }
}

The value in this response is the result from running the signal with the specified data.

supervisord Environment Variables

  • CMAN_CLUSTER: the name of the cluster to autoscale

  • CMAN_POOL: the name of the pool to autoscale

  • CMAN_ARGS: any additional arguments to pass to the autoscaler batch job

  • CMAN_VERSIONS_TO_FETCH: a space-separated list of signal versions to fetch from S3

  • CMAN_SIGNAL_VERSIONS: a space-separated list of versions to use for each signal

  • CMAN_SIGNAL_NAMESPACES: a space-separated list of namespaces to use for each signal

  • CMAN_SIGNAL_NAMES: a space-separated list of signal names

  • CMAN_SIGNAL_APPS: a space-separated list of applications scaled

  • CMAN_NUM_VERSIONS: the number of signal versions to fetch from S3

  • CMAN_NUM_SIGNALS: the number of signals to run

  • CMAN_SIGNALS_BUCKET: the location of the signal artifact bucket in S3

Autoscaler Batch

The autoscaler controls the autoscaling function of Clusterman. It runs for each cluster and pool managed by Clusterman. Within each cluster, it evaluates the signals for each configured application. The difference between the signalled resources and the current number of resources available for the pool determines how the cluster will be scaled.

Note

Currently, Clusterman can only handle a single application per cluster.

Scaling Logic

Clusterman tries to maintain a certain level of resource utilization, called the setpoint. It uses the value of signals as the measure of utilization. If current utilization differs from the setpoint, Clusterman calculates a new desired target capacity for the cluster that will keep utilization at the setpoint.

Clusterman calculates the percentage difference between this desired target capacity and the current target capacity. If this value is greater than the target capacity margin, then it will add or remove resources to bring the cluster to the desired target capacity. This prevents Clusterman from scaling too frequently in response to small changes.

The setpoint and target capacity margin are configured under autoscaling in Service Configuration. There are also some absolute limits on scaling, e.g. the maximum units that can be added or removed at a time. These are configured under scaling_limits in Pool Configuration.

For example, suppose the setpoint is 0.8 and the target capacity margin is 0.1. If the total number of CPUs is 100, and the signalled number of CPUs is 96, the current level of utilization is \(96/100=0.96\), which is different from the setpoint. Then suppose currently both target capacity and actual capacity are 100.0, the desired target capacity is calculated as \(100.0 * 0.96 / 0.8 = 120.0\), the target capacity percentage change is \((120.0 - 100.0)/100.0 = 0.2\), exceeding target capacity margin. So clusterman would scale the pool to a target capacity of 120.0.

Draining and Termination Logic

Clusterman uses a set of complex heuristics to identify hosts to terminate when scaling the cluster down. In particular, it looks to see if hosts have joined the Mesos cluster, if they are running any tasks, and if any of the running tasks are “critical” workloads that should not be terminated. It combines this information with information about the resource groups in the pool, and it will work to prioritize agents to terminate based on this information, while attempting to keep capacity balanced across each of the resource groups. See the MesosPoolManager class for more information.

Moreover, if the draining: true flag is set in the pool’s configuration file, Clusterman will attempt to drain tasks off the host before terminating. This means that it will attempt to gracefully remove running tasks from the agent and re-launch them elsewhere before terminating the host. This system is controlled by submitting hostnames to an Amazon SQS queue; a worker watches this queue and, for each hostname, places the host in maintenance mode (which prevents new tasks from being scheduled on the host, and then removes any running tasks on the host. Finally, this host is submitted to a termination SQS queue, where another worker handles the final shutdown of the host.

Configuration

There are two levels of configuration for Clusterman. The first configures the Clusterman application or service itself, for operators of the service. The second provides per-pool configuration, for client applications to customize scaling behavior.

Service Configuration

The following is an example configuration file for the core Clusterman service and application:

aws:
    access_key_file: /etc/boto_cfg/clusterman.json
    region: us-west-1

autoscale_signal:  # configures the default signal for Clusterman
    name: MostRecentCPU

    # What version of the signal to use (a branch or tag in the clusterman_signals Git repo)
    branch_or_tag: v1.0.2

    # How frequently the signal will be evaluated.
    period_minutes: 10

    required_metrics:
        - name: cpus_allocated
          type: system_metrics

          # The metric will be queried for the most recent data in this range.
          minute_range: 10

autoscaling:
    # signal namespace for the default signal
    default_signal_pool: clusterman

    # Percentage utilization that Clusterman will try to maintain.
    setpoint: 0.7

    # Clusterman will only scale if percentage change of current and new target capacities  is
    # beyond this margin.
    target_capacity_margin: 0.1

    # Clusterman will not scale down if, since the last run, the capacity has decreased by more than
    # this threshold. Note that this includes capacity removed by Clusterman in the last run and
    # capacity lost for any other region (e.g. instance failures).
    prevent_scale_down_after_capacity_loss: true
    instance_loss_threshold: 2

# How long to wait for an agent to "drain" before terminating it
drain_termination_timeout_seconds:
  sfr: 100

batches:
    cluster_metrics:
        # How frequently the batch should run to collect metrics.
        run_interval_seconds: 60

    spot_prices:
        # Max one price change for each (instance type, AZ) in this interval.
        dedupe_interval_seconds: 60

        # How frequently the batch should run to collect metrics.
        run_interval_seconds: 60

    node_migration:
        # Maximum number of worker prcesses the batch can spawn
        # (every worker can handle a single migration for a pool)
        max_worker_processes: 6

        # How frequently the batch should check for migration triggers.
        run_interval_seconds: 60

        # Number of failed attempts tollerated for event workers.
        # Job is marked as failed once this number of attempts is surpassed,
        # and timespan from event creation is higher than this many times
        # the estimated migration time for the pool.
        failed_attemps_margin: 5

clusters:
    cluster-name:
        aws_region: us-west-2
        mesos_api_url: <Mesos cluster FQDN>
        kubeconfig_path: /path/to/kubeconfig.conf

cluster_config_directory: /nail/srv/configs/clusterman-pools/

module_config:
  - namespace: clog
    config:
        log_stream_name: clusterman
    file: /nail/srv/configs/clog.yaml
    initialize: yelp_servlib.clog_util.initialize

  - namespace: clusterman_metrics
    file: /nail/srv/configs/clusterman_metrics.yaml

  - namespace: yelp_batch
    config:
        watchers:
          - aws_key_rotation: /etc/boto_cfg/clusterman.json
          - clusterman_yaml: /nail/srv/configs/clusterman.yaml

The aws section provides the location of access credentials for the AWS API, as well as the region in which Clusterman should operate.

The autoscale_signal section defines the default signal for autoscaling. This signal will be used for a pool, if that pool does not define its own autoscale_signal section in its pool configuration.

The autoscaling section defines settings for the autoscaling behavior of Clusterman.

The batches section configures specific Clusterman batches, such as the autoscaler and metrics collection batches.

The clusters section provides the location of the clusters which Clusterman knows about.

The module_config section loads additional configuration values for Clusterman modules, such as clusterman_metrics.

Pool Configuration

To configure a pool, a directory with the cluster’s name should be created in the cluster_config_directory defined in the service configuration. Within that directory, there should be a file named <pool>.yaml. The following is an example configuration file for a particular Clusterman pool:

draining:
  draining_time_threshold_seconds: 1200
  force_terminate: true
  redraining_delay_seconds: 60

resource_groups:
  - sfr:
      tag: 'my-custom-resource-group-tag'

scaling_limits:
    min_capacity: 1
    max_capacity: 800
    max_weight_to_add: 100
    max_weight_to_remove: 100
    max_tasks_to_kill: 100
    min_node_scalein_uptime_seconds: 300


autoscale_signal:
    name: CustomSignal
    namespace: my_application_signal

    # What version of the signal to use (a tag in the clusterman_signals Git repo)
    branch_or_tag: v3.7

    # How frequently the signal will be evaluated.
    period_minutes: 10

    required_metrics:
        - name: cpus_allocated
          type: system_metrics

          # The metric will be queried for the most recent data in this range.
          minute_range: 10

    # custom parameters to be passed into the signal (optional)
    parameters:
        - paramA: 'typeA'
        - paramB: 10

node_migration:
    trigger:
        max_uptime: 90d
        event: true
    strategy:
        rate: 5
        prescaling: '2%'
        precedence: highest_uptime
        bootstrap_wait: 5m
        bootstrap_timeout: 15m
    disable_autoscaling: false
    expected_duration: 2h

The resource-groups section provides information for loading resource groups in the pool manager.

The scaling_limits section provides global pool-level limits on scaling that the autoscaler and other Clusterman commands should respect. The field min_node_scalein_uptime_seconds is an optional setting allowing to indicate a timespan in which freshly bootstrapped nodes are deprioritized in the selection for termination.

The autoscale_signal section defines the autoscaling signal used by this pool. This section is optional. If it is not present, then the autoscale_signal from the service configuration will be used.

For required metrics, there can be any number of sections, each defining one desired metric. The metric type must be one of Metric Types.

The node_migration section contains settings controlling how Clusterman should be recycling nodes inside the pool. Enabling this configuration is useful for keeping the average uptime of your pool low and/or be able to perform adhoc migrations of the nodes according to some conditional parameter. See Pool Configuration for all details.

Reloading

The Clusterman batches will automatically reload on changes to the clusterman service config file and the AWS credentials file. This is specified in the namespace: yelp_batch section of the main configuration file. The autoscaler batch and the metrics collector batch also will automatically reload for changes to any pools that are configured to run on the specified cluster.

Warning

Any changes to these configuration files will cause the signal to be reloaded by the autoscaling batch. Test your config values before pushing. If the config values break the custom signal, then the pool will start using the default signal.

Resource Groups

Resource groups are wrappers around cloud provider APIs to enable scaling up and down groups of machines. A resource group implments the ResourceGroup interface, which provides the set of required methods for Clusterman to interact with the resource group. Currently, Clusterman supports the following types of resource groups:

Cluster Management

Clusterman comes with a number of command-line tools to help with cluster management.

Discovery

The clusterman list-clusters and clusterman list-pools commands can aid in determining what clusters and pools Clusterman knows about:

Traceback (most recent call last):
  File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/stable/clusterman/run.py", line 16, in <module>
    from clusterman.args import parse_args
  File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/stable/clusterman/args.py", line 18, in <module>
    import colorlog
ModuleNotFoundError: No module named 'colorlog'
Traceback (most recent call last):
  File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/stable/clusterman/run.py", line 16, in <module>
    from clusterman.args import parse_args
  File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/stable/clusterman/args.py", line 18, in <module>
    import colorlog
ModuleNotFoundError: No module named 'colorlog'

Management

The clusterman manage command can be used to directly change the state of the cluster:

Traceback (most recent call last):
  File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/stable/clusterman/run.py", line 16, in <module>
    from clusterman.args import parse_args
  File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/stable/clusterman/args.py", line 18, in <module>
    import colorlog
ModuleNotFoundError: No module named 'colorlog'

The --target-capacity option allows users to directly change the size of the Mesos cluster specified by the --cluster and --pool arguments.

Note that there can be up to a few minutes of “lag time” between when the manage command is issued and when changes are reflected in the cluster. This is due to potential delays introduced into the pipeline while AWS finds and procures new instances for the cluster. Therefore, it is not recommended to run clusterman manage repeatedly in short succession, or immediately after the autoscaler batch has run.

Note

Future versions of Clusterman may include a rate-limiter for the manage command

Note

By providing the existing target capacity value as the argument to --target-capacity, you can force Clusterman to attempt to prune any fulfilled capacity that is above the desired target capacity.

Status

The clusterman status command provides information on the current state of the cluster:

Traceback (most recent call last):
  File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/stable/clusterman/run.py", line 16, in <module>
    from clusterman.args import parse_args
  File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/stable/clusterman/args.py", line 18, in <module>
    import colorlog
ModuleNotFoundError: No module named 'colorlog'

As noted above, the state of the cluster may take a few minutes to equilibrate after a clusterman manage command or the autoscaler has run, so the output from clusterman status may not accurately reflect the desired status.

Simulation

Running the Simulator

Traceback (most recent call last):
  File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/stable/clusterman/run.py", line 16, in <module>
    from clusterman.args import parse_args
  File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/stable/clusterman/args.py", line 18, in <module>
    import colorlog
ModuleNotFoundError: No module named 'colorlog'

Experimental Input Data

The simulator can accept experimental input data for one or more metric timeseries using the --metrics-data-file argument to clusterman simulate. The simulator expects this file to be stored as a compressed (gzipped) JSON file; the JSON schema is as follows:

{
    'metric_name_1': [
        [<date-time-string>, value],
        [<date-time-string>, value],
        ...
    ],
    'metric_name_2': [
        [<date-time-string>, value],
        [<date-time-string>, value],
        ...
    },
    ...
}

Optional Multi-valued Timeseries Data

Some timeseries data needs to have multiple y-values per timestamp. The metrics data file can optionally accept timeseries in a dictionary with the dictionary keys corresponding to the names of the individual timeseries. For example:

{
    'metric_a': [
        [
            <date-time-string>,
            {
              'key1': value,
              'key2': value
            }
        ],
        [
            <date-time-string>,
            {
              'key3': value
            }
        ],
        [
            <date-time-string>,
            {
              'key1': value,
              'key2': value,
              'key3': value
            }
        ]
    ]
}

Additional Tools

There are two ways to get data for simulation, if you would like to use values that are different than the actual values recorded in the metrics store. generate-data allows you to generate values as a function of pre-existing metrics or from a random distribution, and SignalFX scraper allows you to query values from SignalFX.

generate-data

The clusterman generate-data command is a helper function for the clusterman simulator to generate “fake” data, either as some function of pre-existing metric data or as drawn from a specified random distribution. The command takes as input an experimental design YAML file, and produces as output a compressed JSON file that can be directly used in a simulation.

Note

If the output file already exists, new generated metrics will be appended to it; existing metrics in the output file that share the same name as generated metrics will be overwritten, pending user confirmation

Experimental Design File Specification

An experimental design file contains details for how to generate experimental metric data for use in a simulation. The specification for the experimental design is as follows:

metric_type:
    metric_name:
        start_time: <date-time string>
        end_time: <date-time string>
        frequency: <frequency specification>
        values: <values specification>
        dict_keys: (optional) <list of dictionary keys>
  • The metric_type should be one of the Metric Types. There should be one section containing all the applicable metric names for each type.

  • Each metric_name is arbitrary; it should correspond to a metric value that clusterman simulate will use when performing its simulation. Multiple metrics can be specified for a given experimental design by repeating the above block in the YAML file for each desired metric; note that if multiple metrics should follow the same data generation specification, YAML anchors and references can be used.

  • The <date-time string> fields can be in a wide variety of different formats, both relative and exact. In most cases dates and times should be specifed in ISO-8601 format; for example, 2017-08-03T18:08:44+00:00. However, in some cases it may be useful to specify relative times; these can be in human-readable format, for example one month ago or -12h.

  • The <frequency specification> can take one of three formats:

    • Historical data: To generate values from historical values, specify historical here and follow the specification for historical values below.

    • Random data: if values will be randomly generated, then the frequency can be in one of two formats:

      • Regular intervals: by providing an <date-time string> for the frequency specification, metric values will be generated periodically; for example, a frequency of 1m will generate a new data point every minute.

      • Random intervals: to generate new metric event arrival times randomly, specify a <random generator> block for the frequency, as shown below:

        distribution: dist-function
        params:
            dist_param_a: param-value
            dist_param_b: param-value
        

        The dist-function should be the name of a function in the Python random module. The params are the keyword arguments for the chosen function. All parameter values relating to time should be defined in seconds; for example, if gauss is chosen for the distribution function, the units for the mean and standard deviation should be seconds.

Note

A common choice for the dist-function is expovariate, which creates an exponentially-distributed interarrival time, a.k.a, a Poisson process. This is a good baseline model for the arrival times of real-world data.

  • Similarly, the <values specification> can take one of two formats:

    • Function of historical data: historical values can be linearly transformed by \(ax+b\). Specify the following block:

      aws_region: <AWS region to read historical data from>
      params:
          a: <value>
          b: <value>
      
    • Random values: for this mode, specify a <random generator> block as shown above for frequency.

  • The dict_keys field takes a list of strings which are used to generate a single timeseries with (potentially) multiple data points per time value. For example, given the following dict_keys configuration:

    metric_a:
        dict_keys:
            - key1
            - key2
            - key3
    

    the resulting generated data for metric_a might look something like the example in Optional Multi-valued Timeseries Data format.

Output Format

The generate-data command produces a compressed JSON containing the generated metric data. The format for this file is identical to the simulator’s Experimental Input Data format.

Sample Usage

drmorr ~ > clusterman generate-data --input design.yaml --ouput metrics.json.gz
Random Seed: 12345678

drmorr ~ > clusterman simulate --metrics-data-file metrics.json.gz \
> --start-time "2017-08-01T08:00:00+00:00" --end-time "2017-08-01T08:10:00+00:00"

=== Event 0 -- 2017-08-01T08:00:00+00:00        [Simulation begins]
=== Event 2 -- 2017-08-01T08:00:00+00:00        [SpotPriceChangeEvent]
=== Event 28 -- 2017-08-01T08:00:00+00:00       [SpotPriceChangeEvent]
=== Event 21 -- 2017-08-01T08:00:00+00:00       [SpotPriceChangeEvent]
=== Event 22 -- 2017-08-01T08:02:50+00:00       [SpotPriceChangeEvent]
=== Event 3 -- 2017-08-01T08:05:14+00:00        [SpotPriceChangeEvent]
=== Event 23 -- 2017-08-01T08:06:04+00:00       [SpotPriceChangeEvent]
=== Event 0 -- 2017-08-01T08:00:00+00:00        [Simulation ends]

Sample Experimental Design File

metadata:
    spot_prices|aws_availability_zone=us-west-2a,aws_instance_type=c3.8xlarge: &spot_prices

        # If no timezone is specified, generator will use YST
        start_time: "2017-12-01T08:00:00Z"
        end_time: "2017-12-01T09:00:00Z"

        frequency:
            distribution: expovariate
            params:
                lambd: 0.0033333   # Assume prices change on average every five minutes

        values:
            distribution: uniform
            params:
                a: 0
                b: 1

    spot_prices|aws_availability_zone=us-west-2b,aws_instance_type=c3.8xlarge: *spot_prices
    spot_prices|aws_availability_zone=us-west-2c,aws_instance_type=c3.8xlarge: *spot_prices

    capacity|cluster=norcal-prod,role=seagull:
        start_time: "2017-12-01T08:00:00Z"
        end_time: "2017-12-01T09:00:00Z"

        dict_keys:
            - c3.8xlarge,us-west-2a
            - c3.8xlarge,us-west-2b
            - c3.8xlarge,us-west-2c

        frequency:
            distribution: expovariate
            params:
                lambd: 0.001666   # Assume capacity change on average every ten minutes

        values:
            distribution: randint
            params:
                a: 10
                b: 50

app_metrics:
    seagull_runs:
        start_time: "2017-12-01T08:00:00Z"
        end_time: "2017-12-01T09:00:00Z"
        frequency:
            distribution: expovariate
            params:
                lambd: 0.0041666 # 15 seagull runs per hour
        values: 1


system_metrics:
    cpu_allocation|cluster=everywhere-testopia,role=jolt:
        start_time: "2017-12-01T08:00:00Z"
        end_time: "2017-12-01T09:00:00Z"
        frequency: historical
        values:
            aws_region: "us-west-2"
            params:   # calculate value by a*x + b
                a: 1.5
                b: 10

The above design file, and a sample output file are located in docs/examples/design.yaml and docs/examples/metrics.json.gz, respectively.

SignalFX scraper

A tool for downloading data points from SignalFX and saving them in the compressed JSON format that the Clusterman simulator can use. This is an alternative to generating data if the data you’re interested in is in SignalFX, but it’s not yet in Clusterman metrics.

Note

Only data from the last month is available from SignalFX.

The tool will interactively ask you the metric type to save each metric as.

Traceback (most recent call last):
  File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/docs/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/docs/checkouts/readthedocs.org/user_builds/clusterman/checkouts/stable/clusterman/tools/signalfx_scraper.py", line 17, in <module>
    from clusterman_metrics.util.constants import METRIC_TYPES
ModuleNotFoundError: No module named 'clusterman_metrics'

Sample usage:

python -m clusterman.tools.signalfx_scraper --start-time 2017-12-03 --end-time 2017-12-10 \
  --src-metric-names 'seagull.fleet_miser.cluster_capacity_units' --dest-file capacity \
  --api-token <secret> --filter rollup:max region:uswest2-testopia cluster_name:releng

Node Migration

Node Migration is a functionality which allows Clusterman to recycle nodes of a pool according to various criteria, in order to reduce the amount of manual work necessary when performing infrastructure migrations.

NOTE: this is only compatible with Kubernetes clusters.

Node Migration Batch

The Node Migration batch is the entrypoint of the migration logic. It takes care of fetching migration trigger events, spawning the worker processes actually performing the node recycling procedures, and monitoring their health.

Batch specific configuration values are described as part of the main service configuration in Service Configuration.

The batch code can be invoked from the clusterman.batch.node_migration Python module.

Pool Configuration

The behaviour of the migration logic for a pool is controlled by the node_migration section of the pool configuration. The allowed values for the migration settings are as follows:

  • trigger:

    • max_uptime: if set, monitor nodes’ uptime to ensure it stays lower than the provided value; human readable time string (e.g. 30d).

    • event: if set to true, accept async migration trigger for this pool; details about event triggers are described below in Migration Event Trigger.

  • strategy:

    • rate: rate at which nodes are selected for termination; percentage or absolute value (required).

    • prescaling: if set, pool size (in nodes) is increased by this amount before performing node recycling; percentage or absolute value (0 by default). This directly sets a capacity value for the pool if autoscaling is disabled, or applies a temporary capacity offset otherwise.

    • precedence: precedence with which nodes are selected for termination:

      • highest_uptime: select older nodes first (default);

      • lowest_task_count: select node with fewer running tasks first;

      • az_name_alphabetical: group nodes by availability zone, and select group in alphabetical order;

    • bootstrap_wait: indicative time necessary for a node to be ready to run workloads after boot; human readable time string (3 minutes by default).

    • bootstrap_timeout: maximum wait for nodes to be ready after boot; human readable time string (10 minutes by default).

    • allowed_failed_drains: allow for up to this many nodes to fail draining and be requeued before aborting (3 by default)

  • disable_autoscaling: turn off autoscaler while recycling instances (false by default).

  • ignore_pod_health: avoid loading and checking pod information to determine pool health (false by default).

  • health_check_interval: how much to wait between checks when monitoring pool health (2 minutes by default).

  • orphan_capacity_tollerance: acceptable ratio of orphan capacity over target capacity to still consider the pool healthy (float, 0 by default, max 0.2).

  • max_uptime_worker_skips: maximum number of times the uptime monitoring worker can skip churning nodes due to unsatisfied pool capacity (6 by default, set to 0 to always allow skipping).

  • expected_duration: estimated duration for migration of the whole pool; human readable time string (1 day by default).

See Pool Configuration for how an example configuration block would look like.

Migration Event Trigger

Migration trigger events are submitted as Kubernetes custom resources of type nodemigration. They can be easily generated and submitted by using the clusterman migrate CLI sub-command and it related options. In case jobs for a pool need to be stopped, it is possible to use the clusterman migrate-stop utility. The manifest for the custom resource defintion is as follows:

---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: nodemigrations.clusterman.yelp.com
spec:
  scope: Cluster
  group: clusterman.yelp.com
  names:
    plural: nodemigrations
    singular: nodemigration
    kind: NodeMigration
  versions:
    - name: v1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          required:
            - spec
          properties:
            spec:
              type: object
              required:
                - cluster
                - pool
                - condition
              properties:
                cluster:
                  type: string
                pool:
                  type: string
                label_selectors:
                  type: array
                  items:
                    type: string
                condition:
                  type: object
                  properties:
                    trait:
                      type: string
                      enum: [kernel, lsbrelease, instance_type, uptime]
                    target:
                      type: string
                    operator:
                      type: string
                      enum: [gt, ge, eq, ne, lt, le, in, notin]

In more readable terms, an example resource manifest would look like:

---
apiVersion: "clusterman.yelp.com/v1"
kind: NodeMigration
metadata:
  name: my-test-migration-220912
  labels:
    clusterman.yelp.com/migration_status: pending
spec:
  cluster: kubestage
  pool: default
  condition:
    trait: uptime
    operator: lt
    target: 90d

The fields in each migration event allow to control which nodes are affected by the event and what is the desired final condition for them. More specifically:

  • cluster: name of the cluster to be targeted.

  • pool: name of the pool to be targeted.

  • label_selectors: list of additional Kubernetes label selectors to filter affected nodes.

  • condition: the desired final state for the node, i.e. all nodes must be have kernel version higher than X.

    • trait: metadata to be compared; currently supports kernel, lsbrelease, instance_type, or uptime.

    • operator: comparison operator; supports gt, ge, eq, ne, lt, le, in, notin.

    • target: right side of the comparison expression, e.g. a kernel version or an instance type; may be a single string or a comma separated list when using in / notin operators.

Drainer

Drainer is the component to drain pods off the node before terminating. It may drain and terminate nodes for three reasons:

  • spot_interruption

  • node_migration

  • scaling_down

NOTE: all settings are only compatible with Kubernetes clusters.

Drainer Batch

The Drainer batch is the entrypoint of the draining logic.

The batch code can be invoked from the clusterman.batch.drainer Python module.

Pool Configuration

The behaviour of the drainer logic for a pool is controlled by the draining section of the pool configuration. The allowed values for the drainer settings are as follows:

  • draining_time_threshold_seconds: maximum time to complete draining process (1800 by default)

  • redraining_delay_seconds: how much to wait between draining tries in case of draining failure (15 by default).

  • force_terminate: forcibly terminate the node after reaching draining_time_threshold_seconds (false by default).

See Pool Configuration for how an example configuration block would look like.

AutoScalingResourceGroup

Autoscaler

AWSResourceGroup

AWS Markets

class clusterman.aws.markets.InstanceMarket(instance, availability_zone)

clusterman_metrics

MesosPoolManager

Signal

SpotFleetResourceGroup

Indices and tables