Configuration

There are two levels of configuration for Clusterman. The first configures the Clusterman application or service itself, for operators of the service. The second provides per-pool configuration, for client applications to customize scaling behavior.

Service Configuration

The following is an example configuration file for the core Clusterman service and application:

aws:
    access_key_file: /etc/boto_cfg/clusterman.json
    region: us-west-1

autoscale_signal:  # configures the default signal for Clusterman
    name: MostRecentCPU

    # What version of the signal to use (a branch or tag in the clusterman_signals Git repo)
    branch_or_tag: v1.0.2

    # How frequently the signal will be evaluated.
    period_minutes: 10

    required_metrics:
        - name: cpus_allocated
          type: system_metrics

          # The metric will be queried for the most recent data in this range.
          minute_range: 10

autoscaling:
    # signal namespace for the default signal
    default_signal_pool: clusterman

    # Percentage utilization that Clusterman will try to maintain.
    setpoint: 0.7

    # Clusterman will only scale if percentage change of current and new target capacities  is
    # beyond this margin.
    target_capacity_margin: 0.1

    # Clusterman will not scale down if, since the last run, the capacity has decreased by more than
    # this threshold. Note that this includes capacity removed by Clusterman in the last run and
    # capacity lost for any other region (e.g. instance failures).
    prevent_scale_down_after_capacity_loss: true
    instance_loss_threshold: 2

# How long to wait for an agent to "drain" before terminating it
drain_termination_timeout_seconds:
  sfr: 100

batches:
    cluster_metrics:
        # How frequently the batch should run to collect metrics.
        run_interval_seconds: 60

    spot_prices:
        # Max one price change for each (instance type, AZ) in this interval.
        dedupe_interval_seconds: 60

        # How frequently the batch should run to collect metrics.
        run_interval_seconds: 60

    node_migration:
        # Maximum number of worker prcesses the batch can spawn
        # (every worker can handle a single migration for a pool)
        max_worker_processes: 6

        # How frequently the batch should check for migration triggers.
        run_interval_seconds: 60

        # Number of failed attempts tollerated for event workers.
        # Job is marked as failed once this number of attempts is surpassed,
        # and timespan from event creation is higher than this many times
        # the estimated migration time for the pool.
        failed_attemps_margin: 5

clusters:
    cluster-name:
        aws_region: us-west-2
        mesos_api_url: <Mesos cluster FQDN>
        kubeconfig_path: /path/to/kubeconfig.conf

cluster_config_directory: /nail/srv/configs/clusterman-pools/

module_config:
  - namespace: clog
    config:
        log_stream_name: clusterman
    file: /nail/srv/configs/clog.yaml
    initialize: yelp_servlib.clog_util.initialize

  - namespace: clusterman_metrics
    file: /nail/srv/configs/clusterman_metrics.yaml

  - namespace: yelp_batch
    config:
        watchers:
          - aws_key_rotation: /etc/boto_cfg/clusterman.json
          - clusterman_yaml: /nail/srv/configs/clusterman.yaml

The aws section provides the location of access credentials for the AWS API, as well as the region in which Clusterman should operate.

The autoscale_signal section defines the default signal for autoscaling. This signal will be used for a pool, if that pool does not define its own autoscale_signal section in its pool configuration.

The autoscaling section defines settings for the autoscaling behavior of Clusterman.

The batches section configures specific Clusterman batches, such as the autoscaler and metrics collection batches.

The clusters section provides the location of the clusters which Clusterman knows about.

The module_config section loads additional configuration values for Clusterman modules, such as clusterman_metrics.

Pool Configuration

To configure a pool, a directory with the cluster’s name should be created in the cluster_config_directory defined in the service configuration. Within that directory, there should be a file named <pool>.yaml. The following is an example configuration file for a particular Clusterman pool:

draining:
  draining_time_threshold_seconds: 1200
  force_terminate: true
  redraining_delay_seconds: 60

resource_groups:
  - sfr:
      tag: 'my-custom-resource-group-tag'

scaling_limits:
    min_capacity: 1
    max_capacity: 800
    max_weight_to_add: 100
    max_weight_to_remove: 100
    max_tasks_to_kill: 100
    min_node_scalein_uptime_seconds: 300


autoscale_signal:
    name: CustomSignal
    namespace: my_application_signal

    # What version of the signal to use (a tag in the clusterman_signals Git repo)
    branch_or_tag: v3.7

    # How frequently the signal will be evaluated.
    period_minutes: 10

    required_metrics:
        - name: cpus_allocated
          type: system_metrics

          # The metric will be queried for the most recent data in this range.
          minute_range: 10

    # custom parameters to be passed into the signal (optional)
    parameters:
        - paramA: 'typeA'
        - paramB: 10

node_migration:
    trigger:
        max_uptime: 90d
        event: true
    strategy:
        rate: 5
        prescaling: '2%'
        precedence: highest_uptime
        bootstrap_wait: 5m
        bootstrap_timeout: 15m
    disable_autoscaling: false
    expected_duration: 2h

The resource-groups section provides information for loading resource groups in the pool manager.

The scaling_limits section provides global pool-level limits on scaling that the autoscaler and other Clusterman commands should respect. The field min_node_scalein_uptime_seconds is an optional setting allowing to indicate a timespan in which freshly bootstrapped nodes are deprioritized in the selection for termination.

The autoscale_signal section defines the autoscaling signal used by this pool. This section is optional. If it is not present, then the autoscale_signal from the service configuration will be used.

For required metrics, there can be any number of sections, each defining one desired metric. The metric type must be one of Metric Types.

The node_migration section contains settings controlling how Clusterman should be recycling nodes inside the pool. Enabling this configuration is useful for keeping the average uptime of your pool low and/or be able to perform adhoc migrations of the nodes according to some conditional parameter. See Pool Configuration for all details.

Reloading

The Clusterman batches will automatically reload on changes to the clusterman service config file and the AWS credentials file. This is specified in the namespace: yelp_batch section of the main configuration file. The autoscaler batch and the metrics collector batch also will automatically reload for changes to any pools that are configured to run on the specified cluster.

Warning

Any changes to these configuration files will cause the signal to be reloaded by the autoscaling batch. Test your config values before pushing. If the config values break the custom signal, then the pool will start using the default signal.