Automate default AWS SageMaker scaling policy
TLDR; To properly set the default/built-in target metric values for autoscaling
SageMaker endpoints, you need to make sure you use
SageMakerEndpointInvocationScalingPolicy
for your policy name and
SageMakerVariantInvocationsPerInstance
as your target metric:
target_tracking_json='{"TargetValue": SCALING_VALUE, "PredefinedMetricSpecification": {"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"}}'
echo $target_tracking_json > target_tracking_scaling_policy.json
aws application-autoscaling put-scaling-policy --region REGION \
--policy-name SageMakerEndpointInvocationScalingPolicy \
--service-namespace sagemaker \
--resource-id endpoint/ENDPOINT_NAME/variant/VARIANT_NAME \
--scalable-dimension sagemaker:variant:DesiredInstanceCount \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration file://target_tracking_scaling_policy.json
Where SCALING_VALUE
, REGION
, ENDPOINT_NAME
, VARIANT_NAME
need to be
updated with your endpoint specific values. Read on for a more in-depth
explanation.
We deploy our machine learning models for realtime inference using AWS SageMaker. For our use case, this involves creating a container, along with an ML model, and deploying it as an endpoint. In this post, I’d like to focus specifically on setting up a SageMaker endpoint so that it autoscales properly. Registering an autoscaling group to your endpoint and setting max/min node values for this autoscaling group will get you most of the way there. However, one of the pieces missing, is automatically setting the target metric that tells the ASG when to start scaling out/in.
SageMaker endpoints with autoscaling enabled, default to scale on the
SageMakerVariantInvocationsPerInstance
metric. This means that given some
value, if the invocations (inference calls) per instance go above that number,
(over a given amount of time) it will cause the ASG to start spinning up new
instances to handle the increased load. The ASG will keep adding instances
until either SageMakerVariantInvocationsPerInstance
goes below the target
value, or it hits the max number of instances set when registering the ASG.
This is what we want, especially for endpoints that don’t have predictable load, or load that changes greatly throughout the day. It can be a helpful cost saving tactic and increases availability under heavy traffic. However, it’s not clear how to programmatically set this target value, hence this post.
We mostly use the awscli to automate our endpoint deploys, but you could substitute any of these cli commands with analogous api calls. Once you have an endpoint live in SageMaker, you can enable autoscaling by running the following:
aws application-autoscaling register-scalable-target --region REGION \
--service-namespace sagemaker \
--resource-id endpoint/ENDPOINT_NAME/variant/VARIANT_NAME \
--scalable-dimension sagemaker:variant:DesiredInstanceCount \
--min-capacity MIN_VALUE \
--max-capacity MAX_VALUE
Where REGION
, ENDPOINT_NAME
, VARIANT_NAME
, MIN_VALUE
, MAX_VALUE
should be set with your specific endpoint related values. This will register an
autoscaling group to the SageMaker endpoint. In the SageMaker UI you should now
see, under “Endpoints” -> YOUR_ENDPOINT
-> “Endpoint runtime settings”, that
automatic scaling is set to “Yes”. You’ll also see the max/min instance
settings updated with the values you used in the command above.
Confusingly, we need to use another command to specify when our ASG actually triggers its scaling. This can be done manually by going to your endpoint in the UI and selecting your variant under the endpoint runtime settings:
This allows you to select “Configure auto scaling” which will take you to a screen that allows you to edit your max/min instance counts, deregister/disable auto scaling, and, what we are most interested in, set your target scaling value for the built-in scaling policy:
There is a lot of useful information here. First you’ll notice the policy name
is SageMakerEndpointInvocationScalingPolicy
with a target metric called
SageMakerVariantInvocationsPerInstance
, this basically means that by default
SageMaker endpoints will scale based on the number of incoming requests per
instance. You can get more information on this metric
here.
If you’re not sure what value to use, you can approach the value using the load
testing techniques described in this
post.
We also have the option to set our scale in and scale out cool down values. All
of these values will need to be tweaked, experimented with, and tested for each
of your endpoints, since most models will have different performance
characteristics for the various problems they solve.
It’s not super clear from the documentation how we programmatically set these
values. After reading the documentation, it looks like we will need to define
a scaling
policy.
This makes sense given what we see in the UI. We can optionally use either the
SageMakerVariantInvocationsPerInstance
pre-defined metric, or define one
ourselves using a custom metric. AWS “strongly” recommends using the default
built-in metric so we’ll stick with that for now.
Now its time for us to apply our scaling policy. We can do this through the aws cli:
target_tracking_json='{"TargetValue": SCALING_VALUE, "PredefinedMetricSpecification": {"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"}}'
echo $target_tracking_json > target_tracking_scaling_policy.json
aws application-autoscaling put-scaling-policy --region REGION \
--policy-name POLICY_NAME \
--service-namespace sagemaker \
--resource-id endpoint/ENDPOINT_NAME/variant/VARIANT_NAME \
--scalable-dimension sagemaker:variant:DesiredInstanceCount \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration file://target_tracking_scaling_policy.json
Where SCALING_VALUE
, REGION
, POLICY_NAME
, ENDPOINT_NAME
, VARIANT_NAME
are set specific to our endpoint. Should solve our problem right? Well, sort
of. If you name this scaling policy anything other than
SageMakerEndpointInvocationScalingPolicy
like we have above (and is done in
most examples) then this will actually create a custom scaling policy. When
someone goes to view this endpoint they are greeted with this:
Okay, I guess I just need to trust that the values I set in the code are correct. There isn’t much information given in the UI about what that custom policy actually is. We can run:
aws application-autoscaling describe-scaling-policyies --service-namespace sagemaker
Which will give us all the settings related to our autoscaling policies on our SageMaker endpoints. You’ll then have to search through these to find the endpoint you care about to verify that we set things correctly. This can be confusing for end users taking a look at the endpoint, and seems a little black box asking them to just trust that this custom policy is what they wanted.
We said earlier that we just want to use the built-in
SageMakerVariantInvocationsPerInstance
metric. We don’t want to set a custom
scaling policy for this endpoint, just set the default built-in one. You may
have already figured out where we went wrong. We need to use
SageMakerEndpointInvocationScalingPolicy
as our policy name when running the
put-autoscaling-policy command. Here is the corrected command:
target_tracking_json='{"TargetValue": SCALING_VALUE, "PredefinedMetricSpecification": {"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"}}'
echo $target_tracking_json > target_tracking_scaling_policy.json
aws application-autoscaling put-scaling-policy --region REGION \
--policy-name SageMakerEndpointInvocationScalingPolicy \
--service-namespace sagemaker \
--resource-id endpoint/ENDPOINT_NAME/variant/VARIANT_NAME \
--scalable-dimension sagemaker:variant:DesiredInstanceCount \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration file://target_tracking_scaling_policy.json
This is almost identical to the last command, the only change is we updated
POLICY_NAME
to be SageMakerEndpointInvocationScalingPolicy
. Now we can go
back to the UI and verify we have programmatically set these values via the
cli. The end users should be happy since they can see what their endpoint is
using as a target scaling value.
This is nowhere called out in the documentation which is unfortunate. I also didn’t see any examples online that point this out, which motivated writing this post.
It might seem obvious to do this, but without explicit documentation from AWS, I was left guessing, and it took longer than I would have liked to figure out the right solution. Hopefully this post helps out anyone else who is in a similar position.