Lambda payload size workaround

2017-Sep-21

Another of the AWS Lambda + API Gateway limitations is in the size of the response body we can return.

AWS states that the full payload size for API Gateway is 10 MB, and the request body payload size is 6 MB in Lambda.

In practice, I've found this number to be significantly lower than 6 MB, but perhaps I'm just calculating incorrectly.

Using a Flask route like this:

@app.route('/giant')
def giant():
    payload = "x" * int(request.args.get('size', 1024 * 1024 * 6))
    return payload

…and calling it with curl, I get the following cutoff:

$ curl -s 'https://REDACTED/dev/giant?size=4718559' | wc -c
 4718559
$ curl -s 'https://REDACTED/dev/giant?size=4718560'
{"message": "Internal server error"}

Checking the logs (with zappa tail), I see the non-obvious-unless-you've-come-across-this-before error message:

body size is too long

Let's just call this limit "4 MB" to be safe.

So, why does this matter? Well, sometimes—like it or not—APIs need to return more than 4 MB of data. In my opinion, this should usually (but not always) be resolved by requesting smaller results. But sometimes we don't get control over this, or it's just not practical.

Take Kibana, for example. In the past year, we started using Elasticsearch for logging certain types of structured data. We elected to use the AWS Elasticsearch Service to host this. AWS ES has an interesting authentication method: it requires signed requests, based on AWS IAM credentials. This is super useful for our Lambda-based app because we don't have to rely on DB connection pools, firewalls, VPCs, and much of the other pain that comes with using an RDBMS in a highly-distributed system. Our app can use its inherited IAM profile to sign requests to AWS ES quite easily, but we also wanted to give our developers and certain partners access to our structured logs.

At first, we had our developers run a local copy of aws-es-kibana, which is a proxy server that uses the developer's own AWS credentials (we distribute user or role credentials to our devs) to sign requests. Running a local proxy is a bit of a pain, though—especially for 3rd parties.

So, I wrote access-es (which is still in a very early "unstable" state, though we do use it "in production" (but not in user request flows)) to allow our users to access Kibana (and Elasticsearch). access-es runs on lambda and effectively works as a reverse HTTPS proxy that signs requests for session/cookie authenticated users, based on the IAM profile. This was a big win for our no-permanent-servers-managed-by-us architecture.

But the very first day we used access-es to load some large logs in Kibana, it failed on us.

It turns out that if you have large documents in Elasticsearch, Kibana loads very large blobs of JSON in order to render the discover stream (and possibly other streams). Larger than "4 MB", I noticed. Our (non-structured) logs filled with body size is too long messages, and I had to make some adjustments to the page size in the discover feed. This bought us some time, but we ran into the payload size limitation far too often, and at the most inopportune moments, such as when trying to rapidly diagnose a production issue.

The "easy" solution to this problem is to concede that we probably can't use Lambda + API Gateway to serve this kind of app. Maybe we should fire up some EC2 instances, provision them with Salt, manage upgrades, updates, security alerts, autoscalers, load balancers… and all of those things that we know how to do so well, but were really hoping to leave behind with the new "serverless" (no permanent servers managed by us) architecture.

This summer, I did a lot of driving, and during one of the longest of those driving sessions, I came up with an idea about how to handle this problem of using Lambda to serve documents that are larger than the Lambda maximum response size.

"What if," I thought, "we could calculate the response, but never actually serve it with Lambda. That would fix it." Turns out it did. The solution—which will probably seem obvious once I state it—is to use Lambda to calculate the response body, store that response body in a bucket in S3 (where we don't have to manage any servers), use Lambda + API Gateway to redirect the client to the new resource on S3.

Here's how I did it in access-es:

req = method(
    target_url,
    auth=awsauth,
    params=request.query_string,
    data=request.data,
    headers=headers,
    stream=False
)

content = req.content

if overflow_bucket is not None and len(content) > overflow_size:

    # the response would be bigger than overflow_size, so instead of trying to serve it,
    # we'll put the resulting body on S3, and redirect to a (temporary, signed) URL
    # this is especially useful because API Gateway has a body size limitation, and
    # Kibana serves *huge* blobs of JSON

    # UUID filename (same suffix as original request if possible)
    u = urlparse(target_url)
    if '.' in u.path:
        filename = str(uuid4()) + '.' + u.path.split('.')[-1]
    else:
        filename = str(uuid4())

    s3 = boto3.resource('s3')
    s3_client = boto3.client(
        's3', config=Config(signature_version='s3v4'))

    bucket = s3.Bucket(overflow_bucket)

    # actually put it in the bucket. beware that boto is really noisy
    # for this in debug log level
    obj = bucket.put_object(
        Key=filename,
        Body=content,
        ACL='authenticated-read',
        ContentType=req.headers['content-type']
    )

    # URL only works for 60 seconds
    url = s3_client.generate_presigned_url(
        'get_object',
        Params={'Bucket': overflow_bucket, 'Key': filename},
        ExpiresIn=60)

    # "see other"
    return redirect(url, 303)

else:
    # otherwise, just serve it normally
    return Response(content, content_type=req.headers['content-type'])

If the body size is larger than overflow_size, we store the result on S3, and the client receives a 303 see other with an appropriate Location header, completely bypassing the Lambda body size limitation, and saving the day for our "serverless" architecture.

The resulting URL is signed by AWS to make it only valid for 60 seconds, and the resource isn't available without such a signature (unless otherwise authenticated with IAM + appropriate permissions). Additionally, we use S3's lifecycle management to automatically delete old objects.

For clients that are modern browsers, though, you'll need to properly manage the CORS configuration on that S3 bucket.

This approach fixed our Kibana problem, and now sits in our arsenal of tools for when we need to handle large responses in our other serverless Lambda + API Gateway apps.