API Gateway timeout workaround

2017-Sep-22

The last of the four (see previous posts) big API Gateway limitations is the 30 second integration timeout.

This means that API Gateway will give up on trying to serve your request to the client after 30 seconds—even though Lambda has a 300 second limit.

In my opinion, this is a reasonable limit. No one wants to be waiting around for an API for more than 30 seconds. And if something takes longer than 30s to render, it should probably be batched. "Render this for me, and I'll come back for it later. Thanks for the token. I'll use this to retrieve the results."

In an ideal world, all HTTP requests should definitely be served within 30 seconds. But in practice, that's not always possible. Sometimes realtime requests need to go to a slow third party. Sometimes the client isn't advanced enough to use the batch/token method hinted at above.

Indeed, 30s often falls below the client's own timeout. We're seeing a practical limitation where clients can often run for 90-600 seconds before timing out.

Terrible user experience aside, I needed to find a way to serve long-running requests, and I really didn't want to violate our serverless architecure to do so.

But this 30 second timeout in API gateway is a hard limit. It can't be increased via the regular AWS support request method. In fact, AWS says that it can't be increased at all—which might even be true. (-:

As I mentioned in an earlier post, I did a lot of driving this summer. Lots of driving led to lots of thinking, and lots of thinking led to a partial solution to this problem.

What if I could use API Gateway to handle the initial request, but buy an additional 30 seconds, somehow. Or better yet, what if I could buy up to an additional 270 seconds (5 minutes total).

Simply put, an initial request can kick off an asynchronous job, and if it takes a long time, after nearly 30 seconds, we can return an HTTP 303 (See Other) redirect to another endpoint that checks the status of this job. If the result still isn't available after another (nearly) 30s, redirect again. Repeat until the Lambda function call is killed after the hard-limited 300s, but if we don't get to the hard timeout, and we find the job has finished, we can return that result instead of a 303.

But I didn't really have a simple way to kick off an asynchronous job. Well, that's not quite true. I did have a way to do that: Zappa's asynchronous task execution. What I didn't have was a way to get the results from these jobs.

So I wrote one, and Zappa's maintainer, Rich, graciously merged it. And this week, it was released. Here's a post I wrote about it over on the Zappa blog.

The result:

$ time curl -L 'https://REDACTED.apigwateway/dev/payload?delay=40'
{
  "MESSAGE": "It took 40 seconds to generate this."
}

real    0m52.363s
user    0m0.020s
sys     0m0.025s

Here's the code (that uses Flask and Zappa); you'll notice that it also uses a simple backoff algorithm:

@app.route('/payload')
def payload():
    delay = request.args.get('delay', 60)
    x = longrunner(delay)
    if request.args.get('noredirect', False):
        return 'Response will be here in ~{}s: <a href="{}">{}</a>'.format(
            delay, url_for('response', response_id=x.response_id), x.response_id)
    else:
        return redirect(url_for('response', response_id=x.response_id))


@app.route('/async-response/<response_id>')
def response(response_id):
    response = get_async_response(response_id)
    if response is None:
        abort(404)

    backoff = float(request.args.get('backoff', 1))

    if response['status'] == 'complete':
        return jsonify(response['response'])

    sleep(backoff)
    return "Not yet ready. Redirecting.", 303, {
        'Content-Type': 'text/plain; charset=utf-8',
        'Location': url_for(
            'response', response_id=response_id,
            backoff=min(backoff*1.5, 28)),
        'X-redirect-reason': "Not yet ready.",
    }


@task(capture_response=True)
def longrunner(delay):
    sleep(float(delay))
    return {'MESSAGE': "It took {} seconds to generate this.".format(delay)}

That's it. Long-running tasks in API Gateway. Another tool for our serverless arsenal.