There comes a time in every dataset’s life when it wants to become an API. That might be because of consumer demand or an executive order. How are you going to make a good one?
When is an API appropriate?
There are certain datasets that are so large or volatile that downloading the whole thing and/or keeping it up to date becomes burdensome. An API is one strategy to lower the barrier to entry. As Ben Balter wrote:
Go to any agency website, and chances are you’ll find at least one dataset sitting idly by because the barrier to consume it is too damn high. It doesn’t matter how hard stakeholders had to fight to get the data out the door or how valuable the dataset is, it’s never going to become the next GPS or weather data.
A web-based, read-only API is a tool that in some cases can make it easier for consumers to use your data.
To put this in context, I assume here the data is already available as a bulk data download. As I’ve written ad nauseum elsewhere (such as at http://opengovdata.io/maturity/), an API is almost never a starting point. Posting open data, bulk data, and structured data and using good identifiers all come first, and probably in that order, before an API becomes useful. You can’t make a good API without working through all that first, and all of that addresses important and common use cases that APIs do not. So I assume from here on that bulk data is available and that the steps to make the data good data have already been done. So…
The term “API” is vague. It’s often used as short-hand to mean a web-based method for programmable access to a system. But “API” is just a way of saying “protocol”. There were APIs before there was an Internet. Merely having an “API” doesn’t mean an actual use case has been solved: you can make a protocol without it being any useful.
What makes an API good?
Let’s take the common case where you have a relatively static, large dataset that you want to provide read-only access to. Here are 19 common attributes of good APIs for this situation. Thanks to Alan deLevie, Ben Balter, Eric Mill, Ed Summers, Joe Wicentowski, and Dave Caraway for some of these ideas.
Granular Access. If the user wanted the whole thing they’d download it in bulk, so an API must be good at providing access to the most granular level practical for data users (h/t Ben Balter for the wording on that). When the data comes from a table, this usually means the ability to read a small slice of it using filters, sorting, and paging (limit/offset), the ability to get a single row by identifying it with a persistent, unique identifier (usually a numeric ID), and the ability to select just which fields should be included in the result output (good for optimizing bandwidth in mobile apps, h/t Eric Mill). (But see “intents” below.)
Deep Filtering. An API should be good at needle-in-haystack problems. Full text search is hard to do, so an API that can do it relieves a big burden for developers — if your API has any big text fields. Filters that can span relations or cross tables (i.e. joins) can be very helpful as well. But don’t go overboard. (Again, see “intents” below.)
Typed Values. Response data should be typed. That means that whether a field’s value is an integer, text, list, floating-point number, dictionary, null, or date should be encoded as a part of the value itself. JSON and XML with XSD are good at this. CSV and plain XML, on the other hand, are totally untyped. Types must be strictly enforced. Columns must choose a data type and stick with it, no exceptions. When encoding other sorts of data as text, the values must all absolutely be valid according to the most narrow regular expression that you can make. Provide that regular expression to the API users in documentation.
Normalize Tables, Then Denormalize. Normalization is the process of removing redundancy from tables by making multiple tables. You should do that. Have lots of primary keys that link related tables together. But… then… denormalize. The bottleneck of most APIs isn’t disk space but speed. Queries over denormalized tables are much faster than writing queries with JOINs over multiple tables. It’s faster to get data if it’s all in one response than if the user has to issue multiple API calls (across multiple tables) to get it. You still have to normalize first, though. Denormalized data is hard to understand and hard to maintain.
Be RESTful, And More. “REST” is a set of practices. There are whole books on this. Here it is in short. Every object named in the data (often that’s the rows of the table) gets its own URL. Hierarchical relationships in the data are turned into nice URL paths with slashes. Put the URLs of related resources in output too (HATEOAS, h/t Ed Summers). Use HTTP GET and normal query string processing (a=x&b=y) for filtering, sorting, and paging. The idea of REST is that these are patterns already familiar to developers, and reusing existing patterns — rather than making up entirely new ones — makes the API more understandable and reusable. Also, use HTTPS for everything (h/t Eric Mill), and provide the API’s status as an API itself possibly at the root URL of the API’s URL space (h/t Eric Mill again). Some more tips about the use of JSON in requests and responses, URL structures, and more, are in the Heroku HTTP API Guide.
Multiple Output Formats. Provide alternatives for the output format, commonly JSON, XML, and CSV, because different formats are best for different use cases. This is to the extent that you actually have users that want these formats. CSV is nice for researchers but not great for other developers; developers lately are moving away from XML toward JSON. See what formats your users want. A RESTful API (see above) will let the caller choose the output format by simply tacking a file extension to the end of the URL, or you can use content negotiation (h/t Dave Caraway).
Nice Errors. Error messages, either because of an invalid request from the user or a problem on the server side, should be clear and provided in a structured data format (e.g. JSON). A RESTful API (see above) additionally uses HTTP status codes where they apply, especially 200, 400, 404, and 500.
Turn Intents into URLs. An API won’t satisfy everyone’s use case, so pick the most important and make them dead-simple for the user. These use cases are also called “verbs” and “intents.” If a common use case is to get the latest entry added to the dataset, make an API called “/api/1/most-recent-entry.” Don’t make users add filtering, sorting, and paging to do common operations. It’s temping to build a kitchen-sink API that can do anything generically and nothing specifically, but it misses the point: As Ben Balter put it, “APIs should absorb the complexities of using the data, not simply expose it in a machine-readable format.” Intents are also good for hiding implementation details, which gives you flexibility to make back-end changes in the future.
Documentation. This is incredibly important. An API without documentation is useless. Totally useless. Because no one will know how to use it. Documentation should cover why the dataset is important, what the data fields mean, how to use the API, and examples examples examples.
Client Libraries. Your users will be accessing your API through software. They’re going to have to write code. Provide re-usable, fully working, modular code for accessing the API in the most common languages that the developers will be using (usually Python, Ruby, and perhaps PHP). This code gives developers a head start, and since every developer will need to write the same basic API-accessing code you get a big win by taking care of writing it once for everyone. (h/t Alan deLevie)
Versioning. You will make changes to the API. Nothing is right the first time. Put a version into every API URL so that when it’s time for Version 2 you don’t disrupt the Version 1 users. The best version numbers are actually release dates. So your API URLs should look like: /api/2014-02-10/…. Using a date as a version can relieve anxiety around making updates. You could also version with an Accept header.
High Performance. Your API should be fast. And while users will appreciate it, the most important reason is for you. Slow APIs create a risk that your server will be overloaded with use too quickly. Some users will inadvertently (if not maliciously) issue extremely slow and resource-intensive queries if you make such queries possible, and if they issue a lot or if too many users make those queries your API can come down hard. Try not to have that possibility. If you need long-running queries, make it hard for users to inadvertently start them. In addition, query results should be cached on the server side by URL (i.e. don’t put authentication in the URL!) and cachable, in principle, on the client side if the user chooses to so that repeated accesses to exactly the same query are lightning-fast (e.g. use ETags).
High Availability. You don’t know when users will be using the API, so it needs to be available all the time. This is really hard. (It’s one reason bulk data is so much easier.) Basic precautions like rate limiting should be taken to reduce the risk that the API fails under high load. When updating the data behind the API, the API should never be left in a state where it provides incomplete answers. Maintenance windows should be short because they are incredibly disruptive to users, and notice should be posted ahead of time.
Know Your Users. Log what happens in your API and have some analytics so you can tell if anyone is using it and what they’re using it for, and whether the API is really addressing the use cases you want it to.
Know Your Committed Users More. Have a relationship with your committed users so you can alert them to upcoming maintenance and changes to the API, and so you can know who is making resource-intensive queries in case those queries get out of control. This is often done by having an API key (which is like a password for access — but it should be optional!! see the next section). Your system for issuing API keys should be automated and real-time so that developers don’t have to wait to get started. In the API, pass the API key in the HTTP authorization header (h/t Ed Summers). (Or consider another standard method of authorization like OAuth; h/t Ben Balter.)
Never Require Registration. Don’t have authentication on your API to keep people out! In fact, having a requirement of registration may contradict other guidelines (such as the 8 Principles of Open Government Data). If you do use an API key, make it optional. A non-authenticated tier lets developers quickly test the waters, and that is really important for getting developers in the door, and, again, it may be important for policy reasons as well. You can have a carrot to incentivize voluntary authentication: raise the rate limit for authenticated queries, for instance. (h/t Ben Balter)
Interactive Documentation. An API explorer is a web page that users can visit to learn how to build API queries and see results for test queries in real time. It’s an interactive browser tool, like interactive documentation. Lacking that, executable examples is a great alternative. Relatedly, an “explain mode” in queries, which instead of returning results says what the query was and how it would be processed, can help developers understand how to use the API (h/t Eric Mill). An API endpoint that gives a user his/her rate-limiting status is helpful too.
Developer Community. Life is hard. Coding is hard. The subject matter your data is about is probably very complex. Don’t make your API users wade into your API alone. Bring the users together, bring them to you, and sometimes go to them. Let them ask questions and report issues in a public place (such as github). You may find that users will answer other users’ questions. Wouldn’t that be great? Have a mailing list for longer questions and discussion about the future of the API. Gather case studies of how people are using the API and show them off to the other users. It’s not a requirement that the API owner participates heavily in the developer community — just having a hub is very helpful — but of course the more participation the better.
Create Virtuous Cycles. Create an environment around the API that make the data and API stronger. For instance, other individuals within your organization who need the data should go through the public API to the greatest extent possible. Those users are experts and will help you make a better API, once they realize they benefit from it too. Create a feedback loop around the data, meaning find a way for API users to submit reports of data errors and have a process to carry out data updates, if applicable and possible. Do this in the public as much as possible so that others see they can also join the virtuous cycle.
How do you build a good API?
Actually I don’t know yet, but here are some things that might be useful:
- API Umbrella, which is used at api.data.gov, provides API key management, rate limiting, and so on as a wrapper around an existing API. It was also some of the inspiration for starting this blog post.
- instant-api, by Waldo Jaquith, creates an API from static data.
- qu, by the CFPB, creates a platform for serving data