On Structured Content

Table of Contents

October marks my 3rd year of working on docs.microsoft.com - a lot has changed since we first started with the project, and I wanted to focus on one very specific aspect of our tooling and infrastructure for this essay. The topic is closely related to my personal mission: empower people to spend time in meaningful ways.

At docs.microsoft.com, we push tens of thousands of lines of content daily in the API documentation space. Don’t take my word for it - check one of the many pull requests in the Azure Python SDK repo, where one time we added almost 3 millions of lines of auto-generated docs. Some are parameter name updates, some are examples, and others are nothing more than an package version revision. It’s all done automatically, and dropped for consumption in a structured format. This is the topic of our converstion today - the merits of structured content.

(No, this is not an official logo - just something 20 minutes in Photoshop gets you to tell people what my team is doing. Yes, the acronym is BAD.)

Structured content, in the DocFX realm, is typically represented through YAML. There are variations as you look from platform to platform - for example, what we generate for .NET API content might be slightly different from what we have for TypeScript, as the language constructs are vastly different. But at the end of the day, the system has one format it expects to process. So why is it so beneficial to keep the content structured instead of having it in a more “human-friendly” format, like Markdown?

Consistency #

Anything that is created in structured form is generally constructed in a way that is intended to be consistent with some defined baseline. It is possible to argue that it is not always the case, and that given enough leniency in defining the structure, one can make it wildly inconsistent compared to what is expected, but that’s not what we are after with DocFX.

All YAML content has a specific set of properties that are being universally understood by the system. As an example, one of those elements is the unique identifier, that allows developers and authors to easily reference content with the help of the cross-reference syntax. We know for a fact that any automatically-generated API documentation file will have this identifier. There is also a number of specific fields that we use to identify the type of APIs involved.

Consistency is important when it comes to building any processing tooling - it helps us avoid re-engineering things for each new language, and instead rely on standard conventions.

Abstracting out the complex #

Humans are bad at understanding complex structures, especially really large ones. Imagine if you would have to write documentation for 500 namespaces, each having a couple of hundred of types and even more members under each. How sure would one would be that no errors were introduced in any of those? What happens when the documentation needs to be updated? Good luck making sure that you can keep every single field and section name consistent. Even better - good luck making sure that you don’t miss something, like an example or remark.

When the content is structured, the API surface updates, whether it is when new ones come out or old ones are sunsetted, are completely abstracted out from the user - one does not care about method Foo.Bar() becoming Foo.Bar(string param). The user only cares about always having the latest documentation being published. All this happens with the added bonus of the baseline standard being automatically enforced - everywhere (to that, the next point).

Strict structure and quality enforcement #

Having structured content means that it is possible to enforce strict quality and content rules. No matter what your testing gate is, having that bound to structured representations of data allows you to quickly validate whether the content itself is compliant or not. For example, on docs.microsoft.com, we have the ability to verify whether the generated API identifiers are valid or not. There is no need to valdiate links, call a service to check whether the page returns a 404 or not - we just check whether the ID can be resolved by the publishing service, and if that is not the case, throw a warning.

With API documentation, we are also very strict about what we refer to as “nutrition facts” - the set of fields and their order on a page. We want to ensure that whenever a user goes through, say, a Python API doc, they get the exact same structure when they jump to another Python API doc. This user guarantee is not done exclusively through policy, but rather a combination of policy (API content has to be structured), and format (structured conent has a schema) - it’s all defined in a consistent way.

Having no parsing overhead, we minimize the risks of something of lower quality going through the test gates.

Reduction of human involvement #

This is based on the earlier point about abstracting out complexity - humans never touch structured content. This allows us to update API documentation quickly and reliably. What took weeks with MSDN before, is now taking hours, if not minutes, with the new modern infrastructure. Removing humans from this equation is not a bad thing. Instead of having to do dull work on writing API signatures, they are now able to write great examples and valuable tutorials and API-related content.

Think of it this way - instead of assembling the car yourself, we ask you to drive the car instead and use it for your tasks. We’ll handle the engineering of the vehicle.

Easy templating #

When someone hand-writes content for API documentation, there is a lot of it that becomes very hard to change at scale. If a user creates a method table, even assuming that it is somehow automatically updated to account for the latest changes, it still remains a less flexible UX affordance. Think of a scenario where we now carry information in the underlying model about method implementation - we now know which method is static and which is not. How do we integrate that in the method table, that is instantly consistent with all .NET documentation we host? There are tens of situations like that, and with structured content, it’s simply a matter of changing the template once the model changes.

This model is at the core of DocFX and has served us quite well since the inception of the API documentation system on docs.microsoft.com.

Flexibility #

An important point when it comes to changes is how to make changes in a way that does not break existing content - we’ve done some work in the space, reflected in the schema-based document processor (SDP). When a new platform comes on board, we are able to quickly define a new schema that is acting as an “outline” of the target content format. When changes happen, rev-ing the schema is much easier that doing significant tooling or Markdown content overhauls.

Inferrable from existing tooling #

Last but not least, a lot of existing tooling that produces API content does so in a way that is already structured. Try using Javadoc, mdoc, TypeDoc or Sphinx - they all output information about various APIs in a way that make updates easy at scale. Can we re-use that? Yes, absolutely, however the problem is that each tool has its own format - and that’s where we come in with code2yaml, ecma2yaml, type2docfx and sphinx-docfx-yaml to snap everything into one consistent representation of the API content.

Conclusion #

This essay is not meant to be the end-of-it-all guideline for convincing you that having your API content in a structured format is a good idea. It merely reflects my own experience with API content, so take it with a grain of salt. I am curious to learn from you, dear reader, as to how you manage API documentation in your organization.