An API-first, composable mobile Operating System

An operating system based on APIs and composable blocks.

When I use my phone, I mainly do three things.

  • Tap
  • Speak
  • Read

Your phone’s Operating System (OS) does a lot of cool stuff, but the way we use it is usually very simple.
Unless you spend all your day in a browser, most users will have apps to achieve their goals.
I use Gmail’s app to read my emails all the time, YouTube to stream videos from their platform and WhatsApp to stay in touch with friends!

Ultimately all these apps are nothing but vessels that encapsulate data (sent or received) and ultimately display a desired output in a form or another (audio/video/text or a mix of either)

Most apps do extremely simple things: Mostly take a picture, show you some web results, show you data in a format or another.
They do this using Application Programming Interfaces (APIs). There are multiple types of APIs: Your OS has APIs to let apps access various components (camera, microphone, etc), software has APIs and so do web-services.

APIs are really neat, they allow us to communicate with other components and delegate tasks instead of having to re-invet the wheel each time.
This allows us to build flexibily inside our applications.

Flexibility matters

Similarly when I fetch a list of videos from YouTube, or tracks from Spotify, I am presented with something that looks like a table with many rows. Each row has some thumbnails, descriptions and other data. Usually tapping on row of a table lets me access that data, video or audio track.

Given that many apps share this aspect, wouldn’t it make sense to re-use the same layout for both services?
Intents (represented as buttons on today’s phones) would be mapped accordingly, in plain sight or within menus.
[ ⚡️ BTW: An Intent is basically a message to say you did or want something to happen. This is how intents are defined in Android’s dev documents]

For example the Queue button on Spotify might not apply to Youtube’s app.
However, rendering results, viewing the media, fast/back-forwarding, play/pause/stop are controls that are shared by both applications.
In my mind therefore this means it can be simplified.

Composable user interfaces: Speed or design?

In a world where apps are on avg. 50Mb to download (and growing!), I feel like things are getting a little out of hand.
Creavitity and app-designs are cool but ultimately unecessary when you could have virtually few Kb worth of text to allow an OS to understand what it needs to do:

  • How to login into the service
  • How to handle results from each API endpoint
  • What are the intents a user might have

Here’s an example, think about a voice command: “Play my favorite song”.
This would require the OS to parse the language, understand what the command’s actually needing and execute it.

Simple actions, everywhere.

Let’s take a simple task such as tweeting. Tweeting as an action is literally pushing data to an API endpoint. The behind the scenes happening at Twitter, doesn’t matter. Send data, pull data. Write tweets, read tweets / action a tweet.

Rendering tweets as an operation is just a succession of tiny cards of text and images.


Now maybe you want to add buttons and support image tweets?

Every interface should be composable and have mounting points where you can “optionally” show the intents or actions a user can take on a determined node.

A tweet has a favorite intent and a retweet intent, possibly you could also add a save for later or reply intent directly on it. Obviously this is only a simple example but it goes to show that composable interfaces are extremely simple to put together once you understand how they work.


So why do I need to download an app to do any of these things?

You don’t download a keyboard every time you run an app. The keyboard app is shared amongst all apps.

As we’ve seen intents vary per app/service, I cannot technically re-tweet a Spotify track and clicking Like on a tweet won’t secretly add it to a list, unlike on Spotify.

These are interesting problems to have: In a futuristic world where every app uses the same UI elements and structure but the intents underneath provide different outcomes, how do we inform the users of what’s really happening when they request the OS to act for them?

“Hey Phone - show me my 10 most played song”
“Hey Phone - show me the new pictures from my Instagram timeline”

These intents are hard to be mapped today but they should be easy in the future.
Breaking down intents might be useful to instruct your phone which API requests will have to happen

The operating system should be instructed to act like a huge API client that reads a certain API spec, so that it can learn how to perform actions or chains of actions based on the intents parsed by the natural language processing unit of the OS.

But what about other functionalities, that are more unique or hard to define with a vocal intent?

Some apps like to use custom stickers, add image filters, display maps with custom indicators. Operating systems should be better at handling these things.

Intents should become a large library that an Operating system can expand on every update. It’s down to app developers to define what intents a user can take and which are common or unique.

Instagram or Facebook or Twitter or Spotify: Non complicated apps that weight 100MB - WHY?

  • They’re just cards with content that I can scroll, tap or dismiss and interact with in other ways
  • Displayed as grids or lists

These apps could easily be redrawn with OS components.

Abstracting apps?

Can we abstract all of these UI elements so that we can only provide an interaction layer?

The OS must be fed a list of actions that can be taken by the user and what API to hit to publish or retrieve data.

  • Take a picture on Instagram
  • Play a song on spotify
  • Show me the most retweeted message by…

Interacting via voice

When I create an app and I have an action associated to a button there should be a specification that allows me to map a voice command to it.
A little like intents:

  • Conversational intents: Thank you, hello, hey, what do you think,
  • Objective-driven intents: Retrive image, take selfie, play song, email target, turn on flashlight

This way if I want to take a picture and share it on Twitter, the OS should recognize:

  • I intend to take a picture and I want to use the Twitter API to push it to their servers.

Clever APIs

Example: Hypermedia APIs increase discoverability of API’s resources and methods