OAuth • httr2

library(httr2)

OAuth¹ is an authorization framework design for performing work on the behalf of a user. You’ve probably used it a bunch without knowing what it’s called: you use it when you login to a non-Google website using your Google account, when you give your phone access to your twitter account, or when you login to a streaming app on your smart TV.

You’ll notice none of these common scenarios quite match to using an R package to scrape data from a web API, and that’s the pleasure and pain of OAuth. OAuth gives you the incredible power to extract data from popular web services, but it’s fundamentally designed for a different use case. This means that while httr2 can make the interactive experience pretty seamless, unattended usage (i.e. on CI or in an automated script) is usually going to include some pain.

This vignette builds on the techniques discussed in vignette("wrapping-apis"), so I recommend you read that vignette first. I’ll also assume that you’re working in a package, since most people want to wrap up an API into a bunch of R functions, but most of the ideas will also apply if you’re creating a one-off script.

OAuth basics

OAuth is a broad framework that has many different variants, called flows, which makes it hard to provide sweeping generalisations, but the basic idea of OAuth is to create a hierarchy of increasingly more specific and shorter-lived credentials, so that the impact of a credential being lost is as small as possible.

The longest lived and most powerful credential is typically a user name and password. Most people don’t change their passwords regularly, and often (against all advice) reuse the same password on multiple websites. And if you have a user name and password, you have total control over that account; you can even use it to change the password so the actual user can’t log in. That means as a programmer you never want to touch user name-password pairs because if they’re lost or stolen, they give very wide access.

Avoiding that problem lead to the creation of OAuth. The basic idea is that instead of your package asking the user to give you their user name and password, you instead ask them to log in and give your package permission to use the API on their behalf. The API gives this permission in the form of an access token which is essentially a random string of numbers and letters, e.g. UfNlXaEog03hdRPTUPpEInEiIW01jI1WcjOB. An access token is very short lived, lasting maybe only days, and is bound to a specified scope of access. The access token has some big advantages over a user name and password:

It’s short-lived so if it’s lost or stolen, there’s only a limited amount of time where it can be abused.
It has limited scope, so even if stolen, it can’t be used to do something particularly nefarious like changing your password or contact details.
It’s bound to a specific application, so it can be invalidated (cancelled) without affecting any other uses.

If you have an access token you can use it to authenticate an API by passing it as a bearer token in the Authorization header, which you can do with req_auth_bearer_token(). However, in most cases you will want to let httr2 manage this by calling one of the req_oauth_ functions we’ll talk about shortly.

One of the reasons that you want httr2 to manage your tokens is that because access tokens are so short lived, they’re often accompanied by a refresh token. A refresh token lasts for a longer amount of time and has one job: it allows you to get a new access token when the previous one expires. You need to look after a refresh token a little more carefully than an access token. In particular, you should never include a refresh token in a HTTP request as that’s the job of the access token.

Overall this leads to a hierarchy of credentials from weakest to strongest:

The access token usually only lasts a couple hours and needs to be submitted with every request.
The refresh token lasts for days to weeks and it’s designed to be stored locally so you can regenerate access tokens. (We’ll talk about this more in Caching, below.)
The user name + password gives access to everything so your code should never touch them!

Now that you’ve got the basic idea of OAuth, lets talk about the details.

Clients

The first step in working with any OAuth API is to create an application client. It’s called an application client because in the wider world OAuth is usually used with a web, phone, or tv app, but here your “app” is going to be your R package. For this reason, httr2 calls the application client a client, but many APIs will call it an OAuth application.

To create a client, you’ll need to first register for a developer account on the API’s website. In most cases this is easy, totally automated, and only takes a couple of minutes. That will give you access to some sort of developer portal which you can then use to register a new OAuth app (aka a client). This process varies from API to API (it’s normal to spend some time hunting through the docs and settings), but at the end of it you’ll get a client id (another random string of numbers and letters). Sometimes the client id is all you need, and you can create a httr2 client with oauth_client(), e.g.:

client <- oauth_client(
  id = "28acfec0674bb3da9f38",
  token_url = "https://github.com/login/oauth/access_token",
  name = "hadley-oauth-test-2"
)

The call to oauth_client() also includes a name and a token_url.

The name is human-facing, and should typically be the same as your package (the thing that prompted you to create the client). I have a bunch of apps that I’ve used for testing, so here I’ve used the name hadley-oauth-test-2 to remind me which app this client corresponds to.
The token_url points to the URL that’s used to obtain an access token. You’ll need to find this from the documentation for the API you’re wrapping; it will typically be found in the section that describes the OAuth process and will be an endpoint that returns an access token. Don’t be surprised if this endpoint feels very different to the rest of the API; auth is often implemented by a third-party package with slightly different conventions to the rest of the API.

Client secret

In most cases, however, the API will also require a client secret. While this is called a secret, it’s typically not that important to keep it a secret because of two reasons:

It’s typically easy to create an new app on the developer website so stealing yours wouldn’t save much time.
It’s unusual for an OAuth client to be able to do anything in its own right, so stealing your secret doesn’t have much benefit.

That means that unless you have paid for the app or given it private information while creating it, it’s ok to embed the client in your package. That said, httr2 provides some tooling to obfuscate your client secret so that the client secret isn’t directly embedded in your source code, and hence vulnerable to scraping.

To obfuscate a string, call obfuscate(), then copy and paste the result into your package. For example, if your client secret was “secret”, you’d call obfuscate() then you’d copy and paste obfuscated("B4Evdd5x4wl0XTWvtTpuGaw7nM7GEg") into your client specification.

obfuscate("secret")
#> obfuscated("IHCPMSJWkgL-l7p6yTUcN0m7MWSCeg")

Here’s what a complete client specification for GitHub looks like, using a real app that I created specifically for this vignette:

client <- oauth_client(
  id = "28acfec0674bb3da9f38",
  secret = obfuscated("J9iiGmyelHltyxqrHXW41ZZPZamyUNxSX1_uKnvPeinhhxET_7FfUs2X0LLKotXY2bpgOMoHRCo"),
  token_url = "https://github.com/login/oauth/access_token",
  name = "hadley-oauth-test"
)

You can certainly uncover my client secret if you are an experienced R programmer and are willing to spend a bit of time experimenting, but I’m pretty sure it’d be easy for you to just create your own app on GitHub.

Packaging

I recommend wrapping client creation in a function in your package, e.g.:

github_client <- function() {
  oauth_client(
    id = "28acfec0674bb3da9f38",
    secret = obfuscated("J9iiGmyelHltyxqrHXW41ZZPZamyUNxSX1_uKnvPeinhhxET_7FfUs2X0LLKotXY2bpgOMoHRCo"),
    token_url = "https://github.com/login/oauth/access_token",
    name = "hadley-oauth-test"
  )
}

You’ll need this in order to run tests for your package, but you’ll probably also want to use it as the default client for your users. In some cases, it will be necessary for each user to create their own app and matching client (e.g. if rate limits are applied by app, not by user), but this is much less user friendly so you should avoid it if possible. That said, you should always provide a way for a user to supply their own client when you bundle a default. You can see an example of this in the googledrive package.

Authorization code flow

Once you have a client you need to use it with a flow in order to get a token. You’ll need to read the docs for your API to figure out which flows it supports, but the most common is the authorization code² flow, which works something like this:

httr2 opens a browser using the authorization URL provided by the API. The URL includes parameters that identify your app and what scope of access you’re looking for (e.g. tweet.read, userinfo.write).
The user logs in using their user name and password (hopefully using a password manager) and approves the request.
The API sends an authorization code back to httr2 using a callback URL that was supplied in the initial request.
httr2 sends the authorization code to the token URL to get an access token.

In httr2 this flow is implemented by a pair of functions: oauth_flow_auth_code() and req_oauth_auth_code(). Start with oauth_flow_auth_code() to check that you have all the parameters correctly specified, then use req_oauth_auth_code() to authenticate requests. These two steps are described in the following sections.

Creating a token

oauth_flow_auth_code() is the best way to verify you’ve correctly specified the parameters to the client and the auth_url, without depending on correctly understanding any other part of the API. For example, you could get a token to access the GitHub API (using the client defined above) with this code:

token <- oauth_flow_auth_code(
  client = client,
  auth_url = "https://github.com/login/oauth/authorize"
)

This flow can’t be used inside a vignette because it’s designed specifically for interactive use, but if you do run it and print out the token, you’ll see something like this:

token
#> <httr2_token>
#> token_type: bearer
#> access_token: <REDACTED>
#> scope: ''

There’s not much to see here because httr2 automatically redacts the access token (because that could be used to perform actions on behalf of the user).

If your call to oauth_flow_auth_code() succeeds then you’ve got everything set up correctly and you can proceed to the next step. Otherwise, you’ll get an HTTP error. If you’re very lucky, that error will be informative and will help you figure out want went wrong. However, in most cases, you’ll need to carefully double check that you’ve correctly copied and pasted the client id and secret, and check that you’ve supplied the correct authorization and token urls (auth_url and token_url). If the docs have multiple candidates for each and you’re unclear about which is which, you’ll need to do some systematic experimentation.

Authenticating a request

Initial configuration is the only time that you’ll see an httr2_token object because you’ll generally want to rely on httr2 to manage the tokens for you. You’ll do that with req_oauth_auth_code(). To check that it’s working correctly, I recommend finding the simplest possible API endpoint to test it with. A good place to start is an endpoint that provides information about the “current” user, if your API provides one.

For example, the GitHub API provides a GET endpoint at /user that returns information about the current user. If we make a request to this endpoint without authentication, we’ll get an error:

req <- request("https://api.github.com/user")
req |>
  req_perform()
#> Error in `req_perform()`:
#> ! HTTP 401 Unauthorized.

We can authenticate this request with req_oauth_auth_code(), using the same arguments as our previous call to oauth_flow_auth_code():

req |>
  req_oauth_auth_code(
    client = github_client(),
    auth_url = "https://github.com/login/oauth/authorize"
  ) |>
  req_perform() |>
  resp_body_json() |>
  str()

When you run this code, you’ll see something like this, but it will obviously contain information about you, not me.

#> List of 32
#>  $ login        : chr "hadley"
#>  $ id           : int 4196
#>  $ node_id      : chr "MDQ6VXNlcjQxOTY="
#>  $ avatar_url   : chr "https://avatars.githubusercontent.com/u/4196?v=4"
#>  $ gravatar_id  : chr ""
#>  $ url          : chr "https://api.github.com/users/hadley"
#>  $ html_url     : chr "https://github.com/hadley"
#>  ...
#>  $ type         : chr "User"
#>  $ site_admin   : logi FALSE
#>  $ name         : chr "Hadley Wickham"
#>  $ company      : chr "@posit-pbc"
#>  $ blog         : chr "http://hadley.nz"
#>  $ location     : chr "Houston, TX"

Caching

There are two big reasons to allow httr2 to manage tokens for you. The first is that httr2 will automatically refresh the token if it’s expired. The second is cross-session caching, as described below.

By default, the OAuth token will be cached in memory. That means that you will only need to authenticate once in the current session, but you’ll need to re-authenticate if you restart R. In some cases, you may want to save the tokens (refresh and access) so that they can be used across sessions. This is easy to do (just set cache_disk = TRUE) but you need to think through the consequences of saving the refresh token on disk.

httr2 does the best it can to save these credentials securely. They are stored in a local cache directory (oauth_cache_path()) that is accessible to the current user, and they are encrypted so they will be hard for any package other than httr2 to read. However, there’s no way to prevent other R code from using httr2 to access them, so if you do choose to cache tokens, you should inform the user and give them the ability to opt-out. httr2 automatically deletes any cached tokens that are older than 30 days whenever it’s loaded. This means that you’ll need to re-auth at least once a month, but prevents tokens from hanging around on disk long after you’ve forgotten you created them.

You can see which clients have cached tokens by looking in the cache directory used by httr2:

dir(oauth_cache_path(), recursive = TRUE)

Each client gets its own subdirectory named using the client name, so if you turn caching on, it’s particularly important to give your client a good name so that the user can easily tell which package the tokens belong to.

Other flows

When wrapping an API, you’ll need to carefully read the documentation to figure out which flows it provides. If possible you’ll want to use the authorization code flow since it generally provides the best experience, but if it’s not available you’ll need to carefully consider the others. Currently, httr2 supports the following flows:

req_oauth_device() uses the “device” flow which is designed for devices like TVs that don’t have an easy way to enter data. It also works well from the console.
req_oauth_client_credentials() and req_oauth_bearer_jwt() are often needed for service accounts, accounts that represent automated services, not people, and are often used in non-interactive environments.
req_oauth_password() exchanges a user name and password for an access token.
req_oauth_refresh() works directly with a refresh token that you already have. It’s useful for testing and automation.

httr2 doesn’t support the implicit grant flow. This was historically important but is now mostly deprecated and was never a particularly good fit for R because it relies on a technique for returning the access token that only reliably works inside a web browser.

Regardless of which flow you use, you’ll need to follow the same process as the example above: first figure out how to get a token using the oauth_flow_ function, then actually use oauth with a request, by calling the matching req_oauth_ function.

One additional wrinkle is that many APIs don’t implement the flow in exactly the same way as the spec so httr2’s built-in flows might not work at all. If your initial attempt doesn’t work, you’re going to need to do some sleuthing. This is going to be painful, but unfortunately there’s no way around it. I recommend wrapping your httr2 code in with_verbosity() so you can see exactly what httr2 is sending to the server. You’ll then need to carefully compare this to the API documentation and play “spot the difference”. You’re very welcome to file an issue and I’ll do my best to help you out.