Exploring ActivityPub/ActivityStreams

Tim Chase

2024-02-01

Overview

Wanting to dig into ActivityPub / ActivityStreams for a while I figured I'd read the specs and dive in. Having read plenty of RFC specs before, I figured this wouldn't take too long. Boy was I wrong.

This article is currently a WIP so it may get updated multiple times as I learn more and read deeper into the various specs/RFCs and their sub-dependencies.

The whole family of specs feels like the authors wanted a grab-bag of front-end features rather than detailing the protocols with the precision of engineers that write RFCs. The observations that follow document some of my frustrations along the way. Sorry if this comes across a little more ranty & grumbling compared to my usual posts.

On top of all the standardized mess, Mastodon adds its own layer of non-standardized attributes that other ActivityPub software is expected to understand.

ActivityPub (part 1)

Upon embarking, the first thing I learned was that Mastodon, GoToSocial, Pleroma, PixelFed, PeerTube, and other distributed-web services all run on a backbone of ActivityPub so I figured I'd start there.

The ActivityPub spec clocks in at about 36 pages when printed. Pretty manageable.

The very first thing you read there?

The ActivityPub protocol is a decentralized social networking protocol based upon the ActivityStreams 2.0 data format. It provides a client to server API for creating, updating and deleting content, as well as a federated server to server API for delivering notifications and content.

Okay, I guess I need to read the ActivityStreams spec first.

ActivityStreams (part 1)

Okay, the ActivityStreams spec adds roughly another 46 pages of reading.

ActivityStreams consist of JSON data, formally defined as an RFC-7159. Fortunately, I'm reasonably familiar with JSON so I won't go into that here.

However, barely into the second section and we discover that we need to learn about JSON-LD (another roughly 200 pages).

JSON-LD part 1

Based on the spec, the web-server should serve JSON-LD documents with a MIME type of application/json.

And one of the first things we learn? There's a relationship with RDF so time to take a detour to learn about RDF concepts.

RDF (part 1?)

Goody, another 41 pages of reading. With cross-references to othe things like RDF-Schema. I'll defer digging into that for now.

The general gist is that RDF specifies things in terms of Subject→Predicate→Object. The Subject may be an IRI, or a "blank node" (but not a literal). The Object may be an IRI, a "blank node", or a literal. See further documentation. Why can an Object be any of them, but a Subject can't be a Literal? Kinda glossed over.

ActivityStreams (part 2)

Popping from the stack, we return to ActivityStreams. Note that JSON-LD has a MIME type of application/json while ActivityStreams specifies application/activity+json so we already have conflicting information.

However, before getting a paragraph into Serialization we learn the spec defines an entire ActivityStreams vocabulary so time for another detour.

ActivityStreams Vocabulary (part 1)

The vocabulary definition clocks in at roughly another 54 pages. For those playing along at home, we're up to 200+36+46+54=336 pages without counting JSON specifications or the FoaF & Relationship specs referenced by the ActivityPub Vocabulary document.

Looking over this document, the particulars will require knowing more about ActivityPub/ActivityStreams so I'll return to this later.

ActivityStreams (part 3)

The Good

Encoding: The specification clearly states that all data gets serialized as UTF-8. This is good. I've encountered too many file-formats where this doesn't get defined and I have to guess the encoding. It's nice to be able to identify non-UTF8 text and reject malformed requests without having to retry some other encoding.
Date formats: Additionally, the spec clearly defines date formats according to RFC-3339. However, I find the "time-offset isn't a time-zone" mildly annoying.

The Bad

Context data-type

What is the data-type of a @context's value? According to the JSON-LD spec

…the value of a term definition can either be a simple String, mapping the term to an IRI, or a map.

In the case of an ActivityStream, the String form has the fixed URI value of "https://www.w3.org/ns/activitystreams". In the case of an object/map value, the document's type resides in the @vocab sub-node along with other attributes such as the @language or namespace definitions.

Except when it's a list. Wait, the JSON-LD spec clearly said a simple String…or a map. Nothing about a list. Yet the JSON-LD spec itself uses lists as values for the @context.

Fine. It can be a list. Is it a list of homogeneous elements? Of course not. Each element can be either a fixed String or a JSON object/mapping.

So how do you find the document type? Maybe it's the URI String found at .@context or at .@context.@vocab or at .@context[0] or at .@context[0].@vocab or maybe someplace else. Could it be .@context[1] instead? The specs don't have much to say about that.

But the specs could have mandated a single way to do it, declaring that it's a list of objects, each object with a @vocab property, and that the first one is the document-type. One place to look. But no.

What type of value should you expect?

According to section 4.1

In addition to the global identifier (expressed as an absolute IRI using the id property) and an "object type" (expressed using the type property), all instances of the Object type share a common set of properties

Okay, so Actors & Objects should have a mapping as a value, and that mapping should contain at least id & type attributes. Except they don't have to:

All properties are optional (including the id & type).

Would it hurt to make those required? But even worse, according to Example 4 it looks like those values can be strings instead of maps/objects. They appear to be URIs but it's not documented. So if you get a String here, all bets are off.

Links suffer the same value-type issue

Similar to above, a Link like an image can have a String as the value, an Object as the value, or a list. And the list can be composed of heterogeneous types, strings, objects, and maybe lists? Who knows? It's not clearly defined anywhere. This could have been defined once as a list of objects, and everything would fit. But no.

Hierarchy inversion

An IntransitiveActivity has an Actor but no Object. But the ActivityStreams spec defines defines an IntransitiveActivity as

specializations of the Activity type that represent intransitive actions.

This means from an object-oriented perspective an IntransitiveActivity is a sub-class of an Activity. However this means that Activity objects have an Object, but their IntransitiveActivity sub-classes don't.

Type of type

Normally an object has a single type value, a String representing the type of the object. Cool. But sometimes instead of a String, the type can be a list of types. Again, are the entries in the list homogeneous? Nope. Composed of strings and objects. Can a list-entry be another list? Who knows. Check out Example 22 to see this abomination in play. And with disjoint types, you can end up with redundant data, the same value in multiple keys. It's not like this is a specification requiring precision or anything.

How big is a Collection?

A Collection and its sub-classes (OrderedCollection, CollectionPage, and OrderedCollectionPage) have both a totalItems property and an items property. But there are no requirements that these be disjoint. This means that you can have both properties and they can conflict. You could have a list of items with 5 elements in it, yet have the totalItems report 3 or 7 elements. Which should be displayed? It's not in the spec.

Pagination of an (unordered) Collection

If a collection is unordered, pagination through it makes zero sense. Pagination requires ordering.

Should you expect a String vs. a map?

The spec describes natural language values as a way to detect whether you should expect a String-value or a map-value. So if you use the name attribute, you get a String; and if you use the nameMap attribute, you get a map/object. Not too bad. But why not keep that consistent across all the fields (above) that are sometimes a String and sometimes a map? Like that @context attribute. Why not a @contextMap then? But why start with consistency now?

Which language wins?

If you specify the @context as an object, you can include a @language property to specify the default language of the object. Alternatively, you can specify various languages in certain *Map attributes such as Example 15. However, it also specifies

The special language tag "und" [undefined] can be used within the object form to explicitly identify a value whose language is unknown or undetermined.

So if you've specified a default language using @language, what language should "und" text be rendered as?

Markup? What flavor is it?

Some values contain markup. Some values don't. The content & summary do; the name doesn't. Which other fields do? Who knows. It's not well documented. What markup flavor? There are hints that markup in the content field is HTML, But is that explicitly required? And if it is, what DocType? HTML5? HTML4.01? XHTML? Why not use Markdown?

As we'll discover later, ActivityPub specifies a source attribute/extension that can have any flavor of source markup contingent on the mediaType attribute. That then gets converted to the markup in the name (or whatever other fields support markup). But that's a level higher than ActivityStreams which we're reading about here.

HTML markup part Ⅱ

Additionally, if the content is HTML, this makes it easy to bypass user-agent filters. If I want to filter out posts containing "emacs" but someone posts em<span>a</span>cs, a simple filter can't find "emacs" in there.

HTML markup part Ⅲ

Furthermore, inline CSS can trigger strange effects if not sanitized. A content value containing something like <span style="position:absolute; left:0; top:0; width:100vw; height:100vh; background-color:black; color:red">Hah!</span> can throw off the rendering of the whole interface.

HTML markup part Ⅳ

How about other markup concerns like <form> input in your posts? Should this be allowed? How should it be sanitized? 🤷

HTML markup part Ⅴ

If we're sticking arbitrary blobs of HTML in fields such as the name what security considerations have been taken into effect? What happens if some <script>nefarious code<script> shows up in a field's value? If improperly sanitized, it can still allow <script> tags through. The spec is notably silent on these issues.

Security handwaving

In the section on Security Considerations, the spec advises consumers to take care with malicious user-input and when re-emitting ingested content. What sort of care? 🦗 And the exhortation to beware of potential spoofing attacks? What assurances does the spec have for determining the integrity of an ActivityStreams? None. We'll revisit this when we get to ActivityPub where there's some effort here, but it hasn't been standardized.

Privacy handwaving

Similar to the Security issue, the Privacy Considerations section does a lot of handwaving. There's no standardization of users or audience-groups vs. public vs. private postings. The spec talks of "opting in" to disclosure of posts, but doesn't detail how.

HTML markup part Ⅲ

If we're sticking arbitrary blobs of HTML in fields such as the name what security considerations have been taken into effect? What happens if some <script>nefarious code</script> shows up in a field's value? If improperly sanitized, it can still allow <script> tags through.

Security handwaving

In the section on Security Considerations, the spec advises consumers to take care with malicious user-input and when re-emitting ingested content. What sort of care? :crickets: And the exhortation to beware of potential spoofing attacks? What assurances does the spec have for determining the integrity of an ActivityStreams? None. We'll revisit this when we get to ActivityPub where there's some effort here, but it hasn't been standardized.

Privacy handwaving

Namespaces vs. things in that namespace

A @context can have a map as a value, including attributes that don't begin with an @ mapping to a URI. According to Example 3 we have css defined as a URI and then used directly as "css" in the object. However in Example 3 the same syntax defines gr as a namespace, and it then gets used as gr:catgegory further down, rather than using just gr. When is it a stand-alone attribute, and when is it a namespace-prefix? From the context it looks like something comes after the # in the URI. But it doesn't appear to be explicitly documented from what I can tell.

Compact URI namespaces are a nightmare

While this is a JSON-LD thing, it make it particularly challenging to re-serialize an Object.

So an ActivityStream consists of maybe an action with an Actor and an optional object, or maybe it's just an object. And the id might be present or it might not, and properties might be strings or objects or lists, and some might have external schema links with corresponding namespaces, or they use properties from those namespaces directly without a namespace prefix, and sometimes values consist of mappings from language-to-(possibly-unknown-)value, where those values might have some sort of undefined markup (that might or might not have unspecified security or privacy concerns), and to find the @context, you have to look in multiple place.

Got it. That's ActivityStreams.

Except we should investigate the ActivityStreams vocabulary before we return to ActivityPub.

ActivityStreams Vocabulary (part 2)

The Good

The vocabulary seem to cover a reasonable range of activities and objects, and the attributes mostly make sense.

We also get a bit of clarification here. All those "it can be a String, or it can be an Object, or it could be a List" confusions in the ActivityStreams spec get clarified here.

Properties marked as being "Functional" can have only one value. Items not marked as "Functional" can have multiple values.

If a "Functional" attribute has a String as the value and the type allows for a Link, it will be a URI, otherwise the String will be the particular value (such as the latitude or radius). If an attribute is not labeled as "Functional", it can be a heterogeneous list of Links, Objects, and Strings (where String values are usually URIs). Mildly annoying to have to deal with all three possible value-types for most attributes. But at least this makes a bit more sense.

The Bad

Okay, this is a bit of a disaster.

Alphabetization

There are several large tables here consisting of object-types and properties. But the tables aren't sorted alphabetically. This makes it next to impossible to find things in my print-outs. Sure, they have internal HTML links but that's useless on paper. I get that they're grouped by similar functionality, but that doesn't make it any easier to find things because you have to know the groupings a priori to know where to find them.

Same issues with ActivityStreams

Again we see the hierarchy-inversion of Activity and IntransitiveActivity,

Handling out-of-spec items

The spec declares that a Question may have a oneOf or an anyOf, but not both. However, if a malformed activity provides both, what is the correct response? Reject the activity? Choose one arbitrarily? There were a couple other places where such requirements left open-ended the handling of errors and non-conformance.

Simple carelessness/inconsistency?

Sometimes the Vocabulary spec declares attributes as Object/Link and other times it Link/Object. Does the order matter? Why are they different?

Similarly, why are items explicitly called out to accept a List?

As cited above, if an attribute is not labeled as "Functional" (the items attribute isn't), it can take a List of values. Yet the spec for items explicitly spells out that it can take a List of Object/Link. What makes this special? Why not annotate that every non-Functional attribute can take a List?

Speaking of redundant attributes, how about the url?

The spec defines a url as accepting a Link or an xsd:anyURI. But based on everything I've seen in the spec, any Link can just be a String containing the URI. Why the redundancy?

Codify ambiguity

Reading about the context attribute,

The notion of "context" used is intentionally vague. The intended function is to serve as a means of grouping objects and activities that share a common organizing context or purpose.

If your spec says something is intentionally vague, the spec has issues.

Is HTML allowed in String values?

A content and summary attributes state that the data defaults to HTML markup. However, the name can also contain text, but the spec explicitly disallows HTML. This feels irrationally inconsistent.

Some attributes take a "Map" suffix, others don't

If you have a content, name, or summary attribute, and you want to specify multiple languages, you use the contentMap, nameMap, and summaryMap variants. It feels like this should have been used across the board. Use one attribute-name for a String value, a different name for an Object value, and yet a third different name for a a List value.

However, the ActivityStreams vocabulary specification doesn't detail these *Map fields beyond a passing reference to Natural Language Values. The only reference to "Natural Language Values" in the ActivityStreams Vocabulary is the summary/summaryMap and nothing is mentioned regarding name/nameMap or content/contentMap being "Natural Language Vocabulary". They only allude to nameMap and contentMap without documenting why/how.

Additionally, can an object have both the single version and the Map version? The spec only shows the exclusive cases, but the doesn't say whether an object can have both a name and a nameMap attribute.

Start a new Relationship wit' you

The Relationship attribute accepts multiple (and semi-arbitrary) values, some from the FoaF spec, and some from the Relationship spec (side rant: the vocab.org site does a redirect and requires JavaScript enabled just to view the spec; A spec is a ████ text document). Are those the only values? What happens if they conflict or overlap? It's just kinda handwavey here. Yet more specs to read, I guess.

In a Relationship? Says who?

The documentation for modeling a friend request references four different actors:

the actor offering the friend-request
the target receiving the friend-request
the object.subject (the person seeking friendship)
the object.object (the friend-to-be)

However, nothing in this example requires that the actor & object.subject be the same; nor does anything require that the target & object.object be the same. This means that Alice could send a request to Bob asking if Dave would accept Carol's relationship. Or Mallory could ask Bob if Bob would like to accept a relationship with Alice. This seems fraught with potential concerns that the spec leaves unaddressed.

Random acct: prefix

Continuing with the modeling a friend request, all four of those accounts use a acct: prefix on a username. Why? Where is this defined? Not in the ActivityPub, ActivityStreams, ActivityStreams-Vocabulary, or the JSON-LD spec. Digging a bit, it looks like RFC-7565 defines this acct: but none of the ActivityPub-related specs reference this scheme.

Just like, Like and Unlike?

The vocabulary only defines a Like/Dislike activity. A post might elicit a whole range of reactions beyond like/dislike. It might make me laugh, or cry, or angry, or high-five, or any of a number of other emotions/emoji. As swell as it is to Like/Dislike things, it really needs a generic React action.

I reject your answers and substitute my own

When responding to a Question it's entirely possible to provide an inReplyTo with a name (answer to the survey) that doesn't correspond to any of the answers in the original Question. What should happen in this case? Reject the answer/reply? Add the answer/reply to the list of existing answers?

Survey-respondent privacy?

Looking at the result of a Question appears to return each vote along with the ID of everyone who cast each vote leaked in the attributedTo attribute. Is this expected? Is this required, lest folks want their answers kept private?

ActivityPub (part 3)

Okay, now we can finally return to ActivityPub.

Naked Objects/Links

If you submit a naked Object/Link (one that doesn't have an activity associated with it) the server is supposed to automatically convert it into a Create activity.

However, if the message has been cryptographically signed for authentication, changing from a naked Object to an Activity also changes the message's cryptographic signature.

Additionally, having two source fields (the Create.actor and the Create.object.attributedTo) leaves room for a bad actor to create objects with mis-attribution. E.g. Mallory publishes a Create action of Bob picking his nose, and attributes it to Alice, causing Bob to get mad at Alice rather than Mallory. The spec acknowledges this as a possible issue

it should dereference the id both to ensure that it exists and is a valid object, and that it is not misrepresenting the object. (In this example [Example 7], Mallory could be spoofing an object allegedly posted by Alice).

However, the spec doesn't detail how to secure against this. Dereferencing the ID only checks that something exists at that URI. Unless it's exactly the same Object content (which can likely change since we're already munging with objects, normalizing things, etc) it means we can't likely do an exact-match comparison. Furthermore, because objects can be Links and Links can consist of a URI, that means that the linked object-ID could be some protocol other than HTTP or HTTPS. Does my dereference the id code need to support gopher:, ftp:, git:, svn:, imap:, smtp:, irc:, ldap:, smb:, or whatever scheme? Maybe the object refers to other URIs/resources that exist, but aren't properly attributed. Determining if something is misrepresenting the object seems handwavey.

Security/authentication/authorization

Standards, schmandards.

ActivityPub uses authentication for two purposes; first, to authenticate clients to servers, and secondly in federated implementations to authenticate servers to each other. Unfortunately at the time of standardization, there are no strongly agreed upon mechanisms for authentication.

(source) Couldn't have made this part of the initial requirements? Security is best left for an afterthought…

Serve it as what MIME-type now?

The JSON-LD spec says to serve as application/json while the ActivityStreams spec calls for application/activity+json so just to be ornery, the ActivityPub spec mandates application/ld+json;profile="https://www.w3.org/ns/activitystreams" as the MIME-type. Sure, you can accept other types, but why not make the spec consistent to begin with?

ActivityPub (part 4)

So how do you go about starting the whole process? You use WebFinger (which the ActivityPub, ActivityStreams, ActivityStreams Vocabulary, and JSON-LD specs don't mention). You start by hitting a well-known URL like https://example.com/.well-known/webfinger?resource=acct%3Ausername%40example.com. This should return an initial JSON object describing the person, including their name, id (URI), Inbox, Outbox, cryptographic public keys, as well as various other properties. And for what it's worth, because we haven't already had enough MIME-types for ActivityPub data from other specs, WebFinger returns as application/jrd+json, yet one more.

Additionally, because the well-known WebFinger URL must be rooted at /.well-known/webfinger, it prevents the ActivityPub server from being rooted in some subdirectory. So either the ActivityPub server needs to largely control the web-root; or it needs two distinct processes, one listening for the WebFinger request, and another one rooted off a name-spacing sub-resource.

Authentication, authorization, and signing

There's some discussion of Authentication and Authorization but only descriptive aspects in the context of "this is what Mastodon does," not a prescriptive "this is how everybody should do it."

TODO: WIP