First Draft of Atom Cross-posting Extension
I wrote up a first draft of an Atom extension for declaring cross-posting duplicates. It defines both a way for the primary version to declare the duplicates and a way for the duplicates to declare the primary version.
In practice it's very unlikely that the duplicates will declare the primary version, because most of the time they don't know they're duplicates and even if they did they probably wouldn't want to admit to being "just a copy". But I hope that publishers that create cross-post duplicates will see the benefit in declaring these in the feed to improve the usability of the feed when it's consumed into a multi-feed aggregator such as FriendFeed or MT Action Streams.
An Atom extension for declaring cross-post dupes
It's becoming increasingly common for content-publishing applications to include a feature where they'll duplicate (in some sense) the content a user creates on other services such as Twitter or Facebook.
Unfortunately, this has the unfortunate side-effect that multi-feed aggregators cannot easily detect this and often end up showing the same content more than once.
However, I think we can go some way towards a technical solution to this problem without trying to boil the ocean and stop people cross-posting: have the publisher that's creating the duplicate content declare that it has done so in its feeds.
What does this look like? It feels like this just takes one very simple extension element with the same attributes as the in-reply-to element introduced by Atom Threading Extensions: a ref attribute giving the id of the duplicate entry, and a type,href pair linking to a representation of the duplicate entry. For example:
<crosspost:dupe
ref="http://twitter.com/apparentlymart/statuses/3641424947"
href="http://twitter.com/apparentlymart/statuses/3641424947"
type="text/html"
/>
This alone isn't enough to do the de-duping, since we can't trust publishers not to lie about what's a duplicate, but in an application such as FriendFeed or MT Action Streams where a user has configured a list of feeds to import it is easier to assume that all of the referenced feeds are trustworthy in the context of that user: if I've got both my notes blog and Twitter both added to MT Action Streams and the notes blog declares a Twitter entry from my account as a duplicate it's fair to assume that it is indeed a duplicate.
This is not a complete solution, since it is possible that I've cross-posted to both Twitter and Facebook and you consume those two feeds but not the "origin" feed; however, I think this is a step in the right direction and solves the immediate problem at hand. It would be nice if the services that tend to receive these duplicates would extend their APIs such that publishers can declare that they're posting a dupe and so the receiving service can create a reverse-dupe element, but that's not something we can bootstrap so easily today.
I'm interested to see if any providers who offer the functionality to duplicate their content on Twitter and/or Facebook would be willing to work on this. It ought to be a reasonably easy, tightly-scoped specification and should not be a burden for implementers as long as they know how to form an Atom id (or RSS equivalent) for the services they publish to based on a service-local id returned from the API.
A Protocol for Batch HTTP Requests
There's been murmerings of discussion about the possibility of doing batch HTTP requests for some time now. Some folks maintain that it isn't necessary for one reason or another, while several popular web services provide batch request mechanisms that are tailored to their specific API but are not generally applicable.
A while back I wrote up a draft spec for a general-purpose batch request protocol, loosely based on James Snell's proposals. I've discussed this with folks at various conferences and people seemed generally receptive to the idea. I notice that since then he's posted HTTP Multipart Batched Request Format, which is also based on his initial thoughts but he went in a different direction to me. I regret that I didn't post about my draft sooner so that we could have potentially worked together on this.
My approach is to mimick the behavior of an HTTP proxy server. Although in practice I expect many implementations won't be acting as proxies in the traditional sense, it was my expectation that this would make it easier to adapt existing implementations that already know how to deal with proxies.
Simon Wistow wrote a Perl implementation of this spec (now living on my Github) and there's a Python library implementation based on httplib2 as well as a standalone proxy server written in Python using Twisted done by some of my Python-speaking colleagues, both of which will hopefully be released soon.
As with many of these things, the main win in having a standard here is that it should cease to be necessary to write a separate implementation for each new web service. Of course, until more than a couple folks adopt a standard it's just another proprietary request format, so I hope to have some discussion about this to figure out where we can meet in the middle of the various proposals and come out with something that multiple services would be happy to support.
Streaming JSON Parser and Generator
My contribution to the ongoing trend of reinventing the entire XML toolchain for JSON is a pair of Perl libraries which allow JSON to be produced and consumed in a streaming manner, rather than requiring the data to be represented as a complete, in-memory data structure.
For many applications a traditional on-shot JSON library is more than sufficient, but a streaming JSON parser might be useful if you already know what data structure you're expecting because you can skip over or reject parts of the data that do not conform to the expected structure without them ever manifesting as real objects in your program.
The JSON generator has more limited utility but might be useful for serializing large data structures without them needing to exist in memory in their entirety: you can load data in stages, producing the relevant output and then freeing the memory before moving on to the next part.
Much as with a streaming XML library, the programming model is more awkward than with a library that loads everything into memory, but as JSON becomes the web's de-facto data serialization I think having the ability to stream it will be important for more and more applications.
I'd love to see others implementing similar functionality for other languages, hopefully with a similar API. When it comes to Perl, you can download JSON::Streaming::Reader and JSON::Streaming::Writer from CPAN today. The latest versions, 0.03 and 0.02 respectively, are winging their way through the CPAN indexer as I write this.
JSON for Standard APIs
The Web 2.0 world seems to hate XML. And who can blame them, when JSON maps much more neatly onto the data structures they're used to dealing with in their everyday programming languages?
It is a shame, however, that this is leading to the return of the previous generation's big API design anti-pattern: public APIs that are just straight dumps of internal data structures. This means that everyone's API is different, and you need to rewrite everything from scratch for each provider. Worse still, representations and protocols and access patterns are all inextricably tied together, making it hard to reuse good ideas from one provider's API in another.
In a world that's rejected XML, it's getting pretty hard to push AtomPub as a standard API for submitting and managing web content. Some providers are paying lipservice to it by supporting Atom as a representation format, but AtomPub is not really about Atom... it's about using HTTP principles as your API model, which of course is what we today call the REST archetectural style.
I personally thing AtomPub has a pretty nice API model. It defines two simple concepts: a "collection" and an "item". It also defines a few different operations you can peform: you can add a new item to a collection, you can retrieve items from a collection, you can retrieve individual items and you can update and delete individual items. If you ignore the Atom Syndication Format part the basic model is a nice basis on which you can build almost any higher-level API, but it's particularly suitable for creating web content.
The basic Atom Syndication Format as defined by RFC 4287 has a pretty strong slant towards weblog entries, which is no surprise since it was created by a bunch of bloggers. However, part of the Activity Streams effort has been to find ways to expand Atom to make it possible to model other common social web objects such as people, events, photos, etc via Atom. If you apply the activity streams extensions for representing such objects to the AtomPub model you suddenly have a general API for submitting, editing and deleting anything that's catered for by the activity schema specification. But we're back in XML now, right? How can we get from here to JSON?
There has been a number of efforts to define mappings between Atom and JSON, many of them trying to achieve lossless round-tripping. I'm of the opinion that the XML and JSON model are so radically different that you're never going to achieve a lossless transformation of arbitrary XML (which is what Atom is once you augment it with extensions) to JSON. If it was that easy, then we'd already be doing this to go from XML to our language's intrinsic data structures and we wouldn't need JSON!
But if we assume for the moment (and this is certainly not true today, but is a goal) that Atom and AtomActivity (along with its dependencies) together provide a rich enough vocabulary to describe 90% of social objects users deal with on the social web, can we define a JSON vocabulary that refactors only these features into a non-extensible-but-good-enough standard JSON schema? I think we can!
So today I'd like to unveil my first draft of The JSON Syndication Format. While far from complete, it currently describes a basic data model inspired by the basic Atom spec and some bits and bobs from elsewhere. Ultimately I think this format should combine the most important features of the following specifications:
- The Atom Syndication Format
- Atom Threading Extensions
- Atom Feed Paging and Archiving
- Atom Media Extensions
- Atom Activity Extensions
If you read the spec as it stands today you'll note a few interesting things that are not direct mappings from Atom. Firstly, the various different ways of representing content and other text properties in Atom are not supported: you'll use HTML and you'll like it. Secondly, I'm intending to use the PortableContacts schema to describe people rather than a simple mapping of the Atom Person construct.
Since the basic model of AtomPub is agnostic to representation format, it's easy to substitute out application/atom+xml for the two new media types described in this specification. While I doubt Atom and RSS are going to go anywhere for syndication feeds anytime soon, it's clear that Atom (and other XML formats) are on the way out for APIs, so I hope that something like JSON-Syn over AtomPub can be useful as a general social web CRUD protocol that does not require each provider to invent its own schema.
It's unlikely that I'll work much more on JSON-Syn other than this initial prototype until the Activity Streams specs are further along, because the research there will give a better idea of what metadata properties are useful to include in JSON-Syn. The lack of an extensibility mechanism is of course quite limiting, but there's nothing to stop someone who's doing something unusual from inventing a new JSON-based object model, publishing that in an open specification and applying AtomPub principles to that too.
This is just a proof-of-concept effort for now, and I'm sure there are flaws in my approach, but my hope is that we can start moving again towards provider-agnostic APIs despite having refocused efforts on a new serialization format.
Muni Releases Data Files
Muni (aka SFMTA) has released GTFS data files describing its routes to the public. This is the data format that is used to feed data into Google Transit. It's under a wordy licence, though.
Given that SFMTA is a government-run thing (as least, that's what the "municipal" in its name suggests), surely any data it creates should be freely available? It seems strange that they should restrict the usage of it.
Still, it's nice to see SFMTA join the ranks of transit agencies who are making their data available under various non-payola licences, and also nice to have a way to get this data without screen-scraping NextBus.
I Love Blogging
The official TypePad blog asks why I love blogging. That is a good question, I think.
Way back in the mid-nineties, when I was still new to this Internet thing, I started a personal website. It was ugly (blue on cream!), and there was little content to speak of, but it was mine. After a while I started to write about what was going on in my life and post it on my website. At the time I didn't really know what to call what I was doing; the section of my website was called "Inside My Head", and I referred to it as my "online diary" at times.
After a while manually copying and hacking HTML files got a bit tiresome, so I busted out some PHP scripts and learned some SQL and before long I had my own blogging software! Of course, the word "blogging" didn't exist yet, and once again I didn't really know what to call it.
Now that I was able to post stuff to my website through a web form, it occured to me that I could give access to my friends to post as well. After a bit more hacking, some of my friends (the more nerdy ones, naturally!) were posting their content on the site too. "Inside My Head" became the name of my section, and my friends each had their own section on what I called "Grey Matter". Suddenly what was once a one-way publishing tool became a way for us to share cool stuff we found online, to share reviews of movies, post pictures and other stuff we'd made, and various other things.
I didn't realise it at the time, but in parallel to all this my homeboy Brad Fitzpatrick was doing much the same thing. Brad of course went one step further and opened his site up for anyone to join, and this of course became the LiveJournal we know today. A friend at the time pointed out LiveJournal to me, and the temptation to connect with an even wider group of people led to me signing up for my LiveJournal account.
Glossing over the next few years, I started using LiveJournal more and more, my Grey Matter site langished and eventually vanished, and I eventually ended up submitting patches to LiveJournal once Brad released the source back in 2001; LJ ended up being a large part of what I did online in my spare time for several years.
These days there are plenty of ways to express yourself online, what with dedicated photo hosting sites, microblogging, full-on social networks and niche sites for just about every niche you can think of. Blogging still holds my interest despite all that, and here's why: whether it be my friends writing about everyday life or someone I barely know writing about a subject that interests me weblog entries are generally long enough to be interesting while still remaining personal. You know that there's a human somewhere writing this stuff: it's full of opinion, hearsay, emotion and personality that you just don't get in any other medium.
I love blogging so much that a blogging service is now what I do for a living, and I moved half-way around the world to do it. Twitter and Facebook have in many ways taken the minutae out of blogging, which has only served to focus bloggers and blogging platforms on what they do best. I hope people will still be blogging in some sense 50 years from now when I'm old and grey, Blogging will certainly evolve over the years, but the essense of self-publishing and self-expression ought to be timeless.
Here's to another half-century of blogging!
Apparently.me.uk Moves to TypePad
Until now this blog was (more or less) being served from LiveJournal. It's now being served by some crazy hybrid of LiveJournal and TypePad, since TypePad lacks a good way to import all of my old content and comments.
Entries that were posted before February 2009 can be accessed via the monthly archives.