August 18, 2006

[GEEKY] About how the subjectivity of “correct”

Filed under: Geeky — Pito Salas on 10:59 am

This is a nice little dissertation on one of the micro technical challenges in building an aggregator. It has to do with the detection of duplicates in an rss feed.

The article called “RSS Duplicate Detection” goes through the challenges of ‘correctly’ detecting whether two posts are identical; you can read his conclusion that there is no provably “correct” behavior.

By the way if you wonder why there would be duplicates in a feed, here’s a very common scenario.

The aggregator polls the RSS feed once, displaying the resultant posts, and storing them in some kind of archive, so they can be displayed again without polling again (as well as so that they can collect the posts over time.)

An hour or several later, the aggregator polls again, and lo and behold the feed has been updated, so it is fetched again. But in fact only 1 new item was posted in the interrim. So the feed contains that new item and then a stream (repeating) of the same items fetched the first time.

The question: how is the aggregator to figure out that that single post is “new”? Somehow it will compare it with the ones already there, asking the question: “is this post a duplicate of one already there?”

Voila. If the aggregator answers that question ‘incorrectly‘ then one of two things happen: if it thinks it’s new but it wasn’t, then the user sees a duplicate, and writes some hate mail to me. If it thinks it’s not new but it was actually new, then the user sees that a post is ‘missing’ and writes some hate mail to me.

So knowing when two posts are ‘the same’ is critical. Unfortunately there is a right answer, and is, “two posts are the same if the user thinks they are the same and vice vesa.”

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Netscape
  • Reddit
  • Slashdot
  • StumbleUpon
  • Furl

6 Comments »

  1. I dunno. My only real complaint with BlogBridge is how it handles (or doesn’t) updates to a feed. Even with an Atom feed with strict IDs BlogBridge sometimes shows me the same thing twice if something (I haven’t worked out what) else in the feed changed.

    Also I would love it if it correctly handled article updates. If the publisher sends something with the same ID but a more recent update timestamp PLEASE show that. (It would be cool if there were a way to optionally show the older versions, but that is less important to me).

    -ben

    Comment by Ben Bennett — August 18, 2006 @ 1:58 pm

  2. Glad to see things being thought on a feature suggestion I asked for :).
    I think the linked blog puts up good points to consider for duplicate identification. As you said, the best way to say it is duplicate is if the “user thinks it is so”.

    So let the user think!!. What I mean is is to build an algorithm(actually simple if..then to start with) which takes inputs from the user(I mean make it configurable in Options) how user would like BB to treat Duplicates. In that now put down all the points the linked blog puts down. So some users might say ok if GUIDs match then consider as duplicate else I want to see it. Some one might say GUID First then Title + Description some one might say more. Point is all these various criterias are already precoded by BB team. It is only now the users choice which commands which ones of these Commands he wants to be included. Sounds good?

    Then going forward you could implement a more Auto-Learning Algo or some algo based on the way how Email clients detect duplicate/spams.

    Hope I made sense :)

    Comment by BlogBridge User — August 19, 2006 @ 2:23 am

  3. Hi Pito, I thought that’s the reason why the word ‘unique’ (like in URL or URI) was created.

    As long as the permalink for a post is different then also the word unique is forbidden :-)
    Regarding i.e. FeedBurner where a redirection URL is used … I can live with that. The same is true for Flickr photos which get a pool URL after they were added to a photo pool. Del.icio.us entries must be handled different because RSS feeds from bookmark services link to the original content. I see it’s getting more complicated the longer I comment on that subject :-)
    If you have different content for the same URL it would be a great feature to keep older (read ‘other’) versions in the BlogBridge database. This is very helpful when doing journalistic researches where proof of evidence can become critical.

    That said, a pinned article should be frozen in the database. A new version must be a new BB database entry. An export of pinned articles would also be something nice to have for personal archive reasons.

    Comment by Markus Merz — August 20, 2006 @ 1:22 pm

  4. RFE: Alarm sound for new articles in certain feeds

    This request also fits to determining duplicates in a way :-)
    I want an audio alarm to go off when a new article arrives in a feed. This would be very handy (in my case) for smart feeds or what I call temporary feeds which I often add to follow the development of actual news.

    Yes, I am still using W2K and such an integrated ‘ping user for attention’ configurable per feed would be nice. Configurable sound file per feed would be the cream on top.

    Comment by Markus Merz — August 21, 2006 @ 3:34 am

  5. To Markus. regarding the idea of showing ‘older’ versions of a post too, allowing ‘journalistic research’:

    Yeah we had thought about a cute “diff” display showing how the author changed his mind about the tone of his post but decided that we would get flamed for such a feature. (Always a big risk in this part of the web - getting flamed :)

    Comment by Pito Salas — August 21, 2006 @ 5:40 pm

  6. The solution used by FeedDemon to address this is to provide a per-feed “show updated articles as new” option, which uses a stricter algorithm for detecting duplicates in that feed. From memory, the default for this setting, which is off, updates the existing local copy of an article when a new article appears in the feed with the same URL or ID and the old version is not still present in the feed.

    Comment by Kevin Yank — August 21, 2006 @ 11:58 pm

RSS feed for comments on this post.

Leave a comment

Powered by WordPress