Taking newspapers beyond tonight’s fishwrap

The newspaper business works much like an old-style manufacturing business where stories go from reporters to assigning editors to copy editors to layout editors with the final destination being the next day’s newspaper. A lot of thought and knowledge goes into the newspaper creation process, but it ends up getting thrown out, just like the daily paper.

Take a hypothetical sentence:

Howard Stern is scheduled to appear in court Friday in Arlington.”

This sentence contains three of the five Ws they teach in journalism school – there is the who, when and where. When this appears in a newspaper the next day, it makes a lot of sense. The reader knows from context what the italicized words mean.

Copy editors ensure that the meaning of article is accurately conveyed and that the language is precise. They work to make sure that “who” and “whom” are used correctly, that tenses agree, that punctuation is correct.

But the effort is focused on the next day’s paper. If you were to find the sentence in the example above using a search engine later, all three of the Ws are muddled. Which Howard Stern? What date does the Friday refer to? Which Arlington? Without the context, most of the value is lost. It doesn’t matter if who and whom are used correctly — or that every comma is perfect — if I can’t find the story I’m looking for.

Searchers resort to a variety of strategies to hone in on what they’re looking for. “Howard Stern and Anna Nicole Smith” or “Howard Stern shock jock”. If you’re using a news search, you might have the option to filter by publication date.

Companies like Topix.net (owned by newspaper companies) and Relegence (owned by AOL) come in after the fact and try to infer the context using technologies called entity extraction. They algorithmically try to determine which Howard Stern the story refers to. For example, if Howard Stern and Anna Nicole Smith appear frequently together the story is probably about the lawyer; if the terms Howard Stern and Sirius appear frequently the story is probably about the shock jock. But this isn’t foolproof — the shock jock and the former stripper probably appeared together in stories.

Location can be guessed at heuristically. If the story appeared in The Washington Post, it’s probably a reference to Arlington, Virginia. If the story appeared in the Dallas Morning News, it’s probably a reference to Arlington, Texas.

Again, this is all guesswork. If the original sentence had the italicized items tagged, this wouldn’t be necessary:

“Howard Stern [shock jock] is scheduled to appear in court Friday [March 9, 2007] in Arlington [Arlington, VA, US].”

It doesn’t have to change what appears in the paper; the additional information just needs to be in the story.

Copy editors already tag content in stories, but it’s mostly about presentation. For example, bylines, credit lines and subheadings are all tagged so that they appear in the right font and size when printed.

Another area that newspapers spend a lot of time on is deciding what’s important and what’s not. At a major newspaper, this is more than 40-person hours a day of the most senior editors. This is conveyed in the newspaper by what page a story appears on, the position on the page and the size of the headline. Again, most of this information is lost by the time the story reaches online.

If you do a search for “Iraq war” at washingtonpost.com, you get more that 1,700 stories in the past 60 days. Good luck finding the important ones. (The Post, like some newspapers, does show the page on which the story appeared. But those stories are mixed in with all the others.)

If the who, when and where of stories and pictures were adequately captured, the data could be used to automatically generate much more compelling user experiences and improve the reporting process:

  • More precise search results. When a user searches for an ambiguous term like “Howard Stern”, the search can guide him to the correct stories.
  • A timeline on any topic that highlights the most important stories on the topic. Newspapers periodically do timelines by having editors and reporters manually sift through archives. This can be automated and expanded to cover any topic for which there is data. If I want a timeline on amusement park accidents, I can get one. These automatically generated timelines can also save a lot of time when editors decide to print a timeline.
  • A crime map that shows activity in a given neighborhood for any time period. This could also be used by reporters and editors to see patterns that might get lost in day-to-day reporting.
  • A photo explorer that lets readers see what has happened in an area. Imagine Yahoo’s World Explorer, with newspaper photography on the map. You could also use it to time travel a neighborhood.

These are just a few examples of the types of experiences you can create that extend the value of all the work that goes into each day’s paper.

Of all the companies in the media business, newspapers have the strongest assets for capturing knowledge about current events. The type, quality and volume of original content they create is incredibly expensive to do. They just need to decide to move from the fishwrap business to the knowledge business.

Background: I majored in journalism and worked in the newspaper business (including as a copy editor). I’ve worked at startribune.com and washingtonpost.com. You can read more of my prescriptions for the newspaper business.


About Rakesh Agrawal

Rakesh Agrawal is Senior Director of product at Amazon (Audible). Previously, he launched local and mobile products for Microsoft and AOL. He tweets at @rakeshlobster.
This entry was posted in journalism, media, newspapers. Bookmark the permalink.

9 Responses to Taking newspapers beyond tonight’s fishwrap

  1. Pingback: I Don’t Read Newspapers Anyway: Sam Zell Didn’t Count My Vote » Webomatica - tech, movies, music blog

  2. very smart thinking on the semweb / tagging role news organizations could play. sort of the next level of SEO.

    i agree that old news is only useful if put in context. if i understood correctly, some newspapers already did this for their expensive archives. why hide this valuable metadata behind the paywall? it makes you more relevant, after all.

  3. Your post brings up some interesting issues – some of the temporal aspects of context was researched by Douglas Koen at MIT’s Media Lab. The results were published in the IBM Systems Journal article entitled “Time Frames: Temporal augmentation of the news” – http://www.research.ibm.com/journal/sj/393/part1/koen.pdf

    Lots of good work done as part of the News in the Future consortium in the mid-90s.

  4. Krista says:

    Rocky – thanks for your note. Yes — this is what Gerry and the ClearForest gang have been working on.

    This is the sort of metatagging we are hoping to faciliate and automate — free of charge — with the Calais Web Service.

    We just released version 1.0 of the Calais API, which is available to commercial and non-commercial developers alike at OpenCalais.com

  5. Pingback: Copy editors going the way of the dodo « reDesign

  6. Pingback: The Russians are coming! The Russians are coming! « reDesign

  7. Pingback: What the AP must do now « reDesign

  8. Pingback: Geo-enabled Twitter comes alive on Twitter Maps « reDesign mobile

  9. Pingback: Geo-enabled Twitter comes alive on Twitter Maps « reDesign

Comments are closed.