Paragraph Level Search Results on WordPress Using Digress.it and Yahoo Pipes

One of the many RSS related feature requests I put in when we were working on the JISCPress project was the ability to get a page level RSS feed out where each paragraph was represented as a separate item the page feed.

WordPress already delivers a single item RSS feed for each page containing just the substantive content of the page (i.e. the content without the header, footer and sidebar fluff), which means you can do things like this, but what I wanted is for the paragraphs on each page to be atomised as separate feed elements.

Eddie implemented support for this, but I didn’t do anything with it at the time, so here’s an example of just why I thought it might be handy – paragraph level search.

At the moment, searching a document on WriteToReply returns page level results – that is, you get a list of search results detailing the pages on which the search term(s) appear. As you might expect with WordPress, we can get access to these results as a feed by shoving feed in the URI, like this:
http://ouseful.wordpress.com/feed?s=test

Paragraph level feeds, as implemented in the Digress.it WordPress theme we were developing, are keyed by URLs of the form:
http://writetoreply.org/legaldeposit/feed/paragraphlevel/annex-c-online-content-to-be-published/#56

That is:

http://writetoreply.org/DOCNAME/feed/paragraphlevel/PAGENAME/#PARA_NUMBER

So can you guess what I’m gonna do yet…?

First of all, grab the search feed for a particular query on a particular document into a Yahoo Pipe:

Rewrite the URI of each page liked to in the results feed as the full fat, itemised paragraph feed for the page, and emit those items (that is, replace each original search results item with the set of paragraph items from that page).

The next step is to filter those paragrpah feed items for just the paragraphs that contain the original search terms:

We need to rewrite the link because (at the time of writing) the page paragraphs feed doesn’t link to each paragraph, it links to the parent page (a bug report has been made;-)

You can find the pipe here: Double dip JISCPress search

Note that at the time of writing, there’s also a problem with the paragraph number reported in the link (again a report has been made), a workaround patch for which is included in this pipe.

What this means is that we now have a workaround for indexing into individual paragraphs using a search term. If we tag content at the paragraph level, (e.g. by running a page-level paragraph feed, or double dip search results feed through OpenCalais), we can generate related search links into the document, or other documents on the platform, at a paragraph level, increasing the relevance, or resolution (in terms of increased focus), of the returned results.

Just by the by, the approach shown above is based on a search, expand and filter pattern, (cf. a search within results pattern) in which a search query is used to obtain an initial set of results which are then expanded to give higher resolution detail over the content, and then filtered using the original search query to deliver the final results. If a patent for this doesn’t already exist for this, then if I worked for Google, Yahoo, etc etc you could imagine it being patented. B*****ds.

PS here’s a trick I picked up from Joss’ blog somewhere for reversing the order of feed items published by WordPress:
http://writetoreply.org/legaldeposit/feed/?orderby=ID&order=ASC
I assume these parameters also work?