mashed library

mashing up libraries since 2008

Owen Stephens

Extracting ISBNs from RSS feeds, web pages, or anywhere really

ISBNs are by no means a perfect identifier for books - but they are widely used, and so it is often useful to be able to grab an ISBN from one place, and pass it to another (e.g. extract and ISBN from a web page, and then use it to do a search on a library catalogue). Post recipes for extracting ISBNs here.

Share

Reply to This

Replies to This Discussion

Recipe 1: Yahoo Pipes (with a healthy side helping of regular expressions)

First prepare your regular expression.

A regular expression is "a pattern describing a certain amount of text" (http://www.regular-expressions.info/tutorial.html). A bit like a supercharged version of Word's 'Find and Replace'. Regular expressions are incredibly powerful, and can be extremely complex. For this recipe we are going to create a basic 'regular expression' which will match an ISBN within a longer piece of text.

At their most basic regular expressions are literally just the set of characters you want to match. If we know we are looking for a specific ISBN (e.g. 186094311X) the regular expression would like like:

186094311X

However, for this example we want to match not just a specific ISBN but any ISBN. Luckily ISBNs have pretty strict rules about how they are structured which tend to make them easy to spot. For this recipe I'm going to assume that ISBNs are always 10 or 13 characters long, and that every character is a digit with the possible exception of the last character which can be an X (N.B. this is an over simplification, I'll expand a bit more on the limitations of this in a separate post)

Regular Expressions allow us to match patterns by introducing some special characters, and allowing some operators which express how often a character appears (amongst other things). For example any single digit (0-9) can be matched with the special character:

\d

This would find any single digit (no matter which digit this was) in a body of text. If we wanted to find 4 digits next to each other (e.g. a year) we can write an expression like:

\d{4}

The {4} indicates the number of repetitions of a character - e.g. x{4} would find the pattern 'xxxx'. Because we have used \d{4} this would match 1111 or 2009 equally well. Getting back to the ISBN example, we now know how to search for a 10 digit number:

\d{10}

This would find any 10 digit number - but if we know that we are working with bibliographic data then we can be pretty sure that any 10 digit number will be an ISBN. There are two problems with this. Firstly \d only matched digits, so the regular expression \d{10} wouldn't find the ISBN 186094311X because this is 9 digits followed by an X - not 10 digits. Secondly we know that ISBNs can be either 10 OR 13 characters long these days. These problems mean that we need a more sophisiticated regular expression.

Another operator in a regular expression allows us to express a pattern that includes one character OR another. This might be useful when searching for spelling variations. We can match both 'organisation' and 'organization' by using the expression:

organi[s|z]ation

Here the use of [s|z] says look for an 's' OR a 'z' - one or the other.

So we can now build a regular expression that says 'look for 9 digits followed by another digit or an X' as follows:

\d{9}[\d|X]

This will now match my example ISBN 186094311X (9 digits, followed by an X)

To deal with the possibility of a ISBN with 13 characters (aka 12 digits followed by another digit or an X) we can add an element to the regular expression which uses the fact that when specifying the number of repetitions we can put more than one possibility:

\d{9,12}[\d|X]

OK, now we have a regular expression that will match both ISBN-10 and ISBN-13. There are a few more elements to the final regular expression - I'm just going to cover these really briefly:

^ = start of line
$ = end of line
. = match any character at all
* = 0 or more repetitions of the preceding character (so .* matches any number of any characters)
\b = called a 'word boundary' - essentially any character (like a space or punctuation) that could appear before or after a word

I'm also going to allow for the possibility of the ISBN ending with a lower case x rather than an upper case X - regular expressions are generally case sensitive. The final touch I'm going to add is I'm going to use parentheses ( and ) to surround the part of the regular expression that will match the actual ISBN - we'll come to why we do this in a moment.

The final regular expression (regexp) is:

^.*\b(\d{9,12}[\d|X|x])\b.*$

This says something like 'look for the start of a line, followed by any number of any characters, followed by a word boundary, followed by an ISBN-10 or ISBN-13, followed by a word boundary, followed by any number of any characters, followed by the end of the line'

Right - now we've prepared the regular expression, we've done the hard bit, and we are ready to plug this into Yahoo Pipes.
  • Login to Yahoo Pipes at http://pipes.yahoo.com (you need a Yahoo account of course)
  • Click the 'Create a new Pipe' link
  • From the 'Sources' section on the lefthand side find the 'Fetch Feed' module and drag into the pipes area on the right. Enter the URL for the first feed into the box in the 'Fetch Feed' module - as an example I'm going to use the New Books feed from the University of Bradford - the URL is http://www.brad.ac.uk/library/newbooks/newbooks.rss
  • From the 'Operators' section on the lefthand side, find the 'Rename' module, and drag into the pipes area. Plumb the output of the Fetch Feed module into the input of the Rename block. In the Rename block you need to set up a 'Mapping'. In the lefthand box choose from the dropdown 'item link'. In the middle box choose from the dropdown 'Copy'. In the righthand box type 'isbn'
  • From the 'Operators' section on the lefthand side, find the 'Regex' module, and drag into the pipes area. Plumb the output of the 'Rename' module into the input of the 'Regex' block. In the Regex block you need to set up a Rule. In the lefthand box choose the from the dropdown 'item.isbn' - this is the copy of the Link that you have just made. In the replace box enter the regexp we prepared above '^.*\b(\d{9,12}[\d|X|x])\b.*$'. In the 'with' box you simply enter '$1' - this is where the parentheses we added to the regex come in - by using the parentheses we mark part of the regular expression as something we can reuse later - and the pattern marked by the first set of parentheses is referred to as $1 (and any patterns in subsequent parentheses can by reused as $2, $3 etc.)
  • This is essentially taking the whole contents of the newly created 'isbn' field - from the start of the line to the end of the line - and replacing it with just that bit of the whole line that matches an ISBN-10 or ISBN-13
  • Finally, plumb the output of the Regex block into the Pipe Output. You should find that in the output you can see the 'isbn' field. This could then be used to plugin to other pipes or to make links to other catalogues etc.

Reply to This

Probably time to come clean and admit there are some problems with the Regular Expression in the recipe above. It assumes that the ISBN will be just a continuous set of 10 or 13 characters. However, in real life an ISBN might well be split into blocks on characters split by hyphens (or even spaces). E.g.

0 571 08989 5
90-70002-34-5

To take into account all the possible variations you'd need a much more complex regular expression.

Reply to This

How about this:

^.*\b((\d[-| ]?){9,12}[\d|X|x])\b.*$

If I understand this right, it searches for (a digit then either a hyphen or space, 0 or 1 times) 9 to 12 times then is the same as your one above. I tested it using the pipes as you suggested and it also seems to work on this regular expression checker:
http://members.ziggo.nl/h.schotel/testaregex/


One other problem, though, is that it still matches 11 and 12 digit ISBNs, as {9,12} means at least 9 and a maximum of 12.

Reply to This

Refinement to only accept 10 or 13 digit ISBNs:

^.*\b((\d[-| ]?){9}((\d[-| ]?){3})?[\d|X|x])\b.*$

^.*\b(
(\d[-| ]?){9} finds 1 group of 9 digits, each followed by 0 or 1 spaces or hyphens
((\d[-| ]?){3})? finds 0 or 1 group of 3 digits, each followed by 0 or 1 spaces or hyphens
[\d|X|x] finds 1 upper or lower case x
)\b.*$

Reply to This

Thanks for picking up on that error - I'd forgotten that the {9,12} meant a Minimum of 9 repetitions and Maximum of 12 repetitions (bit rusty on this!)

The suggestion you've made is more flexible and would match a wider variety of patterns than mine - so this would match all of the following ISBNs:

186094311X
0 571 08989 5
90-70002-34-5
978-0747557869

These are all real life examples. However, the expression would also match:

0 5 7 1 0 8 9 8 9 5
9-0-7-0-0-0-2-3-4-5

These are highly unlikely patterns for valid ISBNs.

In theory the structure of an ISBN is strictly defined (by the ISBN Users' Manual - http://www.isbn-international.org/en/download/2005%20ISBN%20Users'%20Manual%20International%20Edition.pdf):

Prefix (ISBN-13 only): 3 digits, always "978"
Registration group element: 1-5 digits
Registrant element: 1-7 digits
Publication element: 1-6 digits
Check digit: Single character (digit or X)

Although each element can vary in length, the ISBN will always be 10 or 13 characters long.

The ISBN Users' Manual also says:

When printed, the ISBN is always preceded by the letters “ISBN”.
Note: In countries where the Latin alphabet is not used, an abbreviation in the characters
of the local script may be used in addition to the Latin letters “ISBN”.
The ISBN is divided into five elements, three of them of variable length; the first and last
elements are of fixed length. The elements must each be separated clearly by hyphens or
spaces when displayed in human readable form:
ISBN 978-0-571-08989-5
or
ISBN 978 0 571 08989 5
Note: The use of hyphens or spaces has no lexical significance and is purely to enhance
readability.


In reality you only need to look at a page in Amazon to see that the spaces/hyphens are often omitted, so I wouldn't recommend writing a regular expression that assumes this, although there are some examples of this floating around - e.g. http://www.regexlib.com/REDetails.aspx?regexp_id=463 (ISBN-10 only)

You might assume that if there are spaces or hyphens then they will only appear in the appropriate places - e.g. you'd never get a space/hyphen after the first digit of an ISBN-13 - and you could write a regexp to find the valid patterns within these restrictions - but I think we are into diminishing returns here.

Finally worth saying that you could of course check for valid ISBNs by following the rules in the ISBN Users' Manual, including calculating the check digit and ensuring the ISBN is indeed valid. However, this takes us way beyond a simple recipe for a quick and easy mashup!

Reply to This

Yes, it's a balance between finding all ISBNs and finding only correct ones. As you say, ideally you'd validate them too, although there are so many incorrect ISBNs in bibliographic databases (020 $z for instance or $a and entered anyway) that it woudn't be useful to exclude them if you can be reasonably sure the ones you are pulling out are likely to be OK.

As for the hyphens and spaces, it does really depend on the data you are working on. If the data has come from a library catalogue, then it's probably irrelavent as I've only seen them in really dodgy MARC catalogue records (I am an academic library cataloguer), so something like the following would be fine (I think):
^.*\b(\d{9}(\d{3})?[\d|X|x])\b.*$
My earlier regexp was as you say a little too free and could possibly pick up other numbers or sequences of numbers separated by spaces. I can't think of a concrete example but it seems plausible.

That said, data pulled from publishers' catalogues would probably have loads of hyphens and spaces so would need to cater for them. As long as you could narrow the field you are searching on then it would probably be OK to use a freer version as there are not likely to be other numbers in an ISBN data element aside from things like volume number, print date, numbers in publisher's name, all still fairly unlikely.

Reply to This

Agreed. Also in a specific source there may well be other patterns you can look for that will indicate the presence of an ISBN. In the example I use (University of Bradford New Books feed) the Link field always looks like:

http://ipac.brad.ac.uk/ipac20/ipac.jsp?index=ISBN&term=

Which is pretty easy to match even if you don't do any clever stuff with the ISBN itself.

Reply to This

The National Digital Archive of Datasets includes the ability to search using Regular Expressions - there is some documentation at http://www.ndad.nationalarchives.gov.uk/help/data_browsing/regular_..., and http://tinyurl.com/qgny42 which give a good basic intro

Reply to This

Just found this tutorial tool for Regular Expressions developed at Dev8D last year
http://users.ecs.soton.ac.uk/cjg/regexp/

Reply to This

This looks like it could be a really useful "ingedient", Owen; many thanks.

We could do the same for ISSNs, which, at least, are slightly more predictable than ISBNs (only slightly - they're at least always 8 digits, and only contain the one hyphen, tho' this is sometimes omitted or replaced by a space). Would it be something along the lines of....

^.*\b([\d{4}[-| ]?\d{3}[\d|X|x])\b.*$

I.e.

Start of the line.
Any number of characters.
A word-boundary.
(
Four digits.
A hyphen or a space, 0 or 1 times.
Three digits.
A digit or an X or an x.
)
A word-boundary.
Any number of characters.
End of the line.

So, this matches all of: 1234-1234, 1234-123X, 1234-123x, 12341234, 1234123X, 1234123x, 1234 1234, 1234 123X, and 1234 123x.

As with the ISBN examples, it'd be perfectly possible for some of those to turn out not to be ISSNs at all, but strings of numbers for some other purpose - but in a page of information about journals, they'd probably turn out to be ISSNs, and that's the important thing!

Paul

Reply to This

O'Reilly's "Regular Expressions Cookbook" (2009) has the following regexs for ISBNs:

ISBN10 ^(?:ISBN(?:-10)?:? )?(?=[-0-9X ]{13}$|[0-9X]{10}$)[0-9]{1,5}[- ]?(?:[0-9]+[- ]?){2}[0-9X]$

ISBN13 ^(?:ISBN(?:-13)?:? )?(?=[-0-9 ]{17}$|[0-9]{13}$)97[89][- ]?[0-9]{1,5}[- ]?(?:[0-9]+[- ]?){2}[0-9]$

ISBN10 & 13 ^(?:ISBN(?:-1[03])?:? )?(?=[-0-9 ]{17}$|[-0-9X ]{13}$|[0-9X]{10}$)(?:97[89][- ]?)?[0-9]{1,5}[- ]?(?:[0-9]+[- ]?){2}[0-9X]$

Reply to This

Looks thorough!

Reply to This

RSS

About

Owen Stephens Owen Stephens created this social network on Ning.

Badge

Loading…

© 2009   Created by Owen Stephens on Ning.   Create Your Own Social Network

Badges  |  Report an Issue  |  Privacy  |  Terms of Service