Exploring Astronomy Dataset Links with GridWorks

At ADS we are looking at new ways to index and provide full text searching for the Astronomy and Physics literature we manage to obtain, either through scanning + OCR of historical content, or from digital material provided by some publishers. Two options we’re looking at are Apache Solr and CDS-Invenio. But that’s not what this post is about.

While parsing and indexing a pile of about 42k articles from the past dozen or so years of the ApJ, AJ, ApJL and ApJS, formatted in the NLM XML schema, I noticed that many of the articles contained external links to various things, most interestingly, astronomical datasets.* My first thought was, “hmm, I wonder what’s at the other end of all those links…,” followed closely by, “hey, crawling those links would make a nice dataset to load into that nifty new Freebase Gridworks tool I heard about the other day.” So that’s what I did.

Out of 13652 articles there were 33600 total links which fell into three categories: http urls (28555), dataset links (938) and supplement links (4107). Dataset links consist of an identifier that looks something like ADS/Sa.CXO#obs/927. To get the goods you have to feed that id to a resolver which, assuming a valid identifier, will redirect you to the real location of the dataset. Supplement links took a bit more head-scratching as their values consisted of just a relative file name, like datafile3.txt or 69491.figures.html. We figured out that the solution was to append the filename to the publisher’s URL for the article, e.g., article and dataset or article and figures.

The ultimate objective was to load the results of crawling these links into Gridworks, but that means getting the data into csv or tsv form. Rather than have the crawl script output straight to csv, I stash the results in a MongoDB instance. Here’s an example of one of the resulting json documents in Mongo:

{u'_id': ObjectId('4bfc3737a1f714263b000012'),
 u'anchor_text': u'http://astronomy.swin.edu.au/staff/dforbes/glob.html',
 u'bibcode': u'2001ApJ...556L..83F',
 u'content': u'\n\nDuncan A. Forbes, Swinburne University, Globular Clusters\n\n\n

Globular Cluster Research

\n\nI am interested in various aspects of Extragalactic Globular\n    Cluster research. In particular the formation and evolution\n    of Globular Cluster Systems and their host galaxies. \n
\n\n\n\n\n\n\n\n\n\n\n\n
    \n\n\t  HREF="http://www.physics.mcmaster.ca/resources/fs3_resources.html"> HARRIS DATABASE\n
\n\n\n\n
\n\n \n'
,
 u'context': u'

The combined sample data are available at http://astronomy.swin.edu.au/staff/dforbes/glob.html.

\n'
,
 u'doi': u'10.1086/323006',
 u'ft_source': u'/proj/ads/articles/sources/AAS/ApJL/2001/556/2/323006/323006.xml',
 u'link_id': u'http://astronomy.swin.edu.au/staff/dforbes/glob.html',
 u'link_type': u'UrlLink',
 u'response': {u'accept-ranges': u'bytes',
               u'content-length': u'781',
               u'content-location': u'http://astronomy.swin.edu.au/~dforbes/glob.html',
               u'content-type': u'text/html; charset=UTF-8',
               u'date': u'Tue, 25 May 2010 10:14:07 GMT',
               u'server': u'Apache/2.2.15 (Unix) DAV/2 mod_ssl/2.2.15 OpenSSL/0.9.8e-fips-rhel5',
               u'status': u'200'},
 u'solr_id': u'31908',
 u'url': u'http://astronomy.swin.edu.au/staff/dforbes/glob.html',
 u'xpath': u'/html/article/body/sec[5]/fn-group/fn/p/ext-link'}

From there it was easy to dump what I needed to csv and load into Gridworks. I’m not going to get into how totally awesome the Gridworks software is, except to say you should watch the demo videos.

I can’t post the entire Gridworks project, but here’s some screencaps, a column list and some of the more interesting facets.

Initial data load plus some derived columns

Column list:

  • Id of the MongoDB doc
  • Id of the solr doc
  • ADS bibcode identifier of the article
  • Publication year – derived from the bibcode
  • DOI
  • xpath expression of the element
  • parent tag – the containing element type
  • link context – the containing element’s serialized xml contents
  • link type – one of url, dataset or supplement
  • anchor text – the text contents of the
  • full text source file
  • journal
  • full text source – publisher
  • extlink id – either the url or the dataset id or the supplement filename
  • domain – derived from the url
  • status – http status returned when requesting the resource
  • content-type – content-type header returned in the response
  • mimetype – derived from the content-type response header
  • location – the final url of the resource following any redirects
  • content length
  • response headers – list of all the header attribute names return in the response (just to see what other interesting stuff might be there)

<p class="wp-caption-text">
  Still to be determined how many of the url links point to some kind of data
</p>

<p class="wp-caption-text">
  Knowing the container could help parsing out something about the semantics of the link
</p>

  <p class="wp-caption-text">
    ~70% 200's was more than I expected. Of course 200 doesn't mean it actually found something interesting.
  </p>
</div>

<div id="attachment_151" style="width: 283px" class="wp-caption aligncenter">
  <a href="http://blog.reallywow.com/static/uploads/2010/05/gridworks_contenttype.png"><img class="size-full wp-image-151" title="gridworks_contenttype" src="http://blog.reallywow.com/static/uploads/2010/05/gridworks_contenttype.png" alt="" width="273" height="473" /></a>

  <p class="wp-caption-text">
    would have hoped for fewer text/html
  </p>
</div>

<div id="attachment_152" style="width: 285px" class="wp-caption aligncenter">
  <a href="http://blog.reallywow.com/static/uploads/2010/05/gridworks_domain.png"><img class="size-full wp-image-152  " title="gridworks_domain" src="http://blog.reallywow.com/static/uploads/2010/05/gridworks_domain.png" alt="" width="275" height="321" /></a>

  <p class="wp-caption-text">
    All the gcn.gsfc.nasa.gov hits look like observation reports, like this one, which I think is a good thing
  </p>
</div>

<p style="text-align: left;">
  Finally a thanks to <a href="http://dysinterested.com/">Sean Hannan</a> who worked out <a href="http://gist.github.com/414927">a hack</a> to a bit of the Gridworks javascript that automatically turns any cell values beginning with &#8220;http://&#8221; or &#8220;https://&#8221; into active links. The nice thing about that was it let me turn the column containing the MongoDB id into a link to a little <a href="http://webpy.org">web.py</a> script that dumps a JSON representation of the document.
</p>

<p>
  * NLM allows for links to external resources using either <a href="http://dtd.nlm.nih.gov/articleauthoring/tag-library/2.3/n-ju50.html"><ext-link></a> or <a href="http://dtd.nlm.nih.gov/articleauthoring/tag-library/2.3/n-2hw0.html"><supplementary-material></a> elements.
</p>
Written on May 27, 2010