Tuesday, August 4, 2015

Parsing the DICOM Standard with Javascript in the browser

This morning I decided to enhance the DICOM Dump with Data Dictionary example from the dicomParser library to include the human friendly name for UIDs.  Doing this would require a lookup table of UIDs to names.  I didn't have such a lookup table so would have to borrow it from somewhere else or make it myself.  I know that DICOM publishes itself in a variety of electronic formats some of which are intended to be easy to parse for exactly this purpose but had never tried parsing them before.  I figured I would give it a shot and see how it goes.  I started out by checking out David Clunie's DICOM Status page.  I noticed that there were several formats - PDF, HTML, CHTML, Word, ODT and XML.  My heart immediately sinks as I realize that I am going to have to parse XML.  Then an idea then came to me - why don't I just use the Javascript console embedded in the web browser to extract the data I want from the HTML using Javascript?  I open the HTML page for PS 3.6 and use the "inspect element" feature to look at the structure.  I notice that the table I want is marked with an ID which means I can probably build a selector to find the tbody I want.  A few tries later and I come up with the following selector:

$('#table_A-1 ~ div tbody')

Next up is to write some javascript to iterate over each tr in the tbody and write out the UID and name in Javascript so I can paste it into my file.  A bit of trial and error later and I come up with the following:

(function () {
  var elements = document.querySelectorAll('#table_A-1 ~ div tbody tr');
  var result = "";for(var i=0; i < elements.length; i++) {
    result += "'" + elements[i].childNodes[1].childNodes[1].innerText  + "':'" + 
    elements[i].childNodes[3].childNodes[1].innerText + "',\n";
  }
  return result;
})();

Which generates exactly what I want!  I paste the resulting string into a new file and try it out - but its not working.  For some reason, the lookup on UID is not matching.  I look a bit closer and notice that the values in the HTML have some non printable characters in them:

1.2.840.10008.5.1.4.1.&#8203;1.&#8203;2

I make another change to my javascript to strip out non printable charcters:

(function () {
  var elements = document.querySelectorAll('#table_A-1 ~ div tbody tr');
  var result = "";for(var i=0; i < elements.length; i++) {
    result += "'" + elements[i].childNodes[1].childNodes[1].innerText.replace(/[^\x20-\x7E]+/g, '')  + "':'" +
    elements[i].childNodes[3].childNodes[1].innerText.replace(/[^\x20-\x7E]+/g, '') + "',\n";
  }
  return result;
})();

And now I have the data I want!  Here is a link to the resulting javascript.  Pretty cool little hack demonstrating the power of what you can do with Javascript in a web browser.  This same strategy can be used to quickly extract data from any web page into any format you want.

2 comments:

  1. LOL, I always feel the same way about XML.

    ReplyDelete
  2. This "fear of HTML" is pretty amusing, when you consider that I use the HTML form of tables to encode the DocBook XML source of the standard (from which the HTML and CHTML and everything else is derived); so your JavaScript could just as easily look for <table/> elements with an id or label attribute in the DocBook XML without having to skip the rendering cruft in the HTML.

    There are also a bunch of XSL-T stylesheets in the "support" folder of the "sourceandrenderingpipeline" file in the distribution of each release, which were intended to inspire folks to do this sort of thing. E.g., you could just do <xsl:template match="docbook:table[@label = 'A-1']"/>, etc.

    I.e., XSL-T is nothing to be afraid of either.

    David

    ReplyDelete