Tuesday, August 4, 2015

Parsing the DICOM Standard with Javascript in the browser

This morning I decided to enhance the DICOM Dump with Data Dictionary example from the dicomParser library to include the human friendly name for UIDs.  Doing this would require a lookup table of UIDs to names.  I didn't have such a lookup table so would have to borrow it from somewhere else or make it myself.  I know that DICOM publishes itself in a variety of electronic formats some of which are intended to be easy to parse for exactly this purpose but had never tried parsing them before.  I figured I would give it a shot and see how it goes.  I started out by checking out David Clunie's DICOM Status page.  I noticed that there were several formats - PDF, HTML, CHTML, Word, ODT and XML.  My heart immediately sinks as I realize that I am going to have to parse XML.  Then an idea then came to me - why don't I just use the Javascript console embedded in the web browser to extract the data I want from the HTML using Javascript?  I open the HTML page for PS 3.6 and use the "inspect element" feature to look at the structure.  I notice that the table I want is marked with an ID which means I can probably build a selector to find the tbody I want.  A few tries later and I come up with the following selector:

$('#table_A-1 ~ div tbody')

Next up is to write some javascript to iterate over each tr in the tbody and write out the UID and name in Javascript so I can paste it into my file.  A bit of trial and error later and I come up with the following:

(function () {
  var elements = document.querySelectorAll('#table_A-1 ~ div tbody tr');
  var result = "";for(var i=0; i < elements.length; i++) {
    result += "'" + elements[i].childNodes[1].childNodes[1].innerText  + "':'" + 
    elements[i].childNodes[3].childNodes[1].innerText + "',\n";
  }
  return result;
})();

Which generates exactly what I want!  I paste the resulting string into a new file and try it out - but its not working.  For some reason, the lookup on UID is not matching.  I look a bit closer and notice that the values in the HTML have some non printable characters in them:

1.2.840.10008.5.1.4.1.&#8203;1.&#8203;2

I make another change to my javascript to strip out non printable charcters:

(function () {
  var elements = document.querySelectorAll('#table_A-1 ~ div tbody tr');
  var result = "";for(var i=0; i < elements.length; i++) {
    result += "'" + elements[i].childNodes[1].childNodes[1].innerText.replace(/[^\x20-\x7E]+/g, '')  + "':'" +
    elements[i].childNodes[3].childNodes[1].innerText.replace(/[^\x20-\x7E]+/g, '') + "',\n";
  }
  return result;
})();

And now I have the data I want!  Here is a link to the resulting javascript.  Pretty cool little hack demonstrating the power of what you can do with Javascript in a web browser.  This same strategy can be used to quickly extract data from any web page into any format you want.