Monthly Archives: April 2007

XML parsing – python

XML is a great way to organize information. I first learnt of the power of XML to systematize information when I used it to output a whole bunch of search results from NCBI in the Tinyseq XML format. Once I had this XML document , I could read it into Excel and then very easily analyze the information since it was nicely laid out as an Excel sheet.

Backpackit a service I use to take notes detailing my experimental research results
outputs all of the account data in XML format. Before I can move this data elsewhere , it helps for me to understand the data structure. So the first task I set out to do was to parse the XML output.

I decided to use Python for this , because I felt using Java here would be like using an elephant to crush a fly ( or whatever the expression is ). Also a lot of the data is text , and I always used perl previously to handle text. So a general basis for my codeitch will be What I did in Perl before I wold like to do in Python now. Java will be used once for more heavyweight tasks.

What I needed my program to do was :

  1. Read the XML output
  2. Create objects for each element or node in the output

I can then imagine that once I have these objects I can ask questions like how many objects have embedded images , how many objects have outgoing links etc etc..

The “Dive into python “ book gave me a quick introduction into the xml.dom package. I then ran into some encoding or codec issues and learnt all about “utf8” and “iso8859” character encoding. Once I learnt how to handle the UnicodeEncodeError , I had a full fledged three line program that parsed my input file , created the document object and as proof of successful parsing and printed my XML file back out.

The screencast above documents my travails.

Advertisements

Moving Data from Backpackit to Jotspot

First a tangent: In order to reduce desktop clutter and focus on the task at hand, I have decided to experiment with “A distraction free Desktop” as screencasted by Jon Udell. His accompanying blog post pointed me to a great Mac utility to cleanup my desktop and also to iterm which I believe is a very good alternative to the Terminal App offered by Apple.

One of the first tasks I want to focus on is how to organize my data from backpackit to enable an impending move to Jotspot. There are several issues to be handled.

  • First a lot of my image and file links are hosted on my university account , I need to keep track of those files and links as I plant to consolidate all of them and store them in one place
  • A lot of my text is free form. I want to give it more of a structure so that I can make better sense of it . for eg an entire experiment is detailed in a long paragraph instead of a more organized – Goal – Method- Conclusion type structure
  • My Image links are quite “stupid” and cannot be queried in any smart way – for eg. All my protein gel images have random file names. I have to com up with a smart link or file naming system to bring more order and query-ability to the gel images
  • A significant number of my graphs and plotted data are embedded as png or jpeg files. I need to plot data dynamically using Javascript or flash . Also jpeg plots are not query-able!

Codeitch- or the desire to codeify and systematize everything I do

This blog is roughly about my attempts to codify , algorithmize and systematize everything I do. In it I will hopefully detail my march to coding and getting proficient in a bunch of computer languages. After a long process of looking around , I have narrowed my focus to the following 3 languages in no particular order.
Java , Python and Javascript.

The reasons for these will hopefully emerge as I begin posting. But I will try and spell them out here

Java :I like java for two reasons , its one of the most widely used languages in the enterprise space and the second and very important reason are the Java IDEs. Both java IDEs I use namely Eclipse and Netbeans are Free and amazingly featured. Code prompting available in both IDEs make mastering an API a lot easier than learning the same functionality using other languages or platforms. Also, I love Javadoc !. It really makes picking up new APIs a little easier

Python : My first crack at automating anything came with Perl scripting. I will not lie if I say that If I have to do anything today I will first use Perl. But after several year of Perl use I found I was re-using very little code. I have to get more object oriented in the way I code, and since I never quite got a hang of Perl objects , and its namespace conventions!. Python which is at its heart a purely object oriented scripting language with libraries that easily rival Java was a natural scripting alternative.Learning Python I hope will teach me how to script smart objects that will beg to be reused.

Javascript : This is a surprising bedfellow to my codeitch. I want to learn javascript simply because it is becoming very fashionable. Google Maps and gmail have AJAX at their core and Javascript is the J in AJAX. Plus I have always fancied having a web frontend to everything I do and I am sure Javascript will beg to be used when that happens.

The above are the three languages I want to master.

Apart from these there are two platforms that I want to get comfortable with and they are

Excel : Everyone in the business world uses Excel. Spreadsheets were the PCs killer App and Excel is in my mind microsofts great product. I have seen the amazing things ou can do in Excel without writing a single line of code, and I want to learn to use its power.

Matlab: This platform from Mathworks is the bread and butter of engineering computation. I am anything but an engineer but have seen matlabs power when it comes to simulations. A lot of the very academic questions that I have in my research can really benefit from learning the Matlab platform and no I am not fully convinced on why exactly I need to use Matlab, maybe I will find a more concrete reason.

refs: My tumblr feed