<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Harry Mangalam&#039;s Weblog</title>
	<atom:link href="http://hjmangalam.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://hjmangalam.wordpress.com</link>
	<description>Just another WordPress.com weblog</description>
	<lastBuildDate>Sat, 25 Jun 2011 00:43:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='hjmangalam.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Harry Mangalam&#039;s Weblog</title>
		<link>http://hjmangalam.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://hjmangalam.wordpress.com/osd.xml" title="Harry Mangalam&#039;s Weblog" />
	<atom:link rel='hub' href='http://hjmangalam.wordpress.com/?pushpress=hub'/>
		<item>
		<title>scut &amp; cols &#8211; utilities to slice, dice, join, and view columnar data</title>
		<link>http://hjmangalam.wordpress.com/2009/09/16/scut-cols-utilities-to-slice-dice-join-and-view-columnar-data/</link>
		<comments>http://hjmangalam.wordpress.com/2009/09/16/scut-cols-utilities-to-slice-dice-join-and-view-columnar-data/#comments</comments>
		<pubDate>Wed, 16 Sep 2009 18:36:56 +0000</pubDate>
		<dc:creator>hjmangalam</dc:creator>
				<category><![CDATA[Utilities]]></category>

		<guid isPermaLink="false">http://hjmangalam.wordpress.com/?p=30</guid>
		<description><![CDATA[Introduction There is already a standard Linux utility called cut that has some of scut&#8217;s functionality bundled with every Linux distro (try man cut), however, it&#8217;s fairly stupid and has no regular expression (aka regex) capability. scut&#8217;s cut functions scut (code here) was written originally to find similar lines in 2 different files and output [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=hjmangalam.wordpress.com&amp;blog=1623151&amp;post=30&amp;subd=hjmangalam&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[</p>
<hr />
<h2><a name="_introduction"></a>Introduction</h2>
<p>There is already a standard Linux utility called <em>cut</em> that has some of <em>scut&#8217;s</em> functionality bundled with every Linux distro (try <em>man cut</em>), however, it&#8217;s fairly stupid and has no <a href="http://en.wikipedia.org/wiki/Regular_expression">regular expression</a> (aka regex) capability.</p>
<hr />
<h2><a name="_scut_8217_s_cut_functions"></a>scut&#8217;s cut functions</h2>
<p>scut <a href="http://moo.nac.uci.edu/~hjm/scut">(code here)</a> was written originally to find similar lines in 2 different files and output a combined output, sort of like the <a href="http://www-128.ibm.com/developerworks/linux/library/l-textutils.html#9">join command</a> but as I worked on it, I realized that what I was doing was reiterating an SQL engine. Since there were already great SQL engines available, I stopped work on that aspect of it (tho it still works) but it&#8217;s still a better cut than cut (named as a contraction of <em>super-cut</em>, or perhaps <em>the util that does your &#8220;scut&#8221; work for you</em>.  Anyway, it&#8217;s designed to cut and re-order <em>columns</em> of data from a file. The original functions are still intact, so if you need to look for lines in one file that exist in another file, that should still work as well and if <em>join</em> is unable to do what you need, you might want to try <em>scut</em>.</p>
<p>Consider for example, a long (11 million lines) gene expression data file named <em>OMG</em> that looks like this:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>chr10:100018403-100020903;CHR10FS100020403;1    chr10_0 100020403       100020452       0.01
chr10:100018403-100020903;CHR10FS100019203;1    chr10_1 100019203       100019256       0.22
chr10:100018403-100020903;CHR10FS100019503;1    chr10_2 100019503       100019559       0.08
chr10:100018403-100020903;CHR10FS100019903;1    chr10_3 100019903       100019952       0.66
chr10:100018403-100020903;CHR10FS100020203;1    chr10_4 100020203       100020259       0.26
chr10:100018403-100020903;CHR10FS100019803;1    chr10_5 100019803       100019853       0.50
chr10:100018403-100020903;CHR10FS100018703;1    chr10_6 100018703       100018752       0.03
chr10:100018403-100020903;CHR10FS100018403;1    chr10_7 100018403       100018466       -0.12
chr10:100018403-100020903;CHR10FS100018903;1    chr10_8 100018903       100018952       -1.05
chr10:100018403-100020903;CHR10FS100020303;1    chr10_9 100020303       100020364       -0.76
[continues for 11,353,343 lines]</pre>
</td>
</tr>
</table>
<p>and you wanted to break it on both <em>;</em> and whitespace (any contiguous number of spaces and and tabs), writing fields <em>1 3 4</em> but in the order <em>3 1 4</em> , and have the output data fields separated by &#8221; ^ &#8220;, you could do this with scut in a single pass:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>scut --id1=';|\s+'  --od=' ^ ' --c1='3 1 4'  &lt; OMG</pre>
</td>
</tr>
</table>
<p>Out would spring:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>chr10_0 ^ CHR10FS100020403 ^ 100020403
chr10_1 ^ CHR10FS100019203 ^ 100019203
chr10_2 ^ CHR10FS100019503 ^ 100019503
chr10_3 ^ CHR10FS100019903 ^ 100019903
chr10_4 ^ CHR10FS100020203 ^ 100020203
chr10_5 ^ CHR10FS100019803 ^ 100019803
chr10_6 ^ CHR10FS100018703 ^ 100018703
chr10_7 ^ CHR10FS100018403 ^ 100018403
chr10_8 ^ CHR10FS100018903 ^ 100018903
chr10_9 ^ CHR10FS100020303 ^ 100020303
[etc]</pre>
</td>
</tr>
</table>
<p>The secret here is that scut allows the use of <a href="http://www.perl.com/doc/manual/html/pod/perlre.html">Perl Regular Expressions</a> that can encode just about any pattern. Regular Expressions are both fearsomely confusing and insanely powerful. <a href="http://en.wikipedia.org/wiki/Regular_expression">Wikipedia</a> has a good description.</p>
<p>Because <em>scut</em> can break data on arbitrary regexes, you can tame most text data in a single pass, certainly in multiple passes by piping the output of one operation into the input of another.</p>
<p><a href="http://moo.nac.uci.edu/~hjm/scut_help.txt">scut &#8211;help</a> will give you the help pages, and the internal code is fairly clear, tho not tremendously well-documented.  It&#8217;s also not entirely free of bugs &#8211; some of the routines were written for specific tasks and were not generalized well, but I still find myself using it a lot.</p>
<p>Here&#8217;s another example of scut helping me find out the size distribution of all files in a directory:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>ls -l | scut --c1='4' |stats</pre>
</td>
</tr>
</table>
<p>(note that the GNU <em>ls</em> command shipped with Linux produces a slightly different format than other <em>ls</em> variants.  This one works with GNU <em>ls</em>).</p>
<p>The above command takes the output of ls -l, pipes it into <em>scut</em> which extracts field 4, then pipes that data into <a href="http://moo.nac.uci.edu/~hjm/stats">stats</a>, another Perl utility that generates descriptive stats of whatever goes into it.</p>
<p>it produces: (with [my comments added])</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>Sum       193109146  [total of 193MB)
Number    18 [in 18 files]
Mean      10728285.8888889  [mean size of 10.7MB]
Median    27761 [but obviously due to some large files]
Mode      FLAT
NModes    No # was represented more than once
Min       0
Max       73690201 [here's a 73.6MB  reason for the skew]
Range     73690201
Variance  401893120826683  [and the rest]
Std_Dev   20047272.1542529
SEM       4725187.36152149
Skew      2.14197169376939
Std_Skew  3.71000380198295
Kurtosis  2.96898153032708</pre>
</td>
</tr>
</table>
<hr />
<h2><a name="_scut_8217_s_join_feature"></a>scut&#8217;s join feature</h2>
<p>The <em>join</em> feature of <em>scut</em> works a little differently than does <em>join</em> itself. For a good example of how join works, see this IBM DeveloperWorks tutorial called <a href="http://www-128.ibm.com/developerworks/linux/library/l-textutils.html#9">Simplify data extraction using Linux text utilities</a></p>
<p>In <em>scut</em>, like <em>join</em>, you have to specify 2 filenames, the key fields, and what fields you want output, but with scut, you can specify different input and output delimiters (which can be multicharacter instead of single character), the fields in any order, and a host of other options.</p>
<p>If we have a file of needles (called <em>needles</em>) that need to be found in a haystack file (called <em>haystack</em>), and <em>needles</em> looks like this:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>apbA
carA+pyrA
cdsA+cds
codA
ddlA
dinJ
dnaX+dnaZ
fepB
fhuB
fixC
folK
&lt;etc&gt;</pre>
</td>
</tr>
</table>
<p>and the <em>haystack</em> file looks like this:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>      1     0                 1                2            3            4            5
      2     b              gene               EC         C1.0         C1.1         C2.0
      3 b0001              thrL      0.000423101   0.00046544  0.000429262  0.000433869 ...
      4 b0002  thrA+thrA1+thrA2  1.1.1.3+2.7.2.4  0.001018277  0.001268078  0.001312524 ...
      5 b0003              thrB         2.7.1.39  0.000517967  0.000457605  0.000582354 ...
      6 b0004              thrC         4.2.99.2  0.000670075  0.000558063  0.000789501 ...
      &lt;etc&gt;</pre>
</td>
</tr>
</table>
<p>and you wanted to match the fields in the single column of the <em>needles</em> file with the <em>gene</em> field in the <em>haystack</em> file, and you wanted the output to include:</p>
<ul>
<li> haystack columns 1, 0, 2, 5 in that order </li>
<li> the output delimiter being <em>|</em> instead of &lt;TAB&gt; </li>
<li> written to a file called <em>hits</em> </li>
</ul>
<p>the command to do so would be:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>scut --f1='needles' --f2='haystack' --k1=0 --k2=1  --c2='1 0 2 5'  --od='|' &gt;hits</pre>
</td>
</tr>
</table>
<p>and the <em>hits</em> file would look like this:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>apbA|b0425|1.55E-05|1.34E-05|
carA+pyrA|b0032|6.3.5.5|0.000155796|
cdsA+cds|b0175|2.7.7.41|2.91E-05|
codA|b0337|3.5.4.1|3.20E-05|
ddlA|b0381|6.3.2.4|0|
dinJ|b0226|0|0|
dnaX+dnaZ|b0470|2.7.7.7|4.48E-05|
&lt;etc&gt;</pre>
</td>
</tr>
</table>
<p>I&#8217;ve just modified scut to handle column specifiers more efficiently. It&#8217;s spelled out in the help section. ie:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>      --c1='# # ..'   - the numbers of the columns from file1 that you want
                         printed out in the order in which you want them.  If
                         you DON'T want any columns from the file, just
                         omit the --c1 option completely.
                         If you want the whole line, type --c1='ALL'.

                         You can also use discontinous ranges like '2:4 8:10'
                         to print [2 3 4 8 9 10] and decreasing ranges like
                         '8:4' to print cols [8 7 6 5 4].  You can also negate
                         columns to remove them from a larger range '9:12' -11'
                         to print [9 10 12] or 12:1 -7:-4 to print
                         [12 11 10 9 8 3]. You can also use the 'ALL' keyword
                         to print all cols and negate the ones you don't
                         want with negative ranges - 'ALL -8:-14' to print all
                         columns EXCEPT 8-14.

                         Notes:
                         1) #s are split on whitespace, not commas.
                         2) scut also supports Excel-style column specifiers such as:
         or
      --c1='A C F ..'    (A B F AD BG etc) for up to 78 columns (-&gt;BZ)  If you want
                         more, add them to the %excel_ids hash in the code or create an
                         algorithm that does it right.</pre>
</td>
</tr>
</table>
<hr />
<h2><a name="_the_cols_utility"></a>The cols utility</h2>
<p>In my work, the input to <em>scut</em> is often a horrendously wide data file that overflows even the widest terminal screens with the tiniest fonts. And even if it didn&#8217;t overflow the terminal, when the data is printed to the terminal, the columns are almost always mismatched due to tab skips (if a field exceeds a tab boundary, it will skip to the next tab). Trying to figure out if a floating point number is the 21st or 22nd field can be trying if you have to do it all day.</p>
<p><em>cols</em> <a href="http://moo.nac.uci.edu/~hjm/cols">(code here)</a> was developed to take the such very &#8220;wide&#8221; files that have 10s of columns and present them in a terminal window so that you don&#8217;t have to import them into a GUI spreadsheet app to check that the parsing operation has gone well.  The output of the <em>scut</em> operation is typically piped to cols and then to <em>less -S</em> to promote easier viewing of long lines.</p>
<p>An example is probably the best way to describe this. The file <em>&#8221;DataSet</em>&#8221; is a long file of TAB-separated identifiers and data that has a header line of column IDs prefixed with a hash mark to mark it as a comment in the standard unixy way.</p>
<p>Would you rather try to decipher the column values from this output :</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>      $ head  -15 DataSet | less -S
          1 #b      gene    EC      C1.0    C1.1    C2.0    C2.1    C3.0    C3.1    E1.0    E1.1    E2.0    E2.1
          2 b0001   thrL    0.000423101     0.00046544      0.000429262     0.000433869     0.000250998     0.000
          3 b0002   thrA+thrA1+thrA2        1.1.1.3+2.7.2.4 0.001018277     0.001268078     0.001312524     0.001
          4 b0003   thrB    2.7.1.39        0.000517967     0.000457605     0.000582354     0.000640462     0.000
          5 b0004   thrC    4.2.99.2        0.000670075     0.000558063     0.000789501     0.000801508     0.000
          6 b0005   0       0       0       2.64E-07        0       0       0       0       0       0       0
          7 b0006   yaaA    0       0       0       0       0       0       0       0       0       0       0
          8 b0007   yaaJ    8.52E-06        8.87E-06        1.54E-05        2.74E-05        0       0       0
          9 b0008   talB    2.2.1.2 0.001160911     0.00118164      0.001263549     0.001345351     0.001103703
         10 b0009   mog+chlG        1.87E-05        1.91E-05        1.95E-05        1.70E-05        0       0
         11 b0010   yaaH    0       0       0       0       0       0       0       0       0       0       0
         12 b0011   0       0       0       0       0       0       0       0       0       2.86E-05        0</pre>
</td>
</tr>
</table>
<p>or from this output?</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>     $ head  -15 DataSet | cols | less -S
          1     0                 1                2            3            4            5            6
          2    #b              gene               EC         C1.0         C1.1         C2.0         C2.1
          3 b0001              thrL      0.000423101   0.00046544  0.000429262  0.000433869  0.000250998  0.00026
          4 b0002  thrA+thrA1+thrA2  1.1.1.3+2.7.2.4  0.001018277  0.001268078  0.001312524  0.001398845  0.00078
          5 b0003              thrB         2.7.1.39  0.000517967  0.000457605  0.000582354  0.000640462   0.0003
          6 b0004              thrC         4.2.99.2  0.000670075  0.000558063  0.000789501  0.000801508  0.00055
          7 b0005                 0                0            0     2.64E-07            0            0
          8 b0006              yaaA                0            0            0            0            0
          9 b0007              yaaJ         8.52E-06     8.87E-06     1.54E-05     2.74E-05            0
         10 b0008              talB          2.2.1.2  0.001160911   0.00118164  0.001263549  0.001345351  0.00110
         11 b0009          mog+chlG         1.87E-05     1.91E-05     1.95E-05     1.70E-05            0
         12 b0010              yaaH                0            0            0            0            0</pre>
</td>
</tr>
</table>
<p>(output of both is from the less pager to avoid wrap artefacts.)</p>
<p>They both looks nice and columnar, but if you look closely, you&#8217;ll see some variants that will drive you straight to the Advil after 30 minutes of trying to figure it out, especially if you&#8217;ve side-scrolled enough to miss one of the tab-skips. For example: most of the gene names are short enough to fit in the tab space, but in line 4, <em>&#8221;thrA+thrA1+thrA2</em>&#8221; is wide enough to cause a tab-skip which will then throw off the rest of the line. Similarly, in line 9 of the top listing, does the value <em>&#8221;0.001345351</em>&#8221; belong to the <em>&#8221;C3.0</em>&#8221; or to the <em>&#8221;C3.1</em>&#8221; column? Actually, it belongs to the <em>&#8221;C2.1</em>&#8221; column as is shown in the second listing, which has columnized it correctly.</p>
<p>Also note that the lower one has column headers inserted (line 1) which give you the 0-based column count (change to 1-based numbering with &#8211;ch=1</p>
<p>Note that this utility is meant for visualizing, not actual processing, except in edge cases, as the padding is all spaces. Also the utlity aborts after reading 22 lines, unless told not to.</p>
<p>Here is the help output from <em>&#8221;cols &#8211;help</em>&#8221;</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>    cols is a small Perl-based utility to view columns of data to help
    programmers check that the columns correspond to what they want.
    It strips tabs from the input and pads columns with spaces so it's
    NOT meant to be used as a pipeline processing tool, only as a checking
    tool.

    usage: pipe or redirect X-delimited tabular data (where X=TAB by default,
    but can be set to any Perl regex) to 'cols' with the following options:

    --mw=#  set max width to this many chars or the max per-col width if smaller.
            Defaults to 20.

    --ml=#  process this many lines of input. (Defaults to 22)

    --ch=#  add a line of column headers (starting at # - defaults to 0)
             to the output to tell where you are in very wide output
             (very useful)

    --delim=s the delimiter to use to split the fields (Defaults to TAB)
              Use 'ws' for whitespace (but you can use '\s+' if you want).

    --help  dumps this help

    Pipe output to 'less -S' to view long lines without wrap and arrow keys to
    scroll around.

    example:
              cols --mw=11 --ch=1 &lt;huge.data |less -S</pre>
</td>
</tr>
</table>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/hjmangalam.wordpress.com/30/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/hjmangalam.wordpress.com/30/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/hjmangalam.wordpress.com/30/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/hjmangalam.wordpress.com/30/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/hjmangalam.wordpress.com/30/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/hjmangalam.wordpress.com/30/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/hjmangalam.wordpress.com/30/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/hjmangalam.wordpress.com/30/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/hjmangalam.wordpress.com/30/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/hjmangalam.wordpress.com/30/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/hjmangalam.wordpress.com/30/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/hjmangalam.wordpress.com/30/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/hjmangalam.wordpress.com/30/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/hjmangalam.wordpress.com/30/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=hjmangalam.wordpress.com&amp;blog=1623151&amp;post=30&amp;subd=hjmangalam&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://hjmangalam.wordpress.com/2009/09/16/scut-cols-utilities-to-slice-dice-join-and-view-columnar-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/255884f089123f544bb5e036ae3a89b1?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">hjmangalam</media:title>
		</media:content>
	</item>
		<item>
		<title>HOWTO convert a commandline FORTRAN program to a GUI Python program</title>
		<link>http://hjmangalam.wordpress.com/2009/09/14/howto-convert-a-commandline-fortran-program-to-a-gui-python-program/</link>
		<comments>http://hjmangalam.wordpress.com/2009/09/14/howto-convert-a-commandline-fortran-program-to-a-gui-python-program/#comments</comments>
		<pubDate>Mon, 14 Sep 2009 18:19:35 +0000</pubDate>
		<dc:creator>hjmangalam</dc:creator>
				<category><![CDATA[Linux & Open Source]]></category>

		<guid isPermaLink="false">http://hjmangalam.wordpress.com/?p=20</guid>
		<description><![CDATA[v0.1, 01 Aug 2008 Note This HOWTO is still in beta This piece is still in progress. There are still some unfinished bits in both the Python wrapper and this HOWTO. The GUI is still unfinished, the database connectivity is still not fully described, and there are some typos (and probably a few thinko&#8217;s as [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=hjmangalam.wordpress.com&amp;blog=1623151&amp;post=20&amp;subd=hjmangalam&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>v0.1, 01 Aug 2008</p>
<table cellpadding="8">
<tr valign="top">
<td>
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:1px solid silver;">
<p><b>This HOWTO is still in beta</b></p>
<p>This piece is still in progress.  There are still some unfinished bits in both the Python wrapper and this HOWTO.  The GUI is still unfinished, the database connectivity is still not fully described, and there are some typos (and probably a few thinko&#8217;s as well). However, what&#8217;s written below is essentially correct.</p>
</td>
</tr>
</table>
<hr />
<h2><a name="_introduction"></a>Introduction</h2>
<p>FORTRAN has the reputation for being old, crufty, quite hard to use and stuck in some very old programming paradigms.  However, recent versions of FORTRAN include very modern abilities and many people are still using it, especially for pure number crunching as the compilers are still among the best for doing so.  FORTRAN does not provide easy access to GUI&#8217;s, relational databases, or methods for handling options (AFAIK &#8211; please correct), while many scripting languages, such as Python &amp; Perl do.</p>
<p>This is how I converted a very sophisticated, but fairly UI-ugly (and hard-to-modify) FORTRAN program to one that uses <a href="http://en.wikipedia.org/wiki/Python_(programming_language)">Python</a> as the application glue.  Python was used to do the scut work of command-line user interface and configuration file management.  It was also used to add an optional GUI to it and record some usage to a relational database.  I used the <a href="http://www.scipy.org/F2py">f2py</a> module of the scientific Python package <a href="http://numpy.scipy.org/">numpy</a> to do the FORTRAN compilation and generation of the shared lib, and then used <a href="http://trolltech.com/products/qt/features/tools/designer">Qt-designer</a> and the <a href="http://trolltech.com/products/qt">Qt widget set</a> to <em>draw</em> the GUI and then <a href="http://www.riverbankcomputing.com/software/pyqt/intro">PyQt</a> to convert it to Python.  That sounds quite complex, but as you&#8217;ll see, it&#8217;s not especially if you use a recent version of Linux as all the required packages are available for free.</p>
<p>This was the 1st time I&#8217;ve used numpy for this and it worked much better than I had expected.  My previous experience had been with <a href="http://www.swig.org/">SWIG</a> (the Rosetta Stone for mixing computer languages), and while SWIG allows you to do tremendous magic, it was a slog to get it to work.  With numpy and f2py, it just worked.</p>
<p>The end result is a more easily maintainable program that separates the FORTRAN math engine from the user interface, provides a more standard option-handling and configuration file capabilities, provides the (optional) GUI, and also adds reporting to a remote relational database for use and platform tracking.  All this in about 300 lines of code which includes much debugging.</p>
<hr />
<h2><a name="_the_problem"></a>The Problem</h2>
<p>The initial problem was that a researcher had a great Magnetic Resonance (MR) analysis program for proteins that he wanted to be more user-friendly.  It was written in FORTRAN and ran quite efficiently, but it was difficult to use.</p>
<p>The last time I wrote anything in FORTRAN was in the 70&#8217;s but I got the code and figured out approximately what it did.  I then ran it thru a profiler (<a href="http://oprofile.sourceforge.net/news/">oprofile</a>) and was able to tell where it spent its time (90% in 1 nested set of functions):</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>$ opreport --exclude-dependent --demangle=smart --symbols /home/hjm/shaka/1D-Mangalam-py/fd_rrt1d.so
CPU: Core 2, speed 1667 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        symbol name
1107170  61.4356  cqzhes_
345869   19.1919  cqzvec_
179710    9.9719  cqz_
144523    8.0194  fd_rrt1d_
6700      0.3718  _g95_exp_z8
4593      0.2549  _g95_power_z8_i4
4171      0.2314  umatrix1d_ms0_
3231      0.1793  fd_1d_
1336      0.0741  .plt
1091      0.0605  cqzval_
...</pre>
</td>
</tr>
</table>
<p>I was initially going to try to improve the efficiency but for a number of reasons that was not a priority at this time.  Ease of use, ability of others to help improve it, and multi-platform ability were higher priority for the researcher.</p>
<p>Since I had some experience with Python, I decided that this would be a good time to try out the f2py functionality of numpy.</p>
<p>The <a href="http://moo.nac.uci.edu/\~hjm/fd_rrt1d/fd_rrt1d_code.tgz">original FORTRAN code</a> I was given included 3 FORTRAN source files totalling 1880 lines and some associated configuration and support files.</p>
<hr />
<h2><a name="_converting_the_fortran_to_a_shared_lib"></a>Converting the FORTRAN to a shared lib</h2>
<p>The first thing that I did was to convert the FORTRAN main() to a function so it could be called from Python. Since I wasn&#8217;t re-writing the application, I just needed a Python front-end to set everything up and then kick off the run by passing all the required variables to the native FORTRAN routines. This takes only a few lines of code &#8211; primarily to add the subroutine call with all the variables that were being set from the calling Python:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>      subroutine fd_rrt1d(signal,theta,method,Nsig,wmino,wmaxo,par,
     &amp;     threshhold,ReSp,ImSp,AbsSp,rho,Nb0,Nbc,Nsp,Gamm,cheat,
     &amp;     cheatmore,ros)</pre>
</td>
</tr>
</table>
<p>that&#8217;s really all it took.  Besides that single change, there were few changes to the FORTRAN code besides inserting some debugging variables and comments to myself to clarify the code a bit more.  Here&#8217;s a diff view:</p>
<p><img style="border-width:0;" src="http://hjmangalam.files.wordpress.com/2009/09/kompare_1d_s.jpg?w=450" alt="images/kompare_1d_s.jpg"></p>
<p>To compile the whole thing into a shared lib that can be called from Python took little more work:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>f2py --opt="-O3" -c -m fd_rrt1d --fcompiler=gnu95  --link-lapack_opt *.f</pre>
</td>
</tr>
</table>
<p>The above line uses the <a href="http://gcc.gnu.org/fortran/">gfortran</a> compiler (aka <strong>gnu95</strong>) which seems to both generate marginally faster code and is also more compatible with MacOSX than the <a href="http://www.g95.org">g95</a> compiler I 1st tried (g95 worked fine on Linux, but had problems on MacOSX due to API incompatibilities with numpy on MacOSX).  The end result of this command was a shared lib <strong>fd_rrt1d.so</strong> which is callable by subroutine name by both Python and FORTRAN (the FORTRAN code calls several subroutines spread over those 3 files).  That is one of the <em>magic</em> things about the numpy package; it all works the way it&#8217;s supposed hiding the considerable magic.</p>
<table cellpadding="8">
<tr valign="top">
<td>
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:1px solid silver;">
<p><b>Undefined symbols in library</b></p>
<p>One hiccup was that when I tried this on a different system that had a version of liblapack, I was able to compile the shared lib, but it complained about undefined symbols:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>$ ./1d.py
Traceback (most recent call last):
  File "./1d.py", line 27, in &lt;module&gt;
    from fd_rrt1d import *
ImportError: /home/hjm/shaka/1D-Mangalam-py/fd_rrt1d.so: undefined
symbol: zgemm_</pre>
</td>
</tr>
</table>
<p>sure enough, nm reports it as undefined:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>$ nm fd_rrt1d.so |tail
000065b0 t string_from_pyobj
         U strlen@@GLIBC_2.0
         U strncpy@@GLIBC_2.0
0001df40 b u.1294
00013bee T umatrix1d_dms0_
000128c3 T umatrix1d_dms1_
00015458 T umatrix1d_ms0_
         U zgemm_            &lt;---
         U zgemv_
         U zgesv_</pre>
</td>
</tr>
</table>
<p>However, I installed a newer version of liblapack (<strong>liblapack-dev</strong>, from the Ubuntu 8.04 tree and recompiled and that seems to have addressed the issue, even tho the previously offending symbols are <em>still undefined</em>. No, I don&#8217;t understand this.</p>
</td>
</tr>
</table>
<p>I wrote a skeleton Python program that assigned the variables and called the FORTRAN code. Astonishingly, it worked on the 1st try, so I continued to expand the skeleton to add the commandline option-handling.</p>
<hr />
<h2><a name="_commandline_option_handling"></a>Commandline option-handling</h2>
<table cellpadding="8">
<tr valign="top">
<td>
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:1px solid silver;">
<p><b>Option Handling</b></p>
<p>There 4, count &#8216;em, 4 ways of setting options in the 1D app.  The easiest is to set nothing, which causes the internal, hard-coded defaults to be used. If there is a configuration file, the values set in that file will override over the defaults.  The variables that are not set in that file will use the defaults.  If you set values from the commandline (&#8211;wmaxo=4000), those will override those set from the config file as well as the defaults. Finally, those variables that are set from the GUI have the highest precedence. There&#8217;s a bit of logic code that determines all that, but it&#8217;s not complicated.</p>
</td>
</tr>
</table>
<p>Python has a standard way of providing commandline option handling this via its <strong>getopt</strong> package.  It&#8217;s probably not the best, but there&#8217;s a lot to be said for doing it in a semi-standard way.  It&#8217;s also very easy to implement.</p>
<p>The following is the entire option-handling code for the MANY options that it supports and reads any defined commandline option in, then either calls a function (such as <strong>&#8211;gui</strong>) or does some munging of the variable (<strong>&#8211;wmaxo</strong>) and sticks it in a <a href="http://www.diveintopython.org/getting_to_know_python/dictionaries.html">dictionary</a> (aka hash) for easy lookup and passing to other functions.  If the option is unrecognized, it just calls the usage() function to let the user figure out the error of his ways.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>import getopt
 ...
try:
    opts, args = getopt.getopt(sys.argv[1:], 'hD', ['help', 'debug', 'help1d', 'gui', 'nodb', 'paramfile=', 'signal=', 'theta=', 'method=', 'Nsig=', 'wmino=', 'wmaxo=', 'par=', 'threshhold=', 'ReSp=', 'ImSp=', 'AbsSp=', 'rho=', 'Nb0=', 'Nbc=', 'Nsp=', 'Gamm=', 'cheat=', 'cheatmore=', 'ros='])
except getopt.GetoptError:
    # print help information and exit:
    print "There was an error specifying an option flag.  Here's the correct usage:"
    usage(1)
# set up the options required
for opt, arg in opts:
    if opt in ('-h', '--help'):   usage(1)
    elif opt in ('-D', '--debug'):      DEBUG = 1
    elif opt in ('--help1d'):     usage1d(1)
    elif opt in ('--gui'):        gui()
    elif opt in ('--nodb'):       USEDB = 0
    elif opt in ('--paramfile'):  paramfile = arg
    elif opt in ('--signal'):     clcfg['signal = arg']         # file name
    elif opt in ('--theta'):      clcfg['theta'] = float(arg)   # float
    elif opt in ('--method'):     clcfg['nmr_method'] = arg         # FDM, RRT or DFT
    elif opt in ('--Nsig'):       clcfg['Nsig'] = int(round(float(arg)))      #int
    elif opt in ('--wmino'):      clcfg['wmino'] = int(round(float(arg)))   #int
    elif opt in ('--wmaxo'):      clcfg['wmaxo'] = int(round(float(arg)))   #int
    elif opt in ('--par'):        clcfg['par'] =  arg           # linelist output file
    elif opt in ('--threshhold'): clcfg['threshhold'] = float(arg)
    elif opt in ('--ReSp'):       clcfg['ReSp'] = arg           # file name
    elif opt in ('--ImSp'):       clcfg['ImSp'] = arg           # file name
    elif opt in ('--AbsSp'):      clcfg['AbsSp'] = arg          # file name
    elif opt in ('--rho'):        clcfg['rho'] = float(arg)   #
    elif opt in ('--Nb0'):        clcfg['Nb0'] = int(round(float(arg)))   #
    elif opt in ('--Nbc'):        clcfg['Nbc'] = int(round(float(arg)))   #
    elif opt in ('--Nsp'):        clcfg['Nsp'] = int(round(float(arg)))   #
    elif opt in ('--Gamm'):       clcfg['Gamm'] = float(arg)   #
    elif opt in ('--cheat'):
        clcfg['cheat'] = float(arg)   # float
        if clcfg['cheat'] != 1 or clcfg['cheat'] != 0:
            print &gt;&gt; sys.stderr, "cheat must be '1' or '0'"
            sys.exit(1)
    elif opt in ('--cheatmore'):
        clcfg['cheatmore'] = arg   # T or F
        if clcfg['cheatmore'] != 'T' or clcfg['cheatmore'] != 'F':
            print &gt;&gt; sys.stderr, "cheatmore must be 'T' or 'F'"
            sys.exit(1)
    elif opt in ('--ros'):        clcfg['ros'] = float(arg)   #</pre>
</td>
</tr>
</table>
<hr />
<h2><a name="_configuration_file_handling"></a>Configuration File Handling</h2>
<p>The original FORTRAN program supported a custom-written configuration file input of options that had this form:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>'2p-no-noise.txt'                               /signal
1.5708                                          /theta
'FDM'                                   /method (FDM/RRT/DFT)
4                                       /Nsig
-9000 4500                              /wmin wmax
'par',  1d-4                            /parameters, output threshhold
'fdm','none','none'                     /ReSpectrum,ImSpectrum,AbsSpectrum
1., 512, -20                            /rho, Nb0, Nbc
20000, 5d-2                             /Npower, Gamm
1 F                                     /cheat, cheatmore
1d-8                                    /ros</pre>
</td>
</tr>
</table>
<p>User-written FORTRAN code then parsed this to set the variables.    In providing the Python front-end, it was extremely easy to provide a more sophisticated way of doing this using the <a href="http://www.voidspace.org.uk/python/configobj.html">configobj</a> module which is designed to do just this.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>from configobj import ConfigObj  # for the configuration module
 ...
if paramfile != "": # If there's a param file named, try to get params from it.
    fcfg = ConfigObj(file(paramfile))  # reads all variables in file as strings

    # now have to coerce everything from the param file that is not a
    # string to the correct type
    # following members used to iterate over to coerce into int or float
    int_params = ("Nsig", "Nsp", "wmino", "wmaxo", "Nb0", "Nbc")
    float_params = ("cheat", "theta", "threshhold", "rho", "Gamm", "ros")

    for ip in int_params: fcfg[ip]=int(fcfg[ip])
    for fp in float_params: fcfg[fp]=float(fcfg[fp])</pre>
</td>
</tr>
</table>
<p>I had to add some logic to allow options entered at the commandline to override those in the configuration file, but essentially the code above was all that was needed to support a configuration file that allows key = value pairs that can be nested into arbitrary stanzas.</p>
<p>Here&#8217;s an extract of  the config file showing assignment of strings, ints, and floats.  Note that all values are interpreted as strings and have to be coerced into the appropriate type &#8211; see above in the option-handling section.  The config file included in the tarball includes a summary explanation of how the file is structured and the URL to the <a href="http://www.voidspace.org.uk/python/configobj.html">home page of the configobj module</a>.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre># signal is the file that contains the signal data; if no leading path, then it is
# assumed to be in the current directory.
signal = "2p-no-noise.txt"   # file containing the signal data
theta = 1.5708               # fl pt var returned as string, conv in wrapper code
method = "FDM"               # can be one of (FDM/RRT/DFT)
Nsig = 4                     # int var returned as string, conv in wrapper code</pre>
</td>
</tr>
</table>
<hr />
<h2><a name="_graphical_user_interface"></a>Graphical User Interface</h2>
<p>The addition of a GUI used to be stuff of wizards and black arts.  It&#8217;s still not trivial but it&#8217;s considerably easier using the <strong>Designer</strong> approach, in which you use an application  that allows you to drag control widgets to a canvas, arranging them as you like. I used Trolltech&#8217;s Qt widget library and their VERY easy-to-use Designer app to mock up an interface and then converted the interface XML description to Python using Riverbank&#8217;s PyQt toolkit.  Here&#8217;s a screenshot of the Qt4 Designer being used to design the fd_rr1d GUI:</p>
<p><img style="border-width:0;" src="http://hjmangalam.files.wordpress.com/2009/09/designer-qt4_s.jpg?w=450" alt="images/designer-qt4_s.jpg"></p>
<p>After the UI is <em>drawn</em> and saved as <strong>GUI_1D.ui</strong> (an XML representation of the design), the UI file is converted to Python code using the PyQt utility <strong>pyuic4</strong>.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>$ pyuic4 GUI_1D.ui &gt; UI.py</pre>
</td>
</tr>
</table>
<p>The autogenerated code (<strong>UI.py</strong>) is then wired to functionality using conventional programming techniques or by using Qt&#8217;s system of <a href="http://techbase.kde.org/Development/Tutorials/Python_introduction_to_signals_and_slots">Signals &amp; Slots</a> that can be mostly done using their Designer application</p>
<p>The code required for making the proof of concept (a GUI that pops up and allows the user to set all the options graphically) is actually quite concise (the complexity is hidden in the Qt lib that you link to).</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>from PyQt4.QtCore import * # PyQt core libs
from PyQt4.QtGui import *  # PyQy GUI components
from GUI_1D import *       # the interface definition file converted to Python code
...

# the class multiply inherits from the library prototype and the specific interface
# class defined in the designer ui -&gt; py
class Form1D(QDialog,Ui_Dialog):
    # to pop it up, it only needs to __init__ itself, declare its parent (itself) as
    # it's a top-level dialog, and then call the designer -&gt; pyuic4-generated setupUi()
    # to make it do anything useful, I have to write all the glue code to pass
    # the params, connect signals &amp; slots, do error-checking, etc. but his pops it up
    # don't forget to erase the no-longer needed class and defs when finished.
    def __init__(self, parent=None):
        super(Form1D, self).__init__(parent)
        self.setupUi(self)

def gui():
    """To pop up the designer-built form, it only needs to declare an instance of the
    QtApplication, ditto the form itself, and show it.
    To make it do anything useful, still have to write all the glue code to pass
    the params, connect signals &amp; slots, do error-checking, etc. but this pops it up.
    """

    app = QApplication(sys.argv)
    form = Form1D()
    form.show()
    app.exec_()
...

# the above class is referenced from option-handling stanza:
for opt, arg in opts:
    if opt in ('-h', '--help'):   usage(1)
...
    elif opt in ('--gui'):        gui()</pre>
</td>
</tr>
</table>
<p>So if you started the app with the <strong>&#8211;gui</strong> option, the GUI window would pop up and allow you fill out all the variables via the mouse. <strong>[This section still incomplete]</strong></p>
<hr />
<h2><a name="_relational_database_connectivity"></a>Relational Database connectivity</h2>
<p>When releasing a piece of academic software into the wild, it is often useful to the author to figure out how it&#8217;s being used so that she can rewrite instructions, concentrate on most-used features, find out the platform distribution, etc.  This mechanism can be exploited trivially using Python&#8217;s Relational DataBase (RDB) connection module.  During the run of this program, the Python wrapper provides all the variables, times the execution of the run, and can provide some network information to the author.  This information is presented at the end of the run with a request to send the info back to the author.  If the user agrees, the Python wrapper attempts to contact a pre-defined database server and send back the information.</p>
<p>The mechanism is straightforward:</p>
<ul>
<li> collect the information </li>
<li> compose an <strong>INSERT</strong> command to the RDB </li>
<li> show that information and ask the user&#8217;s permission to return it. </li>
<li> if granted, connect to the remote RDB and execute the INSERT command. </li>
</ul>
<p>The information returned includes the date, the hostname, IP #, and OS of the computer, all  program variables, the program run-time, and some platform information about the machine that ran the program.</p>
<p>Here&#8217;s the info collected.  Note that the sysinfo string is the full output of <strong>lshw -short</strong> and should be trimmed considerably.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>date:   Tue Jul 29 15:20:41 2008
user:   hjm
host:   bongo
ipnbr:  128.200.34.98
OS:     Linux
sysinfo:        H/W path       Device    Class       Description
================================================
                         system      Computer
/0                       bus         Motherboard
/0/0                     memory      3041MiB System memory
/0/1                     processor   Intel(R) Core(TM)2 CPU         T5500  @ 1.66GHz
/0/1/0.1                 processor   Logical CPU
/0/1/0.2                 processor   Logical CPU
/0/100                   bridge      Mobile 945GM/PM/GMS, 943/940GML and 945GT Express Memory Controller Hub
/0/100/1                 bridge      Mobile 945GM/PM/GMS, 943/940GML and 945GT Express PCI Express Root Port
/0/100/1/0               display     Radeon Mobility X1400
/0/100/1b                multimedia  82801G (ICH7 Family) High Definition Audio Controller
/0/100/1c                bridge      82801G (ICH7 Family) PCI Express Port 1
/0/100/1c/0    eth0      network     82573L Gigabit Ethernet Controller
/0/100/1c.1              bridge      82801G (ICH7 Family) PCI Express Port 2
/0/100/1c.1/0  wmaster0  network     PRO/Wireless 3945ABG Network Connection
/0/100/1c.2              bridge      82801G (ICH7 Family) PCI Express Port 3
/0/100/1c.3              bridge      82801G (ICH7 Family) PCI Express Port 4
/0/100/1d                bus         82801G (ICH7 Family) USB UHCI Controller #1
/0/100/1d.1              bus         82801G (ICH7 Family) USB UHCI Controller #2
/0/100/1d.2              bus         82801G (ICH7 Family) USB UHCI Controller #3
/0/100/1d.3              bus         82801G (ICH7 Family) USB UHCI Controller #4
/0/100/1d.7              bus         82801G (ICH7 Family) USB2 EHCI Controller
/0/100/1e                bridge      82801 Mobile PCI Bridge
/0/100/1e/0              bridge      PCI1510 PC card Cardbus Controller
/0/100/1f                bridge      82801GBM (ICH7-M) LPC Interface Bridge
/0/100/1f.1              storage     82801G (ICH7 Family) IDE Controller
/0/100/1f.2              storage     82801GBM/GHM (ICH7 Family) SATA AHCI Controller
/0/100/1f.3              bus         82801G (ICH7 Family) SMBus Controller

runtime:        5.47741103172
par :   FDM_par.out
Nb0 :   100
Nsig :  40960
ImSp :  ImSp_spectra.data
cheat : 1.0
signal :        2p-no-noise.txt
nmr_method :    FDM
cheatmore :     F
Nsp :   20000
ReSp :  ReSp_spectra.data
wmaxo : 4500
rho :   2.0
threshhold :    0.0001
Nbc :   -20
theta : 1.5
Gamm :  0.05
ros :   1e-08
AbsSp : AbsSp_spectra.data
wmino : -9000</pre>
</td>
</tr>
</table>
<p>The entire data to be returned (formatted as above) is presented to the user just prior to sending it, so they have the opportunity to refuse sending the information.</p>
<hr />
<h2><a name="_additional_useful_hints"></a>Additional useful hints</h2>
<p>Presenting help files in an easily navigable way usually requires a hypertext browser or a custom screen pager.  Python offers a very easy way to present any text file via any pager application on the system.  Since most *nix-like systems have the <em>less</em> pager, I just called that pager on the help file.  Here&#8217;s the entire function that presents pagable help text.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>from pydoc import pipepager
...
def usage1d(code):
    try:
        help_fp = file("1d_orig_help.txt", "r")
        help_txt = help_fp.read()                # read in any text from a text file.
    except:
        print "Can't find the help file - should be called '1d_orig_help.txt' - Did you rename it?"
        sys.exit(code)
    pipepager(help_txt, '/usr/bin/less -NS') # pipe help text into 'less -NS'
    sys.exit(code)</pre>
</td>
</tr>
</table>
<hr />
<h2><a name="_download_the_entire_code_tree"></a>Download the entire code tree</h2>
<p>The entire code tree can be downloaded The <a href="http://moo.nac.uci.edu/~hjm/fd_rrt1d/f2py_1D_example.tgz">from here</a>.  The <a href="http://moo.nac.uci.edu/\~hjm/fd_rrt1d/File_Manifest">File Manifest</a> is included; those files not explicitly named are probably not required.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/hjmangalam.wordpress.com/20/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/hjmangalam.wordpress.com/20/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/hjmangalam.wordpress.com/20/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/hjmangalam.wordpress.com/20/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/hjmangalam.wordpress.com/20/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/hjmangalam.wordpress.com/20/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/hjmangalam.wordpress.com/20/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/hjmangalam.wordpress.com/20/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/hjmangalam.wordpress.com/20/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/hjmangalam.wordpress.com/20/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/hjmangalam.wordpress.com/20/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/hjmangalam.wordpress.com/20/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/hjmangalam.wordpress.com/20/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/hjmangalam.wordpress.com/20/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=hjmangalam.wordpress.com&amp;blog=1623151&amp;post=20&amp;subd=hjmangalam&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://hjmangalam.wordpress.com/2009/09/14/howto-convert-a-commandline-fortran-program-to-a-gui-python-program/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/255884f089123f544bb5e036ae3a89b1?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">hjmangalam</media:title>
		</media:content>

		<media:content url="http://hjmangalam.files.wordpress.com/2009/09/kompare_1d_s.jpg" medium="image">
			<media:title type="html">images/kompare_1d_s.jpg</media:title>
		</media:content>

		<media:content url="http://hjmangalam.files.wordpress.com/2009/09/designer-qt4_s.jpg" medium="image">
			<media:title type="html">images/designer-qt4_s.jpg</media:title>
		</media:content>
	</item>
		<item>
		<title>Mind your NegaBIT$</title>
		<link>http://hjmangalam.wordpress.com/2009/09/14/mind-your-negabit/</link>
		<comments>http://hjmangalam.wordpress.com/2009/09/14/mind-your-negabit/#comments</comments>
		<pubDate>Mon, 14 Sep 2009 18:07:33 +0000</pubDate>
		<dc:creator>hjmangalam</dc:creator>
				<category><![CDATA[Linux & Open Source]]></category>

		<guid isPermaLink="false">http://hjmangalam.wordpress.com/?p=13</guid>
		<description><![CDATA[Abstract Free/Libre/Open Source Software (FLOSS) represents a huge pool of ready-made, well-designed, pre-integrated software that that we can use to improve IT support. And not only is it free, but it promotes sharing, scalability, reduces licensing and support costs, and relieves its users of legal liability more than it encumbers them. There&#8217;s every reason to [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=hjmangalam.wordpress.com&amp;blog=1623151&amp;post=13&amp;subd=hjmangalam&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<table bgcolor="#ffffee" width="100%" cellpadding="15">
<tr>
<td>
<p><em>Abstract</em></p>
<p>Free/Libre/Open Source Software (<a href="http://en.wikipedia.org/wiki/Open_source_software">FLOSS</a>) represents a huge pool of ready-made, well-designed, pre-integrated software that that we can use to improve IT support. And not only is it free, but it promotes sharing, scalability, reduces licensing and support costs, and relieves its users of legal liability more than it encumbers them.</p>
</td>
</tr>
</table>
<p>There&#8217;s every reason to be careful in IT spending, especially (but not only) in these unsettled times.  Just as energy conservation is the most efficient mechanism for saving energy (see <a href="http://en.wikipedia.org/wiki/Negawatt_power">NegaWatts</a>), the most efficient mechanism for saving money in Information Technology is by NOT paying for that which you do not need to pay for. I call them <strong>NegaBIT$</strong> (as in <strong>B</strong> undles of <strong>IT $</strong> you don&#8217;t have to spend).</p>
<p>While the direct saving on software costs relative to <a href="http://www.universityofcalifornia.edu/news/article/21271">UC&#8217;s deficit</a> is tiny, it can have a multiplier effect, as the use of FLOSS not only has an immediate effect in reducing some costs, but has a much greater effect in the long term as a way of approaching IT and reducing other costs.</p>
<p>An example may be useful. I don&#8217;t know the cost of the &#8220;Enterprise Monitoring system&#8221; we use here (<a href="http://www.netreo.net/">Netreo</a>), but I do know that Google reports ~33,000 hits on it.  <a href="http://www.nagios.org/">Nagios</a>, a similar, but Open Source system, has &gt;5,000,000 hits.  This doesn&#8217;t mean that Netreo is bad software, nor that Nagios is good software, but it certainly should cause ears to perk up and to examine Nagios in closer detail to see why there are 156 times more Google refs to Nagios than to Netreo.  Another way of evaluating mindshare is to check how many pages link to the site.  Using Google&#8217;s advanced search, about <em>1400 pages link to www.nagios.org</em>, while only <em>2 pages link to www.netreo.com</em>.</p>
<p>One of the complaints I&#8217;ve heard from people charged with evaluating software is that they can&#8217;t obtain comparable information about a competitive  FLOSS package as from a commercial vendor.  This is an inherent problem with FLOSS.  Since the software is free, how can the producers also be expected to provide salespeople and promotional kits that would compete with those from a commercial vendor?  If we are going to exploit FLOSS, we have to be prepared to dig a little harder. Or not &#8211; would you be willing to accept at face value whatever a commercial vendor tells you?  In some cases, especially for the more mature FLOSS systems, they do provide comparable propaganda, but regardless, we should certainly <em>do our own homework</em>.</p>
<p>The solution is to nominate someone to evaluate the best FLOSS products.  This is actually often much easier than doing the evaluation on commercial products as the FLOSS packages are freely available in unrestricted form (and the best known packages, are already in the FLOSS repositories) so that installation is often as simple as (for Nagios on a Debian-based system):</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>   sudo apt-get install nagios</pre>
</td>
</tr>
</table>
<p>There are an additional ~20 optional (and free) packages that work with Nagios that are available via the same mechanism.</p>
<table bgcolor="#ffffee" width="100%" cellpadding="15">
<tr>
<td>
<p><em>It is hard for me to emphasize this enough:</em></p>
<p><strong>The lack of organizational structure to find and present competing FLOSS systems highlights the holes in our evaluation process and inattention in our organization, not the lack of good free software.</strong></p>
<p><strong>The continued lack of appreciation and inability to evaluate the quality of Open Source Software is costing University of California tens of thousands to hundreds of thousands of dollars a year per campus.</strong></p>
</td>
</tr>
</table>
<p>There are extensive methodologies for doing such evaluations.  For examples, one of the most best documented is David Wheeler&#8217;s <a href="http://www.dwheeler.com/oss_fs_eval.html">IRCA approach</a>:</p>
<ul>
<li> <a href="http://www.dwheeler.com/oss_fs_eval.html#identify"><strong>I</strong> dentify</a> candidates </li>
<li> <a href="http://www.dwheeler.com/oss_fs_eval.html#review"><strong>R</strong> ead</a> existing reviews </li>
<li> <a href="http://www.dwheeler.com/oss_fs_eval.html#compare"><strong>C</strong> ompare</a> the leading programs&#8217; basic attributes to your needs, and then </li>
<li> <a href="http://www.dwheeler.com/oss_fs_eval.html#analyze"><strong>A</strong> nalyze</a> the top candidates in more depth. </li>
</ul>
<p>It would be very helpful if such evaluations were also published so that others at UC could make use of them or use them as a starting point for their own evaluations.</p>
<p><em>Paying</em> for software does not assure quality, and <em>not paying</em> for software does not assure catastrophy.  One of the  best-selling software titles of 1995 was Syncronys Software&#8217;s SoftRAM, a Memory Optimizer program for Windows that was eventually shown to do &#8230; <a href="http://en.wikipedia.org/wiki/SoftRAM">nothing</a>.  This exemplifies just one problem with closed source software &#8211; short of actually disassembling the binary executable code, you can&#8217;t see how it works so you can&#8217;t tell if it&#8217;s doing what it says it does.  Not that most of us would be going thru the code line by line, but you can bet that there would be geeks out there that were.</p>
<p>There is also the problem of vendor lock-in and transitioning.  If you have dedicated an entire infrastructure around a particular commercial software, the vendor obviously knows this and can then increase the renewal or support costs to just shy of lethal levels, knowing that they can.  You now have no choice but to pay their ransom or face a huge transition cost.  Sometimes this transition is unexpectedly forced on you, as when rivals buy each other and kill or neglect the competing software lines offered by their now-subsidiary.</p>
<p>Why do people choose commercial software over the OSS equivalent? The <em>feature set</em> is a typical explanation of why an organization chooses a commercial product over an OSS one &#8211; there are often additional options and features in commercial software packages that OSS ones lack.  There is good reason for this.  Vendors <em>have</em> to offer something beyond what the OSS has, and often the reason the FLOSS equivalent lacks a feature is that no one uses or has requested it.</p>
<p>Besides the unit costs, commercial software has an additional hidden cost when provided by a large organization &#8211; that of negotiating license agreements, packaging and distributing the software, tracking the use and leakage of such software, and providing the accounting reports of such use.  For a UCI-sized campus, there are probably 3-4 FTE-equivalents that are involved in this process.</p>
<p>This is not to suggest that FLOSS has no costs &#8211; FLOSS packages have the same kinds of costs for configuration that non-FLOSS packages have.  However when you gain that functional knowledge, it can be used to roll out the package in bulk for little additional cost.</p>
<p>Once you start down this path of exploiting FLOSS, it&#8217;s a slippery slope to saving.  You can find all kinds of compatible software that can be mixed and matched and mashed up to provide additional functionality for no $ down.  And since much FLOSS uses the same underlying libraries, it tends to have a greater  overall compatibility.</p>
<p>And when you find a FLOSS solution that works well, you can roll it out to all 100,000 of your closest friends.  Now THAT is scalability.</p>
<p>NegaBIT$ indeed.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/hjmangalam.wordpress.com/13/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/hjmangalam.wordpress.com/13/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/hjmangalam.wordpress.com/13/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/hjmangalam.wordpress.com/13/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/hjmangalam.wordpress.com/13/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/hjmangalam.wordpress.com/13/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/hjmangalam.wordpress.com/13/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/hjmangalam.wordpress.com/13/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/hjmangalam.wordpress.com/13/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/hjmangalam.wordpress.com/13/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/hjmangalam.wordpress.com/13/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/hjmangalam.wordpress.com/13/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/hjmangalam.wordpress.com/13/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/hjmangalam.wordpress.com/13/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=hjmangalam.wordpress.com&amp;blog=1623151&amp;post=13&amp;subd=hjmangalam&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://hjmangalam.wordpress.com/2009/09/14/mind-your-negabit/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/255884f089123f544bb5e036ae3a89b1?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">hjmangalam</media:title>
		</media:content>
	</item>
		<item>
		<title>How to transfer large amounts of data via network.</title>
		<link>http://hjmangalam.wordpress.com/2009/09/14/how-to-transfer-large-amounts-of-data-via-network/</link>
		<comments>http://hjmangalam.wordpress.com/2009/09/14/how-to-transfer-large-amounts-of-data-via-network/#comments</comments>
		<pubDate>Mon, 14 Sep 2009 18:06:16 +0000</pubDate>
		<dc:creator>hjmangalam</dc:creator>
				<category><![CDATA[Linux & Open Source]]></category>

		<guid isPermaLink="false">http://hjmangalam.wordpress.com/?p=9</guid>
		<description><![CDATA[Note Executive Summary If you have to transfer data, transfer only that which is necessary. If you unavoidably have lots of data to transfer, consider having your institution set up a GridFTP node. If GridFTP is not available, the fastest, easiest, user-mode, node-to-node method to move data for Linux and MacOSX is with bbcp. If [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=hjmangalam.wordpress.com&amp;blog=1623151&amp;post=9&amp;subd=hjmangalam&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<table cellpadding="8">
<tr valign="top">
<td>
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:1px solid silver;">
<p><b>Executive Summary</b></p>
<p>If you have to transfer data, transfer only <a href="#rsync">that which is necessary</a>. If you unavoidably have <strong>lots of data</strong> to transfer, consider having your institution set up a <a href="#gridftp">GridFTP</a> node.</p>
<p>If GridFTP is not available, the fastest, easiest, user-mode, node-to-node method to move data for Linux and MacOSX is with <a href="#bbcp">bbcp</a>.</p>
<p>If you use Windows, <a href="#fdt">fdt</a> is Java-based and will run there as well.</p>
<p>Note that bbcp and the similar <a href="#bbftp">bbftp</a> can require considerable tuning to extract maximum bandwidth. If these applications do not work at expected rates, ESNet&#8217;s <a href="ftp://fasterdata.es.net/">Guide to Bulk Data Transfer over a WAN</a> is an excellent summary of the deeper network issues.</p>
</td>
</tr>
</table>
<p>We all need to transfer data, and the amount of that data is increasing as the world gets more digital.  If it&#8217;s not climate model data from the IPCC, it&#8217;s high energy particle physics data from the LHC, or audio &amp; video streams from a performance recording.</p>
<p>The usual methods of transferring data (<a href="http://en.wikipedia.org/wiki/Secure_copy">scp</a>, <a href="http://en.wikipedia.org/wiki/Http">http</a> and <a href="http://en.wikipedia.org/wiki/Ftp">ftp</a> utilities (such as <a href="http://curl.haxx.se/">curl</a> or <a href="http://en.wikipedia.org/wiki/Wget">wget</a>) work fine when your data is in the MB range, but when you have very large collections of data there are some tricks that are worth mentioning.</p>
<hr />
<h2><a name="comp_encrypt"></a>Compression &amp; Encryption</h2>
<p>Whether to compress and/or encrypt your data in transit depends on the cost of doing so.  For a modern desktop or laptop computer, the CPU(s) are usually not doing much of anything so the cost incurred in doing the compression/encryption is generally not even noticed. However on an otherwise loaded machine, it can be significant, so it depends on what has to be done at the same time.  Compression can reduce the amount of data that needs to be transmitted considerably if the data is of a type that is compressible (text, XML, uncompressed images and music), however progressively such data is already compressed on the disk (in the form of jpeg or mp3 compression), and compressing already compressed data yields little improvement.  Some compression utilities try to detect already-compressed data and skip it, so there&#8217;s often no penalty in requesting compression, but some utilities (like the popular Linux archiving tar) will not detect it correctly and waste lots of time trying.</p>
<p>As an extreme example, here&#8217;s the timing of making a tar archive of a large directory that consists of mostly already compressed data, using compression or not.</p>
<p><strong>Using</strong> compression:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>$ time tar -czpf /bduc/data.tar.gz /data
tar: Removing leading `/' from member names

real    201m38.540s
user    95m32.114s
sys     7m13.807s

tar file = 84,284,016,900 bytes</pre>
</td>
</tr>
</table>
<p><strong>NOT using</strong> compression:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>$ time tar -cpf /bduc/data.tar /data
tar: Removing leading `/' from member names

real    127m13.404s
user    0m43.579s
sys     5m35.437s

tar file = 86,237,952,000</pre>
</td>
</tr>
</table>
<p>It took more than 74 minutes (about 58%) longer using compression which gained us about 2GB less storage (2.3% decrease in size.) YMMV.</p>
<p>Similarly, there is a computational cost to encrypting and decrypting a text, but less so than with compression.  <em>scp</em> uses <em>ssh</em> to do the underlying encryption and it does a very good job, but like the other single-TCP-stream utilities like <em>curl</em> and <em>wget</em>, it will only be able to push so much thru a connection.</p>
<hr />
<h2><a name="avoiding"></a>Avoiding data transfer</h2>
<p>The most efficient way to transfer data is not to transfer it at all.  There are a number of utilities that can be used to assist in NOT transferring data.  Some of them are listed below.</p>
<h3><a name="kdirstat"></a>kdirstat</h3>
<p>The elegant, open source <a href="http://kdirstat.sourceforge.net/">kdirstat</a> (and it&#8217;s ports to MacOSX <a href="http://www.derlien.com/">Disk Inventory X</a> and Windows <a href="http://windirstat.info/">Windirstat</a>) are quick ways to visualize what&#8217;s taking up space on your disk so you can either exclude the unwanted data that needs to be copied or delete it to make more space.  All of these are fully native GUI applications that show disk space utilization by file type and directory structure.</p>
<p><img style="border-width:0;" src="http://hjmangalam.files.wordpress.com/2009/09/kdirstat-main.png?w=450" alt="kdirstat-main.png"></p>
<h3><a name="rsync"></a>rsync</h3>
<p><a href="http://samba.anu.edu.au/rsync">rsync</a>, from the fertile mind of Andrew (<a href="http://us6.samba.org/samba/">samba</a>) Tridgell, is an application that will synchronize 2 directory trees, transferring only blocks which are different.</p>
<p>For example, if you had recently added some songs to your 120 GB MP3 collection and you wanted to refresh the collection to your backup machine, instead of sending the entire collection over the network, rsync would detect and send only the new songs.</p>
<p>For example, the first time rsync is used to transfer a directory tree, there will be no speedup.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>$ rsync -av ~/FF moo:~
building file list ... done
FF/
FF/6vxd7_10_2.pdf
FF/Advanced_Networking_SDSC_Feb_1_minutes_HJM_fw.doc
FF/Amazon Logitech $30 MIR MX Revolution mouse.pdf
FF/Atbatt.com_receipt.gif
FF/BAG_bicycle_advisory_group.letter.doc
FF/BAG_bicycle_advisory_group.letter.odt
 ...

sent 355001628 bytes  received 10070 bytes  11270212.63 bytes/sec
total size is 354923169  speedup is 1.00</pre>
</td>
</tr>
</table>
<p>but a few minutes later after adding <em>danish_wind_industry.html</em> to the <em>FF</em> directory</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>$ rsync -av ~/FF moo:~
building file list ... done
FF/
FF/danish_wind_industry.html

sent 63294 bytes  received 48 bytes  126684.00 bytes/sec
total size is 354971578  speedup is 5604.05</pre>
</td>
</tr>
</table>
<p>So the synchronization has a speedup of 5600-fold relative to the initial transfer.</p>
<p>Even more efficiently, if you had a huge database to back up and you had recently modified it so that most of the bits were identical, rsync would send only the blocks that contained the differences.</p>
<p>Here&#8217;s a modest example using a small binary database file:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>$ rsync -av mlocate.db moo:~
building file list ... done
mlocate.db

sent 13580195 bytes  received 42 bytes  9053491.33 bytes/sec
total size is 13578416  speedup is 1.00</pre>
</td>
</tr>
</table>
<p>After the transfer, I update the database and rsync it again:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>$ rsync -av mlocate.db moo:~
building file list ... done
mlocate.db

sent 632641 bytes  received 22182 bytes  1309646.00 bytes/sec
total size is 13614982  speedup is 20.79</pre>
</td>
</tr>
</table>
<p>There are many utilities based on rsync that are used to synchronize data on 2 sides of a connection by only transmitting the differences. The backup utility <a href="http://backuppc.sf.net">BackupPC</a> is one.</p>
<p>The open source rsync is included by default with almost all Linux distributions. Versions of rsync exist for Windows as well, via <a href="http://www.cygwin.com">Cygwin</a> and <a href="http://www.aboutmyip.com/AboutMyXApp/DeltaCopy.jsp">DeltaCopy</a></p>
<table cellpadding="8">
<tr valign="top">
<td>
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:1px solid silver;">
<p><b>MacOSX</b></p>
<p>rsync is included with MacOSX as well but because of the Mac&#8217;s twisted history of using the using the  <a href="http://en.wikipedia.org/wiki/AppleSingle">AppleSingle/AppleDouble</a> file format (remember those <a href="http://en.wikipedia.org/wiki/Resource_fork">Resource fork</a> problems?), the version of rsync (2.6.9) shipped with OSX versions up to <em>Leopard</em> will not handle older Mac-native files correctly. However, rsync version 3.x <em>will</em> apparently do the conversions correctly.</p>
</td>
</tr>
</table>
<h3><a name="unison"></a>Unison</h3>
<p><a href="http://www.cis.upenn.edu/~bcpierce/unison/">Unison</a> is a slightly different take on transmitting only changes.  It uses a bi-directional sync algorithm to <em>unify</em> filesystems across a network.  Native versions exist for Windows as well as Linux/Unix and it is usually available from the standard Linux repositories.</p>
<p>From a Ubuntu or Debian machine, to install it would require:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>$ sudo apt-get install unison</pre>
</td>
</tr>
</table>
<hr />
<h2><a name="streaming"></a>Streaming Data Transfer</h2>
<h3><a name="bbcp"></a>bbcp</h3>
<p><a href="http://www.slac.stanford.edu/~abh/bbcp/">bbcp</a> seems to be a very similar utility to <a href="#bbftp">bbftp below</a>, with the exception that it does not require a remote server running. In this behavior, it&#8217;s much more like <em>scp</em> in that data transfer requires only user-executable copies on both sides of the connection.  Short of access to a GridFTP site, this appears to be the fastest, most convenient single-node method for transferring data.</p>
<p>The code compiled &amp; installed easily with one manual intervention</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>curl http://www.slac.stanford.edu/~abh/bbcp/bbcp.tar.Z |tar -xZf -
cd bbcp
# edit Makefile to change line 18 to: LIBZ       =  /usr/lib/libz.a
make
# there is no *install* stanza in the distributed 'Makefile'
cp bin/your_arch/bbcp ~/bin   # if that's where you store your personal bins.
hash -r   # or 'rehash' if using cshrc
# bbcp now ready to use.</pre>
</td>
</tr>
</table>
<p><em>bbcp</em> can act very much like <em>scp</em> for simple usage:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>$ time bbcp  file.633M   user@remotehost.subnet.uci.edu:/high/perf/raid/file
real    0m9.023s</pre>
</td>
</tr>
</table>
<p>The file transferred in under 10s for a 633MB file, giving &gt;63MB/s on a Gb net.  Note that this is over our very fast internal campus backbone. That&#8217;s pretty good, but the transfer rate is sensitive to a number of things and can be tuned considerably.  If you look at <a href="http://www.slac.stanford.edu/~abh/bbcp/">all the bbcp options</a>, it&#8217;s obvious that <em>bbcp</em> was written to handle lots of exceptions.</p>
<p>If you increase the number of streams (-s) from the default 4 (as above), you can squeeze a bit more bandwidth from it as well:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>$ bbcp -P 10 -w 2M -s 10 file.4.2G hjm@remotehost.subnet.uci.edu:/userdata/hjm/
bbcp: Creating /userdata/hjm/file.4.2G
bbcp: At 081210 12:48:18 copy 20% complete; 89998.2 KB/s
bbcp: At 081210 12:48:28 copy 41% complete; 89910.4 KB/s
bbcp: At 081210 12:48:38 copy 61% complete; 89802.5 KB/s
bbcp: At 081210 12:48:48 copy 80% complete; 88499.3 KB/s
bbcp: At 081210 12:48:58 copy 96% complete; 84571.9 KB/s</pre>
</td>
</tr>
</table>
<p>or almost 85MB/s for 4.2GB which is very good sustained transfer.</p>
<p>Even traversing the CENIC net from UCI to SDSC is fairly good:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>$ time bbcp -P 2 -w 2M -s 10 file.633M   user@machine.sdsc.edu:~/test.file

bbcp: Source I/O buffers (61440K) &gt; 25% of available free memory (200268K); copy may be slow
bbcp: Creating ./test.file
bbcp: At 081205 14:24:28 copy 3% complete; 23009.8 KB/s
bbcp: At 081205 14:24:30 copy 11% complete; 22767.8 KB/s
bbcp: At 081205 14:24:32 copy 20% complete; 25707.1 KB/s
bbcp: At 081205 14:24:34 copy 33% complete; 29374.4 KB/s
bbcp: At 081205 14:24:36 copy 41% complete; 28721.4 KB/s
bbcp: At 081205 14:24:38 copy 52% complete; 29320.0 KB/s
bbcp: At 081205 14:24:40 copy 61% complete; 29318.4 KB/s
bbcp: At 081205 14:24:42 copy 72% complete; 29824.6 KB/s
bbcp: At 081205 14:24:44 copy 81% complete; 29467.3 KB/s
bbcp: At 081205 14:24:46 copy 89% complete; 29225.5 KB/s
bbcp: At 081205 14:24:48 copy 96% complete; 28454.3 KB/s

real    0m26.965s</pre>
</td>
</tr>
</table>
<p>or almost 30MB/s.</p>
<p>When making the above test, I noticed the disks to and from which the data was being written can have a large effect on the transfer rate.  If the data is not (or cannot be) cached in RAM, the transfer will eventually require the data to be read from or written to the disk.  Depending on the storage system, this may slow the eventual transfer if the disk I/O cannot keep up with the the network.  On the systems that I used in the example above, I saw this effect when I transferred the data to the /home partition (on a slow IDE disk &#8211; see below) rather than the higher performance RAID system that I used above.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>$ time bbcp -P 2  file.633M  user@remotehost.subnet.uci.edu:/home/user/nother.big.file
bbcp: Creating /home/user/nother.big.file
bbcp: At 081205 13:59:57 copy 19% complete; 76545.0 KB/s
bbcp: At 081205 13:59:59 copy 43% complete; 75107.7 KB/s
bbcp: At 081205 14:00:01 copy 58% complete; 64599.1 KB/s
bbcp: At 081205 14:00:03 copy 59% complete; 48997.5 KB/s
bbcp: At 081205 14:00:05 copy 61% complete; 39994.1 KB/s
bbcp: At 081205 14:00:07 copy 64% complete; 34459.0 KB/s
bbcp: At 081205 14:00:09 copy 66% complete; 30397.3 KB/s
bbcp: At 081205 14:00:11 copy 69% complete; 27536.1 KB/s
bbcp: At 081205 14:00:13 copy 71% complete; 25206.3 KB/s
bbcp: At 081205 14:00:15 copy 72% complete; 23011.2 KB/s
bbcp: At 081205 14:00:17 copy 74% complete; 21472.9 KB/s
bbcp: At 081205 14:00:19 copy 77% complete; 20206.7 KB/s
bbcp: At 081205 14:00:21 copy 79% complete; 19188.7 KB/s
bbcp: At 081205 14:00:23 copy 81% complete; 18376.6 KB/s
bbcp: At 081205 14:00:25 copy 83% complete; 17447.1 KB/s
bbcp: At 081205 14:00:27 copy 84% complete; 16572.5 KB/s
bbcp: At 081205 14:00:29 copy 86% complete; 15929.9 KB/s
bbcp: At 081205 14:00:31 copy 88% complete; 15449.6 KB/s
bbcp: At 081205 14:00:33 copy 91% complete; 15039.3 KB/s
bbcp: At 081205 14:00:35 copy 93% complete; 14616.6 KB/s
bbcp: At 081205 14:00:37 copy 95% complete; 14278.2 KB/s
bbcp: At 081205 14:00:39 copy 98% complete; 13982.9 KB/s

real    0m46.103s</pre>
</td>
</tr>
</table>
<p>You can see how the transfer rate decays as it approaches the write capacity of the /home disk.</p>
<h3><a name="bbftp"></a>bbftp</h3>
<p><a href="http://doc.in2p3.fr/bbftp/">bbftp</a> is a modification of the FTP protocol that enables you to open multiple simultaneous TCP streams to transfer data.  It therefore allows you to sometimes bypass per-TCP restrictions that result from badly configured intervening machines.</p>
<p>In order to use it, you &#8216;ll need a bbftp client and server.  Most places that recieve large amounts of data (SDSC, NCAR, other supercomputer centers, teragrid nodes) will already have a bbftp server running, but you can also compile and run the server yourself.</p>
<p>The more usual case is to run only the client.  It builds very easily on Linux with just the typical <em>curl/untar, cd, ./configure, make, make install</em> dance:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>$ curl http://doc.in2p3.fr/bbftp/dist/bbftp-client-3.2.0.tar.gz |tar -xzvf -
$ cd bbftp-client-3.2.0/bbftpc/
$ ./configure --prefix=/usr/local
$ make -j3
$ sudo make install</pre>
</td>
</tr>
</table>
<p>Using bbftp is more complicated than the usual ftp client because it has its own syntax:</p>
<p>To send data to a server:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>$ bbftp -s -e 'put file.154M  /gpfs/mangalam/big.file' -u mangalam -p 10 -V tg-login1.sdsc.teragrid.org
Password:
&gt;&gt; COMMAND : put file.154M /gpfs/mangalam/big.file
&lt;&lt; OK
160923648 bytes send in 7.32 secs (2.15e+04 Kbytes/sec or 168 Mbits/s)

the arguments mean:
-s  use ssh encryption
-e  'local command'
-E  'remote command' (not used above, but often used to cd on the remote system)
-u  'user_login'
-p  # use # parallel TCP streams
-V  be verbose</pre>
</td>
</tr>
</table>
<p>The data was <em>sent</em> at 21MB/s to SDSC thru 10 parallel TCP streams (but well below the peak bandwidth of about 90MB/s on a Gb network)</p>
<p>To get data from a server:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>$ bbftp -s -e 'get /gpfs/mangalam/big.file from.sdsc' -u mangalam -p 10 -V tg-login1.sdsc.teragrid.org
Password:
&gt;&gt; COMMAND : get /gpfs/mangalam/big.file from.sdsc
&lt;&lt; OK
160923648 bytes got in 3.46 secs (4.54e+04 Kbytes/sec or 354 Mbits/s)</pre>
</td>
</tr>
</table>
<p>I was able to <em>get</em> the data at 45MB/s, about half of the theoretical maximum.</p>
<p>As a comparison, because the remote reciever is running an old (2.4) kernel which does not handle dynamic TCP window scaling, scp is only able to manage 2.2MB/s to this server:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>$ scp  file.154M mangalam@tg-login1.sdsc.teragrid.org:/gpfs/mangalam/junk
Password:
file.154M                                  100%  153MB   2.2MB/s   01:10</pre>
</td>
</tr>
</table>
<h3><a name="fdt"></a>Fast Data Transfer (fdt)</h3>
<p><a href="http://monalisa.cern.ch/FDT">Fast Data Transfer</a> is an application for moving data quickly writ in Java so it can theoretically run on any platform.  The performance results on the web page are very impressive, but in local tests, it was slower than bbcp and the startup time for Java (as well as its failure to work in <em>scp</em> mode (couldn&#8217;t find the <em>fdt.jar</em>, even tho it was in the <strong>CLASSPATH</strong>, required you to explicitly start the receiving FDT server (not hard &#8211; see below, but another step)) argue somewhat against it.</p>
<p>Starting the server is easy; it starts by default in server mode:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>java -jar ./fdt.jar
# usual Java verbosity omitted</pre>
</td>
</tr>
</table>
<p>The client uses the same jarfile but a different syntax:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>java -jar ./fdt.jar -ss 1M -P 10 -c remotehost.domain.uci.edu  ~/file.633M  -d /userdata/hjm

# where
# -ss 1M  ..... sets the TCP SO_SND_BUFFER size to 1 MB
# -P 10 ....... uses 10 parallel streams (default is 1)
# -c host ..... defines the remote host
# -d dir ...... sets the remote dir</pre>
</td>
</tr>
</table>
<p>The speed is certainly impressive.  Much more than scp:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre># scp done over the same net, about the same time

$ scp file.4.2G  remotehost.domain.uci.edu:~
hjm@remotehost's password: ***********
 file.4.2G                   100% 4271MB  25.3MB/s   02:49
                                          ^^^^^^^^</pre>
</td>
</tr>
</table>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre># using the default 1 stream:
$ java -jar fdt.jar -c remotehost.domain.uci.edu ../file.4.2G -d /userdata/hjm/
[transferred in 86s for *53MB/s* ]

# with 10 streams and a larger buffer:
$ java -jar fdt.jar -P 10 -bs 1M -c remotehost.domain.uci.edu ../file.4.2G -d /userdata/hjm/
[transferred in 68s for *66MB/s* with 10 streams]</pre>
</td>
</tr>
</table>
<p>But fdt is slower than bbcp.  The following test was done at about the same time between the same hosts:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>bbcp -P 10 -w 2M -s 10 file.4.2G hjm@remotehost.domain.uci.edu:/userdata/hjm/
bbcp: Creating /userdata/hjm/file.4.2G
bbcp: At 081210 12:48:18 copy 20% complete; 89998.2 KB/s
bbcp: At 081210 12:48:28 copy 41% complete; 89910.4 KB/s
bbcp: At 081210 12:48:38 copy 61% complete; 89802.5 KB/s
bbcp: At 081210 12:48:48 copy 80% complete; 88499.3 KB/s
bbcp: At 081210 12:48:58 copy 96% complete; 84571.9 KB/s</pre>
</td>
</tr>
</table>
<h3><a name="gridftp"></a>GridFTP</h3>
<p>If you and your colleagues have to transfer data in the range of multiple GBs and you have to do it regularly, it&#8217;s probably worth setting up a <a href="http://en.wikipedia.org/wiki/GridFTP">GridFTP</a> site.  Because it allows multipoint, multi-stream TCP connections, it can transfer data at mulitple GB/s.  However, it&#8217;s beyond the scope of this simple doc to describe its setup and use, so if this sounds useful, bother your local network guru/sysadmin.</p>
<h3><a name="netcat"></a>netcat</h3>
<p><a href="http://netcat.sourceforge.net/">netcat</a> (aka <em>nc</em>) is installed by default on most Linux and MacOSX systems.  It provides a way of opening TCP or UDP network connections between nodes, acting as an open pipe thru which you can send any data as fast as the connection will allow, imposing no additional protocol load on the transfer. Because of its widespread availability and it&#8217;s speed, it can be used to transmit data between 2 points relatively quickly, especially if the data doesn&#8217;t need to be encrypted or compressed (or if it already is).</p>
<p>However, to use netcat, you have to have login privs on both ends of the connection and you need to explicitly set up a sender that waits for a connection request on a specific port from the receiver.  This is less convenient to do than simply initiating an <em>scp</em> or <em>rsync</em> connection from one end, but may be worth the effort if the size of the data transfer is very large. To monitor the transfer, you also have to use something like <em>pv</em> (pipeviewer); netcat itself is quite laconic.</p>
<p>How it works: On the sending end, you need to set up a listening port:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>[send_host]: $ pv -pet honkin.big.file | nc -q 1 -l -p 1234 &lt;enter&gt;</pre>
</td>
</tr>
</table>
<p>This sends the <em>honkin.big.file</em> thru <em>pv -pet</em> which will display  progress, ETA, and time taken.  The command will hang, listening (-l) for a connection from the other end.  The <em>-q 1</em> option tells the sender to wait 1s after getting the EOF and then quit.</p>
<p>On the receiving end, you connect to the nc listener</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>[receive_host] $ nc host.domain.uci.edu 1234 |pv -b &gt; honkin.big.file &lt;enter&gt;</pre>
</td>
</tr>
</table>
<p>(note: no <em>-p</em> to indicate port on the receiving side).  The <em>-b</em> option to <em>pv</em> shows only bytes received.</p>
<p>Once the receive_host command is inititated, the transfer starts, as can be seen by the pv output on the sending side and the bytecount on the receiving side.  When it finishes, both sides terminate the connection 1s after getting the EOF.</p>
<p>This arrangement is slightly arcane, but supports the unix tools philosophy which allows you to chain various small tools together to perform a task.  While the above example shows the case for a sinle large file, it can also be modified only slightly to do recursive transfers, using tar, shown here recursively copying the local <em>sge</em> directory to the remote host.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>[send_host]: $ tar -czvf - sge | nc -q 1 -l -p 1234</pre>
</td>
</tr>
</table>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>[receive_host] $  nc host.domain.uci.edu 1234 |tar -xzvf -</pre>
</td>
</tr>
</table>
<p>In this case, I&#8217;ve added the verbose flag (-v) to the tar command so using <em>pv</em> is redundant.  It also uses tar&#8217;s built-in compression flag (-c) to compress as it transmits.</p>
<p>You could also bundle the 2 together in a script, using ssh to execute the remote command. etc, etc, etc, etc.</p>
<hr />
<h2><a name="_latest_version_of_this_document"></a>Latest version of this Document</h2>
<p>The latest version of this document should always be <a href="http://moo.nac.uci.edu/~hjm/HOWTO_move_data.html">here</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/hjmangalam.wordpress.com/9/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/hjmangalam.wordpress.com/9/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/hjmangalam.wordpress.com/9/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/hjmangalam.wordpress.com/9/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/hjmangalam.wordpress.com/9/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/hjmangalam.wordpress.com/9/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/hjmangalam.wordpress.com/9/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/hjmangalam.wordpress.com/9/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/hjmangalam.wordpress.com/9/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/hjmangalam.wordpress.com/9/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/hjmangalam.wordpress.com/9/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/hjmangalam.wordpress.com/9/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/hjmangalam.wordpress.com/9/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/hjmangalam.wordpress.com/9/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=hjmangalam.wordpress.com&amp;blog=1623151&amp;post=9&amp;subd=hjmangalam&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://hjmangalam.wordpress.com/2009/09/14/how-to-transfer-large-amounts-of-data-via-network/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/255884f089123f544bb5e036ae3a89b1?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">hjmangalam</media:title>
		</media:content>

		<media:content url="http://hjmangalam.files.wordpress.com/2009/09/kdirstat-main.png" medium="image">
			<media:title type="html">kdirstat-main.png</media:title>
		</media:content>
	</item>
		<item>
		<title>An R Overview &amp; Cheatsheet</title>
		<link>http://hjmangalam.wordpress.com/2009/09/14/an-r-overview-cheatsheet/</link>
		<comments>http://hjmangalam.wordpress.com/2009/09/14/an-r-overview-cheatsheet/#comments</comments>
		<pubDate>Mon, 14 Sep 2009 17:50:33 +0000</pubDate>
		<dc:creator>hjmangalam</dc:creator>
				<category><![CDATA[Linux & Open Source]]></category>

		<guid isPermaLink="false">http://hjmangalam.wordpress.com/?p=6</guid>
		<description><![CDATA[The latest version of this document will always be found here. Note Some assumptions The following tutorial assumes that you&#8217;re using a Linux system from the bash shell, you&#8217;re familiar navigating directories on a Linux system using cd, using the basic shell utilities such as head, tail, less, and grep, and the minimum R system [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=hjmangalam.wordpress.com&amp;blog=1623151&amp;post=6&amp;subd=hjmangalam&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>The latest version of this document <a href="http://moo.nac.uci.edu/~hjm/AnRCheatsheet.html">will always be found here</a>.</p>
<table cellpadding="8">
<tr valign="top">
<td>
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:1px solid silver;">
<p><b>Some assumptions</b></p>
<p>The following tutorial assumes that you&#8217;re using a Linux system from the bash  shell, you&#8217;re familiar navigating directories  on a Linux system using <strong>cd</strong>, using the basic shell utilities such as <strong>head, tail, less, and grep</strong>, and the minimum R system  has been installed on your system.  If not, you should peruse a basic introduction to the bash shell.   The bash prompt is shown as <strong>bash %</strong>. The R shell prompt is <strong>&gt;</strong> and R commands will be prefixed by the <strong>&gt;</strong> to denote it. Inline comments are prefixed by <strong></strong> and can be copied into your shell along with the R or bash commands they comment &#8211; the <strong></strong> shields them from being executed, but do NOT  copy in the R or bash shell prompt. For a quick introduction to basic data manipulation with Linux, please refer to the imaginatively named <a href="http://moo.nac.uci.edu/~hjm/ManipulatingDataOnLinux.html">Manipulating Data on Linux</a>.</p>
</td>
</tr>
</table>
<hr />
<h2><a name="_r_8217_s_operational_modes"></a>R&#8217;s operational modes</h2>
<p>R is a programming language designed explicitly for statistical computation.  As such, it has many of the characteristics of a general-purpose language including iterators, control loops, network and database operations, many of which are useful, but in general not as easy to use as the more general  <a href="http://en.wikipedia.org/wiki/Python_(programming_language)">Python</a> or <a href="http://en.wikipedia.org/wiki/Perl">Perl</a></p>
<p>R can operate in 2 modes, as an interactive interpreter (the R shell)  and as a scripting language, much like Perl or Python.  Typically the R shell is used to try things out and the serial commands written in the R language and saved in a file are used to automate operations once the sequence is well-defined and debugged.</p>
<p>While it is not a great language for procedural programming, it does excel at mathematical  and statistical manipulations.  However, it does so in an odd way, especially for those who have done procedural programming before.  R is quite <em>object-oriented</em> in that you tend to deal with <em>data objects</em> rather than with individual integers, floats, arrays, etc.  The best way to think of data if you have programmed in <em>C</em> is to think of R data (typically termed <em>tables</em> or <em>frames</em>) as C <em>structs</em>, arbitrary constructs that can be dealt with by name.  If you haven&#8217;t programmed in a procedural language, it may actually be easier for you.  R manipulates chunks of data similar to how you might think of them.  For example, &#8220;multiply that column by 3.54&#8243; or &#8220;pivot that spreadsheet&#8221;.</p>
<p>For a still-brief but more complete overview of R, see <a href="http://en.wikipedia.org/wiki/R_(programming_language)">Wikipedia&#8217;s entry for R</a>.</p>
<hr />
<h2><a name="_graphics_and_graphical_user_interfaces_for_r"></a>Graphics and Graphical User Interfaces for R</h2>
<p>R was developed as a commandline language.  However, it has gained progressively more graphics capabilities and graphical user interfaces (GUIs).  Some notable examples are the de facto standard R GUI <a href="http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/">R Commander</a>, the elegant and powerful interactive graphics utility <a href="http://www.ggobi.org">ggobi</a>, and the fully graphical statistics package <a href="http://gretl.sourceforge.net/">gretl</a> which was developed external to R for time-series econometrics but which now supports R as an external module.</p>
<p>In addition, many routines in R are packaged with their own GUIs so that when called from the commandline interface, the GUI will pop up and allow the user to interact with a mouse rather than the keyboard.</p>
<p>As opposed to fully integrated commercial applications which have a cohesive interface, these packages differ slightly in the way that they approach getting things done, but in general the mechanisms follow generally accepted user interface conventions.</p>
<hr />
<h2><a name="_getting_help_on_r"></a>Getting help on R</h2>
<p>After starting R, you can usually start R&#8217;s internal help with:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>bash %  R  # more comments on R's startup state below
# R startup messages deleted
&gt; help.start().</pre>
</td>
</tr>
</table>
<p>This will start a browser page with a lot of links to R documentation, both introductory and advanced.</p>
<p>If that help is not installed, most R help can be found in PDF and HTML at <a href="http://cran.r-project.org/manuals.html">the R documentation page</a>.</p>
<p>There is a very useful 100 page PDF book called: <a href="http://cran.r-project.org/doc/manuals/R-intro.pdf">An Introduction to R</a> (also in <a href="http://cran.r-project.org/doc/manuals/R-intro.html">HTML</a>). Especially note <a href="http://cran.r-project.org/doc/manuals/R-intro.html#Introduction-and-preliminaries">Chapter 1</a> and <a href="http://cran.r-project.org/doc/manuals/R-intro.html#A-sample-session">Appendix A</a> on page 78, which is a more sophisticated, (but less gentle) tutorial than this.</p>
<p>Note also that if you have experience in SAS or SPSS, there is a good introduction written especially with this experience in mind.  It has been expanded into a large book (<a href="http://www.amazon.com/SAS-SPSS-Users-Statistics-Computing/dp/0387094172">to be released imminently</a>), but much of the core is still available for free as <a href="http://oit.utk.edu/scc/RforSAS%26SPSSusers.pdf">R for SAS and SPSS Users</a></p>
<p>There is another nice Web page that provides a quick, broad intro to R appropriately called <a href="http://www.personality-project.org/r/">Using R for psychological research: A simple guide to an elegant package</a> which manages to compress a great deal of what you need to know about R into about 30 screens of well-cross-linked HTML.</p>
<p>There is also a  website that expands on simple tutorials and examples, so after buzzing thru this very simple example, please continue to the <a href="http://www.statmethods.net/">QuickR website</a></p>
<p><a href="mailto:thomas.girke@ucr.edu">Thomas Girke</a> at UC Riverside has a very nice HTML <a href="http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_BioCondManual.html">Introduction to R and Bioconductor</a> as part of his Bioinformatics Core, which also runs <a href="http://faculty.ucr.edu/\~tgirke/Workshops.htm">frequent R-related courses</a> which may be of interest for the Southern California area.</p>
<hr />
<h2><a name="_start_the_r_interpreter"></a>Start the R interpreter</h2>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>bash %  R  # now that was simple, no?</pre>
</td>
</tr>
</table>
<p># if you get the line shown below:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>[Previously saved workspace restored]</pre>
</td>
</tr>
</table>
<p>you had saved the previous data environment and if you type <strong>ls()</strong> you&#8217;ll see all the previously  created variables:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>&gt; ls()
 [1] "aa"          "data.matrix" "fil_ss"      "i"           "ma5"
 [6] "mn"          "mn_tss"      "norm.data"   "sg5"         "ss"
[11] "ss_mn"       "sss"         "tss"         "t_ss"        "tss_mn"
[16] "x"</pre>
</td>
</tr>
</table>
<p>If the previous session was not saved, there will be no saved variables in the environment</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>&gt; ls()
character(0) # denotes an empty character (= nothing there)</pre>
</td>
</tr>
</table>
<p>NB: In many cases the data objects can be manipulated as you would files in the Unix shell, with the takes-some-getting-used-to of adding the almost-always-required parens afterwards:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>&gt; ls() # lists all data objects in view
&gt; ls(someframe) # may give additional inforamtion on the frame
&gt; rm(someframe) # deletes the 'someframe' data object</pre>
</td>
</tr>
</table>
<p>To determine what an object is, you can use a variety of methods, but one of the most useful is <strong>str</strong> (get the structure of an abject).</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>&gt; str(aa)  # aa is a table or data.frame resulting from reading in a 25MB file.
'data.frame':   385238 obs. of  6 variables:
 $ ID    : Factor w/ 1649 levels "ENSXETG00000000007_SCAFFOLD_1356_47264_67263",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ BEGIN : int  1 70 139 232 301 370 439 508 577 658 ...
 $ END   : int  50 119 188 281 350 419 488 557 626 707 ...
 $ Blue  : num  -0.67 -0.47 0.36 0.42 0.09 0.22 -0.42 0.61 0.71 1.61 ...
 $ Red   : num  2.8 2.56 2.96 -0.76 -2.6 -1.52 -0.36 -3.24 -1.8 -3.36 ...
 $ RedNeg: num  -0.7 -0.64 -0.74 0.19 0.65 0.38 0.09 0.81 0.45 0.84 ...

&gt; str(y) # y is a vector of size 50
 num [1:50] -0.202 -0.418  0.325 -0.377 -0.405 ...</pre>
</td>
</tr>
</table>
<p>You can also use dim() to get the bare dimensions of the data structure. Always a good idea before you try to view it by referencing the data directly without limits.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>&gt; dim (aa)  # aa is a table or data.frame resulting from reading in a 25MB file.
[1] 385238      6</pre>
</td>
</tr>
</table>
<hr />
<h2><a name="_some_basic_r_data_types"></a>Some  basic R data types</h2>
<p>Before we get too far, note that R has a variety of data structures that may be required for different functions.</p>
<p>Some of the main ones are:</p>
<ul>
<li> <em>numeric</em> &#8211; (aka double) the default double-precision (64bit) float representation for a number. </li>
<li> <em>integer</em> &#8211; a single-precision (32bit) integer. </li>
<li> <em>single</em> &#8211; a single precision (32bit) float </li>
<li> <em>string</em> &#8211; any character string (&#8220;harry&#8221;, &#8220;r824_09_hust&#8221;, &#8220;888764&#8243;).  Note the last is a number defined as a string so you couldn&#8217;t add &#8220;888764&#8243; + 765. </li>
<li> <em>vector</em> &#8211; a series of data types of the same type. (3, 5, 6, 2, 15) is a vector of numerics. (&#8220;have&#8221;, &#8220;third&#8221;, &#8220;start&#8221;, &#8220;grep) is a vector of strings. </li>
<li> <em>matrix</em> &#8211; an array of identical data types &#8211; all numerics, all strings, all booleans, etc. </li>
<li> <em>data.frame</em> &#8211; a table (another name for it) or array of mixed data types.  A column of strings, a column of integers, a column of booleans, 3 columns of numerics. </li>
<li> <em>list</em> &#8211; a concatenated series of data objects. </li>
</ul>
<hr />
<h2><a name="_loading_data_into_r"></a>Loading data into R</h2>
<p>This process is described in much more detail in the document <a href="http://cran.r-project.org/doc/manuals/R-data.pdf">R Data Import/Export</a>, but this will give you a short example of one of the most popular ways of loading data into R.</p>
<p>The following reads the example data file above (red+blue_all.txt) into an R table called <em>aa</em>;  note the  <em>,header=TRUE</em> option which uses the  file header line to name the columns.  In order for this to work, the column headers have to be separated by the same delimiter as the data (usually a space or a tab).</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>&gt; aa &lt;- read.table(file="red+blue_all.txt",header=TRUE) # try it</pre>
</td>
</tr>
</table>
<p>Note that in R, you can work left to right or right to left, altho the left pointing arrow is the usual syntax. The above command could also have been given as :</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>&gt; read.table(file="red+blue_all.txt",header=TRUE)  -&gt; aa   # try it</pre>
</td>
</tr>
</table>
<p>Did we get the col names labeled correctly?</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>&gt; names(aa)  # try it
[1] "ID"    "BEGIN" "END"   "Blue"  "Red"</pre>
</td>
</tr>
</table>
<p>Many functions will require the data as a matrix (a data structure whose components are identical data types, usually strings.  The following will load the file into a matrix, doing the conversions along the way.  This command also shows the use of the <strong>sep=&#8221;\t&#8221;</strong> option which explicitly sets the data delimiter to the <strong>TAB</strong> character.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>Data &lt;- as.matrix(read.table("red+blue_all.txt", sep="\t", header=TRUE))</pre>
</td>
</tr>
</table>
<hr />
<h2><a name="_viewing_data"></a>Viewing data</h2>
<p>To view some of the table, we only need to reference it.  R assumes that a naked variable is a request to view it.  An R table is referenced in [rows,column] order. That is, the index is [row indices, col indices], so that a reference like aa[1:6,1:5] will show a <em>square</em> window of the table; rows 1-6, columns 1-5</p>
<p>Also: R parameters are 1-indexed, not 0-indexed, tho it seems pretty forgiving (if you ask for [0-6,0-5], R will assume you meant [1-6,1-5]. also note that R requires acknowledging that the data structures are n-dimensional: to see rows 1-11, you can&#8217;t just type:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>&gt; aa[1:11] # try it!</pre>
</td>
</tr>
</table>
<p>but must instead type:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>&gt; aa[1:11,]   # another dimension implied by the ','</pre>
</td>
</tr>
</table>
<p>To slice the table even further, you can view only those elements that interest you</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>  # R will provide row and column headers if they were provided.
  # if they weren't provided, it will provide column indices.
&gt; aa[3:12,4:5]  # try it.</pre>
</td>
</tr>
</table>
<hr />
<h2><a name="ggobi"></a>Visualizing Data with ggobi</h2>
<p>While there are many ways in R to <a href="http://www.statmethods.net/graphs/index.html">plot your data</a>, one unique benefit that R provides is its interface to <a href="http://www.ggobi.org">ggobi</a>, an advanced visualization tool for multivariate data, which for the sake of argument is more than 3 variables.</p>
<hr />
<h2><a name="_operations_on_data"></a>Operations on Data</h2>
<p>To operate on a column, ie: to multiply it by some value, you only need to reference the column ID, not each element.  This is a key idea behind R&#8217;s object-oriented approach. The following multiplies the column <em>Red</em> of the table <em>aa</em> by -1 and stores the result in the column &#8216;RedNeg.  There&#8217;s no need to pre-define the RedNeg column; R will allocate space for it as needed.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>&gt; aa$RedNeg = aa$Red * -1   # try it

# the table now has a new column called 'RedNeg'
&gt; aa[1:4,]      # try it</pre>
</td>
</tr>
</table>
<p>If you want to have the col overwritten by the new value, simply use the original col name as the left-hand value.  You can also combine statements by separating them with a <em>;</em></p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>&gt; aa$Red = aa$Red * -2; aa[1:4,] # mul aa$Red by -2 and print the 1st 4 lines</pre>
</td>
</tr>
</table>
<p>To transpose (aka pivot) a table:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>&gt; aacopy &lt;- aa # copy the table

&gt; aac_t &lt;- t(aacopy) # transpose the table

# show the all the rows &amp; 1st 3 columns of the transposed table
&gt; aac_t[,1:3]</pre>
</td>
</tr>
</table>
<hr />
<h2><a name="_basic_statistics"></a>Basic Statistics</h2>
<p>Now, let&#8217;s get some basic descriptive stats out of the original table. 1st, load the lib that has the functions we&#8217;ll need</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>&gt; library(pastecs)  # load the lib that has the correct functions
&gt; attach(aa) # make the Red &amp; Blue columns usable as variables.

# following line combines an R data structure by column.
# type 'help(cbind)' or '?cbind' for more info on cbind, rbind

&gt; redblue &lt;- cbind(Red,Blue)

# the following calculates a table of descriptive stats for both the Red &amp; Blue variables
&gt; stat.desc(redblue)
                       Red          Blue
nbr.val       3.852380e+05  3.852380e+05
nbr.null      2.486000e+03  2.188000e+03
nbr.na        0.000000e+00  0.000000e+00
min          -1.504000e+01 -5.130000e+00
max           2.256000e+01  5.340000e+00
range         3.760000e+01  1.047000e+01
sum          -6.929548e+04  3.060450e+03
median        4.000000e-02  0.000000e+00
mean         -1.798771e-01  7.944310e-03
SE.mean       3.826047e-03  1.115451e-03
CI.mean.0.95  7.498938e-03  2.186250e-03
var           5.639358e+00  4.793247e-01
std.dev       2.374733e+00  6.923328e-01
coef.var     -1.320198e+01  8.714826e+01</pre>
</td>
</tr>
</table>
<p>There is also a basic function called <em>summary</em>, which will give summary stats on all components of a dataframe.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>&gt; summary(aa)
                                               ID             BEGIN
 ENSXETG00000004313_SCAFFOLD_317_890727_910726  :   289   Min.   :    1
 ENSXETG00000010135_SCAFFOLD_646_44207_64206    :   289   1st Qu.: 4875
 ENSXETG00000017293_SCAFFOLD_132_1100669_1120668:   289   Median : 9722
 ENSXETG00000020596_SCAFFOLD_474_677230_697229  :   289   Mean   : 9781
 ENSXETG00000021861_SCAFFOLD_101_1701655_1721654:   289   3rd Qu.:14639
 ENSXETG00000023499_SCAFFOLD_486_295358_315357  :   289   Max.   :19950
 (Other)                                        :383504
      END             Blue                Red               RedNeg
 Min.   :   50   Min.   :-5.130000   Min.   :-15.0400   Min.   :-5.64000
 1st Qu.: 4924   1st Qu.:-0.470000   1st Qu.: -1.7600   1st Qu.:-0.39000
 Median : 9771   Median : 0.000000   Median :  0.0400   Median :-0.01000
 Mean   : 9830   Mean   : 0.007944   Mean   : -0.1799   Mean   : 0.04497
 3rd Qu.:14688   3rd Qu.: 0.480000   3rd Qu.:  1.5600   3rd Qu.: 0.44000
 Max.   :19999   Max.   : 5.340000   Max.   : 22.5600   Max.   : 3.76000</pre>
</td>
</tr>
</table>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/hjmangalam.wordpress.com/6/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/hjmangalam.wordpress.com/6/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/hjmangalam.wordpress.com/6/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/hjmangalam.wordpress.com/6/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/hjmangalam.wordpress.com/6/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/hjmangalam.wordpress.com/6/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/hjmangalam.wordpress.com/6/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/hjmangalam.wordpress.com/6/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/hjmangalam.wordpress.com/6/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/hjmangalam.wordpress.com/6/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/hjmangalam.wordpress.com/6/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/hjmangalam.wordpress.com/6/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/hjmangalam.wordpress.com/6/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/hjmangalam.wordpress.com/6/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=hjmangalam.wordpress.com&amp;blog=1623151&amp;post=6&amp;subd=hjmangalam&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://hjmangalam.wordpress.com/2009/09/14/an-r-overview-cheatsheet/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/255884f089123f544bb5e036ae3a89b1?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">hjmangalam</media:title>
		</media:content>
	</item>
		<item>
		<title>Save $ on Software (by not paying for it &#8230;legally)</title>
		<link>http://hjmangalam.wordpress.com/2009/09/14/save-on-software-by-not-paying-for-it-legally/</link>
		<comments>http://hjmangalam.wordpress.com/2009/09/14/save-on-software-by-not-paying-for-it-legally/#comments</comments>
		<pubDate>Mon, 14 Sep 2009 19:48:09 +0000</pubDate>
		<dc:creator>hjmangalam</dc:creator>
				<category><![CDATA[Linux & Open Source]]></category>

		<guid isPermaLink="false">http://hjmangalam.wordpress.com/?p=24</guid>
		<description><![CDATA[Introduction Just as energy conservation is the most efficient mechanism for decreasing energy costs (see NegaWatts), the most efficient mechanism for saving money in the IT world is NOT paying for that which you do not need to pay for. You might call this saving NegaBIT$ (as in B undles of IT $ you don&#8217;t [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=hjmangalam.wordpress.com&amp;blog=1623151&amp;post=24&amp;subd=hjmangalam&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<hr />
<h2><a name="_introduction"></a>Introduction</h2>
<p>Just as energy <em>conservation</em> is the most efficient mechanism for decreasing energy costs (see <a href="http://en.wikipedia.org/wiki/Negawatt_power">NegaWatts</a>), the most efficient mechanism for saving money in the IT world is NOT paying for that which you do not need to pay for. You might call this saving <strong>NegaBIT$</strong> (as in <strong>B</strong> undles of <strong>IT $</strong> you don&#8217;t have to pay for).</p>
<p>One of the most obvious ways of not spending $ on software is by not paying for it &#8211; legally.  Using Free/Libre/Open Source Software (<a href="http://en.wikipedia.org/wiki/FLOSS">FLOSS</a>) has a number of advantages personally and organizationally, for both the short and especially the long term.</p>
<table cellpadding="8">
<tr valign="top">
<td>
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:1px solid silver;">
<p><b>Definitions of FLOSS</b></p>
<p>I use the term FLOSS advisedly &#8211; it&#8217;s an inclusive term that includes many incompatible and or different definitions of <em>free</em>,  but the end result is that the software is free to the end user.  There are many significant differences between different <em>Free</em> licenses, including source code availability, redistribution rights, etc.  More info about this can be <a href="http://en.wikipedia.org/wiki/Open-source_software">found here</a>.</p>
</td>
</tr>
</table>
<p>You don&#8217;t need to run Linux to take advantage of FLOSS; Windows and MacOSX are surprisingly amenable to running FLOSS, and with free Virtualization technology such as <a href="http://www.virtualbox.org/">VirtualBox</a>, you can run whatever Operating System (OS) you need to run whatever application you need (albeit a bit slower, and you may need to pay for the other OS).</p>
<p>The use of FLOSS makes using a computer cheaper, easier to maintain, less legally problematic, more friendly, and .. did I mention cheaper?  Read on..</p>
<hr />
<h2><a name="_floss_on_different_operating_systems"></a>FLOSS on Different Operating Systems</h2>
<p>Regardless of the Operating System you use you use (Mac, Windows, Linux), you can exploit FLOSS.</p>
<h3><a name="_windows"></a>Windows</h3>
<p>Despite what you might think, most FLOSS is used on Windows, the <em>least free</em> OS.  That&#8217;s because there are more people using Windows than all other OS&#8217;s combined, creating a huge &#8220;market&#8221; for FLOSS and therefore programmers have started to address it. There are even <a href="http://osswin.sourceforge.net/">web sites dedicated to enumerating Windows FLOSS</a>.  Among the standouts are the <a href="http://www.mozilla.com/en-US/firefox/upgrade.html">Firefox browser</a> and <a href="http://www.mozillamessaging.com/en-US/thunderbird/">Thunderbird email client</a>, <a href="http://www.pidgin.im/">Pidgin IM</a>, <a href="http://www.videolan.org/vlc/">VLC</a> and <a href="http://www.getmiro.com/">Miro</a> video players, the <a href="http://www.openoffice.org">OpenOffice Suite</a>, <a href="http://audacity.sourceforge.net/">Audacity</a> for simple audio editing, <a href="http://www.gimp.org/">GIMP</a> for photo-editing, <a href="http://www.blender.org/">Blender</a> for 3D modeling and animation are all examples of FLOSS that can be run on Windows.</p>
<h3><a name="_macos_x"></a>MacOS X</h3>
<p>The MacOS X is based on FLOSS and in fact you can download the basic OS (<a href="http://developer.apple.com/Darwin/">Darwin</a>) for free.  The GUI add-on bits and proprietary applications are what Apple charges for.  Besides stand-alone FLOSS applications for the Mac, there are 2 systems that allow you easy access to thousands of the same FLOSS applications that run on Linux: <a href="http://www.macports.org">the Ports system</a>  and <a href="http://www.finkproject.org">fink</a>..  These 2 systems use the same kind of installation architecture that the Linux <strong>apt</strong> system uses to handle and resolve the complications and dependencies of installing thousands of applications.  Note however, that most of these are apps of the geeky variety: commandline, powerful and user-surly.  There are some native Aqua apps such as <a href="http://www.neooffice.org">NeoOffice</a> (a Mac-native port of OpenOffice), <a href="http://tuppis.com/smultron/">the Smultron text editor</a>, <a href="http://cyberduck.ch/">Cyberduck</a>, as well as a number of the same apps as on Windows: Firefox/Thunderbird, Audacity, Blender, Gimp, VLC, etc (see above for links).</p>
<h3><a name="_linux"></a>Linux</h3>
<p>On Linux, as opposed to Windows and the Mac, you have to go looking for applications to pay for.  Almost <strong>everything</strong> is free, and most of the default applications are of very high quality.  I use Linux on my laptop and with the KDE desktop, I can point&#8217;n&#8217;click&#8217;n&#8217;drag&#8217;n&#8217;drop or use the commandline to do what I need.  I use Firefox, <a href="http://www.konqueror.org/">Konqueror</a>, and <a href="http://www.opera.com/">Opera</a> for browsing , <a href="http://www.nedit.org/">nedit</a> for text editing, <a href="http://kate-editor.org/">kate</a> or <a href="http://www.jedit.org/">jedit</a> (Java) as a programming editor, <a href="http://www.kontact.org/">Kontact</a> for a combined POPmail client / calendar / RSS reader / contact manager, <a href="http://www.openoffice.org">OpenOffice</a> for when I have to view MS Office docs (works about as well as one version of MS Office reading another version of MSOffice), <a href="http://www.skype.com">Skype</a> &amp; <a href="http://gizmo5.com/">Gizmo</a> for <a href="http://en.wikipedia.org/wiki/VoIP">VOIPing</a>, VNC (<a href="http://www.realvnc.com/">RealVNC</a> or <a href="http://www.tightvnc.com/">TightVNC</a>) for remote control and desktop sharing, <a href="http://kpdf.kde.org/">kpdf</a> or <a href="http://okular.kde.org/">Okular</a> for viewing PDF docs (Adobe makes a Linux client, but the free ones are smaller, faster), <a href="http://www.digikam.org/">digiKam</a> or <a href="http://picasa.google.com/">Picasa</a> for direct downloading &amp; handling the zillion photos I have, <a href="http://www.videolan.org/vlc/">VLC</a> or <a href="http://www.mplayerhq.hu">Mplayer</a> and flash for video, <a href="http://audacity.sourceforge.net/">Audacity</a> for audio editing and <a href="http://amarok.kde.org/">Amarok</a> for handling my music collection and podcasts.</p>
<h3><a name="_virtualization_and_universal_applications"></a>Virtualization and Universal Applications</h3>
<p>With free <a href="http://en.wikipedia.org/wiki/Virtualization">Virtualization technology</a>, you can run any combination of Windows, MacOS X, and Linux simultaneously on the same computer.  If you haven&#8217;t guessed, I use Linux.  However, when I have to use Windows to debug a problem, I fire up the free <a href="http://www.virtualbox.org/">VirtualBox</a> virtual machine and run WinXP simultaneously. Windows in VirtualBox actually boots faster and seems faster than Windows running native on my laptop, with the exception that the graphics are noticably slower).  Note that you <em>do</em> have to pay for the Windows or Mac OS that you run virtualized, altho you can run the <a href="http://www.winehq.org/">WINE Windows environment</a> for free.</p>
<p>There are also interpreted languages which allow programs written in that language to be run fairly easily on all platforms.  The most widely known of such languages is <a href="http://en.wikipedia.org/wiki/Java_(programming_language)">Java</a>, but <a href="http://en.wikipedia.org/wiki/Perl">Perl</a>, <a href="http://en.wikipedia.org/wiki/Python_(programming_language)">Python</a>, <a href="http://en.wikipedia.org/wiki/Ruby_(programming_language)">Ruby</a>, and <a href="http://en.wikipedia.org/wiki/PHP_(programming_language)">PHP</a> all can be used across platforms fairly well.</p>
<hr />
<h2><a name="_other_advantages_of_floss"></a>Other Advantages of FLOSS</h2>
<h3><a name="_scaling"></a>Scaling</h3>
<p>You don&#8217;t pay for the software..at all.  It&#8217;s free to download, to use, to give to your friends, and in many cases to modify or include in your own software if you&#8217;re so inclined (tho with caveats; the individual FLOSS licenses differ considerably).  Whether you support your family or a large organization, the scaling advantages of FLOSS are hard to overstate.</p>
<p>If you learn to use FLOSS, you&#8217;ll find that you can do without a lot of commercial software.  Those NegaBIT$ add up, not only from the initial non-payment, but especially if totaled up over a decade, or .. your lifetime.  And for many such systems, even if they are initially primitive to the senses (<a href="http://www.vim.org/">vi/vim</a> and <a href="http://www.gnu.org/software/emacs/">emacs</a> for examples), once you learn them, the interface doesn&#8217;t change (see <a href="#featureitis">Interface Changes &amp; Featuritis</a> below), they&#8217;re available on almost every system you&#8217;ll try and they are enormously powerful and robust.  That is a tremendous advantage.</p>
<h3><a name="_upgrading_the_os_amp_application_software"></a>Upgrading the OS &amp; Application Software</h3>
<p>Upgrading is much easier.  On my Linux system, a system-wide upgrade can be done with 2 lines in the terminal:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre># the following updates all the versioning info

apt-get update

# the following upgrades the entire system, including kernel,
# kernel modules, applications, utilities, and all the supporting
# libraries, documentation, source code, and other software.

sudo apt-get dist-upgrade</pre>
</td>
</tr>
</table>
<p>And if you don&#8217;t like typing, the <a href="http://www.nongnu.org/synaptic/action.html">Synaptic GUI</a> will let you do it with a few clicks.</p>
<p>Note that I didn&#8217;t have to go to individual web sites for applications, disable my anti-virus software, spend hours on the phone to Bangalore convincing some guy that I really did pay for my copy of <em>Planet Blortfarg</em>, carve chunks out of my credit card, have to spend hours checking why my drivers didn&#8217;t work anymore, or any of the things a Windows upgrade often entails. It just pretty much works.</p>
<h3><a name="featureitis"></a>Interface Changes &amp; Featuritis</h3>
<p>Often, I find that the interfaces (commandline or GUI) of FLOSS does not change as often as with commercial software.  Like car manufacturers, proprietary SW Vendors have to make visual changes in their product at each release cycle to make you perceive that their product is changing for the better.  But as anyone who has searched for an extra 10 minutes to find out where the <em>subscript button</em> has moved to can tell you, change is not always for the better.  When the interface is changed semi-randomly like this, the vendor is telling you &#8220;We don&#8217;t care how your productivity may suffer.  We have a new way of doing things and you&#8217;re going to learn it.&#8221;  In FLOSS, quite often the interface will stay the same or quite similar to the previous version, but the guts will be improved (and those improvements will typically be listed in their accompanying change log).</p>
<h3><a name="_legality_and_risk"></a>Legality and Risk</h3>
<p>If you use FLOSS, you escape forever the risk that a large, hungry, deep-pocketed software company will come looking to audit you for piracy or license abuse.  This alone is a <a href="http://news.cnet.com/2008-1082_3-5065859.html">powerful argument</a> for using FLOSS.</p>
<p>Note also that if you are involved in supporting software for a large organization, there are additional costs of managing a large roll-out of commercial software besides the unit costs. There are also the costs of  negotiating license agreements, packaging and distributing the software, tracking the use and leakage of such software, and providing the accounting reports of such use.  For a UCI-sized campus, there are probably 3-4 FTE-equivalents that are involved in this process.</p>
<h3><a name="_some_cautions"></a>Some Cautions</h3>
<p>This may paint a rosier picture for FLOSS than I intend.  Good software is always hard to write and maintain.  That a software package is free does not assure high quality.  However, if that software makes it into the Linux repositories, it has at least gotten numerous recommendations from users, has received a cursory evaluation for the distribution team and it has passed a compilation and compatibility test with the rest of the software.  It may well not be up to your standards, but if that&#8217;s the case, you haven&#8217;t wasted anything but a few minutes in installing and testing it.</p>
<p>And uninstalling it is just as easy:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>sudo apt-get remove [package name]</pre>
</td>
</tr>
</table>
<hr />
<h2><a name="_where_does_floss_not_work_well"></a>Where does FLOSS not work well?</h2>
<p>FLOSS fails when there are software packages that you depend on that simply do not exist in the FLOSS world.  <a href="http://www.gimp.org">The GIMP</a> and <a href="http://www.koffice.org/krita/">Krita</a> are both very good digital image editing programs, but overall not of the same quality or depth of Adobe Photoshop.  So if you depend on the extra features or formats of Photoshop, you have no choice.  However, like many such programs, the FLOSS equivalent is designed to address the features requested by most of its users and for most users, I bet that the GIMP is as much as they need.  Or more than they need &#8211; most users would probably be happy with <a href="http://picasa.google.com/">Picasa</a>, a free (but not Open Source) application from Google.</p>
<p>While there are many excellent mathematics applications in the FLOSS world (<a href="http://www.gnu.org/software/octave/">Octave</a>, <a href="http://www.scilab.org/">Scilab</a>, <a href="http://www.r-project.org/">R</a>, and <a href="http://www.sagemath.org/">sage</a>, and <a href="http://www.opensourcemath.org/opensource_math.html">many more</a>) there&#8217;s nothing like <a href="http://wolfram.com/products/mathematica/index.html">Mathematica</a>.  It is one of the best arguments for commercial software. (But also note that Mathematica and <a href="http://www.mathworks.com/products/matlab/">MATLAB</a> have Linux versions).</p>
<p>Integration can be more difficult.  FLOSS is often written without regard for what other software an organization needs and therefore some cutting and glueing at the interface may be necessary.  This is less troublesome than previously because of the near-universal use of XML which allows output formats to be parsed more easily.  Similarly, connection to standard relational databases and the use of SQL can often help resolve integration problems.</p>
<hr />
<h2><a name="_questions_comments_updates"></a>Questions?  Comments? Updates?</h2>
<p>If you have comments, questions, criticisms, please let me know.  I&#8217;m &lt;<a href="mailto:harry.mangalam@uci.edu">harry.mangalam@uci.edu</a>&gt;</p>
<p>The latest version of this document can be found <a href="http://moo.nac.uci.edu/~hjm/OSS_NACS_News.html">here</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/hjmangalam.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/hjmangalam.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/hjmangalam.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/hjmangalam.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/hjmangalam.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/hjmangalam.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/hjmangalam.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/hjmangalam.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/hjmangalam.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/hjmangalam.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/hjmangalam.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/hjmangalam.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/hjmangalam.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/hjmangalam.wordpress.com/24/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=hjmangalam.wordpress.com&amp;blog=1623151&amp;post=24&amp;subd=hjmangalam&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://hjmangalam.wordpress.com/2009/09/14/save-on-software-by-not-paying-for-it-legally/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/255884f089123f544bb5e036ae3a89b1?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">hjmangalam</media:title>
		</media:content>
	</item>
		<item>
		<title>Manipulating Data on Linux</title>
		<link>http://hjmangalam.wordpress.com/2009/09/14/manipulating-data-on-linux/</link>
		<comments>http://hjmangalam.wordpress.com/2009/09/14/manipulating-data-on-linux/#comments</comments>
		<pubDate>Mon, 14 Sep 2009 18:06:59 +0000</pubDate>
		<dc:creator>hjmangalam</dc:creator>
				<category><![CDATA[Linux & Open Source]]></category>

		<guid isPermaLink="false">http://hjmangalam.wordpress.com/?p=11</guid>
		<description><![CDATA[Note Assumptions I&#8217;m assuming that you&#8217;re logged into a bash shell on a Linux system with most of the usual Linux utilities installed as well as R. You should create a directory for this excercise &#8211; name it anything you want, but I&#8217;ll refer to it as $DDIR for DataDir. You can as well by [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=hjmangalam.wordpress.com&amp;blog=1623151&amp;post=11&amp;subd=hjmangalam&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<table cellpadding="8">
<tr valign="top">
<td>
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:1px solid silver;">
<p><b>Assumptions</b></p>
<p>I&#8217;m assuming that you&#8217;re logged into a bash shell on a Linux system with most of the usual Linux utilities installed as well as R.  You should create a directory for this excercise &#8211; name it anything you want, but I&#8217;ll refer to it as $DDIR for DataDir.  You can as well by assigning the real name to the shell variable DDIR:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>export DDIR=/the/name/you/gave/it</pre>
</td>
</tr>
</table>
<p>Shell commands are prefixed by <strong>bash &gt;</strong> and can be  moused into your own shell to test including the embedded comments (prefixed by <em>#</em>; they will be ignored.) Do not, of course include the <strong>bash &gt;</strong> prefix.</p>
<p>Also, all the utilities described here will be available on the interactive <a href="http://moo.nac.uci.edu/%7ehjm/BDUC_USER_HOWTO.html">BDUC cluster</a> nodes at UC Irvine.  Unless otherwise stated, they are also freely available for any distribution of Linux.</p>
</td>
</tr>
</table>
<hr />
<h2><a name="_introduction"></a>Introduction</h2>
<p>If you&#8217;re coming from Windows, the world of the Linux command line can be perplexing &#8211; you have to know what you want before you can do anything &#8211; there&#8217;s nothing to click, no wizards, few hints.  So let me supply a few&#8230;</p>
<p>I assume you&#8217;ve been forced to the Linux shell prompt somewhat against your will and you have no burning desire to learn the cryptic and agonizing commands that form the basis of <a href="http://xkcd.org">xkcd</a> and other insider jokes.  You want get your work done and you want to get it done fast.</p>
<p>However, there are some very good reasons for using the commandline for doing your data processing.  With more instruments providing digital output and new technologies providing LOTS (TBs) of digital data, trying to handle this data with Excel is <strong>just not going to work</strong>.  And <em>Comma Separated Value</em> (CSV) files are probably not going to be much help either in a bit.  But we can deal with all of these on Linux using some fairly simple, free utilities. The overwhelming majority of tools that you&#8217;ll use on Linux are free.  There are also some good proprietary tools that have been ported to Linux (<em>MATLAB, Mathematica, SAS</em>, etc), but I&#8217;m going to concentrate on the free ones until I hit a roadblock.  This is also not to say that you can&#8217;t do much of this on Windows, using native utilities and applications, but it&#8217;s second nature on Linux and it works better as well.  The additional point is that it will ALWAYS work like this on Linux.  No learning a new interface every 6 months because <em>Microsoft</em> or <em>Apple</em> need to bump their profit by releasing yet another pointless upgrade that you have to pay for in time and money.</p>
<table cellpadding="8">
<tr valign="top">
<td>
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:1px solid silver;">
<p><b>MacOSX</b></p>
<p>Note that almost all the tools described here are available in identical form for the Mac.  If you aren&#8217;t using <a href="http://en.wikipedia.org/wiki/Fink">fink</a> or <a href="http://en.wikipedia.org/wiki/MacPorts">MacPorts</a>, you&#8217;re missing a huge chunk of functionality.  I&#8217;m not an expert, but I&#8217;ve used both and both work very well.  Tho they both provide Open Source packages to MacOSX, if there&#8217;s a philosophical difference, fink seems to be more tilted to GNU (the Mac is just another vector for OSS) and MacPorts seems to be tilted to Apple (how to make the Mac the best platform for OSS).  Both are very good. but you should choose one and stick to it as the packages they provide will eventually conflict.</p>
</td>
</tr>
</table>
<p>There is a great, free, self-paced tutorial called <a href="http://swc.scipy.org">Software Carpentry</a> that examines much of what I&#8217;ll be zooming thru in better detail.  The title refers to the general approach of Unix and Linux: use of simple, well-designed tools that tend to do one job very well (think saw or hammer).  Unlike the physical tools, tho, the output of one can be piped into the input of another to form a suprisingly effective (if simple) workflow for many needs.</p>
<p><a href="http://www.showmedo.com">Showmedo</a> is another very useful website, sort of like <strong>YouTube for Computer tools</strong>.  It has tutorial videos covering Linux, Python, Perl, Ruby, the bash shell, Web tools, etc.  And especially for beginners, it has a section related specifically to the above-referenced <a href="http://www.showmedo.com/videos/series?name=pQZLHo5Df">Software Carpentry series</a></p>
<table cellpadding="8">
<tr valign="top">
<td>
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:1px solid silver;">
<p><b>the man pages</b></p>
<p>Both Linux and MacOSX have internal documentation called <em>man pages</em> that documents how to use most of the features of a particular command.  Compared to a web page, they are pretty awful.  Compared to nothing, they&#8217;re pretty good.  They are often written by the utility&#8217;s author so they tend to be terse, technical, tedious, and worst of all, lack examples.  However, since they are reviewed by other Linux users, they do tend to be accurate.  To read them just prefix the name of the utility with the word <em>man</em> (ie <em>man join</em>).</p>
<p>On the other hand, there&#8217;s Google, which has dramatically changed support issues in that once <em>someone</em> has resolved an issue, <em>everyone</em> can see how they did it (unless the hosting company benefits from keeping it secret &#8211; another benefit of Open Source Software vs Proprietary Software.</p>
</td>
</tr>
</table>
<p>OK, let&#8217;s start.</p>
<hr />
<h2><a name="_getting_your_data_to_and_from_the_linux_host"></a>Getting your data to and from the Linux host.</h2>
<p>This has been covered in many such tutorials, and the short version is to use <a href="http://www.winscp.com">WinSCP</a> on Windows and <a href="http://cyberduck.ch/">Cyberduck</a> on the Mac.  I&#8217;ve written more on this, so go and take a look if you want the <a href="http://moo.nac.uci.edu/%7ehjm/BDUC_USER_HOWTO.html#toc3">short and sweet version</a> or the <a href="http://moo.nac.uci.edu/%7ehjm/HOWTO_move_data.html">longer, more sophisticated version</a>.  Also, see the MacOSX note above &#8211; you can do all of this on the Mac as well, altho there are some Linux-specific details that I&#8217;ll try to mention.</p>
<p>OK, you&#8217;ve got your data to the Linux host&#8230; Now what..?</p>
<hr />
<h2><a name="_simple_data_files_examination_and_slicing"></a>Simple data files &#8211; examination and slicing</h2>
<h3><a name="file"></a>What kind of file is it?</h3>
<p>First, even tho you might have generated the data in your lab, you might not know what kind of data it is.  While it&#8217;s not foolproof, a tool that may help is called <strong>file</strong>.  It tries to answer the question: &#8220;What kind of file is it?&#8221;  Unlike the Windows approach that maps file name endings to a particular type (filename.typ), <em>file</em> actually peeks inside the file to see if there are any diagnostic characteristics, so especially if the file has been renamed or name-mangled in translation, it can be very helpful.</p>
<p>ie</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>bash &gt; file /home/hjm/z/netcdf.tar.gz
/home/hjm/z/netcdf.tar.gz: gzip compressed data, from Unix, last modified: Thu Feb 17 13:37:35 2005

# now I'll copy that file to one called 'anonymous'

bash &gt; cp /home/hjm/z/netcdf.tar.gz  anonymous

bash &gt; file anonymous
anonymous: gzip compressed data, from Unix, last modified: Thu Feb 17 13:37:35 2005

# see - it still works.</pre>
</td>
</tr>
</table>
<h3><a name="_how_big_is_the_file"></a>How big is the file?</h3>
<p>We&#8217;re going to use a 25MB tab-delimited data file called <a href="http://moo.nac.uci.edu/%7ehjm/red+blue_all.txt">red+blue_all.txt</a>.  Download it by clicking on the previous link.  Save it to the $DDIR directory you&#8217;ve created for this excercise.</p>
<p><a name="ls"></a>We can get the total bytes with <em>ls</em></p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>bash &gt; mkdir ~/where/you/want/the/DDIR   # make the DDIR
bash &gt; export DDIR=/where/you/made/the/DDIR  # create a shell variable to store it
bash &gt;  cd $DDIR               # cd into that data dir
bash &gt;  ls -l red+blue_all.txt
-rw-r--r-- 1 hjm hjm 26213442 2008-09-10 16:49 red+blue_all.txt
                     ^^^^^^^^</pre>
</td>
</tr>
</table>
<p>or in <em>human-readable form</em> with:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>bash &gt;  ls -lh red+blue_all.txt
#            ^
-rw-r--r-- 1 hjm hjm 25M 2008-09-10 16:49 red+blue_all.txt
#                    ^^^ (25 Megabytes)</pre>
</td>
</tr>
</table>
<p><a name="wc"></a>We can get a little more information using <em>wc</em> (wordcount)</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>bash &gt;  wc red+blue_all.txt
  385239  1926195 26213442 red+blue_all.txt</pre>
</td>
</tr>
</table>
<p><strong>wc</strong> shows that it&#8217;s 385,239 lines 1,926,195 words and 26,213,442 characters</p>
<h3><a name="_native_spreadsheet_programs_for_linux"></a>Native Spreadsheet programs for Linux</h3>
<p>In some cases, you&#8217;ll want to use a spreadsheet application to review a spreadsheet. (For the few of you who don&#8217;t know how to use spreadsheets for data analysis (as opposed to just looking at columns of data), there is a <a href="http://swc.scipy.org/lec/spreadsheets.html">Linux-oriented tutorial</a> at the oft-referenced Software Carpentry site.)</p>
<p>While Excel is the acknowledged leader in this application area, there are some very good native free spreadsheets available on Linux that behave very similarly to Excel. There&#8217;s a good exposition on spreadsheet history as well as links to free and commercial spreadsheets for Linux <a href="http://www.cbbrowne.com/info/spreadsheets.html">here</a> and a <a href="http://en.wikipedia.org/wiki/Comparison_of_spreadsheets">good comparison of various spreadsheets</a>.</p>
<p>For normal use, I&#8217;d suggest either:</p>
<p> <a name="openoffice"></a>
<ul>
<li> <a href="http://en.wikipedia.org/wiki/OpenOffice.org_Calc">OpenOffice calc</a> (oocalc) </li>
<li> <a href="http://en.wikipedia.org/wiki/Gnumeric">Gnumeric</a> (gumeric). </li>
</ul>
<p><a name="gnumeric"></a>The links will give you more information on them and either will let you view and edit most Excel spreadsheets.   In addition, there are 2 Mac-native version of OpenOffice: one a port from the OpenOffice group called:</p>
<p> <a name="neooffice"></a>
<ul>
<li> <a href="http://porting.openoffice.org/mac/download/aqua.html">OpenOffice Aqua</a> the X11 port of OpenOffice </li>
<li> <a href="http://www.neooffice.org">NeoOffice</a>  &#8211; a native fork of OpenOffice. </li>
</ul>
<p>A NeoOffice-supplied <a href="http://neowiki.neooffice.org/index.php/NeoOffice_Feature_Comparison">comparison chart</a> may be worth reviewing.</p>
<p>While both OpenOffice and MS Office have bulked up considerably in their officially <a href="http://en.wikipedia.org/wiki/OpenOffice.org_Calc">stated capacity</a>, often spreadsheets are not appropriate to the kind of data processing we want to do.   So how to extract the data?</p>
<p>The first way is the most direct and may in the end be the easiest &#8211; open the files (<strong>oowriter</strong> for Word files, <strong>oocalc</strong> for Excel files) and export the data as plain text</p>
<p>For Word files:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>File Menu &gt; Save as... &gt; set Filter to text (.txt), specify directory and name &gt; [OK]</pre>
</td>
</tr>
</table>
<p>For Excel files:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>File Menu &gt; Save As... &gt; set Filter: to text CSV (.csv), specify directory and name &gt; [OK] &gt;
   set 'Field delimiter' and 'Text delimiter' &gt; [OK]</pre>
</td>
</tr>
</table>
<p>Or use the much faster method below.</p>
<h3><a name="_extracting_data_from_ms_excel_and_ms_word_files_files"></a>Extracting data from MS Excel and MS Word files files</h3>
<p>The above method of extracting data from a binary MS file requires a fair amount of clicking and mousing.  The <em>Linux Way</em> would be to use a commandline utility to do it in one line.</p>
<h4><a name="antiword"></a>Converting a Word file to text</h4>
<p>For MS Word documents, there is a utility, appropriately called&#8230; <strong>antiword</strong></p>
<p><strong>antiword</strong> does what the name implies; it takes a Word document and inverts it &#8211; turns it into a plain text document:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>bash &gt; time antiword some_MS_Word.doc &gt; some_MS_Word.txt

real    0m0.004s
user    0m0.004s
sys     0m0.000s

# took 0.004s!</pre>
</td>
</tr>
</table>
<table cellpadding="8">
<tr valign="top">
<td>
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:1px solid silver;">
<p><b>Timing your programs</b></p>
<p>The <em>time</em> prefix above returns the amount of time that the command that followed it used.  There are actually 2 <em>time</em> commands usually available on Linux; the one demo&#8217;ed above is the internal <em>bash</em> timer.  If you want more info, you can use the <em>system</em> timer which is usually <strong>/usr/bin/time</strong> and has to be explicitly called:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>bash &gt; /usr/bin/time antiword some_MS_Word.doc &gt; some_MS_Word.txt
0.00user 0.00system 0:00.00elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+16outputs (0major+262minor)pagefaults 0swaps

# or even more verbosely:
bash &gt; /usr/bin/time -v  antiword some_MS_Word.doc &gt; some_MS_Word.txt
        Command being timed: "antiword some_MS_Word.doc &gt; some_MS_Word.txt"
        User time (seconds): 0.00
        System time (seconds): 0.00
        Percent of CPU this job got: 0%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.00
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 0
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 263
        Voluntary context switches: 1
        Involuntary context switches: 0
        Swaps: 0
        File system inputs: 0
        File system outputs: 16
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0</pre>
</td>
</tr>
</table>
</td>
</tr>
</table>
<h4><a name="py_xls2csv"></a>Extracting an Excel spreadsheet</h4>
<p>There&#8217;s an excellent Excel extractor called <strong>py_xls2csv</strong>, part of the free <strong>python-excelerator</strong> package (on Ubuntu).  It works similarly to <em>antiword</em>:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>bash &gt; py_xls2csv BodaciouslyHugeSpreadsheet.xls &gt; BodaciouslyHugeSpreadsheet.csv</pre>
</td>
</tr>
</table>
<p><em>py_xls2csv</em> takes no options and saves output with commas as the only separator, which is generally what you want.</p>
<p>If you want to do further data mangling in Python, the <a href="http://pypi.python.org/pypi/xlrd">xlrd module</a> is a very good Excel reader, but is not an entire utility by itself.</p>
<p>If you are Perl-oriented and have many Excel spreadsheets to manipulate, extract, spindle and mutilate, the Perl Excel-handling modules are also very good.  <a href="http://www.ibm.com/developerworks/linux/library/l-pexcel/">Here is a good description</a> of how to do this.</p>
<hr />
<h2><a name="_viewing_and_manipulating_the_data"></a>Viewing and Manipulating the data</h2>
<p>Whether the file has been saved to a text format, or whether it&#8217;s still in a binary format, you will often want to examine it to determine the columnar layout if you haven&#8217;t already. You can do this either with a commandline tool or via the native application, often a spreadsheet.  If the latter, the OpenOffice spreadsheet app <strong>oocalc</strong> is the most popular and arguably the most capable and compatible Linux spreadsheet application. It will allow you to view Excel data in native format so you can determine which cols are relevant to continued analysis.</p>
<p>If the file is in text format or has been converted to it, it may be easier to use a text-mode utility to view the columns.  There are a few text-mode spreadsheet programs available, but they are overkill for simply viewing the layout.  Instead, consider using either a simple editor or the text mode utilities that can be used to view it.</p>
<h3><a name="editors"></a>Editors for Data</h3>
<p>Common, free GUI editors that are a good choice for viewing such tabular data are <a href="http://www.nedit.org/">nedit</a>, <a href="http://www.jedit.org">jedit</a>, and <a href="http://kate-editor.org">kate</a>.  All have unlimited horizontal scroll and rectangular copy and paste, which makes them useful for copying rectangular chunks of data.  Nedit and jedit are also easily scriptable and can record keystrokes to replay for repeated actions. Nedit has a <em>backlighting</em> feature that is sometimes helpful in debugging a data file, which can visually differ between tabs and spaces.   Jedit is written in Java so it&#8217;s portable across platforms and has tremendous support in the form of various <a href="http://plugins.jedit.org/">plugins</a>.</p>
<p>Also, while it is despised and adored in equal measure, the <a href="http://www.xemacs.org/">xemacs</a> editor can also do just about anything you want to do, if you learn enough about it.  It&#8217;s as much a lifestyle as an editor.</p>
<h3><a name="_text_mode_data_manipulation_utilities"></a>Text-mode data manipulation utilities</h3>
<table cellpadding="8">
<tr valign="top">
<td>
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:1px solid silver;">
<p><b>STDIN, STDOUT, STDERR</b></p>
<p>Automatically available to your programs in Linux and all Unix work-alikes (including MacOSX are the 3 channels noted above: Standard IN (STDIN, normally the keyboard), Standard OUT (STDOUT, normally the terminal screen), and Standard Error (STDERR, also normally the terminal screen). These channels can be intercepted, redirected, and piped in a variety of ways to further process, separate, aggregate, or terminate the processes that use them.  This is a whole topic by itself and is covered well in <a href="http://swc.scipy.org/lec/shell02.html">this Software Carpentry tutorial</a>.</p>
</td>
</tr>
</table>
<h4><a name="grep"></a>The grep family</h4>
<p>Possibly the most used utilities in the Unix/Linux world.  These elegant utilities are used to search files for patterns of text called <a href="http://en.wikipedia.org/wiki/Regular_expression">regular expressions</a> (aka regex) and can select or omit a line based on the matching of the regex.  The most popular of these is the basic grep and it, along with some of its bretheren are <a href="http://en.wikipedia.org/wiki/Grep">described well in Wikipedia</a>.  Another variant which behaves similarly but with one big difference is <a href="http://en.wikipedia.org/wiki/Agrep">agrep</a>, or <strong>approximate</strong> grep which can search for patterns with variable numbers of errors, such as might be expected in a file resulting from an optical scan or typo&#8217;s.  Baeza-Yates and Navarro&#8217;s <a href="http://www.dcc.uchile.cl/~gnavarro/software/">nrgrep</a> is even faster, if not as flexible (or <em>differently</em> flexible), as agrep.</p>
<p>A grep variant would be used to extract all the lines from a file that had a particular phrase or pattern embedded in it.</p>
<p>For example:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>bash &gt; wc /home/hjm/FF/CyberT_C+E_DataSet
  600  9000 52673 /home/hjm/FF/CyberT_C+E_DataSet
# so the file has 600 lines.  If we wanted only the lines that had the identifier 'mur', followed by anything, we could extract it:

# (passed thru 'scut' (link:#scut[see below]) to trim the extraneous cols.)
bash &gt; grep mur /home/hjm/FF/CyberT_C+E_DataSet | scut --c1='0 1 2 3 4'
b0085   murE    6.3.2.13        0.000193129     0.000204041
b0086   murF+mra        6.3.2.15        0.000154382     0.000168569
b0087   mraY+murX       2.7.8.13        1.41E-05        1.89E-05
b0088   murD    6.3.2.9 0.000117098     0.000113005
b0090   murG    2.4.1.- 0.000239323     0.000247582
b0091   murC    6.3.2.8 0.000245371     0.00024733

# if we wanted only the id's murD and murG:
bash &gt; grep mur[DG] /home/hjm/FF/CyberT_C+E_DataSet | scut --c1='0 1 2 3 4'
b0088   murD    6.3.2.9 0.000117098     0.000113005
b0090   murG    2.4.1.- 0.000239323     0.000247582</pre>
</td>
</tr>
</table>
<h4><a name="cat"></a>cat</h4>
<p><strong>cat</strong> (short for <em>concatenate</em>) is one of the simplest Linux text utilities.  It simply dumps the contents of the named file (or files) to STDOUT, normally the terminal screen.  However, because it <em>dumps</em> the file(s) to STDOUT, it can also be used to concatenate multiple files into one.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>bash &gt; cat greens.txt
turquoise
putting
sea
leafy
emerald
bottle

bash &gt; cat blues.txt
cerulean
sky
murky
democrat
delta

# now concatenate the files
bash &gt; cat greens.txt blues.txt &gt;greensNblues.txt

# and dump the concatenated file
bash &gt; cat greensNblues.txt
turquoise
putting
sea
leafy
emerald
bottle
cerulean
sky
murky
democrat
delta</pre>
</td>
</tr>
</table>
<h4><a name="moreless"></a>more &amp;  less</h4>
<p>These critters are called <em>pagers</em> &#8211; utilities that allow you to page thru text files in the terminal window.  <em>less is more than more</em> in my view, but your mileage may vary.  These pagers allow you to queue up a series of files to view, can scroll sideways, allow search by <a href="http://en.wikipedia.org/wiki/Regular_expression">regular expression</a>, show progression thru a file, spawn editors, and many more things. <a href="http://www.showmedo.com/videos/video?name=940030&amp;fromSeriesID=94">Video example</a></p>
<h4><a name="headtail"></a>head &amp; tail</h4>
<p>These two utilities perform similar functions &#8211; they allow you view the beginning (<em>head</em>) or end (<em>tail</em>) of a file.  Both can be used to select contiguous ends of a file and pipe it to another file or pager.  <em>tail -f</em> can also be used to view the end of a file continuously (as when you have a program continuously generating output to a file and you want to watch the progress).  These are also described in more detail near the end of the <a href="http://www-128.ibm.com/developerworks/linux/library/l-textutils.html#12">IBM DeveloperWorks tutorial</a> .</p>
<h4><a name="scut"></a>cut &amp; scut</h4>
<p>These are columnar slicing utilities, which allow you to slice vertical columns of characters or fields out of a file, based on character offset or column delimiters.  <a href="http://lowfatlinux.com/linux-columns-cut.html">cut</a> is on every Linux system and works very quickly, but is fairly primitive in its ability to select and separate data.  <a href="http://forums.nacs.uci.edu/BioBB/viewtopic.php?f=10&amp;t=7">scut</a> is a Perl utility which trades some speed for much more flexibility, allowing you to select data not only by character column and single character delimiters, but also by data fields identified by any delimiter that a <a href="http://en.wikipedia.org/wiki/Regular_expression">regular expression</a> (aka regex) can define.  It can also re-order columns and sync fields to those of another file, much like the <strong>join</strong> utility <a href="#join">see below</a>.  See <a href="http://moo.nac.uci.edu/~hjm/scut_cols_HOWTO.html">this link</a> for more info.</p>
<h4><a name="cols"></a>cols</h4>
<p><em>cols</em> is a very simple, but arguably useful utility that allows you to view the data of a file aligned according to fields.  Especially in conjunction with <em>less</em>, it&#8217;s useful if you&#8217;re manipulating a file that has 10s of columns especially if those columns are of disparate widths. <a href="http://moo.nac.uci.edu/~hjm/scut_cols_HOWTO.html">cols is explained fairly well here</a> and the <a href="http://moo.nac.uci.edu/%7ehjm/cols">cols code is available here</a>.</p>
<h4><a name="paste"></a>paste</h4>
<p><em>paste</em> can join 2 files <em>side by side</em> to provide a horizontal concatenation. ie:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>bash &gt; cat file_1
aaa bbb ccc ddd
eee fff ggg hhh
iii jjj kkk lll

bash &gt; cat file_2
111 222 333
444 555 666
777 888 999

bash &gt; paste file_1 file2

aaa bbb ccc ddd         111 222 333
eee fff ggg hhh         444 555 666
iii jjj kkk lll         777 888 999</pre>
</td>
</tr>
</table>
<p>Note that <em>paste</em> inserted a TAB character between the 2 files, each of which used spaces between each field.  See also the <a href="http://www-128.ibm.com/developerworks/linux/library/l-textutils.html#8">IBM DeveloperWorks tutorial</a></p>
<h4><a name="join"></a>join</h4>
<p><em>join</em> is a more powerful variant of <em>paste</em> that acts as a simple relational <em>join</em> based on common fields.  <em>join</em> needs identical field values to finish the join.  See that <em>scut</em> <a href="#scut">described above</a> can also do this type of operation.  For much more powerful relational operations, <a href="http://www.sqlite.org">SQLite</a> is a fully featured relational database that can do this reasonably easily (<a href="#sqlite">see below</a>).  <a href="http://www-128.ibm.com/developerworks/linux/library/l-textutils.html#9">A good example of <em>join</em> is here</a></p>
<h4><a name="_pr"></a>pr</h4>
<p><em>pr</em> is actually a printing utility that is mentioned here because for some tasks especially related to presentation, it can join files together in formats that are impossible to do using any other utility.  For example if you want the width of the printing to expand to a nonstandard width or want to columnize the output in a particular way or modify the width of tab spacing, <em>pr</em> may be able to do what you need.</p>
<hr />
<h2><a name="hdf5"></a>Complex Binary Data storage and tools</h2>
<p>While much data is available in (or can be converted to) text format, some data is so large (typically, &gt;1 GB) that it demands special handling.  Data sets from the following domains are typically packaged in these formats:</p>
<ul>
<li> Global Climate Modeling </li>
<li> Stock exchange transactions </li>
<li> Confocal images </li>
<li> Satellite and other terrestrial scans </li>
<li> Microarray and other genomic data </li>
</ul>
<p>There are a number of specialized large-data formats, but I&#8217;ll discuss a popular large-scale data format called <a href="http://en.wikipedia.org/wiki/Hierarchical_Data_Format">HDF5</a>, which has now been merged with the <a href="http://en.wikipedia.org/wiki/NetCDF">netCDF</a> data format.  These can be thought of as numeric databases, tho they have significant functional overlap with relational databases. One advantage is that they have no requirement for a database server, an advantage they share with <a href="#sqlite">SQLite, below</a>. As the wiki pages describe in more detail, these Hierarchical Data Formats are self-describing, somewhat like XML files, which enable applications to determine the data structures without external references.  HDF5 and netCDF provide sophisticated, compact, and especially hierarchical data storage, allowing an internal structure much like a modern filesystem.  A single file can provide character data in various encodings (ASCII, UTF, etc), numeric data in various length integer, floating point, and complex representation, geographic coordinates, encoded data such as base+offsets for efficiency, etc.  These files and the protocols for reading them, assure that they can be read and written using common functions without regard to platform or network protocols, in a number of languages.</p>
<p>These files are also useful for very large data as they have parallel Input/Output (I/O) interfaces.  Using HDF5 or netCDF4, you can read and write these formats in parallel, increasing I/O dramatically on parallel filesystems such as <a href="http://en.wikipedia.org/wiki/Lustre_(file_system)">Lustre</a>, <a href="http://www.pvfs.org">PVFS2</a>, and <a href="http://en.wikipedia.org/wiki/Global_File_System">GFS</a>.  Many analytical programs already have interfaces to the HDF5 and netCDF4 format, among them <a href="http://tinyurl.com/dyclf6">MATLAB</a>, <a href="http://reference.wolfram.com/mathematica/ref/format/HDF5.html">Mathematica</a>, <a href="http://cran.r-project.org/web/packages/hdf5/index.html">R</a>, <a href="https://wci.llnl.gov/codes/visit/FAQ.html#28">VISIT</a>, and <a href="http://www.hdfgroup.org/tools.html">others</a>.</p>
<h3><a name="_tools_for_hdf_and_netcdf"></a>Tools for HDF and netCDF</h3>
<p>As noted above, some applications can open such files directly, obviating the need to use external tools to extract or extend a data set in this format.  However, for those time when you have to subset, extend, or otherwise modify such a dataset, there are a couple of tools that can make that job much easier.  Since this document is aimed at the beginner rather than the expert, I&#8217;ll keep this section brief, but realize that these extremely powerful tools are available (and free!).</p>
<p>Some valuable tools for dealing with these formats:</p>
<p> <a name="nco"></a>
<ul>
<li> <a href="http://nco.sf.net">nco</a>, a suite of tools written in C/C++ mainly by UCI&#8217;s own <a href="http://www.ess.uci.edu/~zender">Charlie Zender</a>. they were originally written to manipulate netCDF 3.x files but have been updated to support netCDF4 (which uses HDF5 format).  They are <a href="http://nco.sourceforge.net/#Definition">described in detail here, with examples</a>.  They are portable to all current Unix implementations and of course Linux.  They are extremely well-debugged and their development is ongoing. </li>
<li> <a href="http://www.pytables.org/moin">PyTables</a> is another utility that is used to create, inspect, modify, and/or query HDF5 tables to extract data into other HDF or text files.  This project also have very good documentation and even has a couple video introductions on <a href="http://www.showmedo.com">ShowMeDo</a>.  They can be reached <a href="http://www.pytables.org/moin/HowToUse#Videos">from here</a>.  In addition, PyTables also has a companion graphical data browser called <a href="http://www.vitables.org">ViTables</a>. </li>
</ul>
<hr />
<h2><a name="pytables"></a>Databases</h2>
<p>Databases (DBs) are data structures and the supporting code that allow you to store data in a particular way.  Some DBs are used to store only numbers (and if this is the case, it might be quicker and compact to store that data in an <a href="#hdf5">HDF5 file</a>). If you have lots of different kinds of data and you want to query that data on complex criteria (give me the names of all the people who lived in the 92655 area code and have spent more than $500 on toothpicks in the last 10 years), using a DB can be extremely useful and in fact using a DB may be the only way to extract the data you need in a timely manner.</p>
<p>As implied above, there are a number of different types of DBs, delineated not only by the way they store and retrieve data, but by the way the user interacts with the DB.  The software can be separated into Desktop DBs (which are meant to serve mostly local queries by one person &#8211; for example, a research DB that a scientist might use to store his own data) and DB Servers, which are meant to answer queries from a variety of clients via network socket.</p>
<p>The latter are typically much more complex and require much more configuration to protect their contents from network mischief.  Desktop DBs typically have much less security and are less concerned about answering queries from other users on other computers, tho they can typically do so.  Microsoft Access and SQLite are examples of Desktop DBs.</p>
<h3><a name="_relational_databases"></a>Relational databases</h3>
<p>Relational DBs are those that can <em>relate</em> data from one table (or data structure) to another.  A relational DB is composed of tables which are internally related and can be joined or related to other tables with more distant information.  For example,  woodworker&#8217;s DB might be composed of DB tables that describe PowerTools, Handtools, Plans, Glues, Fasteners, Finishes, Woods, and Injuries, with interconnecting relationships among the tables.  The <em>Woods</em> table definition might look like this:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>TABLE Woods (
    id INTEGER PRIMARY KEY,  # the entry index
    origin VARCHAR(20),      # area of native origin
    now_grown VARCHAR (100), # areas where now grown
    local BOOLEAN,           # locally available?
    cost FLOAT,              # cost per board foot
    density FLOAT,           # density in lbs/board ft.
    hardness INT,            # relative hardness scaled 1-10
    sanding_ease INT,        # relative ease to sand to finish
    color VARCHAR(20),       # string descriptor of color range
    allergy_propensity INT,  # relative propensity to cause allergic reaction 1-10
    toxicity INT,            # relative toxicity scaled 1-10
    strength INT,            # relative strength scaled 1 (balsa) to 10 (hickory)
    appro_glue VARCHAR(20),  # string descriptor of best glue
    warnings VARCHAR(200),   # any other warnings
    notes VARCHAR (1000),    # notes about preparation, finishing, cutting
    &lt;etc&gt;
);</pre>
</td>
</tr>
</table>
<p>See <a href="http://en.wikipedia.org/wiki/Relational_database">the Wikipedia entry on Relational Databases</a> for more.</p>
<h4><a name="_desktop"></a>Desktop</h4>
<p> <a name="sqlite"></a>
<ul>
<li> SQLite (<a href="http://www.sqlite.org">website</a>, <a href="http://en.wikipedia.org/wiki/Sqlite">wikipedia</a>) is an amazingly powerful, <a href="http://en.wikipedia.org/wiki/ACID">ACID-compliant</a>, astonishingly tiny DB engine (could fit on a floppy disk) that can do much of what much larger DB engines can do.  It is public domain, has a huge user base, has good documentation (including a <a href="http://www.amazon.com/Definitive-Guide-SQLite-Mike-Owens/dp/1590596730">book</a>) and is well suited to using for the transition from flat files to relational database.  It has support for almost every computer language, and several utilities (such as graphical browsers like <a href="http://sqlitebrowser.sourceforge.net/screenshots.html">sqlitebrowser</a> and <a href="http://www.knoda.org/">knoda</a> to ease its use.  The <strong>sqlite3</strong> program that provides the native commandline interface to the DB system is fairly flexible, allowing generic import of TAB-delimited data into SQLite tables.  Ditto the graphical <strong>sqlitebrowser</strong> programs.  Here is <a href="http://moo.nac.uci.edu/%7ehjm/recursive.filestats.sqlite_skel.pl">a well-documented example of how to use SQLite in a Perl script</a> (about 300 lines including comments). </li>
<li> <a href="http://www.openoffice.org">OpenOffice</a> comes with it&#8217;s <a href="http://hsqldb.org/">own DB</a> and <a href="http://dba.openoffice.org/">DB interaction tools</a>, which are extremely powerful, tho they have not been well-documented in the past. The OpenOffice DB tools can be used not only with its own DB, but with many others including MySQL, PostgreSQL, SQLite, MS Access, and any DB that provides an <a href="http://en.wikipedia.org/wiki/Open_Database_Connectivity">ODBC interface</a>. </li>
</ul>
<h4><a name="oobase"></a>Server-based</h4>
<p> <a name="mysql"></a>
<ul>
<li> <a href="http://en.wikipedia.org/wiki/MySQL">MySQL</a> is a hugely popular, very fast DB server that provides much of the DB needs of the WWW, both commercial and non-profit.  Some of the largest, most popular web sites in the world use MySQL to keep track of the web intereactions, as well as their data.  The <a href="http://genome.ucsc.edu/index.html">UC Santa Cruz Genome DB</a> uses MySQL with a schema of &gt;300 tables to keep one of the most popular Biology web sites in the world running. </li>
<li> <a href="http://en.wikipedia.org/wiki/PostgreSQL">PostgreSQL</a> is similarly popular and has a reputation for being even more robust and full-featured.  It also is formally an <a href="http://en.wikipedia.org/wiki/Object-relational_database_management_system">Object-Relational</a> database, so its internal storage can be used in a more object oriented way.  If your needs require storing Geographic Information, PostgreSQL has a parallel development called PostGIS, which is optimized for storing Geographical Information and has become a de facto standard for GIS DBs.  It supports the popular <a href="http://mapserver.osgeo.org/">Mapserver</a> software </li>
<li> <a href="http://en.wikipedia.org/wiki/Firebird_(database_server)">Firebird</a> is the Open Source version of a previously commercial DB called Interbase from Borland.  Since its release as OSS, it has undergone a dramatic upswing in popularity and support. </li>
<li> Others.  There are a huge number of very good relational DBs, many of them free or OSS.  <a href="http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems">Wikipedia has a page</a> that names a number of them and briefly describes some differences, altho unless you are bound by some external constraint, you would be foolish not to choose one of SQLite, MySQL, or PostgreSQL due to the vast user-generated support and utilities. </li>
</ul>
<hr />
<h2><a name="firebird"></a>Data visualization</h2>
<p>You almost always want to visualize your data.  It&#8217;s one thing to page thru acres of spreadsheet or numeric data, but nothing can give you a better picture of your data than &#8230; a picture.  This is an area where Linux shines, tho it&#8217;s not without scrapes and scratches.  I&#8217;m going to break this into 2 arbitrary sections. The first is <em>Simple</em> the second <em>Complex</em>.  &#8220;Simple&#8221; alludes to that both the process and data are relatively simple; &#8220;Complex&#8221; implies that both the visualization process and data are more complex.</p>
<h3><a name="_simple_data_visualization"></a>Simple Data Visualization</h3>
<p> <a name="qtiplot"></a>
<ul>
<li> <a href="http://soft.proindependent.com/qtiplot.html">qtiplot</a> is &#8220;a fully fledged plotting software similar to the OriginLab <a href="http://www.originlab.com">Origin</a> software&#8221;.  It also is multiplatform, so it can run on Windows as well as MacOSX. This is probably what researchers coming from the Windows environment would expect to use when they want a quick look at their data.  It has a fully interactive GUI and while it takes some getting used to, it is fairly intuitive and has a number of useful descriptive statistics features as well. Highly recommended. </li>
<li> <a href="http://quickplot.sourceforge.net/">quickplot</a> is a more primitive graphical plotting program but with a very large capacity if you want to plot large numbers of point (say 10s or 100s of thousands of points.)  It can also read data from a pipe so you can place it at the end of a data pipeline to show the result. </li>
<li> <a href="http://www.gnuplot.info/">gnuplot</a> is one of the most popular plotting applications for Linux.  If you spend a few minutes with it, you&#8217;ll wonder why.  If you persist and spend a few hours with it, you&#8217;ll understand.  It&#8217;s not a GUI program, altho <strong>qgfe</strong> provides a primitive GUI (but if you want a GUI, try <strong>qtiplot</strong> above.  gnuplot is really a scripting language for automatically plotting (and replotting) complex data sets.  To see this in action, you may first have to download the demo scripts and then have gnuplot execute all of them with <strong>gnuplot /where/the/demo/data/is/all.dem</strong>.  Pretty impressive (and useful, as all the demo examples ARE the gnuplot scripts)  It&#8217;s an extremely powerful plotting package, but it&#8217;s not for dilettantes. Highly recommended (if you&#8217;re willing to spend the time). </li>
<li> <a href="http://www.pyxplot.org.uk">pyxplot</a> is a graphing package sort of like gnuplot in that it relies on an input script, but instead of the fairly crude output of gnuplot, the output is <a href="http://www.pyxplot.org.uk/examples/firstSteps/03multiAxis/">truly publication quality</a> and the greater the complexity of the plot, the more useful it is.  Recommended if you need the quality or typesetting functionality. </li>
<li> <a href="http://en.wikipedia.org/wiki/Gretl">gretl</a> is not a plotting package <em>per se</em> but a fully graphical statistical package that incorporates gnuplot and as such is worth knowing about.  It is quite intuitive and the statistical functions are an added bonus for when you want to start applying them to your data.  It was originally developed with its own statistics engine but can start and use R as well. Highly recommended. </li>
<li> <a href="http://en.wikipedia.org/wiki/R_(programming_language)">The R statistical language</a> has some <a href="http://www.statmethods.net/graphs/index.html">very good plotting facilities</a> for both low and high dimensional data.  Like the Gretl program immediately above, R combines impressive statistical capabilities with graphics, although R is an interpreted language that uses commands to create the graphs which are then generated in graphics windows.  There are some R packages that are completely graphical and there are moves afoot to put more effort into making a completely graphical version of R, but for the most part, you&#8217;ll be typing, not clicking.  that said, it&#8217;s an amazingly powerful language and <a href="http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html">one of the best statistical environments available</a>, commercial or free.  Very highly recommended. </li>
</ul>
<h3><a name="rplot"></a>Visualization of Multivariate Data</h3>
<p> <a name="ggobi"></a>
<ul>
<li> <a href="http://www.ggobi.org">ggobi</a> is on the border of &#8220;simple&#8221; and &#8220;complex&#8221;, but since it can be started and used relatively easily and has some very compelling abilities.  I have to admit to being a ggobi fan for many years &#8211; it has some features that I haven&#8217;t seen anywhere else which, if your data is multivariate, really helps you to understand it. It can be run by itself, but its real power is obvious when you use it inside of R.  That interface also obviates the grotacious requirement to code all your data into XML (R does it for you automatically).  The ggobi has a <a href="http://www.ggobi.org">very good website</a> and documentation, and also has some very compelling demos and videos showing off its abilities.  For an extended example of how to use it and some output, please check out the ggobi part of <a href="http://moo.nac.uci.edu/%7ehjm/AnRCheatsheet.html#ggobi">An R Cheat Sheet</a> Highly recommended. </li>
<li> <a href="http://meteora.ucsd.edu/~pierce/ncview_home_page.html">ncview</a> is a viewer for netCDF files.  Since such files are typically used to map variables onto a physical space, this application provides a quick visualization of the data in a netCDF file mapped to the geographical or time coordinates. </li>
<li> <a href="http://www.unidata.ucar.edu/software/idv/">Integrated Data Viewer (IDV)</a> &#8220;from Unidata is a Java&#8482;-based software framework for analyzing and visualizing geoscience data. The IDV brings together the ability to display and work with satellite imagery, gridded data, surface observations, balloon soundings, NWS WSR-88D Level II and Level III radar data, and NOAA National Profiler Network data, all within a unified interface.&#8221;  It supports a fantastically wide set of data and can even be run (slowly) without installing it via Java&#8217;s Web Start.  It can also interactively query remote data servers to provide filtered data from much larger data sets.  It is strongly tailored to mapping geophysical data onto geographic maps, so is more of a GIS tool than a strictly multivariate visualizationj tool. </li>
<li> <a href="http://sites.google.com/site/ifrithome/">ifrit</a>, named for &#8220;Ionization FRont Interactive Tool&#8221; for that domain, has now become a much more general data visualization tool. It is used to visualize 4 main types of data: <a name="ifrit"></a>
<ul>
<li> Scalar data: several scalar variables in 3D space. </li>
<li> Vector field data: a 3D field of vectors. </li>
<li> Tensor field data: a symmetric 3&#215;3 tensor in 3D space. </li>
<li> Particle data: a set of particles (points) with several optional attributes (numbers that distinguish particles from each other) per particle. In order to supply data to ifrit, you will probably have to do some editing of the data file header to tell ifrit the layout. </li>
</ul>
</li>
<li> <a href="https://wci.llnl.gov/codes/visit/">VISIT</a> is a very sophisticated fully graphical visualization tool that can provide plotting of just about every dataset that you can supply, and do so in very high quality (including making animations).  It supports the HDF and netCDF formats mentioned above as well <a href="https://wci.llnl.gov/codes/visit/FAQ.html#12">about 40 others</a>. To provide this visualization capability, it&#8217;s interface is fairly complex as well, but if your data visualization needs are very sophisticated, this is the application for you. The web site includes a <a href="https://wci.llnl.gov/codes/visit/screens.html">gallery of visualizations</a> made with VISIT. </li>
</ul>
<hr />
<h2><a name="visit"></a>Programmatic manipulation</h2>
<p>Sometimes your data is of a complexity that you&#8217;ll need to reduce or filter it before it can be used as input for statistical processing or visualization.  If this is the case, you may be headed into the realm of actual programming.  I refer you again to the excellent <a href="http://swc.scipy.org">Software Carpentry tutorials</a> and <a href="http://showmedo.com/videos/carpentry">related videos on ShowMeDo</a>.</p>
<p>For further data manipulation, I would suggest learning a scripting (or interpreted) language, such as Python, Perl, R, or Java for the following reasons.</p>
<h3><a name="python"></a>Python</h3>
<p><a href="http://en.wikipedia.org/wiki/Python_(programming_language)">Python</a>, <a href="http://ipython.scipy.org/moin/">iPython</a> &amp; <a href="http://numpy.scipy.org/">NumPy</a>.  <strong>Python</strong> is a general purpose interpreted programming language and while not the most popular, it is quite suitable for research, especially when coupled with interactive iPython shell/debugger and the numerical module Numpy.  Like Java (and unlike Perl) it is strongly object-oriented, which can make it easier to understand and extend.  iPython is an interactive commandline debugger (there are also many free GUI debuggers such as <a href="http://eric-ide.python-projects.org/">eric</a>), which has a number of features that make it very easy and powerful.  The Numpy module is Python&#8217;s scientific numeric package which makes it very easy to deal with multidimensional arrays and their manipulation as well as <em>wrapping</em> libraries from other languages to include as Python modules &#8211; as an example, see <a href="http://moo.nac.uci.edu/%7ehjm/HOWTO_Pythonize_FORTRAN.html">HOWTO_Pythonize_FORTRAN.html</a>.  I&#8217;ve written a <a href="http://moo.nac.uci.edu/%7ehjm/linebyline.py">simple skeleton example</a> in Python to show how Python does some useful things.</p>
<h3><a name="perl"></a>Perl</h3>
<p><a href="http://en.wikipedia.org/wiki/Perl">Perl</a> &amp; the <a href="http://pdl.perl.org/">Perl Data Language (PDL)</a>.  <strong>Perl</strong> is also a powerful general purpose language, with probably the best integration of <a href="http://en.wikipedia.org/wiki/Regular_expression">regular expressions</a>.  If parsing files, substituting strings, and manipulating text is your bag, then Perl is your sewing machine.  The <strong>PDL</strong> is an additional set of routines that provides much the same functionality as <strong>Numpy</strong> does for <strong>Python</strong>.  While both Perl and Python have their own set of self-updating and library-searching routines, Perl&#8217;s <a href="http://www.cpan.org/">CPAN</a> is more mature and better debugged than Python&#8217;s <a href="http://peak.telecommunity.com/DevCenter/EasyInstall">easy_install</a> system, altho both are very good.  Perl and Python are about equally well-supported when it comes to interacting with databases and network activity.</p>
<h3><a name="rprogram"></a>The R Statistical Language</h3>
<p><a href="http://en.wikipedia.org/wiki/R_(programming_language)">R</a> as alluded to above is an interpreted language that is designed explicitly for data manipulation and analysis.  It has significant graphing and database capabilities built in and if you are a biologist, the R-based <a href="http://www.bioconductor.org">BioConductor suite</a> is a very powerful set of modules for analyses like microarray data, genomics data mining, and even sequence analysis.  It is not a general purpose programming language but for a researcher (not a web programmer), it would be a very good choice.</p>
<h3><a name="java"></a>Java</h3>
<p><a href="http://en.wikipedia.org/wiki/Java_(programming_language)">Java</a> is yet another interpreted, general purpose, object-oriented, programming language, but one which has great support in the area of web development and cross-platform programming.  Most browsers have Java plug-ins which enable them to run Java code, so your applications can <em>run in a browser</em>, which can sometimes be advantageous.  However, the language itself is not as intuitive or concise as Python or Perl and does not have the same strengths in numerical or scientific support, altho that could be my igorance showing.</p>
<h3><a name="processing"></a>Processing.org</h3>
<p><a href="http://processing.org/">processing.org</a> is a free visualization Java toolkit that makes it quite easy to write code to manipulate data in 3D.  If you have a visualization need that is not met by any of the above tools, you may want to play with this toolkit to see if you can develop it yourself.  The language is quite simple and intuitive and the code examples given are well-documented and are executed in the provided programming environment.  There are some interesting and beautiful examples of what it can do <a href="http://www.openprocessing.org/">shown here</a>.</p>
<hr />
<h2><a name="_version_control_for_code_and_documents"></a>Version Control for Code and Documents</h2>
<p>This is a bit beyond the original scope of the document, but since I brought up writing your own code, <a href="http://en.wikipedia.org/wiki/Revision_control">version control</a> is something you should know about.  This <a href="http://www.swc.scipy.org/lec/version.html">Software Carpentry page</a> describes what it is and why it&#8217;s important.  READ IT!!</p>
<p>Versioning tools can, unsurprisingly, help you keep track of file versions.  The two such OSS systems now in widespread and growing use are <a href="http://git-scm.com">git</a> and <a href="http://subversion.tigris.org">Subversion</a> (aka <strong>svn</strong>).  <strong>git</strong> was writ by a major software developer (Linus Torvalds, he of Linux fame and name) for a major software project (tracking the development of Linux itself).  As such, it is very fast, efficient, and was written to encourage branching and merging of versions by a widespread network of asynchronously online group of developers.  As such it is more like a peer-to-peer repository. <strong>svn</strong> was written with the same goals (developed by the team that previously wrote CVS, one of the most successful versioning systems before Subversion), but it is harder to branch and merge versions and has some different architectural features that lend it to a server implementation.</p>
<p>Note that these repositories can be used not only for code but for anything based on text &#8211; notes, grants, small amounts of data, email, etc.  I know a few people who use subversion as their entire backup system.  One in particular who uses Linux and Open Source tools for everything (including writing papers in TeX) has very few binary format files to back up so everything goes into svn.</p>
<p>Both the <a href="http://www.swc.scipy.org/lec/version.html">Software Carpentry</a> site and <a href="http://showmedo.com/videos/series?name=bfNi2X3Xg">Showmedo</a> have pages dedicated to version control but mostly <strong>subversion</strong>.  Setting up a subversion repository is not trivial, but learning how to use it can be tremendously valuable if you need or appreciate versioned files. Google Video has a couple of <em>tech Talks</em> dedicated to <strong>git</strong>, <a href="http://www.youtube.com/watch?v=4XpnKHJAok8">one by Linus Torvalds</a> which is very entertaining but is not much of a <em>HOWTO</em> and <a href="http://video.google.com/videoplay?docid=-3999952944619245780">another by Randall Schwartz</a> which is more useful and describes more of what git can be used for.</p>
<hr />
<h2><a name="_further_reading"></a>Further Reading</h2>
<p><a name="ibmdevworks"></a>There is a similar, if less complete (but possibly more articulate) tutorial called <a href="http://www-128.ibm.com/developerworks/linux/library/l-textutils.html">Simplify data extraction using Linux text utilities</a> from the IBM DeveloperWorks series.</p>
<p>For a quick introduction to more sophisticated data analysis with R that builds on this document, please refer to <a href="http://moo.nac.uci.edu/%7ehjm/AnRCheatsheet.html">An R Cheatsheet</a> and the references therein.</p>
<hr />
<h2><a name="_appendix"></a>Appendix</h2>
<p>List of applications noted here:</p>
<ul>
<li> <a href="#file">file</a> &#8211; what kind of file is this? </li>
<li> <a href="#ls">ls</a>, <a href="#wc">wc</a> &#8211; how big is the file </li>
<li> <a href="#openoffice">OpenOffice.org</a> &#8211; native GUI MS Office work-alike Word Processor (oowriter), Spreadsheet (oocalc), Database Tools (part of oocalc), Drawing (oodraw), Presentation (ooimpress) apps </li>
<li> <a href="#gnumeric">Gnumeric</a> &#8211; native, GUI Spreadsheet </li>
<li> <a href="#neooffice">NeoOffice</a>  &#8211; native MacOSX fork of OpenOffice </li>
<li> <a href="#antiword">antiword</a>, <a href="#py_xls2csv">py_xls2csv</a> &#8211; extract data from MS Word, Excel binary files. </li>
<li> <a href="#editors">nedit, jedit, kate, xemacs</a> &#8211; GUI editors for data files </li>
<li> <a href="#cat">cat</a> &#8211; dump a file to STDOUT </li>
<li> <a href="#moreless">more, less</a> &#8211; text file pagers </li>
<li> <a href="#headtail">head, tail</a> &#8211; view or slice rows from teh top or bottom of a data file </li>
<li> <a href="#scut">cut, scut</a> &#8211; how to slice columns out of a data file </li>
<li> <a href="#cols">cols</a> &#8211; columnize delimited data files </li>
<li> <a href="#paste">paste</a> &#8211; glue 2 files together side by side </li>
<li> <a href="#join">join</a>, <a href="#scut">scut</a> &#8211; merge 2 files based on common keys </li>
<li> <a href="#hdf5">Complex data formats</a> </li>
<li> <a href="#nco">nco</a>, <a href="#pytables">pytables</a> &#8211; slice/dice, extract data from netCDF, HDF files </li>
<li> Relational Databases
<ul>
<li> <a href="#sqlite">SQLite</a> </li>
<li> <a href="#oobase">OpenOffice Base</a> </li>
<li> <a href="#mysql">MySQL</a> </li>
<li> <a href="#postgresql">PostgreSQL</a> </li>
<li> <a href="#firebird">Firebird</a> </li>
</ul>
</li>
<li> Simple Data Plots
<ul>
<li> <a href="#qtiplot">qtiplot</a> </li>
<li> <a href="#quickplot">quickplot</a> </li>
<li> <a href="#gnuplot">gnuplot, qgfe</a> </li>
<li> <a href="#pyxplot">pyxplot</a> </li>
<li> <a href="#gretl">gretl</a> </li>
<li> <a href="#rplot">R</a> </li>
</ul>
</li>
<li> Complex Data Visualization
<ul>
<li> <a href="#ggobi">ggobi &amp; R</a> </li>
<li> <a href="#ncview">ncview</a> </li>
<li> <a href="#idv">IDV</a> </li>
<li> <a href="#ifrit">ifrit</a> </li>
<li> <a href="#visit">VISIT</a> </li>
</ul>
</li>
<li> Programmatic Data Manipulation
<ul>
<li> <a href="#python">Python, iPython, Numpy</a> </li>
<li> <a href="#perl">Perl, Perl Data language</a> </li>
<li> <a href="#rprogram">R</a> </li>
<li> <a href="#java">Java</a> </li>
<li> <a href="#processing">processing.org</a> </li>
</ul>
</li>
</ul>
<hr />
<h2><a name="_copyright"></a>Copyright</h2>
<p><a href="http://creativecommons.org/licenses/by-sa/3.0/"> <img style="border-width:0;" src="http://wiki.creativecommons.org/images/1/1f/By-sa_plain300.png" alt="http://wiki.creativecommons.org/images/1/1f/By-sa_plain300.png"> </a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/hjmangalam.wordpress.com/11/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/hjmangalam.wordpress.com/11/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/hjmangalam.wordpress.com/11/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/hjmangalam.wordpress.com/11/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/hjmangalam.wordpress.com/11/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/hjmangalam.wordpress.com/11/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/hjmangalam.wordpress.com/11/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/hjmangalam.wordpress.com/11/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/hjmangalam.wordpress.com/11/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/hjmangalam.wordpress.com/11/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/hjmangalam.wordpress.com/11/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/hjmangalam.wordpress.com/11/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/hjmangalam.wordpress.com/11/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/hjmangalam.wordpress.com/11/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=hjmangalam.wordpress.com&amp;blog=1623151&amp;post=11&amp;subd=hjmangalam&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://hjmangalam.wordpress.com/2009/09/14/manipulating-data-on-linux/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/255884f089123f544bb5e036ae3a89b1?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">hjmangalam</media:title>
		</media:content>

		<media:content url="http://wiki.creativecommons.org/images/1/1f/By-sa_plain300.png" medium="image">
			<media:title type="html">http://wiki.creativecommons.org/images/1/1f/By-sa_plain300.png</media:title>
		</media:content>
	</item>
		<item>
		<title>UCI OSS Backup Evaluation</title>
		<link>http://hjmangalam.wordpress.com/2009/09/13/uci-oss-backup-evaluation/</link>
		<comments>http://hjmangalam.wordpress.com/2009/09/13/uci-oss-backup-evaluation/#comments</comments>
		<pubDate>Mon, 14 Sep 2009 11:10:16 +0000</pubDate>
		<dc:creator>hjmangalam</dc:creator>
				<category><![CDATA[Linux & Open Source]]></category>

		<guid isPermaLink="false">http://hjmangalam.wordpress.com/?p=16</guid>
		<description><![CDATA[Introduction and Constraints UCI, like many UC campuses, is facing the dual squeeze of decreasing IT budgets, and increasing licensing fees for our institutional Backup Systems. We also are facing somewhat more requirements from clients as more data is being gathered or generated, analyzed, and archived. In view of these pressures, the Office of Information [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=hjmangalam.wordpress.com&amp;blog=1623151&amp;post=16&amp;subd=hjmangalam&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<hr />
<h2><a name="_introduction_and_constraints"></a>Introduction and Constraints</h2>
<p>UCI, like many UC campuses, is facing the dual squeeze of decreasing IT budgets, and increasing licensing fees for our institutional Backup Systems.  We also are facing somewhat more requirements from clients as more data is being gathered or generated, analyzed, and archived.  In view of these pressures, the Office of Information Technology (OIT) is evaluating what our requirements are, and what Backup solutions can be used to more economically address our needs.  We are evaluating both Proprietary and Open Source Software (OSS) approaches and it may be that the optimal solution is a combination of the 2.</p>
<p><b>Any Backup approach is guided by at least 2 issues:</b></p>
<ol type="1">
<li> The value of the data (or the cost of replacing it). </li>
<li> The cost of backing it up. </li>
</ol>
<p>Much of the most valuable institutional data is stored on high-cost, high-reliability, highly-secured central servers.  This makes backup fairly easy and most such devices have inherent or included data redundancy or data protection, making decisions about what backup system to use much easier (in some cases, there is no choice at all since the proprietary nature of the storage allows <strong>only</strong> the vendor&#8217;s implementation).</p>
<p>The situation is somewhat different for a university, which brings in much of its support $ in the form of overhead on grants.  Such faculty-initiated grants brought in ~$328M in external funding last year for UC Irvine alone.  Those grants were being composed on and still reside substantially on personal Laptops and Desktops scattered around the campus.  The vast majority of them are backed up sporadically, if at all.  I&#8217;ve had personal experience in trying to rescue at least 5 grants that were lost to disk crashes or accidental deletion days before the submisison date.  That is valuable data and it would be exceptionally useful to be able to provide backup services to such users, even disregarding the somewhat less critical primary data from their labs.  At UCI, this includes about 400 faculty who write for external grants on a regular basis.  Any Backup system should minimally cover these people and therefore the scalability and Mac/Windows client compatibility of any such system is quite important.</p>
<p>Below is a summary of our current backup systems, <a href="http://www.emc.com/products/detail/software/networker.htm">EMC Networker</a> and <a href="http://retrospect.com/">EMC Retrospect</a>.</p>
<hr />
<h2><a name="_networker_client_load"></a>Networker Client Load</h2>
<p>We currently use Networker to back up ~230 clients (mostly other servers) to 3 backup servers, with storage requirements as described graphically below: <img style="border-width:0;" src="http://hjmangalam.files.wordpress.com/2009/09/uci_backup_clients_plot.jpg?w=450" alt="UCI_backup_clients_plot.jpg"></p>
<p>and statistically here:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" cellpadding="10">
<tr>
<td>
<pre>Sum       109326.2 ........ total GB
Number    227 ............. # clients
Mean      481.61 .......... GB / client
Median    155.9 ...........     "
Min       0.2 ............. GB on smallest client
Max       4651.2 .......... GB on largest client
Range     4651 ............ diff between 2 above
Variance  523643.27 ....... among all clients
Std_Dev   723.63 ..........        "
SEM       48.02 ...........        "
Skew      2.43 ............        "
Std_Skew  14.99 ...........        "
Kurtosis  6.97 ............        "</pre>
</td>
</tr>
</table>
<p>We currently use about 1/4 of an FTE to administer the Networker system after setup. The IAT recharge rates for this service is <a href="http://www.nacs.uci.edu/org/nacs-prices.html">listed here</a>, but in summary, it&#8217;s $20/mo/system plus $.39/GB for storage and a $40 charge for file restores.</p>
<hr />
<h2><a name="_retrospect_client_load"></a>Retrospect Client Load</h2>
<p>Retrospect is a disk-only based system,</p>
<p>We currently pay $620/yr for this license and currently have 59 clients (18 Macs, rest Windows).</p>
<p>We currently charge $87.50/user/year with a 50GB limit on storage, with self-service file restores.</p>
<hr />
<h2><a name="_open_source_backup_software_evaluation"></a>Open Source Backup Software Evaluation</h2>
<p>There are perhaps 50 OSS Backup systems, but most of them are too limited in their features or maturity to be considered.  Only 3 seem to rise to the level of possibility, tho for different things:</p>
<ul>
<li> <a href="http://amanda.zmanda.com">Amanda</a> &#8211; A <a href="http://tinyurl.com/mlku4d">good, neutral description</a> from <a href="http://www.backupcentral.com">BackupCentral</a>.  <a href="http://www.zmanda.com">Zmanda</a> provides commercial suport and additional code goodies. </li>
<li> <a href="http://www.bacula.org">Bacula</a> &#8211; A <a href="http://tinyurl.com/lp6nel">good neutral description</a> from BackupCentral.  <a href="http://www.baculasystems.com">Bacula Systems</a> provides support. </li>
<li> <a href="http://backuppc.sf.net">BackupPC</a> &#8211; A <a href="http://tinyurl.com/nwgle4">good neutral description</a> from BackupCentral. </li>
</ul>
<p>Amanda, Bacula, and BackupPC share these characteristics:</p>
<ul>
<li> All have Commercial support available now or imminently. </li>
<li> The OSS version is of course, free for unlimited number of servers and clients. </li>
<li> All can backup to disk. </li>
<li> All have Open Source versions (altho Amanda and Bacula have $ versions that have additional, non-OSS goodies provided. </li>
<li> All can support MacOSX, Windows, *nix clients via some combination of rsync,  samba, semi-proprietary protocol (Bacula protocol is OSS, but only Bacula uses it it) </li>
<li> All can support 10s-100s of clients per server. </li>
<li> None currently have support for the <a href="http://www.ndmp.org/info/overview.shtml">NDMP</a> protocol (tho Bacula and Zmanda are planning it) </li>
<li> None support transparent <a href="http://en.wikipedia.org/wiki/Bare-metal_restore">Bare Metal Restore</a> </li>
<li> None support <em>duplicate on backup</em> </li>
<li> None natively support <a href="http://en.wikipedia.org/wiki/Snapshot_(computer_storage)">Snapshots</a> altho all can be implemented on Solaris to provide a number of ZFS features such as: snapshots, filesystem compression, slightly better I/O, etc.  However, while Solaris is certainly stable, there are still aspects of ZFS that seem to be causing problems. (To ease a transition from Linux to Solaris, <a href="http://www.nexenta.org/os">Nexenta</a> is a free distribution of Solaris packaged as Ubuntu). </li>
</ul>
<h3><a name="_amanda_zmanda"></a>Amanda/Zmanda</h3>
<ul>
<li> useful for separate instances of backup servers servicing sets of clients (star config) </li>
<li> store to tape or disk; can use Barcode writers, readers to generate, process labels </li>
<li> Amanda uses std unix text-based config files; seems to be more easily configurable than Bacula, tho less so than BackupPC (configured by Web GUI or text files) </li>
<li> very flexible backup scheduling mechanism </li>
<li> uses open &amp; common storage protocols (tar, dump, gzip, compress, etc) </li>
<li> very secure by on-wire &amp; and on-disk encryption </li>
<li> CLI or GUI available. </li>
<li> mature (16yr), extremely well-reviewed C++ source code. Zmanda says that they are re-writing it in Perl for easier maintenance and to encourage more external contributions. </li>
<li> very large user base </li>
<li> Zmanda extension for backing up MySQL. </li>
<li> uses own internal datastructures, so no additional DB instance required. </li>
<li> <a href="http://wiki.zmanda.com/images/a/a4/Amanda-calug.pdf">Zmanda-supplied technical PDF</a> about current and near-future options (NDMP, others) </li>
<li> Commercial support costs are <a href="http://www.zmanda.com/pricing.html">as described here</a>. </li>
</ul>
<h3><a name="_bacula"></a>Bacula</h3>
<ul>
<li> Commercial support available </li>
<li> store to tape or disk </li>
<li> allows multiple servers to service multiple clients simultaneously, allowing a much larger single instance </li>
<li> uses SQLite, MySQL or PostgreSQL as DB (all well-supported &amp; understood), but additional complexity, and DB is therefore a critical link. </li>
<li> published storage code, but not as widely used as Amanda </li>
<li> Commercial support costs are <a href="http://www.baculasystems.com/eng/Products/Subscriptions">as described here</a>. </li>
</ul>
<h3><a name="_backuppc"></a>BackupPC</h3>
<ul>
<li> Zmanda will be providing <a href="http://www.zmanda.com/backuppc.html">commercial support for BackupPC</a> very soon, probably ~$10/client for large academic institutions. </li>
<li> store only to disk, <em>not tapes</em>; therefore not useful for long-term archiving, unless willing to buy appropriately large hardware. </li>
<li> support for client restores </li>
<li> UCI-written support for client self-registration. </li>
<li> rsync support depends on Perl implementation of rsync so it lags the most recent rsync features by a few revisions. </li>
<li> supports file de-dupe via hard links, but hard links limit the storage to 1 filesystem (but XFS, ext4 support up to &gt;= 1-8 EB). </li>
<li> uses file tree, not DB, as the data structure; simpler but more primitive. </li>
<li> uses no proprietary or application-specific client software; therefore very simple to implement and robust for restores. </li>
<li> supports encrypted data transfer for MacOSX, Lin, but not (easily) for Windows (encryption requires use of ssh public key from server and ssh remote execution of windows executables). </li>
</ul>
<h3><a name="_some_feature_comparisons"></a>Some Feature Comparisons</h3>
<p>Nice <a href="http://wiki.bacula.org/doku.php?id=comparisons">Table of Feature Comparisons</a> among OSS Backup packages and proprietary Backup packages</p>
<h3><a name="_crude_measures_of_popularity"></a>Crude Measures of popularity</h3>
<p>Via Google-linking (a very crude measure; very sensitive to key words)</p>
<ul>
<li> <em>Amanda/Zmanda</em>: 411,000 google links; 460 pages linking to zmanda.com; 252 linking to amanda.zmanda.com </li>
<li> <em>Bacula</em>: 402,000 google links; 468 pages linking to www.bacula.org. </li>
<li> <em>BackupPC</em>: 267,000 google links; 5,070 pages link to backuppc.sourceforge.net; 3,140 pages link to backuppc.sf.net </li>
</ul>
<p>And for some comparison:</p>
<ul>
<li> <em>Vertitas</em>: 255,000 for <em>veritas backup software</em>; 5,510 linking to <a href="http://www.symantec.com">http://www.symantec.com</a> (all products) </li>
<li> <em>Networker</em>: 66,500 for <em>EMC networker</em>; 2,370 linking to emc.com (all products) </li>
</ul>
<p><a href="http://trends.google.com">Google Trends</a> indicates that the Search Volume Index is decreasing rapidly for Veritas, is holding constant for EMC Networker, Bacula and BackupPC, but Bacula is ~3x the value of BackupPC, which is itself 2x EMC Networker. Veritas has decreased to just above BackupPC.  Zmanda only started in 2006, and does not have much of an index built up, but it show significant spikes in News Reference Volume. &#8220;amanda backup&#8221; has been decreasing from 2004 and remains about 1/2 of BackupPC and 1/6 of Bacula.</p>
<hr />
<h2><a name="_future_projects"></a>Future Projects</h2>
<p>I would like to see the automatic backup of all of our faculty&#8217;s Desktops/Laptops (\~1000) to shield against catastrophic loss of recent data, but I&#8217;m not sanguine about the chances for this due to funding issues, unless we go with a pure OSS solution, which would be considerably better than nothing at all.</p>
<h3><a name="_details"></a>Details</h3>
<p>At 10GB per faculty, this would mean a storage server of ~20TB (now a medium-sized file server), and at 1% data changing per day, that means that on the order of 100GB a day would have to be transferred for incremental backups.  At 7 MB/s (a decent transfer rate over 100Mb), transmission time is only ~4 hrs, easily done in a night, in parallel sessions. Measured on an Opteron backup server (doing server-side compression &amp; file deduping via hardlinking), it takes about 25% of a CPU to handle 1 backup session, so a 4-core machine could theoretically handle ~16 simultaneous backups, if the bandwidth can supply it with enough data.  Our test backup server is currently single-homed, but has 5 interfaces, so could easily be multi-homed.  If we use OSS Backup software, it will cost \~$10K for hardware to provide 1000 very valuable PCs or laptops with at least protection against catastrophic loss.</p>
<hr />
<h2><a name="_considerations_for_any_such_decision"></a>Considerations for any such decision</h2>
<p><em>Please feel free to expand on (or critique) these points.</em></p>
<h3><a name="_clients"></a>Clients</h3>
<ul>
<li> what features of current clients are actually used? (Do you need a wiki or calendaring function in your backup software? &#8211; why pay for unused features?) </li>
<li> how many clients currently served?
<ul>
<li> how many would you like to serve? </li>
</ul>
</li>
<li> OS distribution of clients </li>
<li> minimum backup cycle </li>
<li> how much data per client per cycle
<ul>
<li> max data accepted, if such a limit </li>
</ul>
</li>
<li> do clients need to be able to initiate their own backups? </li>
<li> client email notification of problems or email just to admin? </li>
<li> mechanism of backup (rsync or propr.) </li>
<li> compressed on client or server (open or propr.) </li>
<li> client upgrades done from server, or client involvement </li>
<li> preferred client registration (self or admin registration?) </li>
<li> support of mobile client IP&#8217;s or static IP only? </li>
<li> local nets only or internet support (backup/restore in Europe?) </li>
<li> special application backups required?  MS Exchange, RDBMSs, CMSs </li>
<li> does timing of backup have to be client-adjustable or <em>just at night</em> or doesn&#8217;t matter? </li>
<li> typical client ethernet type and number of hops to server </li>
<li> need to support secure backup over wireless? </li>
<li> can open files be backed up? Win/Lin/Mac <sup>+</sup> </li>
<li> do you need to back up encrypted filesystems and if so, how many features will work with such filesystems? </li>
<li> are clients cluster aware?  Can they back up to a cluster or one of a set of distributed servers? <sup>+</sup> </li>
<li> if the software backs up Windows servers, is it SharePoint-aware? Exchange-aware? <sup>+</sup> </li>
<li> if the software backs up Windows servers, can it do hot reinserts into Active Directory?  For example, can it restore a single security group deleted from a domain controller or a single mailbox on an Exchange server? <sup>+</sup> </li>
<li> can users recover files themselves?  If so, are they properly restricted to only recovering files they own? <sup>+</sup> </li>
</ul>
<h3><a name="_server_amp_admin"></a>Server &amp; Admin</h3>
<ul>
<li> what features of the server are actually used. (why pay for unused features?) </li>
<li> how many servers are required per 100 clients? </li>
<li> what is the FTE setup, configuration, and support requirements for the server &amp; each type of client? </li>
<li> do servers stage-to-disk before writing to tape? <sup>+</sup> </li>
<li> if multiple servers, are they synchronized or independent? </li>
<li> type of server required (OS, CPUs, RAM, used for other services?) </li>
<li> is there an institutional inability to deal with a particular platform? </li>
<li> how many net interfaces per server are typically being used? </li>
<li> data format on server (open, propr.) </li>
<li> storage (tape / tape robot, disk, hierarchical) </li>
<li> type of server admin interface used, preferred (native GUI, Web, CLI) </li>
<li> database, other support software required (OSS, propr.) </li>
<li> how are server upgrades done? </li>
<li> does a feature of the backup server require or obviate a specific filesystem type? </li>
<li> can you have (do you need) standby servers? </li>
<li> is the software database server-aware? For example, can it backup Oracle databases without having to take the databases offline? <sup>+</sup> </li>
<li> Does it support direct fiber/fiber-aware backups? <sup>+</sup> </li>
</ul>
<h3><a name="_backup_protocol"></a>Backup Protocol</h3>
<ul>
<li> what kinds of backups are required (disaster recovery, archival, incremental, snapshots) </li>
<li> network protocol used, preferred </li>
<li> encryption requirements </li>
<li> type of data compression used, preferred &#8211; format, block-level de-duping? file-level de-duping?, with HW accel or just software? </li>
<li> is dedupe technology required?  Versus just buying more storage media?  If needed, what level is needed? (compression of individual files, hard links to files, proprietary dedupe technology, <a href="http://www.linux-mag.com/cache/7535/1.html">target or source-based dedupe</a>, etc) </li>
<li> does the software deduplicate to tape? Or does it reconstitute the data, then write it to tape? <sup>+</sup> </li>
<li> does the de-duplication component (if the backup software has one) write both the deduplicated data and the de-duplicate table to physical tape? (This obviates the need to have the same backup server to reconstitute the data.) <sup>+</sup> </li>
<li> if using rsync, which version?  Version 3x has significant advantages over 2x. <sup>=</sup> </li>
</ul>
<table cellpadding="8">
<tr valign="top">
<td>
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:1px solid silver;">
<p><b>Contributions:</b></p>
<p><sup>+</sup> contributed by <a href="mailto:Scott.Talkovic@uci.edu">Scott Talkovic</a></p>
<p><sup>=</sup> contributed by <a href="mailto:sbeardsley@ucdavis.edu">Scott Beardsley</a></p>
</td>
</tr>
</table>
<h3><a name="_cost_cost_efficiency_amp_legal_concerns"></a>Cost, Cost Efficiency, &amp; Legal Concerns</h3>
<ul>
<li> what is the cost per user for the system? </li>
<li> what is the coverage of the various systems at different price points? </li>
<li> what is the tradeoff of moving up one curve and down another? (for example, the cost of using inefficient dedupe technology and compensating with more storage, vs using efficient, but proprietary dedupe technology) </li>
<li> what is the institutional cost of using technology that does not cover all clients equally vs the cost of not covering them at all? </li>
<li> what features in the proprietary software will actually be used?  (Since we have such systems, we should be able to document this at least for Networker and Retrospect) </li>
<li> what is cost and timing of scaling?  If we have to provide backups for a new department suddenly, what is the $ cost and the time lag? </li>
<li> if it&#8217;s possible to go out of compliance, what is the cost of doing so? </li>
<li> what legal recourse is there if the system fails in some way? </li>
</ul>
<hr />
<h2><a name="_feedback_and_suggestions"></a>Feedback and Suggestions</h2>
<p>We are still in a preliminary mode for this evaluation, but if you have suggestions, queries, or would like to be notified of the final result and sent any documentation of the process,    <a href="mailto:harry.mangalam@uci.edu?Subject=UCI%20OSS%20Backup%20Comment">please let me know</a>.</p>
<h3><a name="_contributed_suggestions"></a>Contributed Suggestions</h3>
<table bgcolor="#ffffee" width="100%" cellpadding="15">
<tr>
<td>
<p><em>Recommendations</em></p>
<p>While your open source solutions are good how easy are they for the end user to restore and what about reporting? [Someone] turned me onto <a href="http://www.crashplan.com">Crashplan</a> which is definitely not open source, [but] is a pretty good and robust product. It&#8217;s very simple for the end user. It provides daily reports to the admin &amp; user on backups, super simple for restores, allows for individual or group quotas, incremental backups can occur during certain hours or continuous real-time backups. Stores to disk. Data is encrypted &amp; compressed before sending. Identical files across any directory or client are stored only once. Users can back up data to other sites as well or others using Crashplan software for additional redundancy. Runs on Linux/Mac/Windows platforms. Unsure how well it scales. Supposed to handle thousands of connections and terabytes of data. Server is free while the client is expensive ($70/client [for &lt; 5 clients]) but allows backups from anywhere.</p>
<p>Steve Carlyle &lt;<a href="mailto:Steve.Carlyle@uci.edu">Steve.Carlyle@uci.edu</a>&gt;</p>
</td>
</tr>
</table>
<hr />
<h2><a name="_resources"></a>Resources</h2>
<h3><a name="_books"></a>Books</h3>
<p>The O&#8217;Reilly book <em>Backup &amp; Recovery: Inexpensive Backup Solutions for Open Systems</em> is a good overview of some important considerations for a backup system. UC people can read it in its entirety <a href="http://www.amazon.com/Backup-Recovery-Inexpensive-Solutions-Systems/dp/0596102461">here via O&#8217;Reilly Safari</a></p>
<p>Note that O&#8217;Reilly itself <a href="http://searchdatabackup.techtarget.com/news/article/0,289142,sid187_gci1320375,00.html">uses a combination of commercial and Open Source tools</a>.</p>
<h3><a name="_web_sites"></a>Web sites</h3>
<p>Curtis Preston, the author of the above book runs a backup-related site called <a href="http://www.backupcentral.com">BackupCentral</a>, which is a very good info clearing house / blog on all things backup.</p>
<p><a href="http://searchdatabackup.techtarget.com/">SearchDataBackup</a> is another backup-related site.</p>
<h3><a name="_whitepapers"></a>Whitepapers</h3>
<p>Of variable quality, some sponsored by vendors.</p>
<p><a href="http://moo.nac.uci.edu/~hjm/BackupOnABudget_6.9.pdf">Backup on a Budget</a> (PDF) &#8211; by the ubiquitous Curtis Preston.  Reiteration of many of his points, especially pointed to the fact that for many organizations, backup is not rocket science, people are the most expensive thing you pay for, and that backup to clouds <em>may</em> be useful (but mostly in rare occassions).</p>
<h3><a name="_useful_individual_pages"></a>Useful(?) individual pages</h3>
<h4><a name="_fwbackups"></a>fwbackups</h4>
<p>Via <a href="http://blogs.techrepublic.com.com/10things/?p=895">TechRepublic&#8217;s 10 outstanding Linux backup utilities</a>.  Very short list, with some interesting choices.  <a href="http://www.diffingo.com/oss/fwbackups">fwbackups</a> is a very slick little program writ in Python/GTK that can work on all platforms (but GTK on Windows requires the whole GTK lib).  It&#8217;s not an enterprise system, but if you have a Personal Linux box to back up it&#8217;s pretty straightforward, and can use rsync/ssh to encrypt over the wire.  However, it does require your own shell login and dedicated dir space on a server.</p>
<p><a href="http://www.boxbackup.org">Boxbackup</a> is not in the running for an enterprise backup system, but is interesting for near-realtime backup for a small group of clients.</p>
<hr />
<h2><a name="_latest_version"></a>Latest version</h2>
<p>The latest version of this document should always be <a href="http://moo.nac.uci.edu/%7ehjm/UCI_OSS_Backup_Evaluation.html">here</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/hjmangalam.wordpress.com/16/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/hjmangalam.wordpress.com/16/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/hjmangalam.wordpress.com/16/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/hjmangalam.wordpress.com/16/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/hjmangalam.wordpress.com/16/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/hjmangalam.wordpress.com/16/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/hjmangalam.wordpress.com/16/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/hjmangalam.wordpress.com/16/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/hjmangalam.wordpress.com/16/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/hjmangalam.wordpress.com/16/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/hjmangalam.wordpress.com/16/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/hjmangalam.wordpress.com/16/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/hjmangalam.wordpress.com/16/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/hjmangalam.wordpress.com/16/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=hjmangalam.wordpress.com&amp;blog=1623151&amp;post=16&amp;subd=hjmangalam&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://hjmangalam.wordpress.com/2009/09/13/uci-oss-backup-evaluation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/255884f089123f544bb5e036ae3a89b1?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">hjmangalam</media:title>
		</media:content>

		<media:content url="http://hjmangalam.files.wordpress.com/2009/09/uci_backup_clients_plot.jpg" medium="image">
			<media:title type="html">UCI_backup_clients_plot.jpg</media:title>
		</media:content>
	</item>
		<item>
		<title>An IntroBDUCtion</title>
		<link>http://hjmangalam.wordpress.com/2009/09/13/an-introbduction/</link>
		<comments>http://hjmangalam.wordpress.com/2009/09/13/an-introbduction/#comments</comments>
		<pubDate>Sun, 13 Sep 2009 15:44:22 +0000</pubDate>
		<dc:creator>hjmangalam</dc:creator>
				<category><![CDATA[HowTos]]></category>

		<guid isPermaLink="false">http://hjmangalam.wordpress.com/?p=22</guid>
		<description><![CDATA[How to let us know what&#8217;s wrong Since BDUC is a research cluster, it is in perpetual flux as apps, libraries, and modules are added, updated, or modified so sometimes a bug will creep in where none existed before. When you find something missing or a behavior that seems odd, please let us know. You [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=hjmangalam.wordpress.com&amp;blog=1623151&amp;post=22&amp;subd=hjmangalam&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<hr />
<h2><a name="_how_to_let_us_know_what_8217_s_wrong"></a>How to let us know what&#8217;s wrong</h2>
<p>Since BDUC is a research cluster, it is in perpetual flux as apps, libraries, and modules are added, updated, or modified so sometimes a bug will creep in where none existed before.  When you find something missing or a behavior that seems odd, please let us know.  You can <a href="mailto:bduc-request@uci.edu?Subject=User%20Comment">email the BDUC admins here</a>.</p>
<p>Note that it will help considerably if you tell us more than <em>It doesn&#8217;t work</em>, or <em>I can&#8217;t log in</em>.  If you want quick resolution of the problem, please send us as much relevant info as possible, including a description of what triggered the misbehavior. If the misbehavior involves an error message, doing a Google search on that error message <strong>verbatim</strong> will often produce the answer.</p>
<p>While many of you are not programmers, you&#8217;re dealing with programs, and if we are to have any hope of debugging the process that caused the failure, the more info the better (usually).  PLEASE READ  <a href="http://www.chiark.greenend.org.uk/~sgtatham/bugs.html">How to Report Bugs Effectively</a> before you report a failure.  At least glance at it.</p>
<p>If you&#8217;re going to spend a lot of time with computers, you should also read Eric Raymond&#8217;s encyclopedic <a href="http://www.catb.org/~esr/faqs/smart-questions.html">How To Ask Questions The Smart Way</a>.   It will be of use thru your life.</p>
<hr />
<h2><a name="_what_is_a_bduc"></a>What is a BDUC?</h2>
<p>The <em>Broadcom Distributed Unified Cluster</em> (BDUC) is, as the name suggests, a distributed group of clusters unified by running under a single &lt;a href=&quot;http://moo.nac.uci.edu/<sub>hjm/Sun_Grid_Engine_62_install_and_config.pdf&#8221;&gt;Sun Grid Engine</a> (SGE) Resource Manager. BDUC consists of  subclusters of 2-48core AMD64 Opteron nodes (for a total of about 380cores) running 64bit Linux.  One group of 40 nodes is in the NACS Academic Data Center and another of 40 nodes is in the ICS data center, for a total of </sub>125 nodes / ~380 cores. There is another smaller subcluster (see BEAR, below) running Kubuntu.</p>
<p>The nodes are interconnected with 1Gb ethernet and have the MPICH and MPICH2 environments for parallel jobs. All the nodes share a common <strong>/home</strong> which is on a RAID6 system but which is NOT backed up.  If you generate valuable data, you should move it off ASAP.</p>
<hr />
<h2><a name="_what_is_a_bear"></a>What is a BEAR?</h2>
<p>The <em>Broadcom EA Replacement</em> (BEAR) is a Broadcom-supplied subcluster consisting of 7 larger nodes administered  especially for interactive use.  These nodes each have 4-8 Opteron285/2.6GHz cores and 32-64GB RAM.  They run the 64bit Kubuntu (10.04.1) Desktop Edition, so you can have access to the full graphical KDE desktop via VNC, as well as the individual GUI applications and shell utilities.  BEAR is fully integrated with BDUC and shares its <strong>/home</strong> directories, but has a different, larger set of applications.  One of the nodes (claw1) is reserved for interactive use; the others can be used for both interactive and batch runs (currently limited to 48hrs) on the <strong>claws</strong> Q.</p>
<p><em>You can compile and run jobs on all the claw nodes, but don&#8217;t saturate claw1 with multiple serial or parallel jobs.</em></p>
<hr />
<h2><a name="_how_do_i_get_an_account"></a>How do I get an account?</h2>
<p>You request an account by sending a message <strong>including your UCINetID</strong> to <a href="mailto:bduc-request@uci.edu">&lt;<a href="mailto:bduc-request@uci.edu">bduc-request@uci.edu</a>&gt;</a>. Please let us know in that message if you want to use the SGE batch system to submit long-running or multiple jobs.  You should get an acknowledgement within a few hours and your account should be available then. By default, BDUC &amp; BEAR are open to all postgrad UCI researchers, altho it will be available to undergrads with faculty sponsorship.</p>
<h3><a name="connect"></a>How do I connect to BDUC?</h3>
<p>You <em>must</em> use <a href="http://en.wikipedia.org/wiki/Secure_Shell">ssh</a>, an encrypted terminal protocol. Be sure to use the <em>-Y</em> or <em>-X</em> options, if you want to view X11 graphics (<a href="#graphics">see below</a>).</p>
<p><strong>On a Mac</strong>, use the <em>Applications &#8594; Utilities &#8594; Terminal</em> app.<br /> <strong>On a WinPC</strong>, use the excellent <a href="http://www.chiark.greenend.org.uk/~sgtatham/putty/">putty</a>. See also <a href="#XonWin">below</a>.<br /> <strong>On Linux</strong>, I assume that you know how to start a Terminal session with one of the bazillion terminal apps (<a href="http://konsole.kde.org/">konsole</a> &amp; <a href="http://software.jessies.org/terminator/">terminator</a> are 2 good ones).</p>
<p><a href="http://en.wikipedia.org/wiki/Telnet">Telnet</a> access is NOT available. Use your UCINetID and associated password to log into the login node (bduc-login.nacs.uci.edu) via <strong>ssh</strong>.</p>
<p>To connect using a Mac or Linux, open the Terminal app and type:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">ssh -Y UCINetID@bduc-login.nacs.uci.edu
# the '-Y' requests that the X11 protocol is tunneled back to you inside of ssh.</pre>
</td>
</tr>
</table>
<p>As of June 15th, 2009, you can also ssh directly to the claw1 node for a 64bit interactive node from anywhere on campus.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">ssh -Y UCINetID@bduc-claw1.nacs.uci.edu</pre>
</td>
</tr>
</table>
<h3><a name="passwordless_ssh"></a>How to set up passwordless ssh</h3>
<table style="margin:.2em 0;">
<tr valign="top">
<td style="padding:.5em;">
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:3px solid #e8e8e8;padding:.5em;">
<p><b>Passwordless ssh setup is now automatic</b></p>
<p>From <strong>Nov. 15th, 2009</strong> onwards, this is set up for you automatically when your account is activated, so you no longer have to do this manually.  However, as a reference for those of you who want to set it up on other machines, I&#8217;ve moved the documentation to the <a href="#HowtoPasswordlessSsh">Appendix</a>. The automatic setup also includes setting the <em>~/.ssh/config</em> file to prevent the &#8220;first time ssh challenge problem&#8221;.</p>
</td>
</tr>
</table>
<h3><a name="ssherrors"></a>ssh errors</h3>
<p>Occasionally you may get the error below when you try to log into BDUC (or more rarely, among the BDUC nodes):</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
 Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
93:c1:d0:97:e8:a0:f5:91:13:89:7d:94:6c:aa:9b:8c.
 Please contact your system administrator.
Add correct host key in /Users/joeuser/.ssh/known_hosts to get rid of this message.
Offending key in /Users/joeuser/.ssh/known_hosts:2
RSA host key for bduc.nacs.uci.edu has changed and you have requested strict checking.
 Host key verification failed.</pre>
</td>
</tr>
</table>
<p>The reason for this error is that the computer to which you&#8217;re connecting to has changed its identification key.  This might be due to the mentioned <em>man-in-the-middle</em> attack but is far more likely to be an administrative change that has caused the BDUC node to have changed its ID.  This may be due to a change in hardware, reconfiguration of the node, a reboot, an upgrade, etc.</p>
<p>The fix is buried in the error message itself.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">Offending key in /Users/joeuser/.ssh/known_hosts:2</pre>
</td>
</tr>
</table>
<p>Simply edit that file and delete the line referenced.  When you log in again, there will be a notification that the key has been added to your <em>known_hosts</em> file.</p>
<p>Should you want to be able to log in regardless of this warning, you&#8217;ll have to edit the <em>/etc/ssh/ssh_config</em> file and add the 2 lines as shown below. (Macs, Linux).   There are <a href="http://goo.gl/rCeE">good reasons for not doing this</a>, but it&#8217;s a convenience that many of us use.  Consider it the <em>rolling stop</em> of ssh security.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">Host *
        StrictHostKeyChecking ask</pre>
</td>
</tr>
</table>
<p>After you do that, you&#8217;ll still get the warning (which you should investigate) but you&#8217;ll be able to log in.</p>
<p>If you&#8217;re using <a href="http://www.chiark.greenend.org.uk/~sgtatham/putty/">putty</a> on Windows, you won&#8217;t be able to effect this security skip-around. <a href="http://goo.gl/rCeE">Read why here</a>.</p>
<h3><a name="_after_you_log_in_8230"></a>After you log in&#8230;</h3>
<p>Logging in to <strong>bduc.nacs.uci.edu</strong> will give you access to a Linux shell (<a href="http://www.gnu.org/software/bash/">bash</a> by default, <a href="http://www.tcsh.org/Home">tcsh</a>, ksh available).</p>
<table style="margin:.2em 0;">
<tr valign="top">
<td style="padding:.5em;">
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:3px solid #e8e8e8;padding:.5em;">
<p><b>Some bash pointers.</b></p>
<p>The default shell (or environment in which you type commands) for your BDUC login is bash.  It looks like the Windows CMD shell, but it is MUCH more powerful.  There&#8217;s a good exposition of some of the things you can do with the shell <a href="http://www.catonmat.net/blog/the-definitive-guide-to-bash-command-line-history/">here</a> and a <a href="http://www.catonmat.net/blog/wp-content/plugins/wp-downloadMonitor/user_uploads/bash-history-cheat-sheet.pdf">good cheatsheet here</a>. If you&#8217;re going to spend some time working on BDUC, it&#8217;s worth your while to learn some of the more advanced commands and tricks.</p>
<p>You can also customize your bash prompt to produce more info than the default <em>user@host</em>. While you&#8217;re waiting for your calculations to finish, check out the definitive <a href="http://tldp.org/HOWTO/Bash-Prompt-HOWTO">bash prompt HOWTO</a> and / or use <a href="http://bashish.sourceforge.net/">bashish</a> to customize your bash environment.</p>
<p><a href="http://www.dirb.info">DirB</a> is a set of bash functions that make it very easy to bookmark and skip back and forth to those bookmarks. Download the file from the URL above, <em>source</em> it early in your <em>.bashrc</em> and then read how to use it via <a href="http://moo.nac.uci.edu/~hjm/DirB.pdf">this link</a>.  It&#8217;s very simple and very effective. Very briefly, <em>s bookmark</em> to set a bookmark, <em>g bookmark</em> to cd to bookmark, <em>sl</em> to list bookmarks.  Recommended if you have deep dir trees and need to keep hopping among the leaves.</p>
</td>
</tr>
</table>
<p>You will also have access to the resources of the BDUC via the SGE commands.  The most frequently used commands for SGE will be <em>qrsh</em> to request an interactive node and <em>qsub</em> to submit a batch job.  You can also check the status of various resources with the <em>qconf</em> command.  See the <a href="http://gridengine.info/files/SGE_Cheat_Sheet.pdf">SGE cheatsheet</a> for more detail.</p>
<p>The login node should be considered your 1st stop in doing real work.  You can copy files to and from your home directory from the login node, but you shouldn&#8217;t run any long (&gt;10m) jobs on the login node.  If you do and we notice, we&#8217;ll kill them off.  To do real work, request a node from the interactive queue, like this:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;"># for a 64bit interactive node
hmangala@bduc-login:~ $ qrsh -q int

# wait a few seconds...

hmangala@bduc-amd64-2:~

#or you can ssh directly to one of the claw nodes:

ssh claw1</pre>
</td>
</tr>
</table>
<hr />
<h2><a name="datastorageonbduc"></a>Data Storage on BDUC</h2>
<h3><a name="_no_limits_but_no_warnings_either"></a>No limits, but no warnings either</h3>
<p>We have not yet imposed disk quotas on BDUC. We encourage you to use the data storage you need, up to hundreds of GB, but we also warn you that if we detect large directories that have not been used in weeks, we retain the right to clean them out.  The larger the dataset, the more scrutiny it will get. IF YOU HAVE LARGE DATASETS AND ARE NOT USING THEM, THEY MAY DISAPPEAR WITHOUT WARNING. We mean it when we say that if you generate valuable data, it is up to you to back it up elsewhere ASAP.</p>
<p>If you have no idea of how large your data is and how it is distributed, you can find out via the <em>du</em> command (disk usage).</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">$ cd /home
$ du -sh hmangala   # you would substitute *your* home dir
5.3G    hmangala/</pre>
</td>
</tr>
</table>
<p>To see the distribution of files graphically,</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">$ cd; ssh -Y claw1 'kdirstat'</pre>
</td>
</tr>
</table>
<p>This will launch <a href="http://kdirstat.sourceforge.net/">kdirstat</a> which will determine the size, type and age of your files and present them in a color-coded map by size.  You can then inspect and hopefully remove the ones least needed.</p>
<h3><a name="filestoandfrom"></a>How do I get my files to and from BDUC?</h3>
<p>This is covered in more detail in the document  <a href="http://moo.nac.uci.edu/\~hjm/HOWTO_move_data.html">HOWTO_move_data</a>. There are currently a few ways to get your files to and from BDUC.  The most direct, most available way is via <a href="http://en.wikipedia.org/wiki/Secure_copy">scp</a>.  Besides the commandline <strong>scp</strong> utility bundled with all Linux and Mac machines, there are GUI clients for MacOSX and  Windows, and of course, Linux.  If you have large collections of files or large individual files that change only partially, you might be interested in using <a href="http://moo.nac.uci.edu/%7ehjm/HOWTO_move_data.html#rsync">rsync</a> as well.</p>
<table style="margin:.2em 0;">
<tr valign="top">
<td style="padding:.5em;">
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:3px solid #e8e8e8;padding:.5em;">
<p><b>Make sure bash knows if this is an interactive login</b></p>
<p>If you have customized your <em>.bashrc</em> to spit out some useful data when you log in (such as the number of jobs you have running), make sure to wrap that command in a test for an interactive shell.  Otherwise, when you try to <em>scp</em> or <em>sftp</em> or <em>rsync</em> data to your BDUC account, your shell will unexpectedly vomit up the same text into the connecting program with unpleasant results.  Wrap those commands with something like this in your <em>.bashrc</em>:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">interactive=`echo $- | grep -c i `
if [ ${interactive} = 1 ] ; then
  # put all your intereractive stuff in here:
  # ie tell me what my 22 latest files are
  ls -lt | head -22
fi</pre>
</td>
</tr>
</table>
</td>
</tr>
</table>
<h4><a name="_windows"></a>Windows</h4>
<p>The hands-down, no-question-about-it, go-to utility here is the free <a href="http://www.winscp.net">WinSCP</a>, which gives you a graphical interface for SCP, SFTP and FTP.</p>
<table bgcolor="#ffffee" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<p style="margin-top:0;"><b>Line endings in files from Windows vs Linux/Unix/MacOSX</b></p>
<p>If you are creating data on Windows and saving it as <em>plain text</em> for use on Linux, many Windows applications will save the data with DOS <em>end-of-line</em> characters (a carriage return plus a line feed aka <em>CRLF</em>) as opposed to the Linux/MacOSX newline (a line feed alone aka <em>LF</em>).  This may cause problems on Linux as some applications will detect and automatically correct Windows newlines but others will not.  Ditto visual editors which you might think would give you an indication of this.  Most editors will give you a choice as to which newline you want when you save the file, but in some the choice is not obvious.  In any case, unless you&#8217;re sure of how your data is formatted, you can pass it though the Linux utility <em>dos2unix</em> which will replace the Windows newline with a Linux newline:</p>
<pre style="color:gray;padding:.5em;">$ dos2unix windows.file linux.file</pre>
<p><a href="http://en.wikipedia.org/wiki/Newline">Read the whole sordid history of the newline here</a></p>
</td>
</tr>
</table>
<h4><a name="_macosx"></a>MacOSX</h4>
<p>There may be others but it looks like the winner here is the oddly named, but freely available <a href="http://cyberduck.ch/">Cyberduck</a>, which provides graphical file browsing via FTP, SCP/SFTP, WebDAV, and even Amazon S3(!).</p>
<h4><a name="_linux"></a>Linux</h4>
<p>The full range of high-speed net data commandline utilities are available via the above-referenced <a href="http://moo.nac.uci.edu/~hjm/HOWTO_move_data.html">HOWTO_move_data</a>, however, for ease of use, it may well be easiest to use the built-in capabilities of KDE&#8217;s Swiss Army knife browser <a href="http://www.konqueror.org">Konqueror</a> or twin panel file manager <a href="http://www.krusader.org/">Krusader</a> which both support the secure file browser <a href="http://www.linux.com/feature/124686">kio-plugin</a> called <a href="http://isdepartment.wordpress.com/2007/04/04/introduction-to-the-kdes-kio-slaves-using-fish">fish</a>.  If you use a <strong>fish URL</strong>, you can connect the server via shared keys or via password:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">fish://hmangala@bduc.nacs.uci.edu</pre>
</td>
</tr>
</table>
<h3><a name="archivemount"></a>archivemount</h3>
<p>Once you&#8217;ve generated some data on BDUC, you may want to keep it handy for a short time while you&#8217;re further processing it.  In order to keep it both compact and accessible, BDUC supports the <em>archivemount</em> utility on both the <em>login</em> and <em>claw1</em> nodes.  This allows you to mount a compressed archive (tar.gz, tar.bz2, and zip archives) on a mountpoint as a <a href="http://en.wikipedia.org/wiki/Filesystem_in_Userspace">fuse filesystem</a>.  You can <em>cd</em> into the archive, modify files in place, copy files out of the archive, or copy files into the archive.  When you unmount the archive, the changes are saved into the archive. Here&#8217;s an <a href="http://www.linux-mag.com/id/7825">extended article on it from Linux Mag</a>.</p>
<p>Here&#8217;s an example of how to use <em>archivemount</em> with a 84MB data tarball (<em>data.tar.gz</em>) that you want to interact with.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;"># how big is this thang?
$ ls -lh
total 84M
-rw-r--r-- 1 hmangala hmangala 84M Jun 15 14:55 jksrc.zip

# OK - 84MB, which is fine.  Now let's make a mount point for it.

$ mkdir jk

$ ls
jk/  jksrc.zip

# so now we have a zipfile and a mountpoint.  That's all we need to archivemount
# let's time it just to see how long it takes to unpack and mount this archive:

$ time archivemount jksrc.zip jk

real    0m0.810s  &lt;-  less than a second wall clock time
user    0m0.682s
sys     0m0.112s

$ cd jk      # cd into the top of the file tree.

# lets see what the top of this file tree looks like.  All file utils can work on this data structure
$ tree |head -11
.
`-- kent
    |-- build
    |   |-- build.crontab
    |   |-- dosEolnCheck
    |   |-- kentBuild
    |   |-- kentGetNBuild
    |   `-- makeErrFilter
    |-- java
    |   |-- build
    |   |-- build.xml
&lt;etc&gt;

# and the bottom of the file tree.
$ tree |tail
            |   |-- wabaCrude.h
            |   `-- wabaCrude.sql
            |-- xaShow
            |   |-- makefile
            |   `-- xaShow.c
            `-- xenWorm
                |-- makefile
                `-- xenWorm.c

2286 directories, 12793 files &lt;- lots of files that don't take up anymore 'real' space on the disk.

# how does it show up with 'df'?  See the last line..

$ df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/md2             373484336  11607976 342598364   4% /
/dev/md1               1019144     47180    919356   5% /boot
tmpfs                  8254876         0   8254876   0% /dev/shm
/dev/sdc             12695180544 6467766252 6227414292  51% /data
bduc-sched.nacs.uci.edu:/share/sge62
                      66946520   8335072  55155872  14% /sge62
fuse                 1048576000         0 1048576000   0% /home/hmangala/build/fs/jk

# finally, !!IMPORTANTLY!! un-mount it.

$ cd ..   # cd out of the tree

$ fusermount -u jk    # unmount it with 'fusermount -u'</pre>
</td>
</tr>
</table>
<table style="margin:.2em 0;">
<tr valign="top">
<td style="padding:.5em;">
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:3px solid #e8e8e8;padding:.5em;">
<p><b>Don&#8217;t make huge archives if you&#8217;re going to use archivemount</b></p>
<p><em>archivemount</em> has to &#8220;unpack&#8221; the archive before it mounts it, so trying to <em>archivemount</em> an enormous archive will be slow and frustrating.  If you&#8217;re planning on using this approach, please restrict the size of your archives to  ~100MB.</p>
<p>If you need to process huge files, please consider using <a href="http://en.wikipedia.org/wiki/NetCDF">netCDF</a> or <a href="http://en.wikipedia.org/wiki/HDF5">HDF5</a> formated files and <a href="http://nco.sf.net">nco</a> or <a href="http://www.pytables.org/moin">pytables</a> to process them.  <em>NetCDF</em> and <em>HDF5</em> are highly structured, binary formats that are both extremely compact and extremely fast to parse/process.  BDUC has a number of utilities for processing both types of files including <a href="http://www.r-project.org/">R</a>, <a href="http://nco.sf.net">nco</a>, and <a href="https://wci.llnl.gov/codes/visit/">VISIT</a>.</p>
</td>
</tr>
</table>
<h3><a name="sshfs"></a>sshfs</h3>
<p><a href="http://en.wikipedia.org/wiki/SSHFS">sshfs</a> is a utility that allows you to mount remote directories in your BDUC home dir.  Since it operates in <em>user-mode</em>, you don&#8217;t have to be <em>root</em> or use <em>sudo</em> to use it. It&#8217;s very easy to use and you don&#8217;t have to alert us to use it..</p>
<p>You have to be able to ssh to the machine from which you want to exchange files, typically the desktop or laptop you&#8217;re connecting to BDUC from (ergo WinPCs cannot do this without much more effort).  For MacOSX and Linux, in the example below assume I&#8217;m connecting from a laptop named <em>ringo</em> to the BDUC <em>claw1</em> node.  I have a valid BDUC login (<em>hmangala</em>) and my login on ringo is <em>frodo</em>.</p>
<p><em>sshfs</em> works on both the <em>login</em> and <em>claw1</em> nodes.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">frodo@ringo:~ $ ssh bduc-claw1  # from ringo, ssh to BDUC with passwordless ssh

 # &lt;BDUC login stuff deleted&gt;

# make a dir named 'ringo' for the ringo filesystem mountpoint
hmangala@bduc-claw1:~  $ mkdir ringo

# sshfs-attach the remote filesystem to BDUC on ~/ringo
# NOTE: you usually have to provide the FULL PATH to the remote dir, not '~'
# using '~' on the local side (the last arg) is OK.
# ie: this is wrong:
# hmangala@bduc-claw1:~  $ sshfs frodo@ringo.dept.uci.edu:~ ringo
#                                                         ^
# the following is right:
hmangala@bduc-claw1:~  $ sshfs frodo@ringo.dept.uci.edu:/home/frodo ~/ringo

hmangala@bduc-claw1:~  $ ls -l |head
total 4790888
drwxr-xr-x   2 hmangala hmangala          6 Dec 10 14:17 ringo/  # the new mountpoint for ringo
-rw-r--r--   1 hmangala hmangala       3388 Sep 22 16:25 9.2.zip
-rw-r--r--   1 hmangala hmangala       4636 Dec  8 10:18 acct
-rw-r--r--   1 hmangala hmangala        501 Dec  8 10:20 acct.cpu.user
-rwxr-xr-x   1 hmangala hmangala        892 Nov 11 08:55 alias*
-rw-r--r--   1 hmangala hmangala        691 Sep 30 13:21 all3.needs

 &lt;etc&gt;         ^^^^^^^^^^^^^^^^^ note the ownership

# now I cd into the 'ringo' dir
hmangala@bduc-claw1:~  $ cd ringo

hmangala@bduc-claw1:~/ringo  $ ls -lt |head
total 4820212
drwxr-xr-x 1 frodo frodo       20480 2009-12-10 14:43 nacs/
drwxr-xr-x 1 frodo frodo        4096 2009-12-10 14:41 Mail/
-rw------- 1 frodo frodo          61 2009-12-10 12:54 ~Untitled
-rw-r--r-- 1 frodo frodo          42 2009-12-10 12:44 testfromclaw
-rw-r--r-- 1 frodo frodo      627033 2009-12-10 11:22 sun_virtualbox_3.1.pdf

#&lt;etc&gt;       ^^^^^^^^^^^ note the ownership.  Even tho I'm on bduc-claw1, the original ownership is intact

# writing from BDUC to ringo filesystem
hmangala@bduc-claw1:~/ringo  $ echo "testing testing" &gt; test_from_bduc

hmangala@bduc-claw1:~/ringo  $ cat test_from_bduc
testing testing

hmangala@bduc-claw1:~/ringo  $ ls -lt |head
total 4820216
drwxr-xr-x 1 frodo frodo       20480 2009-12-10 14:47 nacs/
-rw-r--r-- 1 frodo frodo          16 2009-12-10 14:46 test_from_bduc
drwxr-xr-x 1 frodo frodo        4096 2009-12-10 14:41 Mail/
#            ^^^^^^^^^^^  even tho I wrote it as 'hmangala' on BDUC, it's owned by 'frodo'

# and finally, unmount the sshfs mounted filesystem.
hmangala@bduc-claw1:~/ringo $ fusermount -u ringo

# get more info on sshfs with 'man sshfs'</pre>
</td>
</tr>
</table>
<hr />
<h2><a name="_you_are_responsible_for_your_data"></a>YOU are responsible for your data</h2>
<p>We <strong>do not</strong> have the resources to provide backups of your data.  If you store valuable data on BDUC, it is <em>ENTIRELY</em> your responsibility to protect it by backing it up elsewhere. You can do so via the mechanisms discussed above, especially with (if using a Mac or Linux) rsync, which will copy only those bytes which have changed, making it extremely efficient.  Using rsync (with examples) <a href="http://moo.nac.uci.edu/~hjm/HOWTO_move_data.html#rsync">is described here</a>.</p>
<hr />
<h2><a name="_how_do_i_do_stuff"></a>How do I do stuff?</h2>
<p>On the login node, you shouldn&#8217;t do anything too strenuous (computationally).  If you run something that takes more than a minute or so to complete, you should be running on an interactive node or submit it to one of the batch queues.</p>
<p><strong>qrsh</strong> given alone will start an <em>ssh -Y</em> session with one of the nodes in the interactive Q.</p>
<h3><a name="_can_i_compile_code"></a>Can I compile code?</h3>
<p>We have the full GNU toolchain available on both the CentOS interactive nodes and on all the Ubuntu/claw nodes, so normal compilation tools such as autoconf, automake,  libtool, make, ant, gcc, g++, gfortran, gdb, ddd, java, python, R, perl, etc are available to you.  We do not yet have any proprietary compilers or debuggers available (ie. the Intel or PGC compilers or the TotalView Debugger).  Please let us know if there are other tools or libraries you need that aren&#8217;t available.</p>
<h4><a name="_compiling_your_own_code"></a>Compiling your own code</h4>
<p>You can always compile your own (or downloaded) code.  Compile it in its own subdir and when you&#8217;ve built the executables, install it rooted from your own home directory.</p>
<p>If the code is well-designed, it should have a <em>configure</em> shell script in the top-level dir.  The <em>./configure &#8211;help</em> command should then give you a list of all the parameters it accepts.  Typically, all such scripts will accept the <em>&#8211;prefix</em> flag.  You can use this to tell it to install everything in your $HOME dir.</p>
<p>ie</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">./configure --prefix=/home/you ...other options..</pre>
</td>
</tr>
</table>
<p>This command, when it completes successfully will generate a <em>Makefile</em>. At this point, you can type <em>make</em> (or <em>make -j2</em> to compile on 2 CPUs) and the code will be compiled into whatever kind of executable is called for. Once the code has been compiled successfully (there may be a <em>make test</em> or <em>make check</em> option to run tests to check for this), you can install it in your $HOME directory tree with <em>make install</em>.</p>
<p>ie</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">/home/you/bin
/home/you/man
/home/you/lib
/home/you/include
/home/you/share
&lt;etc&gt;</pre>
</td>
</tr>
</table>
<p>Then you can run it out of your <em>~/bin</em> dir without interfering with other code.  In order for you to be able to run it transparently, you will have to prepend your <em>~/bin</em> to the <em>PATH</em> environment variable, typically by editing it into the appropriate line in your <em>~/.bashrc</em>.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">export PATH=~/bin:${PATH}</pre>
</td>
</tr>
</table>
<h3><a name="appsavailable"></a>How do I find out what&#8217;s available?</h3>
<h4><a name="modules"></a>Via the module command</h4>
<p>We use the tcl-based <a href="http://modules.sourceforge.net/">environment module system</a> to wrangle non-standard software versions and subsystems into submission. To find out what modules are available, simply type:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">$ module avail
------------ /apps/Modules/modulefiles ------------
R/2.10.0               gromacs_s_mpich2/4.0.7 openmpi/1.4.3
R/2.11.0               gromacs_s_ompi/4.0.7   paml/4.4
R/2.12.1               gromacs_s_ompi/4.5.1   petsc/3.1-p8
R/2.8.0                gromacs_s_ompi/4.5.2   pgc/10.6
R/dev                  gromacs_s_ompi/4.5.3   picard/1.45
abyss/1.1.2            gromacs_s_ompi/4.5.4   plink/1.07
abyss/1.2.3            hadoop/0.20.2          python/2.6.1
abyss/1.2.5            haploview/4.1          rapidminer/5.1.001
abyss/1.2.6            hdf5/1.8.4p1           ray/1.3
allpathslg/36681       hdf5/1.8.5.p1          ray/1.4
amber/11               hmmer/3.0              readline/5.2
annovar/2010Jan17      hyphy/2.0              rosetta/3.1
antlr/3.2              igv/1.5.58             samtools/0.1.13
ants/1.9               imagej/1.41            samtools/0.1.7
autodock/4.2.3         interviews/17          scilab/5.1.1
bedtools/2.6.1         java/1.6               scilab/5.3.0
bfast/0.6.3c           loni_pipeline/5.1.4    scilab/5.3.1
blat/3.4               maestro/91207          sge/6.2
boost/1.410            maq/0.7.1              simset/2.9
bowtie/0.10.1          matlab/R2008b          soap/2.20
bowtie/0.12.3          matlab/R2009b          sparsehash/1.6
breakway/0.6           matlab/R2010b          sqlite/3.6.22
bwa/0.5.7              mgltools/1.5.4         ssaha2/2.5.3
cnver/0.7.2            modeller/9v7           stampy/1.0.12
cnver/0.8.1            mosaik/1.0.1388        stampy/1.0.9
cufflinks/0.8.1        mpich/1.2.7            subversion/1.6.9
edena/2.1.1            mpich2/1.1.1p1         sva/1.02
eigensoft/3.0          mpich2/1.2.1p1         tablet/1.11.01.25
enthought_python/6.3-2 msort/20081208         tacg/4.5.1
exonerate/2.2          namd/2.6               taverna/2.2.0
freesurfer/4.5.1Dev    namd/2.7b1             tcl/8.5.5
freesurfer/5.0.0       namd/2.8b1             tcl/8.5.9
fsl/4.1                ncl/5.1.1              tinker/5.1.09
fsl/4.1.6              nco/4.0.4              tk/8.5.5
gamess/2010R1          netcdf/3.6.3           tophat/1.0.13
gapcloser/20100125     netcdf/4.1.1           tophat/1.2.0
gatk/1.0.5336          neuron/7.0             triton/4.0.0
gaussian/3.0           nmica/0.8.0            velvet/1.0.13
gnu_parallel/20101202  nwchem/6.0             velvet/1.0.19
gnuplot/4.2.4          octave/3.0.1           velvet/1.1.02
gnuplot/4.5p1          octave/3.2.0           visit/1.11.2
gpu/1.0                open64/4.2.3           vmd/1.8.7
gromacs_d/4.0.7        openmpi/1.4            zdock/3.0.1
gromacs_s/4.0.7        openmpi/1.4.2

(current as of June 8, 2011)</pre>
</td>
</tr>
</table>
<p>To load a particular module, use the <em>module load &lt;module/version&gt;</em> command:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">$ module load imagej/1.41  # for example</pre>
</td>
</tr>
</table>
<p>If a module has a dependency, it should set it up for you automatically. Let us know if it doesn&#8217;t.  If you note that a module has an update that we should install, tell us.</p>
<h4><a name="_via_the_shell"></a>Via the shell</h4>
<p>This is a bit tricky.  there are literally thousands of applications that are available and many of them have names that are entirely unrelated to their function.  In order to determine whether a well-know application is already on the system, you can simply try typing its name.  If it&#8217;s NOT installed or not on your executable&#8217;s PATH, the shell will return <strong>command not found</strong>.</p>
<p>All the interactive nodes have <strong>TAB completion</strong> enabled at least in the bash shell.  This means that if you type a few characters of the name and hit &lt;TAB&gt; twice, the system will try to compete the command for you.  If there are multiple executables that match those characters, the shell will present all the alternatives to you. ie</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">$ jo&lt;TAB&gt;&lt;TAB&gt;
jobs        jockey-kde  joe         join</pre>
</td>
</tr>
</table>
<p>You can then complete the command or enter enough characters to make the command unique and hit &lt;TAB&gt; again and the command will complete.</p>
<h4><a name="_via_the_installer_database"></a>Via the installer Database</h4>
<p>The 2 installer databases (one for Ubuntu&#8217;s <strong>apt-get</strong> on the claw nodes, one for CentOS&#8217;s <strong>yum</strong> on the rest) will let you search all the applications that HAVE been installed and all those that CAN be installed.</p>
<p>To search for the ones that CAN be installed on the BEAR (claw1-4) nodes, use the command <strong>asrch</strong> (an alias for <strong>apt-get search</strong>).  This searches thru all the application names and descriptions in a case-insensitve search to find a wide variety of names that match the pattern you give it.  For example:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">$ asrch biology
avida-base - Auto-adaptive genetic system for Artificial Life research
biomode - [Biology] An Emacs mode to edit genetic data
bioperl - Perl tools for computational molecular biology
   &lt;41 lines deleted&gt;
molphy - [Biology] Program Package for MOLecular PHYlogenetics
phylip - [Biology] A package of programs for inferring phylogenies
phylip-doc - [Biology] A package of programs for inferring phylogenies
treetool - [Biology] An interactive tool for displaying trees
tacg - [Biology] a sophisticated 'grep' for nucleic acid strings</pre>
</td>
</tr>
</table>
<p>To see a more detailed descripton of the application, use <strong>ashow</strong> (an alias for <strong>apt-get show</strong>), which will provide a few lines or paragraphs of text about the application:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">$ ashow phylip
Package: phylip
Priority: optional
Section: multiverse/science
Installed-Size: 5792
Maintainer: Ubuntu MOTU Developers &lt;ubuntu-motu@lists.ubuntu.com&gt;
Original-Maintainer: Debian-Med Packaging Team &lt;debian-med-packaging@lists.alioth.debian.org&gt;
Architecture: amd64
Version: 1:3.67-2
Depends: libc6 (&gt;= 2.4), libx11-6, libxaw7, libxt6
Suggests: phylip-doc
Filename: pool/multiverse/p/phylip/phylip_3.67-2_amd64.deb
Size: 2520650
MD5sum: eacef9de8503a21b90a05bfabea9fbca
SHA1: 61a2ec92c1b0699db07ea08196848e2f41f79a6c
SHA256: 3453f9b3bc9d418bf0c4941eb722e807a96ec32ac3a041df34ee569929bd19dc
Description: [Biology] A package of programs for inferring phylogenies
 The PHYLogeny Inference Package is a package of programs for inferring
 phylogenies (evolutionary trees) from sequences.
 Methods that are available in the package include parsimony, distance
 matrix, and likelihood methods, including bootstrapping and consensus
 trees. Data types that can be handled include molecular sequences, gene
 frequencies, restriction sites, distance matrices, and 0/1 discrete
 characters.
Homepage: http://evolution.genetics.washington.edu/phylip.html
Bugs: mailto:ubuntu-users@lists.ubuntu.com
Origin: Ubuntu</pre>
</td>
</tr>
</table>
<p><strong>HOWEVER</strong>, this only tells you that the application or library is available, not whether it&#8217;s installed.  To find out whether it&#8217;s installed, you use <strong>dpkg</strong>.  <strong>dpkg -S pattern</strong> will tell you whether a package containing a file that has that pattern has been installed and what package provided it: ie</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">$ dpkg -S ifconfig
net-tools: /sbin/ifconfig
net-tools: /usr/share/man/man8/ifconfig.8.gz</pre>
</td>
</tr>
</table>
<p>The <em>-l</em> flag has a different meaning, but can also be useful</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">dpkg -l |grep -i python |less
 &lt;lots of output - try it&gt;</pre>
</td>
</tr>
</table>
<p>There is a GUI application called <strong>synaptic</strong> that provides a more pointyclicky interface but <strong>asrch</strong> and <strong>dpkg</strong> are much faster via the commandline.</p>
<p>To search for all possible applications and libraries on the BBUC nodes using <strong>yum</strong>, it&#8217;s similar:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">$ yum search lapack
Loading "downloadonly" plugin
Loading "fastestmirror" plugin
Loading mirror speeds from cached hostfile
 * epel: mirror.hmc.edu
 * dag: apt.sw.be
 * atrpms: dl.atrpms.net
 * rpmforge: ftp-stud.fht-esslingen.de
 * base: centos.cogentcloud.com
 * updates: mirrors.usc.edu
 * lscsoft: www.lsc-group.phys.uwm.edu
 * addons: mirror.stanford.edu
 * extras: centos.promopeddler.com
lapack-devel.i386 : LAPACK development libraries
blas-devel.i386 : LAPACK development libraries
lapack.i386 : The LAPACK libraries for numerical linear algebra
blas-devel.i386 : LAPACK development libraries
   &lt;10 lines deleted&gt;
blas.i386 : The BLAS (Basic Linear Algebra Subprograms) library.
lapack-devel.i386 : LAPACK development libraries
blas.i386 : The BLAS (Basic Linear Algebra Subprograms) library.
R-RScaLAPACK.i386 : An interface to perform parallel computation on linear algebra problems using ScaLAPACK</pre>
</td>
</tr>
</table>
<p>To find out if a package has been installed:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">$ yum list lapack-devel.i386
Loading "downloadonly" plugin
Loading "fastestmirror" plugin
Loading mirror speeds from cached hostfile
 * epel: mirror.hmc.edu
 * dag: apt.sw.be
 * atrpms: dl.atrpms.net
 * rpmforge: ftp-stud.fht-esslingen.de
 * base: mirrors.xmission.com
 * updates: mirrors.usc.edu
 * lscsoft: www.lsc-group.phys.uwm.edu
 * addons: centos.cogentcloud.com
 * extras: mirror.hmc.edu
Installed Packages
lapack-devel.i386                        3.1.1-1.el5.rf         installed</pre>
</td>
</tr>
</table>
<h4><a name="_via_the_internet"></a>Via the Internet</h4>
<p>Obviously, a much wider ocean to search.  My first approach is to use a Google search constructed of the platform, application name, and/or function of the software.  Something like</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">linux image photography hdr 'high dynamic range'  # '' enforce exact phrase</pre>
</td>
</tr>
</table>
<p>which yields <a href="http://tinyurl.com/nf5qrn">this page of results.</a></p>
<p>Also, don&#8217;t be afraid to try <a href="http://www.google.com/advanced_search?hl=en">Google&#8217;s Advanced Search</a> or even <a href="http://www.google.com/linux">Google&#8217;s Linux Search</a>.</p>
<p>After evaluating the results, you&#8217;ll come to a package that seems to be what you&#8217;re after, pfstools, for example.  If you didn&#8217;t find this in the previous searches of the application databases, you can look again, searching explicitly:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">$ashow pfstools
 ...
Description: command line HDR manipulation programs
 pfstools is a set of command line (and two GUI) programs for reading,
 writing, manipulating and viewing high-dynamic range (HDR) images and video
 frames. All programs in this package exchange data using a simple generic
 file format (pfs) for HDR data. It is an attempt to integrate existing file
 formats by providing a simple data format that can be used to exchange data
 between applications.
 ...</pre>
</td>
</tr>
</table>
<p>and then you can ask an admin to install it for you.  Typically the apps found in the application repositories lag the latest releases by a few point versions, so if you really need the latest version, you&#8217;ll have to download the source code or binary package and install it from that package.  You can compile your own version as a private package, but to install it as a system binary, you&#8217;ll have to ask one of the admins.</p>
<h3><a name="_interactive_use"></a>Interactive Use</h3>
<p>Logging on to an interactive node may be all that you need.  If you want to slice &amp; dice data interactively, either with a graphical app like <a href="http://www.mathworks.com/products/matlab/description1.html">MATLAB</a>, <a href="https://wci.llnl.gov/codes/visit/">VISIT</a>, <a href="http://jmp.com/">JMP</a>, or <a href="http://www.clustal.org/">clustalx</a>, or a commandline app like <a href="http://nco.sf.net">nco</a> or <a href="http://forums.nacs.uci.edu/BioBB/viewtopic.php?f=10&amp;t=7">scut</a> or even hybrids like <a href="http://gnuplot.info/">gnuplot</a> or <a href="http://www.r-project.org/">R</a>, you can run them from any of the interactive nodes, read, analyze and save data to your <em>/home</em> directory.  As long as you satisfy the <a href="#graphics">graphics</a> requirements, you can view the output of the X11 graphics programs as well.</p>
<h3><a name="_byobu_and_screen_keeping_a_session_alive_between_logins"></a>byobu and screen: keeping a session alive between logins</h3>
<p>In most cases, when you log out of an interactive session, the processes associated with that login will also be killed off, even if you&#8217;ve put them in the background (by appending the starting command with <em>&amp;</em>).  If you regularly need a process to continue after you&#8217;ve logged out, you should submit it to the SGE scheduler with <em>qsub</em> (<a href="#SGE_batch_jobs">see immediately below</a>).</p>
<p>However, sometimes it is convenient to continue a long-running process when you have to log out (as when you have to shut down your network connection to take your laptop home).   In this case, you can use the  underappreciated <em>screen</em> program, which establishes a long-running proxy connection on the remote machine that you can detach from and then re-attach to without losing the connection.  As far as the remote machine is concerned, you&#8217;ve never logged off, so your running processes aren&#8217;t killed off.  When you re-establish the connection by logging in again, you can re-attach to the screen proxy and take up as if you&#8217;ve never been away.</p>
<p>You can also use <em>screen</em> as a terminal multiplexer, allowing multiple terminal sessions to be used from one login, especially useful if you&#8217;re using Windows with PuTTY that doesn&#8217;t have a multiple terminal function built into it.</p>
<p>For these reasons, <em>screen</em> by itself is a very powerful and useful utility, but it is admittedly hard to use.  To the rescue comes a <em>screen</em> wrapper called <em>byobu</em> which provides a much easier-to-use interface to the <em>screen</em> utility.  <em>byobu</em> has been installed on all the interactive nodes on BDUC and can be started by typing:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">$ byobu</pre>
</td>
</tr>
</table>
<p>There will a momentary screen flash as it refreshes and re-displays the login, and then the screen will look similar, except for 2 lines along the bottom that show the screen status.  In the images below, the one at left is <em>without byobu</em>; at right is <em>with byobu</em>.  The <em>byobu</em> screen shows 3 active sessions: <em>login</em>, <em>claw_1</em>, and <em>bowtie</em>.  The graphical tabs at the bottom are part of the KDE application <a href="http://konsole.kde.org/">konsole</a> which also supports multiplexed sessions (allowing you to multi-multiplex sessions (polyplex?))</p>
<p><img src="http://hjmangalam.files.wordpress.com/2011/06/without_byobu_s.jpg?w=450" style="border-width:0;" alt="without byobu">  <img src="http://hjmangalam.files.wordpress.com/2011/06/with_byobu_s.jpg?w=450" style="border-width:0;" alt="with byobu"></p>
<p>The help screen, shown below, can always be gotten to by hitting the <em>&lt;F9&gt;</em> key, followed by the <em>&lt;Enter&gt;</em> key.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">Byobu 2.57 is an enhancement to GNU Screen, a command line
tool providing live system status, dynamic window management,
and some convenient keybindings:

F2    Create a new window    |  F6    Detach from the session
F3    Go to the prev window  |  F7    Enter scrollback mode
F4    Go to the next window  |  F8    Re-title a window
F5    Reload profile         |  F9    Configuration
                             |  F12   Lock this terminal
'screen -r'  - reattach      |  &lt;ctrl-A&gt; Escape sequence
'man screen' - screen's help | 'man byobu'  - byobu's help</pre>
</td>
</tr>
</table>
<p>Most usefully, you can create new sessions with the <em>F2</em> key, switch between them with <em>F3/F4</em> and detach from the screen session with <em>F6</em>.</p>
<p>Note that you must have started a <em>screen</em> session before you can detach, so to make sure you&#8217;re always in a screen session, you can have it start automatically on login by changing the state of the <strong>Byobu currently launches at login</strong> flag (at bottom of screen after the 1st <em>F9</em>.</p>
<p>When you log back in after having detached, type <em>byobu</em> again to re-attach to all your running processes.  If you set <em>byobu</em> to start automatically on login, there will be no need of this, of course, as it will have started.</p>
<hr />
<h2><a name="SGE_batch_jobs"></a>SGE Batch Submission &amp; Queues</h2>
<p>If you have jobs that are very long or require multiple nodes to run, you&#8217;ll have to <em>submit</em> jobs to an SGE Queue (aka Q).</p>
<p><strong>qsub job_name.sh</strong> will submit the job described by <em>job_name.sh</em> to SGE, which will look for an appropriate Q and then start the job running via that Q.  For example, if you need a long running Q, you can request it explicitly: <em>qsub -q long job_name.sh</em> , which will try to run it on the least loaded machine.</p>
<p>Once you log into the login node (via <em>ssh -Y &lt;your_UCINetID&gt;@bduc-login.nacs.uci.edu</em>), you can get an idea of the hosts that are currently up by issuing the <strong>qhost</strong> command. You can find out the status of your jobs with <strong>qstat</strong> alone, which will tell you the status of <strong>your</strong> jobs or</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">qstat -u '*'</pre>
</td>
</tr>
</table>
<p>will tell you the status of all jobs currently queued or running.  A very useful PDF cheatsheet for the SGE <em>q</em> commands <a href="http://gridengine.info/files/SGE_Cheat_Sheet.pdf">is here</a>.</p>
<p>To get an overall idea of the status of the entire cluster, type <em>bduc_status</em>, which will dump a listing of:</p>
<ul>
<li> who&#8217;s logged into the node </li>
<li> the top 100 jobs currently running </li>
<li> nodes/Qs in error state </li>
<li> overall cluster node usage by Q. </li>
</ul>
<h3><a name="_qsub_scripts"></a>qsub scripts</h3>
<p>The shell script that you submit (<em>job_name.sh</em> above) should be written in <em>bash</em> and should completely describe the job, including where the inputs and outputs are to be written (if not specified, the default is your home directory.  The following is a simple shell script that defines <em>bash</em> as the job environment, calls <em>date</em>, waits 20s and then calls it again.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">#!/bin/bash
# (c) 2008 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms.
# This is a simple example of a SGE batch script

# request Bourne shell as shell for job
#$ -S /bin/bash

# print date and time
date
# Sleep for 20 seconds
sleep 20
# print date and time again
date</pre>
</td>
</tr>
</table>
<p>Note that your script has to include (usually at the end) at least one line that executes something &#8211; generally a compiled program but it could also be a Perl or Python script (which could also invoke a number of other programs). Otherwise your SGE job won&#8217;t do anything.</p>
<h4><a name="keepdatalocal"></a>Using qsub scripts to keep data local</h4>
<p>BDUC depends on a network-shared <em>/home</em> filesystem.  The actual disks are in the bduc-login node so users are local to the data when they log in.  However, when you submit an SGE job, unless otherwise specified, the nodes have to read the data over the network and write it back across the network.  This is fine when the total data involved is a few MB, such as is often the case with molecular dynamics runs &#8211; small data in, lots of computation, small data out.  However, if your jobs involve 100s or 1000s of MB, the network traffic can grind the entire cluster to a halt.</p>
<p>To prevent this network armaggedon, there is a <em>/scratch</em> directory on each node which is writable by all users, but is <em>sticky</em> &#8211; the files written can only be deleted by the user who wrote them.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">$ ls -ld /scratch
drwxrwxrwt 6 root root 4096 Oct 29 18:20 /scratch/
         ^
         + the 't' indicates 'stickiness'</pre>
</td>
</tr>
</table>
<p>If there is a chance that your job will consume or emit lots of data, please use the local /scratch dir to stage your input data, and especially write your output.</p>
<p>This is dirt simple to do.  Since your qsub script executes on each node, your script should copy the data from your <em>$HOME dir</em> to <em>/scratch/$USER/input</em> to stage the data, then specify <em>/scratch/$USER/input</em> as input, with your application writing to <em>/scratch/$USER/output_node#</em>. When the application has finished, copy the output files back to your <em>$HOME dir</em> again, and finally cleaning up the <em>/scratch/$USER/whatever</em> afterwards.</p>
<p>Here&#8217;s <a href="https://wiki.duke.edu/display/SCSC/Scratch+Disk+Space">another page of information</a> on using scratch space.</p>
<p>An <a href="http://moo.nac.uci.edu/~hjm/bduc/scratchjob.sh">example script</a> that does this data copying.</p>
<p> <a name="stagingdata"></a><br />
<table style="margin:.2em 0;">
<tr valign="top">
<td style="padding:.5em;">
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:3px solid #e8e8e8;padding:.5em;">
<p><b>Staging data &#8211; some important caveats</b></p>
<p>Staging data to the remote node makes sense when you have large input data and it has to be repeatedly parsed.  It makes less sense when a lot of data has to be read <strong>once</strong> and is then ignored. (If the data is only read once, why copy it?  Just read it in the script.)  If you stage it to <em>/scratch</em>, it is still traversing the network once so there is little advantage. (If you have significant data to be re-read on an ongoing basis, contact me and depending on circumstances, we may be able to let you leave it on the <em>/scratch</em> system of a set of nodes for an extended period of time.  Otherwise, we expect that all <em>/scratch</em> data will be cleaned up post-job.</p>
<p>If it does make sense to stage your data, please try to follow the guidelines below.  If the cluster locks up, offending jobs will be deleted without warning so ask me if you have questions.</p>
<p><strong>Limit your staging bandwidth</strong><br /> If your job(s) are going to require a mass copy (for example, if you submit 20 jobs that each have to copy 1GB), then throttle your job appropriately by using a bandwidth-limiting protocol like <em>scp -C -l 2000</em> instead of <em>cp</em>.  This <em>scp</em> command compresses the data and also limits the bandwidth to ~250KB/s in the above case (<em>2000</em> refers to KiloBITS, not KiloBYTES).  <em>scp</em> will work without requiring passwords, just like <em>ssh</em> within the cluster.  The syntax is slightly different tho.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;"># use scp to copy from bduc-login to a local node dir as would be required in a qsub script
scp -C -l 2000 bduc-login:~/my_file /scratch/hmangala</pre>
</td>
</tr>
</table>
<p>This prevents a few bandwidth-unlimited jobs from causing the available cluster bandwidth to drop to zero, locking up all users. If you have <em>a single job</em> that will copy a single 100MB file, then don&#8217;t worry about it; just copy it directly.</p>
<p>Assume the aggregate bandwidth of the cluster is about <em>30 MB/s</em>.  No set of jobs should exceed half of that, so if you&#8217;re submitting 50 jobs, the total bandwidth should be set to no more than 25MB/s or 0.5 MB/s per job or in scp terms <em>-l 4000</em>.</p>
<p><strong>Check the network before you submit a job</strong><br /> While there&#8217;s no way to prodict the cluster environment after you submit a job, there&#8217;s no reason to make an existing BAD situation worse.  If the cluster is exhibiting network congestion, don&#8217;t add to it by submitting 100 staging jobs. (and if it does appear to be lagging, <a href="mailto:harry.mangalam@uci.edu">please let me know</a>)</p>
<p><strong>How to check for cluster congestion</strong><br /> On the login node, you can use a number of tools to see what the status is.</p>
<p><em>bduc_status</em> will dump a long description detailing who&#8217;s logged in, what the SGE Q status, including the 1st 100 jobs, any Qs in error state, the Queue Summary, the hosts that are down, and the overall cluster load.</p>
<p><em>top</em> give you an updating summary of the top CPU-using processes on the node.  If the top processes include <em>nfsd</em>, and the load average is above <sub>4 with no user processes exceeding 100%, then the cluster can be considered congested. For those that don&#8217;t have the fancy prompt, you can add it by inserting the following line into your <em></sub>/.profile</em> or <em>~/.bashrc</em>.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">PS1="\n\[33[01;34m\]\d \t \[33[00;33m\][\$(cat /proc/loadavg | cut -f1,2,3 -d' ')] \
\[33[01;32m\]\u@\[33[01;31m\]\h:\[33[01;33m\]\w\n\! \$ \[33[00m\]"</pre>
</td>
</tr>
</table>
<p><em>ifstat</em> will produce a continuous, instantaneous chart of network interface output.</p>
<p><em>dstat</em> will produce a similar readout of many system parameters including CPU, memory usage, network, and storage activity.</p>
<p><em>htop</em> produces a colored, top-like output that is multiply sortable to debug what&#8217;s happening with the system.</p>
<p><em>atop</em> produces yet another top-like output but highlights saturated systems.  It provides more info to the root user, but is also useful for regular users.</p>
<p><em>iftop</em> produces a very useful (but only available to root) text-based, updating diagram of network bandwidth by endpoints.  Mentioned as it might be useful to users on their own machines.</p>
</td>
</tr>
</table>
<h4><a name="_debugging_why_your_job_isn_8217_t_running"></a>Debugging why your job isn&#8217;t running</h4>
<p>You can (at least partially) diagnose your own SGE problems. It may well be that the Qs are set up sub-optimally (and if so, we&#8217;ll try to work with you to optimize them), but you can see very quickly if that&#8217;s the case or if it&#8217;s due to a more mundane problem</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">$ qstat                  # will give you a list of your SGE jobs
$ qstat -j &lt;job number&gt;  # will give you an exhaustive list of reasons
                         #  that your job is not executing</pre>
</td>
</tr>
</table>
<h4><a name="_more_example_qsub_scripts"></a>More example qsub scripts</h4>
<ul>
<li> <a href="http://moo.nac.uci.edu/~hjm/bduc/sleeper1.sh">sleeper1.sh</a> is a slightly more elaborate one. </li>
<li> <a href="http://moo.nac.uci.edu/~hjm/bduc/fsl_sub">fsl_sub</a> is a longer, much more elaborate one that uses a variety of parameters and tests to set up the run. </li>
<li> <a href="http://moo.nac.uci.edu/~hjm/bduc/array_job.sh">array_job.sh</a> is a qsub script that implements an array job &#8211; it uses SGE&#8217;s internal counter to vary the parameters to a command.  This example also uses some primitive bash arithmetic to calculate the parameters. </li>
<li> <a href="http://moo.nac.uci.edu/~hjm/bduc/qsub_generate.py">qsub_generate.py</a> is a Python script for generating serial qsubs, in a manner similar to the SGE array jobs.  However, if you need more control over your inputs &amp; outputs and /or are more familiar with Python, it may be useful. </li>
</ul>
<h3><a name="_current_queue_organization"></a>Current Queue Organization</h3>
<p>The batch queues have been to reorganized for clarity.  They now are organized as follows:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">Queue        time*      total CPUs   Type
===============================================

long-ics     ( 78 batch cores)
long-adc     ( 64 batch cores)
long         (191 batch cores)

int           2hr          4         interactive (*)

long-ics     240hr         78        batch

long-adc     240hr         64        batch

long-quad    240hr        124        batch (all 4core motherboards)

long         240hr        191        batch

* for the 'int' Q, you have 2 hr of aggregate CPU time (not
    wallclock time).</pre>
</td>
</tr>
</table>
<p>To submit short jobs (&lt;12hr), you can most easily <strong>not</strong> specify a Q &#8211; it will go on any batch Q.  To run on a longxxx Q, either specify the estimated runtime in the submission script by including the <strong>-l h_rt</strong> parameter</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">#$ -l h_rt=00:30:00 #30 min run</pre>
</td>
</tr>
</table>
<p>(also see below)</p>
<p>or submit specifically to one of the long Qs.</p>
<p>ie:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">$ qsub -q long-ics yourshellname.sh

# or include the Q spec in the script:

#$ -q long-ics</pre>
</td>
</tr>
</table>
<h3><a name="_fixing_qsub_errors"></a>Fixing qsub errors</h3>
<p>Occasionally, a script will hiccup and put your job into an error state.  This can be seen by the qstat <strong>state</strong> output:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">$ qstat -u '*'

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
   6868 0.62500 simple.sh  hmangala     E     06/08/2009 11:29:02 claws@claw3.bduc                   1
                                       ^^^</pre>
</td>
</tr>
</table>
<p>the <strong>E</strong> (<sup>^</sup>) means that the job is in an <strong>ERROR</strong> state.  You can either delete the job with <strong>qdel</strong>:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">qdel &lt;Job ID&gt; # deletes the job</pre>
</td>
</tr>
</table>
<p>or often change it&#8217;s status with the <strong>qmod</strong> command.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">qmod -cj &lt;Job ID&gt; # clears the error state of the job</pre>
</td>
</tr>
</table>
<h3><a name="SGE_script_params"></a>Some useful SGE script parameters</h3>
<p>When you submit an SGE script, it is processed by <em>both bash and SGE</em>. In order to protect the SGE directives from being misinterpreted by <em>bash</em>, they are prefixed by <em>#$</em>  This prefix causes bash to ignore the rest of the line (considers it a comment), but allows SGE to process the directive correctly.</p>
<p>So, the rules are:</p>
<ul>
<li> If it&#8217;s a bash command, don&#8217;t prefix it at all. </li>
<li> If it&#8217;s an SGE directive, prefix it with both characters (<em>#$</em>). </li>
<li> If it&#8217;s a comment, prefix it only with a <em>#</em>. </li>
</ul>
<p>Here are some of the most frequently used</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">#$ -N job_name     # this name shows in qstat
#$ -S /bin/bash    # run with this shell
#$ -q long-ics     # run in this Q
#$ -l h_rt=50:00:00  # need 50 hour runtime
#$ -l mem_free=2G  # need 2GB free RAM
#$ -l scr_free=XG  # need X GB scratch space
#$ -pe mpich 4     # define parallel env
#$ -cwd            # run the job out of the current directory
                   # (the one from which you ran the script)
#$ -o job_name.out # the name of the output file
#$ -e job_name.err # the name of the error file
#  or
#$ -o job_name.out -j y            # '-j y' merges stdout and stderr

#$ -t 0-10:2       # task index range (for looping); generates 0 2 4..10
#                 Uses $SGE_TASK_ID to find out whether they are task
                  0, 2, 4, 6, 8 or 10

#$ -notify
#$ -M &lt;email&gt; - send mail about this job to the given email address.
#$ -m beas          # send a mail to owner when the job
#                       begins (b), ends (e), aborted (a),
#                       and suspended(s).</pre>
</td>
</tr>
</table>
<p>When a job starts, a number of SGE environment variables are set and are available to the job script.  Read about all of them <a href="http://wikis.sun.com/display/GridEngine/Submitting+Batch+Jobs">here (bottom of page).</a></p>
<p>For more on SGE shell scripts, <a href="http://nbcr.sdsc.edu/pub/wiki/index.php?title=Sample_SGE_Script">see here</a>.</p>
<p>For a sample SGE script that uses mpich2, <a href="#mpich2script">see below</a></p>
<h3><a name="_where_do_i_get_more_info_on_sge"></a>Where do I get more info on SGE?</h3>
<p>Oracles purchase of Sun has resulted in a major disorganization of SGE (now OGE) documentation.  If a link doesn&#8217;t work, it may be because of this kerfuffle.  tell me if a link doesn&#8217;t work anymore and I&#8217;ll try to fix it.</p>
<ul>
<li> The ROCKS group has a <a href="http://www.rocksclusters.org/rocksapalooza/2006/lab-sge.pdf">very good SGE Introduction</a> from the User&#8217;s perspective.  Ignore the ROCKS-specific bits. </li>
<li> <a href="http://www.google.com/search?hl=en&amp;q=Sun+Grid+Engine&amp;btnG=Search">Google Sun Grid Engine</a> is a good, easy start.  Maybe you&#8217;ll be lucky.. <img src='http://s0.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  </li>
<li> <a href="http://gridengine.info/">Chris Dagdigian&#8217;s SGE site</a> is very good and has an <a href="http://wiki.gridengine.info/wiki/index.php?Main_Page">excellent wiki</a> </li>
<li> The official <a href="http://www.oracle.com/technetwork/oem/grid-engine-166852.html">Sun (now Oracle) Grid Engine site</a> has a lot of good links. </li>
<li> The <a href="http://wikis.sun.com/display/sungridengine/Home">SGE docs</a> are the final word, but there are a lot of pages to cover. </li>
</ul>
<p>If you need to run an MPI parallel job, you can request the needed resources by Q as well by specifying the resources inside the shell script (more on this later) or externally via the -q and -pe flags (type <em>man sge_pe</em> on one of the BDUC nodes).</p>
<hr />
<h2><a name="_special_cases"></a>Special cases</h2>
<h3><a name="_editing_huge_files"></a>Editing Huge Files</h3>
<p>In a word, <strong>don&#8217;t</strong>.  Many research domains generate or use multi-GB text files. Prime offenders are log files and High-Thruput Sequencing files such as those from Illumina. These are meant to be processed programmatically, not with an interactive editor. When you use such an editor, it typically tried to load the entire thing into memory and generates various cache files.  (If you know of a text editor that handles such files without doing this, please let me know.)</p>
<p>Otherwise, use the utilities <a href="http://goo.gl/6kBwR">head</a>, <a href="http://goo.gl/ISdl2">tail</a>, <a href="http://goo.gl/3vB04">grep</a>, <a href="http://goo.gl/PQY80">split</a>, <a href="http://goo.gl/nDbu">less</a>, <a href="http://goo.gl/nZwOX">sed</a>, and <a href="http://goo.gl/r8YOc">tr</a>, possibly in combinations with <a href="http://goo.gl/TkFSc">Perl</a>/<a href="http://goo.gl/Vjqc">Python</a> to peek into such files and or change them.</p>
<p><a href="http://en.wikipedia.org/wiki/Grep">grep</a> especially is one of the most useful tools for text processing you&#8217;ll ever use.</p>
<p>For example, the following command starts at 2,000,000 lines into a file and stops at 2,500,000 lines and shows that range in the <em>less</em> pager.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">$ perl -n -e 'print if ( 2000000 .. 2500000)' humongo.txt | less</pre>
</td>
</tr>
</table>
<p>In addition, please use the commandline utilities <a href="http://goo.gl/WQGhy">gzip/gunzip</a>, <a href="http://goo.gl/baoIB">bzip2</a>, <a href="http://goo.gl/VpiyQ">zip</a>, <a href="http://goo.gl/7sdXN">zcat</a>, etc instead of the <a href="http://goo.gl/b2828">ark</a> graphical utility on such files. <em>ark</em> apparently tries to store everything in RAM before dumping it.</p>
<h3><a name="_namd_scripts"></a>NAMD scripts</h3>
<p><a href="http://www.ks.uiuc.edu/Research/namd/">namd</a> is a molecular dynamics application that interfaces well with <a href="http://www.ks.uiuc.edu/Research/vmd/">VMD</a>. Both of these are available on BDUC &#8211; see the output of the <em>module avail</em> command. The scripts to submit a <em>namd</em> run to the SGE Q&#8217;ing system are a bit tricky due to the way <em>namd</em> is compiled and run &#8211; it uses an MPI-like parallelization approach without explicitly requiring an external MPI library &#8211; those functions are linked into the <em>charmrun</em> executable supplied with the <em>namd</em> package.  However, SGE requires an MPI parallel environment to reserve the cores necessary for such a run.</p>
<p><a href="http://moo.nac.uci.edu/~hjm/bduc/namd_sge_submit.sh">namd_sge_submit.sh</a> is a SGE submission script that runs successfully on BDUC if given a valid <em>namd</em> input file. It must be submitted to SGE as follows:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">qsub -pe openmpi &lt;#cores&gt; &lt;name_of_script&gt;
# or explicitly, for an 8core job
qsub -pe openmpi 8 namd_sge_submit.sh</pre>
</td>
</tr>
</table>
<p><em>(thanks to Chad Cantwell for the hints and pointer to the <a href="http://www.ks.uiuc.edu/Training/Workshop/Cluster/files/using_rocks.html">original page</a>)</em></p>
<h3><a name="_sate"></a>SATe</h3>
<table style="margin:.2em 0;">
<tr valign="top">
<td style="padding:.5em;">
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:3px solid #e8e8e8;padding:.5em;">
<p><b>SATe is ONLY available on Claw nodes</b></p>
<p>Until we get a better sense of SATe popularity, it and its requisite tools are only available on the Claw nodes.  You can log into Claw1 directly (<em>ssh -Y &lt;your_UCINetID&gt;@bduc-claw1.nacs.uci.edu</em>) and from there, to any of the other Claw nodes. If you are going to run a job that will take more than a 10 minutes, we <em>INSIST</em> that you run it under SGE so that the nodes don&#8217;t get oversubscribed.  How to write and submit an SGE script <a href="#SGE_batch_jobs">is described here</a>.</p>
<p>Note that you will have to run on the <em>claws queue.</em> ie Your qsub script will have include the SGE directive:</p>
<p><strong>#$ -q claws</strong></p>
</td>
</tr>
</table>
<p><a href="http://phylo.bio.ku.edu/software/sate/sate.html">SATe</a> is a Python wrapper around a number of Phylogenetic tools.  It, along with its requisite tools (<a href="ftp://ftp.ebi.ac.uk/pub/software/clustalw2/">ClustalW2</a>, <a href="http://align.bmr.kyushu-u.ac.jp/mafft/software/">MAFFT</a>, <a href="http://www.drive5.com/muscle/">MUSCLE</a>, <a href="http://opal.cs.arizona.edu/">OPAL</a>, <a href="http://www.ebi.ac.uk/goldman-srv/prank/prank/">PRANK</a>, <a href="http://icwww.epfl.ch/~stamatak/index-Dateien/Page443.htm">RAxML</a>) are installed in the shared <em>/usr/local/bin</em> directory of the Claw nodes.</p>
<p>The test cases work with the default settings, but if you want to change any parameters, you have to edit the configuration file and feed it to <em>run_sate.py</em> with <em>-c</em> as shown below.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">export SH=/usr/local/bin # &lt;- shortens the following lines considerably

run_sate.py -c sate.cfg  -i $SH/sate_data/small.fasta -t $SH/sate_data/small.tree -j test
            ^^^^^^^^^^^</pre>
</td>
</tr>
</table>
<p>You can name the  configuration file anything you like, but it has a specific format.  Especially, do <em>NOT</em> try to start comments anywhere except the 1st character of a line, and then only beginning with <em>#</em>.</p>
<p>Here is <a href="http://moo.nac.uci.edu/~hjm/sate.cfg">a good SATe configuration file</a> to start from.</p>
<p>Here is <a href="http://moo.nac.uci.edu/~hjm/sate.bad.cfg">the same SATE configuration file with some bad comments</a> (marked as such with <em>BAD COMMENT</em> in the offending line.)</p>
<p>Here is <a href="http://moo.nac.uci.edu/~hjm/SATE.sh">an example qsub submission script for SATe</a>.  Submit to SGE as:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">qsub SATE.sh</pre>
</td>
</tr>
</table>
<h3><a name="_r_on_bduc"></a>R on BDUC</h3>
<p><a href="http://www.r-project.org">R</a> is an object-oriented language for statistical computing, like SAS (see below).  It is becoming increasingly popular among both academic and commercial users to the extent that it was <a href="http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html">noted in the New York Times</a> in early 2009.  For a very simple overview with links to other, better resources, see <a href="http://moo.nac.uci.edu/~hjm/AnRCheatsheet.html">this link</a></p>
<p>There are multiple versions of R on BDUC, and they do not all behave identically.  Since we have a split cluster (most nodes (<sub>80; </sub>160 cores) run <a href="http://www.centos.org">CentOS</a> (<a href="http://www.redhat.com">RedHat</a>-based) ; the 4 claw nodes (16 cores) run a version of <a href="http://www.ubuntu.org">Ubuntu</a>, (<a href="http://www.debian.org">Debian</a>-based).  Because of slightly different library structures and versions, some R add-ons don&#8217;t work across the subclusters, so in those situations, we concentrate on getting the <em>standard</em> approach working on the CentOS nodes, and provide work-arounds on the claw nodes.</p>
<p>The module system provides R versions <em>2.10.0</em> and <em>2.8.0</em> for all nodes.  Additionally, the claw nodes provide version <em>2.9.0</em> because it is the default version.  Finally, I&#8217;ve added the <em>R development</em> version which is automatically downloaded, compiled, and re-installed every night from the R archives.  This is the VERY LATEST version, so new that it (infrequently) fails.  Howeer, if you need the latest and greatest version, it&#8217;s available.  To load any of these versions, inquire what the available versions are with <em>module avail</em> and then use the appropriate <em>module load R/&lt;version&gt;</em> to set up the paths.</p>
<p>For most things, everything works identically.  The things that don&#8217;t usually have to do with parallel processing in R and the underlying <a href="http://en.wikipedia.org/wiki/Message_Passing_Interface">Message Passing Interface</a> (MPI) technology:</p>
<ul>
<li> <a href="http://cran.r-project.org/web/packages/Rmpi/index.html">Rmpi</a> should work on all CentOS nodes with version 2.10.0.  The claw nodes will not work with the 2.10.0 version as it has a complicated lib dependency that leads into some very bushy areas.  Rmpi DOES work on the claw nodes, but only under R 2.9.0 (the default). </li>
<li> <a href="http://cran.r-project.org/web/packages/rsprng">rsprng</a> (R&#8217;s wrapping of <a href="http://sprng.cs.fsu.edu">SPRNG</a>) is available on all the CentOS nodes for R/2.10.0 and on the default 2.9.0 version on the claw nodes. </li>
<li> <a href="http://cran.r-project.org/web/packages/snow/">snow</a> and <a href="http://cran.r-project.org/web/packages/snowfall/">snowfall</a> are available on the CentOS nodes with version 2.10.0 and on the claw nodes with the default 2.9.0 version. </li>
</ul>
<h3><a name="_sas_9_2_for_linux"></a>SAS 9.2 for Linux</h3>
<p>We have a single node-locked license for SAS 9.2 on claw1, a 4core Opteron node with 32GB RAM.  While the license is for that node only, as many instances of SAS can be run as there is RAM for it.</p>
<p>To start SAS on claw1, type:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;"># cd to the directory where your data is
cd /dir/holding/data

# and start SAS
sas</pre>
</td>
</tr>
</table>
<p>This will start an X11 SAS session, opening several windows on your monitor (as long as you have an active X11 server running).  If you&#8217;re connecting from Mac or Windows, <a href="#graphics">please see this link</a>.</p>
<p>You can use the SAS program editor (one of the windows that opens automatically), or use any other editor you want and paste or import that code into SAS.  The combination of <a href="http://www.gnu.org/software/emacs/">emacs</a> and <a href="http://ess.r-project.org/">ESS (Emacs Speaks Statistics)</a> is a very powerful combination.  It&#8217;s mostly targeted to the R language, but it also supports SAS and Stata.</p>
<p><a href="http://www.nedit.org">Nedit</a> also has a <a href="http://www.nedit.org/ftp/contrib/highlighting/sas.1.0.pats">template file for SAS</a>.</p>
<table style="margin:.2em 0;">
<tr valign="top">
<td style="padding:.5em;">
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:3px solid #e8e8e8;padding:.5em;">
<p><b>To use Java (ods graphics)</b></p>
<p>SAS 9.2 uses Java for at least some of its plotting routines (the <em>ods graphics</em>).</p>
<p>The 64b version of SAS that we use on claw1 still uses the 32b version of Java which needs the environment vars set to tell SAS where to find things, so if you are going to use SAS on claw1, please add the following to your <em>~/.bashrc</em> file:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;"># convenience shortcut
export SASPATH=/home/apps/SAS-x86_64/9.2

# following is required to allow 32bit java to find its libs
export LD_LIBRARY_PATH=${SASPATH}/jre1.5.0_21/lib/i386:\
${SASPATH}/jre1.5.0_21/lib/i386/server:${LD_LIBRARY_PATH}

# Need to set the CLASSPATH to the JRE root so when SAS calls java, the right executable is called.
export JAVAHOME=${SASPATH}/jre1.5.0_21/</pre>
</td>
</tr>
</table>
</td>
</tr>
</table>
<h3><a name="_parallel_jobs"></a>Parallel jobs</h3>
<p>BDUC supports several <a href="http://en.wikipedia.org/wiki/Message_Passing_Interface">MPI</a> variants.</p>
<h4><a name="_mpich2"></a>MPICH2</h4>
<p>BDUC is running MPICH2 version 2-1.1.1p1.  Using it is not hard, but requires a few things:</p>
<ul>
<li> To compile MPI binaries, you&#8217;ll have to <a href="#modules">module load</a> the MPICH2 environment: </li>
</ul>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">module load mpich2</pre>
</td>
</tr>
</table>
<ul>
<li> You need to set up <a href="#passwordless_ssh">passwordless ssh</a> so that you can ssh to any BDUC node without entering a password, including editing your <strong>~/.ssh/config</strong> file to prevent 1st-time connection warnings  from interrupting your jobs </li>
<li> you need to create the file <strong>~/.mpd.conf</strong>, as below: </li>
</ul>
<table style="margin:.2em 0;">
<tr valign="top">
<td style="padding:.5em;">
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:3px solid #e8e8e8;padding:.5em;">
<p>From <strong>Dec. 15th, 2009</strong> onwards, the <em>.mpd.conf</em> is set up for you automatically when your account is activated, so you no longer have to do this manually.  However, as a reference for those of you who want to set it up on other machines, I&#8217;ll leave the documentation in place.</p>
</td>
</tr>
</table>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">cd
# replace 'thisismysecretpassword' with something random.
# You won't have to remember it.
echo "MPD_SECRETWORD=thisismysecretpassword" &gt;.mpd.conf
chmod og-rw .mpd.conf</pre>
</td>
</tr>
</table>
<ul>
<li> your mpich2 qsub scripts have to include the 2 following lines in order to allow SGE to find the PATHS to executables and libraries </li>
</ul>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">module load mpich2
export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID"</pre>
</td>
</tr>
</table>
<p><a name="mpich2script"></a>A full MPICH2 script is shown below.  Note the <em>#$ -pe mpich2 8</em> line which sets up the MPICH2 parallel environment for SGE and requests 8 slots (CPUs). (see <a href="#SGE_script_params">above</a> for more SGE script parameters)</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">#!/bin/bash
# good idea to be explicit about using /bin/bash (NOT /bin/sh).
# Some Linux distros symlink bash -&gt; dash for a lighter weight
# shell, which works 99% of the time but causes unimaginable pain
# in those 1% occassions.

# Note that SGE directives are prefixed by '#$' and plain comments are prefixed by '#'.
# Text after the '&lt;-' should be removed before executing.

#$ -q long    &lt;- the name of the Q you want to submit to
#$ -pe mpich2 8    &lt;- load the mpich2 parallel env and ask for 8 slots
#$ -S /bin/bash    &lt;- run the job under bash
#$ -M harry.mangalam@uci.edu &lt;- mail this guy ..
#$ -m bea          &lt;- .. when the script (b)egins, (e)nds, or (a)borts
#$ -N cells500     &lt;- name of the job in the qstat output
#$ -o cells500.out &lt;- name of the output file.
#
module load mpich2              &lt;- load the mpich2 environment
export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID" &lt;- this is REQUIRED for SGE to set it up.
module load neuron              &lt;- load another env (specific for 'neuron')
export NRNHOME=/apps/neuron/7.0 &lt;- ditto
cd /home/hmangala/newmodel      &lt;- cd to this dir before executing
echo "calling mpiexec now"      &lt;- some deugging text
mpiexec -np 8 nrniv -mpi -nobanner -nogui /home/hmangala/newmodel/model-2.1.hoc
# above, start the job with 'mpiexec -np 8', followed by the executable command.</pre>
</td>
</tr>
</table>
<h3><a name="_matlab"></a>MATLAB</h3>
<p>MATLAB can be started from the login node by typing <em>matlab</em>.  This will log you into a 64bit interactive node and start the MATLAB Desktop. <em>matlabbig</em> will start an interactive session on one of the claw nodes (32GB RAM).</p>
<p>We have 3 licenses for interactive MATLAB on the BDUC cluster.  Those licenses are decremented from the campus MATLAB license pool.  They are meant for running interactive, relatively short-term MATLAB jobs, typically less than a couple hours.  If they go longer than that, or I see that you&#8217;ve launched several MATLAB jobs, they are liable to be killed off.</p>
<p>If you want to run long jobs using MATLAB code, the accepted practice is to compile your MATLAB <em>.m</em> code to a native executable using the MATLAB compiler <em>mcc</em> and then submitting that code, along with your data to a batch Q (see above for submitting batch jobs).  This approach does not require a MATLAB license, so you can run as many instances of this compiled code for as long as you want without impacting the campus licenses.</p>
<p>The official mechanics of doing this <a href="http://tinyurl.com/nebw3e">is described here</a>.</p>
<p>Some additional notes from someone who has done this <a href="#matlabcompiler">is in the Appendix</a>.</p>
<h3><a name="_matlab_alternatives"></a>MATLAB Alternatives</h3>
<p>There are a number of MATLAB alternatives, the most popular of which are available on BDUC.  Since these are Open Source, they aren&#8217;t limited in the number of simultaneous uses, altho you should always try to run batch jobs in the SGE queue if possible. <a href="http://moo.nac.uci.edu/~hjm/ManipulatingDataOnLinux.html#MathModel">See this doc for an overview of them and further links</a>.</p>
<h3><a name="_hadoop"></a>Hadoop</h3>
<p><a href="http://hadoop.apache.org/">Hadoop</a> is a Java-based framework for running large-grained, parallel jobs on clusters.  It now encompasses a large number of subprojects, but it is usually used with the <a href="http://hadoop.apache.org/mapreduce/">MapReduce</a> approach. It scales very well, but it is complex to run on BDUC because it requires its own filesystem and scheduler.  Since on BDUC (and other general-purpose clusters which are not dedicated to hadoop full-time) the job scheduling is more general-purpose, we have to run it as a meta-job. That is, you submit a request to SGE to allocate a number of nodes on which to run hadoop; SGE allocates them to hadoop; hadoop sets up the logical structures it needs on those allocated nodes and everyone&#8217;s happy.  We run hadoop under <a href="http://myhadoop.sourceforge.net/">myHadoop</a> a small bit of middleware designed to handle the interactions between SGE and Hadoop.</p>
<p>Note that Hadoop initializes its own filesystem on the existing /scratch directory for each node.  Because Hadoop can end up storing many GBs of data, we have set up a dedicated hadoop SGE Q named (surprise) <em>hadoop</em>.  All the nodes in this Q should have &gt;300GB available and you can test this by running the following command on the login node:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">cf --config=/usr/local/bin/cfrc --target=HADOOP 'df -h |grep sda3 | scut --c1=3'</pre>
</td>
</tr>
</table>
<p>This will start an interative script that will create a subdir in the current directory which will contain files named for all the hadoop nodes which lists the free diskspace on /scratch.  You can see the results by doing this:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">cd REMOTE_CMD-df--h--grep-sda3---s-&lt;timestamp&gt; # timestamp changes obviously
grep G *
a64-141:334G
a64-142:337G
..
..</pre>
</td>
</tr>
</table>
<p>An SGE submit script for Hadoop  <a href="http://moo.nac.uci.edu/~hjm/bduc/hadoop_example_qsub.sh">is here</a>.</p>
<p>The usual way to exploit Hadoop is to write your application in Java, typically wrapping it into a <em>jarfile</em>.  This Java requirement is not absolute tho. You can also write your application in <a href="http://www.jython.org/">Jython</a> (Python written in Java) and therefore essentially write Java using Python.  You can also write your hadoop app in pure Python (or even in C++).  <a href="http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/">Here is a page describing that approach.</a></p>
<p>The hadoop installation can be found at $HADOOP_HOME. By referring to this variable, you can, for example, add standard libraries to the class path of your java application (e.g. Class-Path: $HADOOP_HOME/lib/hadoop-0.20.2-tools.jar).</p>
<p>(Thanks to Fabian Lindenberg for helping to set up and debug Hadoop on BDUC.)</p>
<h3><a name="_gpus"></a>GPUs</h3>
<p>Thanks to Dr. Steve Jenks, the <em>claw8</em> node contains 2 Nvidia C1060 Graphics Processing Units (<a href="http://en.wikipedia.org/wiki/GPU">GPUs</a>), each with 240 cores.  These cores are specialized to do <a href="http://en.wikipedia.org/wiki/SIMD">SIMD</a> tasks very fast. For example, if your code supports that kind of processing, you can get 10-100X speedup for those parts of the code.  You can learn more about programming these GPUs with <a href="http://en.wikipedia.org/wiki/CUDA">CUDA</a> at the local docs or via the <a href="http://developer.nvidia.com/cuda-toolkit-40">more up-to-date docs at NVIDIA</a>.</p>
<p>In order to use the GPUs, you will have to be registered to use the GPU SGE Queue (contact <a href="mailto:harry.mangalam@uci.edu">harry.mangalam@uci.edu</a>) and of course will have to provide your own code, altho the entire Nvidia SDK examples are compiled and available locally: source code at <em>/apps/gpu/1.0/NVIDIA_GPU_Computing_SDK/C/src/</em>, compiled biaries at <em>/apps/gpu/1.0/NVIDIA_GPU_Computing_SDK/C/bin/</em>.  Of course, since only <em>claw8</em> has the GPUs installed, you&#8217;ll only be able to run them there.</p>
<p>In order to run the compiled GPU code for long runs, you&#8217;ll have to submit them thru the SGE scheduler using the <em>gpu</em> queue which is initialized by using the <strong>module load gpu</strong> directive in your qsub script.  We expect that you&#8217;ll do your debugging on your own machine altho you can do it on claw8 after you have registered and been added to the <em>gpu</em> group.</p>
<p>We currently have the CUDA Toolkit 4.0 installed and will try to remain current with NVIDIA upgrades.</p>
<hr />
<h2><a name="graphics"></a>Graphics</h2>
<p>All the interactive nodes will have the full set of X11 graphical tools and libraries. However, since you&#8217;ll be running remotely, any application that requires OpenGL, while it will probably run, will run so slowly that you won&#8217;t want to run it for long.  If you have an application that requires OpenGL, you&#8217;ll be much better off downloading the processed data to your own desktop and running the application locally.</p>
<h3><a name="_if_you_connect_using_linux"></a>If you connect using Linux</h3>
<p>In order to have access to these X11 tools via Linux, your local Linux must have the X11 libraries available. Unless you have explicitly excluded them, all modern Linux distros include X11 runtime libraries.  Don&#8217;t forget to use the the <em>-Y</em> flag when you connect using ssh to tunnel the X11 display back to your machine:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">ssh -Y your_UCINetID@bduc.nacs.uci.edu</pre>
</td>
</tr>
</table>
<h3><a name="_if_you_connect_using_macosx"></a>If you connect using MacOSX</h3>
<p>The MacOSX installation DVDs come with a free, Apple-certified X11 installation. On Leopard it&#8217;s in <strong>Optional Installs &#8594; Optional Installs.mpkg</strong>   All you have to do is install it and start it running in the background to accept the X11 windows (<strong>Applications &#8594; Utilities &#8594; X11</strong>) Ditto the <em>-Y</em> ssh flag as above.</p>
<h3><a name="XonWin"></a>If you connect using Windows</h3>
<p>There are quite a few ways to use a Linux system besides logging into it directly from the console.</p>
<ul>
<li> remote shell access, using <a href="http://www.chiark.greenend.org.uk/~sgtatham/putty/">PuTTY</a>, a free ssh client, which even allows X11 forwarding so that you can use it with Xming (below) to view Graphical apps from BDUC. <em>Putty</em> is a straight ssh terminal connection that allows you to securely connect to the Linux server and interact with it in a purely text-based basis. For a shell/terminal cognoscenti, it&#8217;s considerably less capable than any of the terminal apps (konsole, eterm, gnome-terminal, etc) that come with Linux, but it&#8217;s fine for establishing the 1st connection to the Linux server. If you&#8217;re going to run anything that requires an X11 GUI, you&#8217;ll need to set PuTTY to do X11 forwarding.  To enable this, double-click the PuTTY icon to bring up the PuTTY configuration window. On the left Pane, follow the clickpath: <em>Connection &#8594; SSH &#8594; X11 &#8594; set the Enable X11 Forwarding</em>. After setting this, click on Session at top of the pane, and set a name in <em>Saved Sessions</em> on lower right pane, click the [Save] button to save the connection information so that the next time you need to connect, the correct setting will already be set. </li>
<li> <a href="http://sourceforge.net/projects/xming/">Xming</a>, a lightweight and free X11 server (client, in normal terminology). Xming provides <em>only the X server</em>, as opposed to <em>Cygwin/X</em> below.  Xming provides the X server that displays the X11 GUI information that comes from the Linux machine. When started, it looks like it has done nothing, but it has started a hidden X11 window (note the Xming icon in the toolbar). When you start an X application on the Linux server (after logging in with PuTTY as described above), it will accept a connection from the Linux machine and display the X11 app as a single window that looks very much like a normal MS WinXP window. You&#8217;ll be able to move it around, minimize it, maximize it and close it by clicking on the appropriate button in the title bar. There may be a slight lag in response in that window, but over the University network, it should be be acceptable. </li>
<li> if you have trouble setting up Putty and Xming, please see  <a href="http://www.math.umn.edu/systems_guide/putty_xwin32.html">this page which describes it in more detail, with screenshots</a> </li>
<li> <a href="http://x.cygwin.com/">Cygwin/X</a>, another free, but much larger and capable X server (combined with an entire Linux-on-Windows implementation). Provides much more power and requires much more user configuration than Xming.  Cygwin/X provides not only a free Xserver but nearly the entire Linux experience to Windows. This is more than what most normal users want (both in diskspace and configuration), especially if you have a real Linux server to use. The X11 server is very good tho, as you might expect. </li>
<li> <a href="http://www.realvnc.com/">VNC server and client</a>. Run the server on the Linux machine and connect to it with the client running on your Windows Desktop. Can provide the entire Linux Desktop experience on your Windows machine, altho with less graphics performance (it&#8217;s fine to connect to a machine on the university network, but slow across the Internet).  VNC is mechanism that can present the entire Linux Desktop to the user, including not only the application windows, but the Desktop itself, with all the bells and whistles that that metaphor provides. The RealVNC package for Windows provides both the Viewer and the Server, so you can provide remote access to your Windows Desktop as well.  This can be especially useful if you&#8217;re trying to demo a Desktop application to others &#8211; you can configure the VNC server to allow multiple read-only clients (they can&#8217;t take control of your desktop) to watch you run the app. Combined with the multiplaform VOIP application Skype which can run on the same machine, you have a very cheap tele-screensharing setup good for demo&#8217;ing applications. The Windows VNC server is efficient enough to support at least 10 viewing clients and the refresh rate is good enough for a mostly 2D demo across UC. </li>
<li> <a href="http://nomachine.com/">NoMachine</a> <a href="http://www.nomachine.com/download.php">Server and Clients</a>, a system much like the VNC system but much more efficient and therefore has better performance. It is also more complicated to set up. Please read <a href="http://www.linux.com/archive/feature/116354">this review</a> for an overview of what is required and how to install it.   For personal use, there is a free server and client. For more connections, you&#8217;ll need a commercial license.  There are also 2 free NX servers: <a href="http://freenx.berlios.de/">FreeNX</a> and Google&#8217;s recently released <a href="http://code.google.com/p/neatx/">NeatX</a>, both of which are fairly easy to install and allow unlimited connections.  BDUC uses both the free versions and it is described in more detail immediately below. </li>
</ul>
<h3><a name="nomachine"></a>NoMachine NX connections</h3>
<p>We&#8217;ve added GPL&#8217;ed NXservers to both the <em>login</em> and the <em>claw1</em> nodes. Both have direct external connections.  With the <em>nxclient</em> software installed on your machine, you can run the entire Linux Desktop (<em>Gnome</em> on login, <em>KDE</em> on claw1) with remarkable speed.  Here&#8217;s a screenshot of my laptop screen with the nxclient Desktop from claw1 running SAS, matlab, and tablet (a genomics assembly viewer):</p>
<p><img src="nxclient_desktop_ss.png" style="border-width:0;" alt="nxclient desktop shot"></p>
<p>Get the appropriate client software for your platform <a href="http://www.nomachine.com/download.php">here</a>.</p>
<h4><a name="_configuring_the_nxclient"></a>Configuring the nxclient</h4>
<p>The configuration is fairly simple.  The initial pane allows you to set your <em>Login</em> (your UCINetID and <em>Password</em> (your UCINetID password) and name the session anything you like.</p>
<p><img src="nxclient_screen1.png" style="border-width:0;" alt="nxclient screen1"></p>
<p>Clicking the <em>Configure&#8230;</em> button takes you to a set of tabbed configuration pages.  The only one that needs to be modified is the 1st one <em>General</em>:</p>
<p><img src="nxclient_config_general_kde.png" style="border-width:0;" alt="nxclient general kde config"></p>
<p>The screenshot above shows the setup for logging into the claw1 node (which supports the KDE Desktop).  If you want to use it on the login node which supports the Gnome Desktop, see below:</p>
<p><img src="nxclient_config_general_gnome.png" style="border-width:0;" alt="nxclient general gnome config"></p>
<p><em>DO NOT</em> change the default Key unless you have problems logging into the <em>login</em> node.</p>
<table style="margin:.2em 0;">
<tr valign="top">
<td style="padding:.5em;">
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:3px solid #e8e8e8;padding:.5em;">
<p><b>Key Changes with <em>bduc-login</em></b></p>
<p>When bduc-login was upgraded, the ssh keys used to validate the <em>nx</em> user (who inits nxserver) changed.  If you use the nxclient with bduc-login, you&#8217;ll have to change the nxclient key to this one: (also on the bduc-login node in: <em>/usr/local/share/nxclient_dsa_key</em>)</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">-----BEGIN DSA PRIVATE KEY-----
MIIBuwIBAAKBgQCXv9AzQXjxvXWC1qu3CdEqskX9YomTfyG865gb4D02ZwWuRU/9
C3I9/bEWLdaWgJYXIcFJsMCIkmWjjeSZyTmeoypI1iLifTHUxn3b7WNWi8AzKcVF
aBsBGiljsop9NiD1mEpA0G+nHHrhvTXz7pUvYrsrXcdMyM6rxqn77nbbnwIVALCi
xFdHZADw5KAVZI7r6QatEkqLAoGBAI4L1TQGFkq5xQ/nIIciW8setAAIyrcWdK/z
5/ZPeELdq70KDJxoLf81NL/8uIc4PoNyTRJjtT3R4f8Az1TsZWeh2+ReCEJxDWgG
fbk2YhRqoQTtXPFsI4qvzBWct42WonWqyyb1bPBHk+JmXFscJu5yFQ+JUVNsENpY
+Gkz3HqTAoGANlgcCuA4wrC+3Cic9CFkqiwO/Rn1vk8dvGuEQqFJ6f6LVfPfRTfa
QU7TGVLk2CzY4dasrwxJ1f6FsT8DHTNGnxELPKRuLstGrFY/PR7KeafeFZDf+fJ3
mbX5nxrld3wi5titTnX+8s4IKv29HJguPvOK/SI7cjzA+SqNfD7qEo8CFDIm1xRf
8xAPsSKs6yZ6j1FNklfu
-----END DSA PRIVATE KEY-----</pre>
</td>
</tr>
</table>
<p>On your nxclient, the click-path is: [Configure] &#8594; [Key] &#8594; (delete current key) &#8594; (paste in the contents of the above text box including the BEGIN &amp; END lines.)</p>
<p>Then [Save] &#8594; [Save] &#8594; [OK] &#8594; [Login]</p>
</td>
</tr>
</table>
<p>Unlike the commercial NoMachine NXserver, these servers allow any number of connections. There may be a fairly long wait (up to a minute) before the session is initially validated and the screen comes up (the Desktop is loading on the server), but after that, the interaction is very fast.</p>
<h4><a name="_terminating_the_session"></a>Terminating the session</h4>
<p>Note that when you close the session, you have 2 options &#8211; to <em>Disconnect</em> (closes the client but leaves the session running so you can reconnect to the same session you left) or <em>Terminate</em> (closes the client and kills the session, so you&#8217;ll start from a new Desktop instance).</p>
<p>Unless there is good reason to keep it running, please <strong>Terminate the session</strong> to free up resources.</p>
<p>I have run into a situation whereby if the session is ended oddly it will leave a <em>ghost session</em> (ie, you kill the nxclient by killing the shell from which it was started).  When you next start up the nxclient, it will offer to let you re-connect to an existing session, but then be unable to reconnect.  If this happens, you should still be able to start a new session, but please call me to address that situation &#8211; I have to manually remove the ghost session credentials in <em>/usr/local/var/lib/neatx/sessions/</em>.</p>
<table style="margin:.2em 0;">
<tr valign="top">
<td style="padding:.5em;">
<p><b><u>Note</u></b></p>
</td>
<td style="border-left:3px solid #e8e8e8;padding:.5em;">
<p><b>Further oddities with <em>nxclient</em></b></p>
<p>In the myriad configurations possible, there are a few other notes that may be useful.</p>
<p>If you can use the nxclient to log into a system and it appears to accept mouse input but no longer accepts keyboard input, you may need to define the <em>XKEYSYMDB</em> environment variable to explicitly point to the bduc node&#8217;s <em>XKeysymDB</em> file.  Add the following line to your <em>~/.bashrc</em> file:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">export XKEYSYMDB=/usr/share/X11/XKeysymDB</pre>
</td>
</tr>
</table>
<p>Then kill off all your nxclient sessions and start the nxclient again.  Keyboard input should work. This also addresses similar errors in older applications that use <em>Motif</em> widgets (nedit, others).</p>
<p>If it still does not work, you may have the wrong keyboard selected.  On bduc-login (GNOME Desktop), the way to select the correct keyboard is shown in the following images.   First, click to the Keyboard selection option (below left):</p>
<p><img src="images/keyboard_select.png" style="border-width:0;" alt="Keyboard Option Path"> <img src="images/keyboard_select2.png" style="border-width:0;" alt="Keyboard selection"></p>
<p>and then select the closest keyboard using the scrolling list (above, right):</p>
<p>If you start getting nx errors that result in a blank/black screen with a pop-up box that claims to be unable to complete: <em>/bin/bash -c &#8220;/etc/X11/xinit/Xsession gnome-session&#8221;</em>, you may have damaged your <em>~/.Xauthority</em> and/or <em>~/.ICEauthority</em> files.  The simple fix to this problem is to move them out of the way and then ssh into the node to re-create them.</p>
</td>
</tr>
</table>
<hr />
<h2><a name="_how_to_manipulate_data_on_linux"></a>How to Manipulate Data on Linux</h2>
<p>This is a topic for a whole &#8216;nother document named  <a href="http://moo.nac.uci.edu/~hjm/ManipulatingDataOnLinux.html">Manipulating Data on Linux</a> and the documents and sites referred to therein.</p>
<hr />
<h2><a name="_frequently_asked_questions"></a>Frequently Asked Questions</h2>
<p>OK, maybe not frequently, but cogently, and CAQ just doesn&#8217;t have the same ring. If you have other questions, please ask them.  If they address a frequent theme, I&#8217;ll add them here.  In any case, I&#8217;ll try to answer them.</p>
<p><b>Q&amp;A</b></p>
<ol>
<li> <em> What&#8217;s a node?  Is it the same as a processor? </em> A node refers to a self-contained chassis that has its own power supply, motherboard (containing RAM, CPU, controllers, IO slots and devices (like ethernet ports), various wires and unidentifiable electrogrunge).  It usually contains a disk, altho this is not necessary with boot-over-the-network.  It&#8217;s not the same as a processor.  Typical BDUC nodes (from the Jurassic period) have 2-4 CPU cores per node.  Modern nodes have 8 to &gt;100 cores. </li>
<li> <em> When I submit a .sh script with qsub, does the following line refer to 10 processors or 10 nodes? <strong>#$ -pe openmpi 10</strong> </em> 10 processor cores.  Most modern physical CPUs (the thing that plugs into the motherboard socket) have multiple processor cores internally these days. </li>
<li> <em> What about the call to mpiexec? <strong>mpiexec -np 10 nrniv -mpi -nobanner -nogui modelbal.hoc</strong> </em> Same thing.  That&#8217;s why they should be the same number. </li>
<li> <em> Is it possible for the processors on one node to be working on different jobs? </em> Yes, altho the scheduler can be told to try to keep the jobs on 1 node (better for sharing memory objects like libs, but worse if there&#8217;s significant contention for other resources like disk &amp; network IO).  Most of the MPI environments on BDUC are currently set to spread out the jobs rather than bunch them together on as few nodes as possible. </li>
<li> <em> If processor 1 (working on Job A) fails, does it bring down  processor 2 (working on Job B) as well? </em> No, and in fact it doesn&#8217;t typically work that way. A job does not run on a particular CPU; on a multi-core node, different threads of the same job can hop among CPU cores.  The kernel allocates threads and processes to whatever resources it has to optimize the job. </li>
<li> <em> Is the performance of processor 1 dependent on whether processor 2 is engaged in the same or different job? </em> It depends. The computational bits of a thread, when they are being executed on a CPU, don&#8217;t interfere much with the other processor. They do share memory, interrupts, and IO so if they&#8217;re doing roughly the same thing at roughly the same time, they&#8217;ll typically want to read and write at the same time and thus compete for those resources.  That was the rationale for <em>spreading out</em> the MPI jobs rather than <em>filling up</em> nodes. </li>
<li> <em> Is it possible for one processor to use more than its &#8220;share&#8221; of the memory available to the node, i.e., is it wrong for me to count on having a certain amount of memory just because I&#8217;ve specified a certain number of processors (nodes?) for my job? </em> The CPU running prog1 will request the RAM that it needs independent of other CPUs running prog1 or prog2, prog3, etc.  If the node gets close to running out of real RAM, it will start to swap idle (haven&#8217;t-been- accessed-recently) pages of RAM to the disk, freeing up more RAM for active programs.  If the computer runs out of both RAM and swap, it will hopefully kill off the offending programs until it regains enough RAM to function and then it will continue until it happens again.  This is why you should try to estimate the amount of RAM your prog will use and indicate that to the scheduler with the <em>-l mem_free</em> directive.  See <a href="#SGE_script_params">the section above.</a> </li>
<li> <em> I can ssh to BDUC but I can&#8217;t scp files to it.  Why? </em> Probably because you edited your <em>.bashrc</em> (or <em>.zrc</em> or <em>.tcshrc</em>) to emit something useful when you log in.  (Both scp and ssh have a useful option <em>-v</em> that puts it into <em>verbose</em> mode that tells you much more about what the process is doing and why it fails). You need to mask this output from non-interactive logins like <em>scp</em> and remote <em>ssh</em> execution by placing such commands inside a <strong>test for an interactive shell</strong>. When using bash, you would typically do something like this: </li>
</ol>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">interactive=`echo $- | grep -c i `
if [ ${interactive} = 1 ] ; then
  # tell me what my 22 latest files are
  ls -lt | head -22
fi</pre>
</td>
</tr>
</table>
<hr />
<h2><a name="_appendix"></a>Appendix</h2>
<h3><a name="HowtoPasswordlessSsh"></a>HOWTO: Passwordless ssh</h3>
<p><em>Passwordless ssh</em> will allow you to ssh/scp to frequently used hosts without entering a passphrase each time.  <strong>The process below works on Linux and Mac only</strong>. Windows clients can do it as well, but it&#8217;s a different procedure.  However, regardless of your desktop machine, you can use passwordless ssh to log in to all the nodes of the BDUC cluster once you&#8217;ve logged into the login node.</p>
<table bgcolor="#ffffee" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<p style="margin-top:0;"><b>Note for BDUC Parallel / MPICH2 Users</b></p>
<p>If you&#8217;re going to be using MPI, via some variant of MPI (MPICH, MPICH2, OpenMPI), or another parallel toolkit, you almost certainly will have to set this up to work on BDUC so you (or your scripts) can passwordlessly ssh to other nodes.  For BDUC users using only serial programs it can still be useful as it cuts down on the amount of typing of passwords you&#8217;ll have to do.</p>
<p>And it&#8217;s dead simple.</p>
</td>
</tr>
</table>
<p>In a terminal on your Mac or Linux machine, type:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;"># for no passphrase, use
ssh-keygen -b 1024 -N ""

# if you want to use a passphrase:
ssh-keygen -b 1024 -N "your passphrase"
# but you probably /don't/ want a passphrase - else why would you be going thru this?</pre>
</td>
</tr>
</table>
<p>save to the default places.</p>
<p><strong>For the BDUC cluster case:</strong> Since all cluster nodes share a common <strong>/home</strong>, all you have to do is rename the public key file (normally <strong>id_rsa.pub</strong> in your ~/.ssh dir) to <strong>authorized_keys</strong>.</p>
<p><strong>For unrelated (non-cluster) hosts:</strong> <em>Linux users</em>, use the <em>ssh-copy-id</em> command, included as part of your ssh distribution. (<em>Mac users</em> will have to do it manually, described just below.) <em>ssh-copy-id</em> does all the copying one shot, using your <strong>~/.ssh/id_rsa.pub</strong> key (by default; use the -i option to specify another identity file, say <strong>~/.ssh/id_dsa.pub</strong> if you&#8217;re using DSA keys)</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">ssh-copy-id  your_bduc_login@bduc.nacs.uci.edu
# you'll have to enter your password one last time to get it there.</pre>
</td>
</tr>
</table>
<p>What this does is to scp <strong>id_rsa.pub</strong> to the remote host (the ssh server your&#8217;re trying to log into) and append that key to the remote file <strong>~/.ssh/authorized_keys</strong>.  If things don&#8217;t work, check that the <strong>id_rsa.pub</strong> file has been appended correctly.</p>
<p>Then verify that it&#8217;s worked by ssh&#8217;ing to BDUC.  You shouldn&#8217;t have to enter a password anymore.</p>
<p><strong>For Mac users</strong>, scp the same keys to the remote host and append your public key to the remote <strong>~/.ssh/authorized_keys</strong>.  Here are the commands below.  Just modify the UCINETID value and mouse them into the <strong>Terminal</strong> window on your local Mac.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">bash  # starts the bash shell just to make sure the rest of the commands work
cd    # makes sure you're in your local home dir
export UCINETID=""  # fill in the empty quotes with *your UCINETID*

# you'll need to enter the password manually for the next 2 commands)

scp ~/.ssh/id_rsa.pub ${UCINETID}@bduc-login.nacs.uci.edu:~/.ssh/id_rsa.pub
ssh ${UCINETID}@bduc-login.nacs.uci.edu 'cat ~/.ssh/id_rsa.pub &gt;&gt; ~/.ssh/authorized_keys'

# and now you should be able to ssh in without a password
ssh ${UCINETID}@bduc-login.nacs.uci.edu</pre>
</td>
</tr>
</table>
<table bgcolor="#ffffee" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<p style="margin-top:0;"><b>First time challenge from ssh</b></p>
<p>If this is the 1st time you&#8217;re connecting to BDUC from your Mac (or PC), you&#8217;ll get a challenge like this:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">The authenticity of host 'bduc-login.nacs.uci.edu (128.200.15.20)' can't be established.
RSA key fingerprint is 57:70:23:8e:e1:15:8c:51:b0:52:ca:c7:a8:e9:26:9b.
Are you sure you want to continue connecting (yes/no)?</pre>
</td>
</tr>
</table>
<p>and you have to type <em>yes</em>.</p>
<p>For MPI / Parallel users, you should set up a local <strong>~/.ssh/config</strong> file to tell ssh to ignore such requests.  The file should contain:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">Host *
   StrictHostKeyChecking no</pre>
</td>
</tr>
</table>
<p>and must be chmod&#8217;ed to be readable only by you.  ie</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">chmod go-rw ~/.ssh/config</pre>
</td>
</tr>
</table>
</td>
</tr>
</table>
<h3><a name="matlabcompiler"></a>Notes on using the MATLAB comiler on the BDUC cluster</h3>
<p>(Thanks to <em>Michael Vershinin</em> and <em>Fan Wang</em> for their help and patience in debugging this procedure).</p>
<p>As noted, the official docs for compiling your MATLAB code is <a href="http://tinyurl.com/nebw3e">is described here</a>.  Before you start hurling your <em>.m</em> code at the compiler, please read the following for some hints.</p>
<p>The following is a simple case where all the MATLAB code is in a single file, say <em>test.m</em>. Note that for the easiest path, you should write your MATLAB code to compile as a function. This means that keyword <em>function</em> has to be used to define the MATLAB code (<a href="#matlab_compile_example">see example below</a>). If you want to pass parameters to the function, you have include a function parameter indicating this.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;"># Before you use any MATLAB utilities, you will have to load the
# MATLAB environment via the 'module' command

module load matlab/R2009b

# for a C file dependency, you compile it with 'mex'.  Note that mex doesn't like
# C++ style comments (//), so you'll have to change them to the C style /* comment */

mex some_C_code.c    # -&gt; produces 'some_C_code.mexa64'

# then compile the MATLAB code for a standalone application.
# (type mcc -? for all mcc options)

# If the m-code has a C file dependency which has already been mex-compiled,
# mcc will detect the requirement and link the '.mexa64' file automatically.

mcc -m test.m  # -&gt; 'test'  (can take a minute or more)

# !! if you have additional files that are dependencies, you may have to define
# !! them via the '-I /path/to/dir' flags to describe the dirs where your
# !! additional m code resides.

# for a _C_ shared lib (named libmymatlib.so) with multiple input .m files

mcc -B csharedlib:libmymatlib file1.m file2.m file3.m

# for a _C++_ shared lib (named libmymatlib.so) with multiple input .m files

mcc -B cpplib:libmymatlib file1.m file2.m file3.m</pre>
</td>
</tr>
</table>
<p>In the <em>standalone</em> case which will probably be the most popular approach on BDUC, the mcc compilation will generate a number of files:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">readme.txt  ...............  autogen'd description of the process
test   ....................  the 'semi-executable'
test.m  ...................  original 'm code'
test_main.c  ..............  C code wrapper for the converted m code
test_mcc_component_data.c .  m code translated into C code
run_test.sh  ..............  the script that wraps and runs the executable
test.prj  .................  XML description of the entire compilation
                               dependencies (Project file)</pre>
</td>
</tr>
</table>
<p>In order to now run the executable, you often can&#8217;t submit the auto-generated  <em>run_test.sh</em> directly in the SGE Q. You have to submit it wrapped in a SGE script which finally calls the <em>run_test.sh</em> script which sets up all the necessary environment variables and paths to run the executable.</p>
<p>So while you can test it for a few minutes like this:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">./run_test.sh [matlab_root] ./test

# where the [matlab_root] would be '/apps/matlab/r2009b' for the
# matlab version that supports the compiler</pre>
</td>
</tr>
</table>
<p>Note that if you have already loaded the MATLAB module, you can usually run the compiled executable alone from the commandline.</p>
<p>However, for long/production runs, you will have to create a bash script (call it <em>runmycode.sh</em>) like this:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">#!/bin/bash

#$ -S /bin/bash          # run with this shell

#$ -N comp_matlab_run    # this name shows in qstat
#$ -q long               # run in this Q
#$ -l h_rt=50:00:00      # need 50 hour runtime
#$ -l mem_free=2G        # need 2GB free RAM
#$ -l scr_free=1G        # need 1 GB scratch space
#$ -cwd            # run the job out of the current directory
                   # (the one from which you ran the script)

#$ -notify
#$ -M &lt;email&gt; - send mail about this job to the given email address.
#$ -m beas          # send a mail to owner when the job
#                       begins (b), ends (e), aborted (a),
#                       and suspended(s).

./run_test.sh  /apps/matlab/r2009b ./test</pre>
</td>
</tr>
</table>
<p>and qsub it to SGE:</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">qsub runmycode.sh</pre>
</td>
</tr>
</table>
<h4><a name="matlab_compile_example"></a>MATLAB Compilation Example</h4>
<p>Below is a very simple example showing how to compile and execute some MATLAB code. Save the following code to a file named <em>average.m</em>.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">function y = average(x)
% AVERAGE Mean of vector elements.
% AVERAGE(X) is the mean of vector, where X is a vector of
% elements. Nonvector input results in an error.
[m,n] = size(x);
if (~((m == 1) | (n == 1)) | (m == 1 &amp; n == 1))
    error('Input must be a vector')
end
y = sum(x)/length(x);      % Actual computation
y</pre>
</td>
</tr>
</table>
<p>Once the code is saved as <em>average.m</em>, compile by copying and pasting into a terminal window.</p>
<table border="0" bgcolor="#e8e8e8" width="100%" style="margin:.2em 0;">
<tr>
<td style="padding:.5em;">
<pre style="margin:0;padding:0;">module load matlab/R2009b   # load the MATLAB environment
mcc -m average.m;           # compile the code (takes many seconds)
z=1:99                      # assign the input vector to a shell variable
./average $z                # call the executable with the range.</pre>
</td>
</tr>
</table>
<p>Note also that if you&#8217;re going to run this under SGE as multiple instances, each instance will have to run with the appropriate MATLAB environment so you will have to preface each exec with the <em>module load matlab/R2009b</em> directive.</p>
<hr />
<h2><a name="_release_information_amp_latest_version"></a>Release information &amp; Latest version</h2>
<p>The latest version of this document should always be available <a href="http://moo.nac.uci.edu/~hjm/bduc/BDUC_USER_HOWTO.html">here</a>.  The <a href="http://www.methods.co.nz/asciidoc/">asciidoc</a> source is available <a href="http://moo.nac.uci.edu/\~hjm/bduc/BDUC_USER_HOWTO.txt">here</a>.</p>
<p>This document is released under the <a href="http://www.gnu.org/licenses/fdl.txt">GNU Free Documentation License</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/hjmangalam.wordpress.com/22/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/hjmangalam.wordpress.com/22/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/hjmangalam.wordpress.com/22/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/hjmangalam.wordpress.com/22/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/hjmangalam.wordpress.com/22/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/hjmangalam.wordpress.com/22/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/hjmangalam.wordpress.com/22/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/hjmangalam.wordpress.com/22/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/hjmangalam.wordpress.com/22/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/hjmangalam.wordpress.com/22/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/hjmangalam.wordpress.com/22/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/hjmangalam.wordpress.com/22/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/hjmangalam.wordpress.com/22/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/hjmangalam.wordpress.com/22/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=hjmangalam.wordpress.com&amp;blog=1623151&amp;post=22&amp;subd=hjmangalam&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://hjmangalam.wordpress.com/2009/09/13/an-introbduction/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/255884f089123f544bb5e036ae3a89b1?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">hjmangalam</media:title>
		</media:content>

		<media:content url="http://hjmangalam.files.wordpress.com/2011/06/without_byobu_s.jpg" medium="image">
			<media:title type="html">without byobu</media:title>
		</media:content>

		<media:content url="http://hjmangalam.files.wordpress.com/2011/06/with_byobu_s.jpg" medium="image">
			<media:title type="html">with byobu</media:title>
		</media:content>
	</item>
	</channel>
</rss>
