Thursday, February 3, 2011

simple statistical analysis from shell level

hi,

i'm looking for some handy program/script to which i can pump data via stdin and which can present me some basic statistics of input data. for instance - provided with set of values separated by new line character i would like to get:

  • average for all values
  • average for data except 5% smallest and 5% largest values
  • standard deviation

yes - i know, can be done with bash or awk, but maybe you already know something handy?

ps.

i'm perfectly aware of 'big cannons' like octave, r and some other - but i need something much simpler.

thanks

  • you could try something along the lines of;

    perl -e 'use List::Util qw(sum);while($r = <>){push (@array, $r)}; print sum(@array) / @array';
    

    to get the average. And you could install the Statistics::Descriptive package http://search.cpan.org/~colink/Statistics-Descriptive-2.6/Descriptive.pm

    to do what you need for the other requiremtns. stdev is probably easy, the other one would take a few more lines to sort and filter. (no doubt its possible to do in a single line...;-)

    pQd : thanks - yeah - i can script it too without problem, but i was wondering if there is some standardized tool...
    From
  • This little AWK snippet will do part of what you're looking for:

    awk '{sum += $0; count++; vals[$0]++} END {mean = sum / count; print "Total: ", sum; print "Mean: ", mean; for (i in vals){ s += vals[i] * ((i - mean) ** 2) }; print "Standard Dev: ", sqrt(s/count)}' datafile
    

    The drop 5% part would be a little more complicated and depend on exactly how you mean it.

    I know you're looking for something canned, but short of using R, Octave, SAS or SPSS, I don't know of anything.

    Edit: Corrected formula

    pQd : thanks.. point is - to pipe from the command line daily/monthly performance logs grepped by custom criteria and get result on the spot.
    Dennis Williamson : @pQd: The AWK snippet will do that, as you know, and I imagine that R and Octave could be made to do it, too. I'm not so sure about the others (plus they're not cheap).
    Slartibartfast : You're using a very odd definition of Std. Deviation. I don't think that it is the square root of the mean.
    Dennis Williamson : @slartibartfast: Well I'm no statistician, but I believe that's correct for population std. dev.
    Dennis Williamson : @slartibartfast: Oops, you were right, I left out part of it. I'll post a correction.
  • R might be exactly what you are looking for, or it may be a total over kill for your purpose. Hard to tell from your question.

    Anyways, check it out http://en.wikipedia.org/wiki/R_(programming_language)

    pQd : thx, although i vote here for 'overkill' option :)
    Marcin : "overkill is underrated" (from the new A-Team movie)
  • The first and last items are do-able (I've done them a couple of times) without maintaining the entire set of data in memory and without knowing the total number of items in advance. The middle item (dropping the outliers) is more challenging and requires maintaining the entire list in RAM or at least knowing the total number of items in advance.

    I don't know of any simple pre-built tools to do any of those (though Octave and R sound as though they might be such).

    pQd : i know that. thanks..

0 comments:

Post a Comment