Index: [thread] [date] [subject] [author]
  From: Borries Demeler <demeler@bioc02.uthscsa.edu>
  To  : rasmb@bbri.harvard.edu
  Date: Wed, 9 Apr 1997 10:25:24 -0500 (CDT)

Data storage format (fwd)

> Dear Rasmbers (and particularly the software gurus),

Allow me to second Jo's comment. In a private message to John Philo I
already pointed out the same points made by Jo, and now to the general
RASMB audience.

I believe that Beckman should probably adhere to the previous method of
recording data in an ascii format, and I think the biggest reason for me 
(besides the ones raised by Jo) is hardware/software portability.

Some of us may prefer to analyze data on a non-PC type machine (Macintosh,
DEC/VAX, various UNIX flavors, etc.), and we may want to use different software 
to analyze the data. Binary data may present some complication, since 
not all hardware/software uses the same data structures, with ascii format 
being the notable exception.

As a long-time software developer I have solved the problem from the beginning
by introducing an editing routine in my software package (UltraScan) that
automatically creates a binary copy of the data as you analyze a particular
dataset for the first time. This binary copy is henceforth used to perform 
the various analysis methods, which can then benefit from the shorter 
loading time of the binary data set.

I believe that all software developers could use such an approach to bypass
the slowness in reading of ascii data. However, I think the question of 

"Do we software developers want to create a binary data format that can be 
used by all of our various methods simultaneously"

should be discussed among ourselves and really isn't too worthwhile to be
considered in this greater forum.

Lastly, let me remind you all that according to my experience it has always
been a slow process trying to get Beckman to change anything in their software
or hardware design, and it is much quicker to handle such matters locally
with my own software design. There have been enough changes in file formats 
already, let's keep the ascii format as it is right now, I think it is a good
compromise for all and the slowness in reading the data is bound to be
alleviated by faster computers emerging quickely. 

Finally, how long it takes to read a set of data files is also a question 
of programming language, compiler, optimization level and algorithm design. 
I found big differences for example in different ways of reading the file in 
C and FORTRAN. The best algorithm I found allows me to read 50 raw data scans 
of approximately 800 datapoints in a little less than 4 seconds (on a 66 MHZ 
486 with a 500 MB western digital harddrive, using C-code and the gcc compiler
under Linux and optimization -O3). On the same system, it takes about 12 
seconds when I read the same files with a fortran routine compiled with 
an old Microsoft fortran compiler running under MS-DOS. I am sure I could
come up with larger differences if I looked.

So much for my $0.02 worth...

Regards to all, -Borries
*******************************************************************************
* Borries Demeler                                                             *
* The University of Texas Health Science Center at San Antonio                *
* Dept. of Biochemistry, 7703 Floyd Curl Drive, San Antonio, Texas 78284-7760 *
* Voice: (210) 567-6592 Fax: (210) 567-6595 Email: demeler@bioc02.uthscsa.edu *
*******************************************************************************

Index: [thread] [date] [subject] [author]