Exercise 5: Data IO



Introduction:

Data input and output (IO) represents a very small fraction of most programs, but the effect of IO is very important for the execution of the program. IO is the mean to get a large or small amount of data channeled in to the program and after doing whatever required with these data, store the final result on disk or present them on the display. We have already seen a few ways to do this, using the read and write statements in connection with getting information from standard input and sending data both to standard output and data files. When the program needs to communicate larger amounts of data, it is more convenient to let the IO go directly to files by reading and writing data from the file system. You have already seen this done in some of the previous exercises. Until now this part of the program has been handed to you as a "black box". This time we are going to work with this in some detail, making you able to decide which type of data you require to read/write and how to handle this in the code. For this purpose there are a number of examples that illustrate the different types of IO on simple data sets. Finally there will be a series of dataset which you will have to analyze making a Fourier decomposition to find the typical frequencies present in the signal and compare timings between two different methods.



Getting the required files:

Download the tar file to your ~/Dat_F directory and unpack it:

$ cd ~/Dat_F
$ tar -zxvf fortran5.tgz
$ rm fortran5.tgz


This creates a new directory "Experiment_5" that contains the files needed for the exercises.



Simple IO examples:

There are a few minor assignments in this exercise. They represent different aspects of data formats in disk files and the way these are accessed.

Until now you have dealt with data files written in ASCII format. This format has the advantage that it can be read using a normal text editor. This allows us to check the data before they are read into the program or after they have been created by the program. There are two different way to perform the data IO on ASCII data. The simplest is to use the free format, where the program at runtime decides how the data are read or written according to the definition of the variables involved in the process. The second method is formated IO, you have seen this in one case, namely in the program orbit in the previous exercise were the data output was required to have a specific format for gnuplot to read it. The advantage with the formated IO is that the user can specify exactly the way the data is handled. For doing this one must specify the format of each of the variables that has to be read from or written to a disk file. For instance, we want all floating point number to be represented with only one decimals accuracy and write with a minimum of, say, two blanks between each number. In the following we are going to look at both methods.

ASCII unformatted IO:
This is the simplest IO format as we just have to specify the variables to be handled and the program itself performs the IO accordingly. Following this there are a few minor tasks.
In another case it may be an advantage to instead write the arrays one after the other in big blocks.
Compare the data structure of the two files

$ head file1 file2

What is the difference in the number of lines between the two files? Credits: 2/-1


As you can see, it is not possible to use the second file directly as input to gnuplot. The use of the two different ways of writing the same data to a file depend on how the data are to be subsequently used. You may also have notices that in the second file, the numbers are written in 5 columns. This gives a problem, namely if we in the first case wanted to write more than 5 real number on a single line --- try doing it to see what happens! You may already have guessed.

ASCII formatted IO:

Formated IO provides a possibility to write data in exactly the way we want it. All it requires is an additional format statement that determines the structure of the format:
......
open(10,file='data.dat')
write(10,100), 'x=',x(i), f1(i),f2(i)
100 format(2x,a2,2x,f8.3,2x,e10.3,2x,g9.3)
......

In the format statement the different entries have different effects:
There are further formats to be used in connection with integer and complex variables, they are all described in the book.

Make a new program to handle formatted data IO. Use the few lines from above for writing the same data as in the previous program. Try to play with the numbers for the different formats and compare the output.

Can you get it to show strange characters in the data file? Yes
No
Credits: 2/-1

When does fortran print strange characters in the data file? Credits: 2/-1


Notice, when reading this type of data from a file the program will all the time have to start from the top of the file to find exactly the dataset you need. This is inconvenient when dealing with large datasets.

Binary IO:

There are two disadvantages with the ASCII data format, the first was mentioned just above and the second is that ASCII files take up more disk space than needed. The ASCII format is therefor fine for small datasets, but as the amount of data increases it is much more convenient to use a binary format. This is the format the computer likes to write numbers in and it is therefore much faster to read and write. To access a random line of data from the ASCII file, you have to read the file from the beginning and all the way down to the entry you need to access. In binary data files you have this possibility, but it comes with a different down side --- you have to specify the size of the data blocks you want to read/write with. Therefore, it requires the data to have a minimum of uniformity.

In the ASCII case you could simply open a file and start reading/writing from/to it, simply by specifying the file name and a unit number. For binary files more information has to be provided.

open(10,file='test.dat',status='old',access='direct',form='unformatted',recl=1*nwords,err=100)
read(10,rec=1) a
Write a small program that takes the same data as in the previous cases and write it in binary format to a single file. As earlier, there are two ways to do this. Either in the simple format that gnuplot reads, or as single blocks of data one after the other. You will have to do both here! --- Notice that nwords is different between the two cases!

How much smaller is the binary data file written with 3 large blocks than the same ASCII file? Credits: 2/-1

As you can see from the expression read(10,rec=1), then the presence of the keyword rec, makes it possible to both read and write to and from a specific data block in the file without having to start from the top of the file each time. This is a significant advantage when one wants to access a particular data set in a large file and you already know which data record contains the data.

To see that the unformatted binary data can be used to more complicated data sets, the next task is to write a program that converts the ASCII data in "catalog.txt" to binary format. The data here represents information about a selection of starts. The numbers represent the stars position on in the sky, their distance, their change in position per year, a measure of their visual magnitude and an indication of their surface temperature, but none of that is important for this task. What is important is how much space needs to be allocated to each variable.

After converting the data to binary format, make a program that can extract a single entry from the binary file and print it to standard output. Here you should use formated output to make a consistent appearance of the output.

What is the 5th number in the 219th star entry in catalog2.bin? Credits: 2/-1


Namelist:

The final example on simple IO represents the usage of namelists. You only have to write a very short program that handles the requirements given here:

Here is my working namelist program Credits: 2/-1





Measurements of the gravity constant g:

In the basement of the Rockefeller building the geologist have an instrument that is designed to measure changes in gravity as a function of time. Fluctuations in g occurs for many different reasons. For one, the slowly orbiting moon provides one of the major impacts on the variations. For this example, and many other cases, the effects are found to be periodic or nearly periodic. To identify the nature of the variations and their impact, it is required to identify the characteristic periods and amplitudes in the data. The rest of this exercise will be concerned with this problem, letting you write a small program to identify the characteristic periods and amplitudes in a dataset.

To analyze the data, extracting various interesting time variations, we will make use of Fourier transforms. As you hopefully know from a previous course, there is a correspondence between time series measured in the time domain, h(t), and in the frequency domain, H(f). This correspondence, for continues functions, is expressed as follows;

(1)
and
   (2)

where t is in seconds and f is in cycles per second. In a measured dataset the function h(t) is not continues, but represents the value of the function at discreet and here uniform time intervals, . The function in time is then represented bywhere n goes from 0 to N. With such a representation there is a maximum frequency that can be represented, namely , representing the Nyquist frequency. This implies that the shortest wavelength is represented by only two data points. For frequencies higher than this there is no unambiguous representation. Instead these will be seen as waves with longer wavelength and therefore be responsible for giving a wrong Fourier transform. This phenomena is knows as aliasing and is shown in the image below:



This image also shows how different samplings of a sinusoidal function looks, and how a too high frequency wave gives rise to a false low frequency wave representation. This implies that if the data are limited in frequency such that highest frequency is less than the Nyquist frequency, then the discreet Fourier representation is exact. When this is not the case, then additional power to various frequencies are obtained through the aliasing seen above and a wrong answer is obtained. From a frequency point of view, these are the "allowed" frequencies that can be obtained:
, where n goes from -N/2 to N/2.

Under these conditions it is possible to rewrite the expressions in eq.(1) and eq.(2) as

(3)
and
(4)
Here we are interested in eq.(4). From this it is seen that to get the full information of the frequency, we have to make N² operations, N operations for each frequency. This is rather expensive and more elaborate methods have been developed. These take into account symmetric and scaling relations for the Fourier transform and constitutes the basis for the Fast Fourier Transform (FFT). It is not the purpose of this course to go into details with this issue, but it has to be mentioned that the scaling of FFT only requires Nlog(N) operations (See for instance Numerical recipes chapter 12 for a more detailed description of FFTs and their implementation) which makes FFTs much faster then FT on large data sets.

In this part of the exercise, you have to write a code that:


The figure shows the amplitudes of the wavelengths that are included in the data.dat representation. Plotting the full data set, you will only see a single value different from zero, namely entry 513 that represents the mean value of g. This plot shows that only three different wavelengths are represented in the data set.

Use your program and data.dat to reproduce the graph above.

I can reproduce the plot! and have called it ft.png Credits: 2/-1

For large data set it can take quit some time to compute the Fourier transform. Just to show you that FFT is much faster, and provides you with the same result, you are going to change the previous simple FT method to the FFT method given in Numerical recipes. It will not take long as it is all prepared for in the Makefile. There are a few issues that have to be changed relative your previous program:

Compare the time between the FT and the FFT version by using the Unix command time.

$ time a.out < in.dat

where in.dat contains the input information to the program you would otherwise type in by hand (use man to check the information time provides).

To easily see the difference, there are a number of data files in the directory grav. They are called "dat.*.fft" where * goes from 0-10. The size of the data increases with decreasing number. Try to time both the ft and the fft version of the code using these data, starting with the small data sets. As you will very soon realise, the ft version takes to long time, so DO NOT set it running for more than a few minutes before you stop increasing the data size... Save the effective run times for the two methods as a function of N (first entry in the files) and plot the result with gnuplot, call the image timing.png (use log scales). To test that the scaling I hinted earlier is not totally out of the blue, include two well fitting lines that represernts the scaling of the computation time for the two cases.

The timing plot is here! Credits: 2/-1


$Id: index.php,v 1.17 2009/10/06 07:29:58 kg Exp $ kg@astro.ku.dk