For Programmers: Free Programming Magazines  


Home > Archive > Unix Programming > September 2004 > Q) Accessing records in files









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Q) Accessing records in files
Nimmi Srivastav

2004-09-21, 4:00 am

I am trying to understand how records may be retrieved from a file,
based on some key. From the literature that I have gathered, it
appears to me that almost all approaches (certainly hashed files and
indexed files, not sure about B-trees) work on the following principle
(paraphrased from a book):
The operating system divides secondary memory into equal-sized blocks.
A file is stored as a linked list of blocks. Through some efficient
mechanism locate the address of block that has the record, and then
sequentially search the block for the record with the given index.

Am I on the right track so far?

If I am on the right track, my question is that when I use
record-based access, why do I have to be worried about the physical
address of the block, whereas for stream I/O model I am only worried
about file pointer location?

Some follow up questions: Is a file stored as a linked list of blocks
in secondary memory, or is a file actually stored in contiguous
regions of secondary memory? How are calls like fread() and fwrite()
able to work correctly across block boundaries? Is the "block list"
information maintained in the file itself? How do implementations of
hash files and index files obtain the block pointer information that
the books seem to talk about?

Regards,
Nimmi


[My apologies if this is an off-topic posting]
Sean Burke

2004-09-21, 4:00 am


nimmi_srivastav@yahoo.com (Nimmi Srivastav) writes:

> I am trying to understand how records may be retrieved from a file,
> based on some key. From the literature that I have gathered, it
> appears to me that almost all approaches (certainly hashed files and
> indexed files, not sure about B-trees) work on the following principle
> (paraphrased from a book):
> The operating system divides secondary memory into equal-sized blocks.
> A file is stored as a linked list of blocks. Through some efficient
> mechanism locate the address of block that has the record, and then
> sequentially search the block for the record with the given index.
>
> Am I on the right track so far?


I think that you may be confusing structure at two different
layers of abstraction. Unix file systems generally arrange
data in equal-sized blocks that are linked together in lists
or trees or variations on these. This is dependent on the file
system, and different file systems (FAT, NTFS, UFS, ext2, etc.)
use completely different arrangements, as long as they allow
the application to see the file as a simple array of bytes.

Database applications may arrange records in files using B-trees
or lists or hashes or other arrangements, but this structure is
in addition to, and layered on top of, the block structure that
the file system uses.

It doesn't have to be this way. Databases can keep their records
in "raw partitions". Raw partitions still have blocks, but that
is because disks are read from and written to in units of blocks.

Contrariwise, file systems can offer more options to application
code than the simple array of bytes. On VAX/VMS, for example,
the operating systems offers various record-structured file types
that applications can take advantage of.

> If I am on the right track, my question is that when I use
> record-based access, why do I have to be worried about the physical
> address of the block,


When you do an ls(), you are not providing the "physical address"
of a block. You are only specifying an offset from the beginning of
the file. It is up to the file system to translate that to a disk
block.

> whereas for stream I/O model I am only worried
> about file pointer location?


Because streams maintain an internal offset that is updated as
you read from them.

> Some follow up questions: Is a file stored as a linked list of blocks
> in secondary memory, or is a file actually stored in contiguous
> regions of secondary memory?


That is hidden from application code. Each file system has its
own methods for doing this.

> How are calls like fread() and fwrite()
> able to work correctly across block boundaries?


That is a problem for the kernel and file system.

> Is the "block list"
> information maintained in the file itself?


No.

> How do implementations of
> hash files and index files obtain the block pointer information that
> the books seem to talk about?


They don't. They just specify offsets to ls(), pread() and
pwrite(). The file system code, which executes within the kernel,
uses the block list data to map the offset to a disk block.

-SEan



SM Ryan

2004-09-21, 9:04 am

nimmi_srivastav@yahoo.com (Nimmi Srivastav) wrote:
# I am trying to understand how records may be retrieved from a file,
# based on some key. From the literature that I have gathered, it

The only view the kernel supports is a flexible array of bytes. Anything
beyond that is defined by user libraries. For example, the standard C
library supports bother variable length records terminated by a line feed
and fixed length records. There are other libraries that provide B-trees,
hash index, etc.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
Title does not dictate behaviour.
Jean-David Beyer

2004-09-21, 3:58 pm

Nimmi Srivastav wrote:
> I am trying to understand how records may be retrieved from a file,
> based on some key.


In UNIX and Linux, to a programmer, a file is just a long string of bytes.
At the kernel level, things can be a bit more complex, depending on
whether you are really using normal file systems or raw file systems.

The concept of "records" does not really exist in the UNIX-Linux file
system, though records may be introduced at a higher level. One (lousy)
approach is to assume all records are fixed length. Then you can use
ls(record-number * record-size) to find the record.

Normally, records tend to be of variable length, so that approach does not
work well. One approach that does work, if you know the statistics of the
record sizes, is to do the IO in terms of fixed size pages, and write a
number of records on a page. Then use divide the record number by the
number of records/page to get the page number (quotient) and use the
remainder to get the item on the page (perhaps with a small table of
contents on each page).

If you can use raw file systems, you can divide the space up as you see
fit. Records can be contiguous, so no rummaging around in i-nodes and
trees is required to find a record once you know its record number.

Now if you do not know the record number, but have keys (e.g., primary
keys in a dbms context), you need some indexing scheme to convert keys to
record numbers. There are many options, sorted indices, b-trees, hash
tables, superimposed codeword indices, etc. Which one(s) you use depend a
lot on the application.

> From the literature that I have gathered, it
> appears to me that almost all approaches (certainly hashed files and
> indexed files, not sure about B-trees) work on the following principle
> (paraphrased from a book):
> The operating system divides secondary memory into equal-sized blocks.
> A file is stored as a linked list of blocks.


One of DEC's earlier operating systems for the PDP-11 did that. It stinks
because if one record goes bad, the rest of the file is lost, and it is
difficult to even delete the bad file. It is really a disaster.
Furthermore, if all you want to do is append a few bytes, you must
sequentially search all the way through the file to find the end. This was
tolerable, perhaps, when files could not exceed 65536 bytes in length, but
not today, where they sometimes exceed a Gigabyte.

> Through some efficient
> mechanism locate the address of block that has the record, and then
> sequentially search the block for the record with the given index.
>
> Am I on the right track so far?


Yes, for 1965 or so.
>
> If I am on the right track, my question is that when I use
> record-based access, why do I have to be worried about the physical
> address of the block, whereas for stream I/O model I am only worried
> about file pointer location?


It depends on what you are doing. For a student project, for example,
running on Linux or UNIX, assume the file is just a long sequence of bytes
and do not worry about the underlying implementation. That is the job of
the OS: to protect you from needing to know that. It permits the OS to
change the implementation and your application need know nothing about it.
It will just keep working.
>
> Some follow up questions: Is a file stored as a linked list of blocks
> in secondary memory,


Maybe Microsoft still do that, but I doubt anyone else does.

> or is a file actually stored in contiguous
> regions of secondary memory?


Let me say that in normal UNIX-Linux file systems, the file tends to be
stored in contiguous extents (several blocks), but that there can be many
extents that might be used.

> How are calls like fread() and fwrite()
> able to work correctly across block boundaries? Is the "block list"
> information maintained in the file itself? How do implementations of
> hash files and index files obtain the block pointer information that
> the books seem to talk about?
>
> Regards,
> Nimmi
>
>
> [My apologies if this is an off-topic posting]



--
.~. Jean-David Beyer Registered Linux User 85642.
/V\ Registered Machine 241939.
/( )\ Shrewsbury, New Jersey http://counter.li.org
^^-^^ 10:15:01 up 2 days, 6:40, 3 users, load average: 4.36, 4.26, 4.20

Shmuel (Seymour J.) Metz

2004-09-23, 3:58 pm

In <8b0c42d.0409202104.5e6e3fe6@posting.google.com>, on 09/20/2004
at 10:04 PM, nimmi_srivastav@yahoo.com (Nimmi Srivastav) said:

>I am trying to understand how records may be retrieved from a file,
>based on some key.


There are many different libraries for indexed file. You have to write
calls in accordance with the API of the library that you are using.

>The operating system divides secondary memory into equal-sized
>blocks. A file is stored as a linked list of blocks. Through some
>efficient mechanism locate the address of block that has the record,
>and then sequentially search the block for the record with the given
>index.


Yes. But if you use an existing library then you don't need to deal
with that level.

>If I am on the right track, my question is that when I use
>record-based access, why do I have to be worried about the physical
>address of the block,


You don't. You only need to deal with physical addresses if you are
doing low level I/O. If you are using an existing library, it does
everything for you. Even if it bypasses the file system, that's not
something that you need to deal with directly.

>Some follow up questions: Is a file stored as a linked list of
>blocks in secondary memory, or is a file actually stored in
>contiguous regions of secondary memory?


That depends on the file system. Except for a temporary file system,
you must store the data on DASD. Whether the blocks are linked or are
pointed to by something else is an implementation detail that need
not be the same for two different file systems.

>How do implementations of hash files and index files obtain the block
>pointer information that the books seem to talk about?


They generally use standard file system calls and let the file system
translate the offsets into addresses under the covers. Data base
packages with efficiency concerns might use other approaches.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to spamtrap@library.lspace.org

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com