Code Comments
Programming Forum and web based access to our favorite programming groups.I am trying to understand how records may be retrieved from a file, based on some key. From the literature that I have gathered, it appears to me that almost all approaches (certainly hashed files and indexed files, not sure about B-trees) work on the following principle (paraphrased from a book): The operating system divides secondary memory into equal-sized blocks. A file is stored as a linked list of blocks. Through some efficient mechanism locate the address of block that has the record, and then sequentially search the block for the record with the given index. Am I on the right track so far? If I am on the right track, my question is that when I use record-based access, why do I have to be worried about the physical address of the block, whereas for stream I/O model I am only worried about file pointer location? Some follow up questions: Is a file stored as a linked list of blocks in secondary memory, or is a file actually stored in contiguous regions of secondary memory? How are calls like fread() and fwrite() able to work correctly across block boundaries? Is the "block list" information maintained in the file itself? How do implementations of hash files and index files obtain the block pointer information that the books seem to talk about? Regards, Nimmi [My apologies if this is an off-topic posting]
Post Follow-up to this messagenimmi_srivastav@yahoo.com (Nimmi Srivastav) writes: > I am trying to understand how records may be retrieved from a file, > based on some key. From the literature that I have gathered, it > appears to me that almost all approaches (certainly hashed files and > indexed files, not sure about B-trees) work on the following principle > (paraphrased from a book): > The operating system divides secondary memory into equal-sized blocks. > A file is stored as a linked list of blocks. Through some efficient > mechanism locate the address of block that has the record, and then > sequentially search the block for the record with the given index. > > Am I on the right track so far? I think that you may be confusing structure at two different layers of abstraction. Unix file systems generally arrange data in equal-sized blocks that are linked together in lists or trees or variations on these. This is dependent on the file system, and different file systems (FAT, NTFS, UFS, ext2, etc.) use completely different arrangements, as long as they allow the application to see the file as a simple array of bytes. Database applications may arrange records in files using B-trees or lists or hashes or other arrangements, but this structure is in addition to, and layered on top of, the block structure that the file system uses. It doesn't have to be this way. Databases can keep their records in "raw partitions". Raw partitions still have blocks, but that is because disks are read from and written to in units of blocks. Contrariwise, file systems can offer more options to application code than the simple array of bytes. On VAX/VMS, for example, the operating systems offers various record-structured file types that applications can take advantage of. > If I am on the right track, my question is that when I use > record-based access, why do I have to be worried about the physical > address of the block, When you do an ls(), you are not providing the "physical address" of a block. You are only specifying an offset from the beginning of the file. It is up to the file system to translate that to a disk block. > whereas for stream I/O model I am only worried > about file pointer location? Because streams maintain an internal offset that is updated as you read from them. > Some follow up questions: Is a file stored as a linked list of blocks > in secondary memory, or is a file actually stored in contiguous > regions of secondary memory? That is hidden from application code. Each file system has its own methods for doing this. > How are calls like fread() and fwrite() > able to work correctly across block boundaries? That is a problem for the kernel and file system. > Is the "block list" > information maintained in the file itself? No. > How do implementations of > hash files and index files obtain the block pointer information that > the books seem to talk about? They don't. They just specify offsets to ls
(), pread() and pwrite(). The file system code, which executes within the kernel, uses the block list data to map the offset to a disk block. -SEan
Post Follow-up to this messagenimmi_srivastav@yahoo.com (Nimmi Srivastav) wrote: # I am trying to understand how records may be retrieved from a file, # based on some key. From the literature that I have gathered, it The only view the kernel supports is a flexible array of bytes. Anything beyond that is defined by user libraries. For example, the standard C library supports bother variable length records terminated by a line feed and fixed length records. There are other libraries that provide B-trees, hash index, etc. -- SM Ryan http://www.rawbw.com/~wyrmwif/ Title does not dictate behaviour.
Post Follow-up to this messageNimmi Srivastav wrote: > I am trying to understand how records may be retrieved from a file, > based on some key. In UNIX and Linux, to a programmer, a file is just a long string of bytes. At the kernel level, things can be a bit more complex, depending on whether you are really using normal file systems or raw file systems. The concept of "records" does not really exist in the UNIX-Linux file system, though records may be introduced at a higher level. One (lousy) approach is to assume all records are fixed length. Then you can use ls(record-number * record-size) to find the record. Normally, records tend to be of variable length, so that approach does not work well. One approach that does work, if you know the statistics of the record sizes, is to do the IO in terms of fixed size pages, and write a number of records on a page. Then use divide the record number by the number of records/page to get the page number (quotient) and use the remainder to get the item on the page (perhaps with a small table of contents on each page). If you can use raw file systems, you can divide the space up as you see fit. Records can be contiguous, so no rummaging around in i-nodes and trees is required to find a record once you know its record number. Now if you do not know the record number, but have keys (e.g., primary keys in a dbms context), you need some indexing scheme to convert keys to record numbers. There are many options, sorted indices, b-trees, hash tables, superimposed codeword indices, etc. Which one(s) you use depend a lot on the application. > From the literature that I have gathered, it > appears to me that almost all approaches (certainly hashed files and > indexed files, not sure about B-trees) work on the following principle > (paraphrased from a book): > The operating system divides secondary memory into equal-sized blocks. > A file is stored as a linked list of blocks. One of DEC's earlier operating systems for the PDP-11 did that. It stinks because if one record goes bad, the rest of the file is lost, and it is difficult to even delete the bad file. It is really a disaster. Furthermore, if all you want to do is append a few bytes, you must sequentially search all the way through the file to find the end. This was tolerable, perhaps, when files could not exceed 65536 bytes in length, but not today, where they sometimes exceed a Gigabyte. > Through some efficient > mechanism locate the address of block that has the record, and then > sequentially search the block for the record with the given index. > > Am I on the right track so far? Yes, for 1965 or so. > > If I am on the right track, my question is that when I use > record-based access, why do I have to be worried about the physical > address of the block, whereas for stream I/O model I am only worried > about file pointer location? It depends on what you are doing. For a student project, for example, running on Linux or UNIX, assume the file is just a long sequence of bytes and do not worry about the underlying implementation. That is the job of the OS: to protect you from needing to know that. It permits the OS to change the implementation and your application need know nothing about it. It will just keep working. > > Some follow up questions: Is a file stored as a linked list of blocks > in secondary memory, Maybe Microsoft still do that, but I doubt anyone else does. > or is a file actually stored in contiguous > regions of secondary memory? Let me say that in normal UNIX-Linux file systems, the file tends to be stored in contiguous extents (several blocks), but that there can be many extents that might be used. > How are calls like fread() and fwrite() > able to work correctly across block boundaries? Is the "block list" > information maintained in the file itself? How do implementations of > hash files and index files obtain the block pointer information that > the books seem to talk about? > > Regards, > Nimmi > > > [My apologies if this is an off-topic posting] -- .~. Jean-David Beyer Registered Linux User 85642. /V\ Registered Machine 241939. /( )\ Shrewsbury, New Jersey http://counter.li.org ^^-^^ 10:15:01 up 2 days, 6:40, 3 users, load average: 4.36, 4.26, 4.20
Post Follow-up to this messageIn <8b0c42d.0409202104.5e6e3fe6@posting.google.com>, on 09/20/2004 at 10:04 PM, nimmi_srivastav@yahoo.com (Nimmi Srivastav) said: >I am trying to understand how records may be retrieved from a file, >based on some key. There are many different libraries for indexed file. You have to write calls in accordance with the API of the library that you are using. >The operating system divides secondary memory into equal-sized >blocks. A file is stored as a linked list of blocks. Through some >efficient mechanism locate the address of block that has the record, >and then sequentially search the block for the record with the given >index. Yes. But if you use an existing library then you don't need to deal with that level. >If I am on the right track, my question is that when I use >record-based access, why do I have to be worried about the physical >address of the block, You don't. You only need to deal with physical addresses if you are doing low level I/O. If you are using an existing library, it does everything for you. Even if it bypasses the file system, that's not something that you need to deal with directly. >Some follow up questions: Is a file stored as a linked list of >blocks in secondary memory, or is a file actually stored in >contiguous regions of secondary memory? That depends on the file system. Except for a temporary file system, you must store the data on DASD. Whether the blocks are linked or are pointed to by something else is an implementation detail that need not be the same for two different file systems. >How do implementations of hash files and index files obtain the block >pointer information that the books seem to talk about? They generally use standard file system calls and let the file system translate the offsets into addresses under the covers. Data base packages with efficiency concerns might use other approaches. -- Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel> Unsolicited bulk E-mail subject to legal action. I reserve the right to publicly post or ridicule any abusive E-mail. Reply to domain Patriot dot net user shmuel+news to contact me. Do not reply to spamtrap@library.lspace.org
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.