Tuesday, March 14, 2017

To fragment or not fragment, that is the question

First there were punch cards, and they were good. They were a nice, neat representation of data. One record on one card -- what could be easier?

Except that record sizes were limited to 80 bytes. And if you dropped a stack, and cards got out of sequence.

Then there were magtapes, and they were good too. Better than cards, because record sizes could be larger than 80 bytes. Also, if you dropped a tape the data stayed in sequence. But also quite similar to cards, data on magtapes was simple a series of records.

At first, there was one "file" on a tape: you started at the beginning, you read the records until the "end-of-file" mark, and you stopped. Later, we figured out that a single tape could hold multiple files, one after the other.

Except that files were always contiguous data. They could not be expanded on a single tape, since the expanded file would write over a portion of the next file. (Also, reading and writing to the same tape was not possible on many systems.)

So we invented magnetic disks and magnetic drums, and they were good too. Magtapes permitted sequential access, which meant reading the entire file and processing it. Disks and drums allowed for direct access which meant you could jump to a position in the file, read or write a record, and then jump somewhere else in the file. We eventually moved away from drums and stayed with disks, for a number of reasons.

Early disks allocated space much like tapes: a disk could contain several files but data for each file was contiguous. Programmers and system operators had to manage disk space, allocating space for files in advance. Like files on magtapes, files on disks were contiguous and could not be expanded, as the expansion would write over the next file.

And then we invented filesystems. (On DEC systems, they were called "directory structures".) Filesystems managed disk space, which meant that programmers and operators didn't have to.

Filesystems store files not as a long sequence of disk space but as collections of blocks, each block holding a number of bytes. Blocks added to a file could be from any area of the disk, not necessarily in line (or even close) to the original set of blocks. By adding or removing blocks, files could grow or shrink as necessary. The dynamic allocation of disk space was great!

Except that files were not contiguous.

When processing a file sequentially, it is faster to access a contiguous file than a non-contiguous file. Each block of data follows its predecessor, so the disk's read/write heads move little. For a non-contiguous file, with blocks of data scattered about the disk, the read/write heads must move from track to track to read each set of blocks. The action of moving the read/write heads takes time, and is therefore considered expensive.

Veteran PC users may remember utility programs which had the specific purpose of defragmenting a disk. They were popular in the 1990s.

Now, Windows defragments disks as an internal task. No third-party software is needed. No action by the user is needed.

To review: We started with punch cards, which were contiguous. Then we moved to magtapes, and files were still contiguous. Then we switched to disks, at first with contiguous files and then with non-contiguous files.

Then we created utility programs to make the non-contiguous files contiguous again.

Now we have SSDs (Solid-State Disks), which are really large chunks of memory with extra logic to hold values when the power is off. But they are still memory, and the cost of non-contiguous data is low. There are no read/write heads to move across a platter (indeed, there is no platter).

So the effort expended by Windows to defragment files (on an SSD) is not buying us better performance. It may be costing us, as the "defrag" process does consume CPU and does write to the SSD, and SSDs have a limited number of write operations in their lifespan.

So now, perhaps, we're going back to non-contiguous.

Tennis, anyone?

1 comment:

EyesofDoma said...

quantum computing would solve this i feel.