Showing posts with label storage. Show all posts
Showing posts with label storage. Show all posts

Tuesday, March 14, 2017

To fragment or not fragment, that is the question

First there were punch cards, and they were good. They were a nice, neat representation of data. One record on one card -- what could be easier?

Except that record sizes were limited to 80 bytes. And if you dropped a stack, and cards got out of sequence.

Then there were magtapes, and they were good too. Better than cards, because record sizes could be larger than 80 bytes. Also, if you dropped a tape the data stayed in sequence. But also quite similar to cards, data on magtapes was simple a series of records.

At first, there was one "file" on a tape: you started at the beginning, you read the records until the "end-of-file" mark, and you stopped. Later, we figured out that a single tape could hold multiple files, one after the other.

Except that files were always contiguous data. They could not be expanded on a single tape, since the expanded file would write over a portion of the next file. (Also, reading and writing to the same tape was not possible on many systems.)

So we invented magnetic disks and magnetic drums, and they were good too. Magtapes permitted sequential access, which meant reading the entire file and processing it. Disks and drums allowed for direct access which meant you could jump to a position in the file, read or write a record, and then jump somewhere else in the file. We eventually moved away from drums and stayed with disks, for a number of reasons.

Early disks allocated space much like tapes: a disk could contain several files but data for each file was contiguous. Programmers and system operators had to manage disk space, allocating space for files in advance. Like files on magtapes, files on disks were contiguous and could not be expanded, as the expansion would write over the next file.

And then we invented filesystems. (On DEC systems, they were called "directory structures".) Filesystems managed disk space, which meant that programmers and operators didn't have to.

Filesystems store files not as a long sequence of disk space but as collections of blocks, each block holding a number of bytes. Blocks added to a file could be from any area of the disk, not necessarily in line (or even close) to the original set of blocks. By adding or removing blocks, files could grow or shrink as necessary. The dynamic allocation of disk space was great!

Except that files were not contiguous.

When processing a file sequentially, it is faster to access a contiguous file than a non-contiguous file. Each block of data follows its predecessor, so the disk's read/write heads move little. For a non-contiguous file, with blocks of data scattered about the disk, the read/write heads must move from track to track to read each set of blocks. The action of moving the read/write heads takes time, and is therefore considered expensive.

Veteran PC users may remember utility programs which had the specific purpose of defragmenting a disk. They were popular in the 1990s.

Now, Windows defragments disks as an internal task. No third-party software is needed. No action by the user is needed.

To review: We started with punch cards, which were contiguous. Then we moved to magtapes, and files were still contiguous. Then we switched to disks, at first with contiguous files and then with non-contiguous files.

Then we created utility programs to make the non-contiguous files contiguous again.

Now we have SSDs (Solid-State Disks), which are really large chunks of memory with extra logic to hold values when the power is off. But they are still memory, and the cost of non-contiguous data is low. There are no read/write heads to move across a platter (indeed, there is no platter).

So the effort expended by Windows to defragment files (on an SSD) is not buying us better performance. It may be costing us, as the "defrag" process does consume CPU and does write to the SSD, and SSDs have a limited number of write operations in their lifespan.

So now, perhaps, we're going back to non-contiguous.

Tennis, anyone?

Tuesday, April 29, 2014

Cloud vendor lock-in is different from PC vendor lock-in

The IBM PC hardware was open, yet the business model of the PC software market was a fragmented one, with each vendor providing a mostly closed system.

The emerging model of cloud-based computing is a fragmented market, with each vendor providing a mostly closed system, but the nature of the closed-ness is quite different from the PC market.

In the PC market, the strategy was to allow data on any platform but use proprietary formats to tie data to an application. Microsoft Office applications used complex formats for their files, formats that made it difficult to use the files with any other application.

For cloud-based systems the fragmentation will be around data storage, not applications or the format of data. Vendors will strive to keep data in their storage system (Google Docs, Microsoft OneDrive) and push apps onto multiple platforms (Windows, iOS, Android).

Google Docs, Microsoft Office 365, and other cloud-based services use private storage systems. Create a document in Google Docs and it is stored in Google Drive. Create a spreadsheet in Office 365 and it is stored in Microsoft's OneDrive. These are private storage systems -- only the Google Docs apps can access Google Drive, and only Office 365 can access OneDrive.

We have limited visibility into these private storage systems. We cannot "see" our data, other than through the API and UI offered by the vendor. We cannot directly access our data. This allows the vendor to store the data in any convenient format: as a file, in a relational database, or in some other form.

Accessibility is what allows one to change from one office suite to another and still read your old documents. The new office suite must be able to read the format, of course, but such operations are possible. Microsoft used this trick to convert users from Wordperfect and Lotus 1-2-3 to Microsoft Word and Excel. Open Office uses this trick to read .DOC and .XLS files.

Cloud-based offerings don't allow such tricks. One cannot use Office 365 to read a document stored in Google Drive. (Not because the format is different, but because Office 365 cannot reach into Google Drive. Google Docs cannot reach into OneDrive, either.)

Cloud-based systems do allow one to download documents to your PC. When you do, they are stored in files (that's what PCs use). You can they upload the document to a different cloud-based system. But keep in mind: this download/upload trick works only while the cloud-based systems allow you to download a document to your PC. The owners of the cloud-based system can change or remove that capability at any time.

Switching from one cloud-based system to another may be difficult, and perhaps impossible. If a cloud vendor offers no way to get data "out", then the data, once entered into the system, remains there.

Vendors want to lock customers into their systems. The strategy for PC software was to use storage formats that tied data to an application. The strategy for cloud-based systems is not the format but the storage location. Look for Microsoft, Google, and others to offer convenient ways to transfer your data from your PC into their cloud.

And also look for convenient ways to get the data out of that cloud.

Saturday, May 25, 2013

Best practices are not best forever

Technology changes quickly. And with changes in technology, our views of technology change, and these views affect our decisions on system design. Best practices in one decade may be inefficient in another.

A recent trip to the local car dealer made this apparent. I had brought my car in for routine service, and the mechanic and I reviewed the car's maintenance history. The dealer has a nice, automated system to record all maintenance on vehicles. It has an on-line display and prints nicely-formatted maintenance summaries. A "modern" computer system, probably designed in the 1980s and updated over the years. (I put the word "modern" in quotes because it clearly runs on a networked PC with a back end database, but it does not have tablet or phone apps.)

One aspect of this system is the management of data. After some amount of time (it looks like a few years), maintenance records are removed from the system.

Proper system design once included the task of storage management. A "properly" designed system (one that followed "best practices") would manage data for the users. Data would be retained for a period of time but not forever. One had to erase information, because the total available space was fixed (or additional space was prohibitively expensive) and programming the system to manage space was more effective that asking people to erase the right data at the right time. (People tend to wait until all free storage is used and then binge-erase more data than necessary.)

That was the best practice -- at the time.

Over time, the cost of storage dropped. And over time, our perception of the cost of storage dropped.

Google has a big role in our new perception. With the introduction of GMail, Google gave each account holder a full gigabyte of storage. A full gigabyte! The announcement shocked the industry. Today, it is a poor e-mail service that cannot promise a gigabyte of storage.

Now, Flickr is giving each account holder a full terabyte of storage. A full terabyte! Even I am surprised at the decision. (I also think that it is a good marketing move.)

Let's return to the maintenance tracking system used by the car dealer.

Such quantities of storage vastly surpass the meager storage used by a few maintenance records. Maintenance records each take a few kilobytes of data (it's all text, and only a few pages). A full megabyte of data would hold all maintenance records for several hundred repairs and check-ups. If the auto dealer assigned a full gigabyte to each customer, they could easily hold all maintenance records for the customer, even if the customer brought the car for repairs every month for an extended car-life of twenty years!

Technology has changed. Storage has become inexpensive. Today, it would be a poor practice to design a system that auto-purges records. You spend more on the code and the tests than you save on the reduction in storage costs. You lose older customer data, preventing you from analyzing trends over time.

The new best practices of big data, data science, and analytics, require data. Old data has value, and the value is more than the cost of storage.

Best practices change over time. Be prepared for changes.




Wednesday, March 2, 2011

The disappearing notion of storage

In the beginning was the processor. And it was good. But the processor needed data to act upon, and lo! there was memory (in the form of mercury delay lines). But the processor and the memory worked only while powered, and we needed a way to store the data in the long term (that is, when the machine was not powered) and so there came storage devices.

Storage devices evolved along a tortuous path. From punch cards to magnetic tape, from paper tape to DECtapes (which were not quite the same as plain magtapes), and then to magnetic platters that were eventually called "discs".

The path for microcomputers was slightly different. Microcomputers started with nothing, had a short stint with paper tape and cassette tape, took off with floppy disks, then hard disks, and finally (eons later) flash thumb drives and solid-state disks (SSDs).

A computer needs storage because everything that we want to store is bigger than memory. The company's accounts would not fit in the 4K of memory (especially with the general ledger program and tape libraries sitting in memory too) so the data had to live somewhere. Removable media (paper tape, magtape, disks) made for an auxiliary memory of virtually infinite capacity.

But what happens when a computer's main memory becomes large enough to hold everything?

I've made the argument (in another forum) that virtual memory is sensible only when the CPU's addressable memory space is larger than physical memory. If the processor can address 16MB of memory (including all virtual pages) and the computer contains 4MB of memory, then virtual memory works. You can allocate memory up to the 4MB limit, and then swap out memory and effectively use 16MB of memory. But give that processor 16MB of memory, and there is no need for virtual memory -- in fact it is not possible to use virtual memory, since you can never have a page fault. (I'm ignoring the CPU designs that reserve address bits for virtual memory. Assume that every address bit can be used to address real memory.)

Computer memory has been limited in size due to a number of factors, one being manufacturing capabilities and costs. It wasn't possible (read 'feasible with the budget') to obtain more than a paltry few kilobytes of memory. External storage was slow but cheap.

The cost factor has just about disappeared. With solid-state disks, we now have lots and lots if bits. They happen to be organized into an entity that pretends to be a storage device, but let's think about this. Why pretend to be a storage device when the CPU can address memory directly?

Here's what I see happening: CPUs will change over time, and will be equipped with larger and larger addressing capabilities. This will require a change to physical architecture and instruction sets, so I expect the change will occur over decades. But in the end, I expect a computer to consist of a CPU, memory, video, and a network port. No USB ports. No memory sticks. No serial port. No keyboard port. And no hard disk drive. You will load applications and data onto your computer. They will be stored in memory. No muss, no fuss!

Such changes will mean changes for programs. We won't need the elaborate caching schemes. We won't need "save file" dialogs. We won't need disk de-blocking routines or special drivers for spinning metal platters.

Not everything changes. We will still need backup copies of data. We will still need transactions, to ensure that a complete set of changes is applied. We will still need names for data sets (we call them 'file names' today) and we will still need to grant permission to use selected parts of our data store.

So I wouldn't toss the concept of file systems just yet. But be prepared for changes in hardware to rock the boat.