Could Microsoft discontinue Windows? Is it possible that the future will see no Windows PCs?
Windows has been the key product for Microsoft. Or it was, in the 1990s and 2000s. Back then, Microsoft's was to provide everything, and have all Microsoft products connect. Windows was the platform. The Office suite was the set of common tools. Exchange provided directory and authentication services; Windows used it to log in users and Outlook used it to route e-mail and store organization calendars. SQL Server was the database, and also used by other tools such as Visual Studio and TFS.
That strategy worked in the days of networked computers. It doesn't work today, and Microsoft has changed its strategy. They moved away from the "Windows and everything for every user" strategy. They now offer cloud services and cloud-based applications. Microsoft is happy to sell virtual Windows servers from the cloud. They are also happy to sell virtual Linux servers.
Windows is an expensive effort. It requires a lot of design, development, testing, and support. Microsoft gets revenue from Windows licenses. Is the revenue from Windows worth the expense? Today, the answer is most likely "yes". But five years from now? Ten years from now? Will the revenue cover the expense?
I think of Windows as two similar but distinct products. The first is the Windows we know: an operating system that runs on PCs. This version of Windows not only hosts applications but also handles all of the hardware of the PC: memory, video, sound, USB ports, clocks and timers, keyboard, mouse, ... everything. Windows has a hardware abstraction layer with drivers for specific devices. The upper layers of Windows (memory management, task switching, processes, and application programs) doesn't worry much about the lower-levels. Upper-level functions use the hardware abstraction layer as an API to lower-level functions.
That set of lower-level functions is complex, and expensive to develop and maintain. (The upper-level functions are expensive, too.)
There is another Windows product, one that isn't used on PCs. Microsoft offers Windows in the cloud -- virtual servers and workstations. This Windows product runs the same applications as the "real" Windows that runs on PCs, but it has a very different lower level.
The lower-level functions for "real" Windows (the one on your PC) must contain code for thousands of devices, and perhaps more. There are many models of PCs, with different chip sets for audio, USB, video, and all of the other hardware bits of a computer. Microsoft does a pretty good job at making Windows work with current and old PCs. (I recently installed Windows 10 on an old laptop that initially had Windows Vista, and everything works except for the wi-fi connection.)
The story is different for the "virtual" Windows (the one that runs in the cloud). That version of Windows has the same upper-level functions, but the logic below the hardware abstract layer is just about non-existent. Data to and from devices doesn't go to the server; instead it gets routed to the client computer that is using the virtual computer. This second computer has the actual hardware -- the keyboard, the mouse, USB devices, audio devices, etc. Getting data to those devices is the responsibility of the client PC operating system.
If the virtual PC is running Windows and the client PC is running Windows, then Windows must talk to all of those devices. If not the virtual Windows PC, then the client Windows PC. With a Windows client, Windows must talk to the devices.
But is it necessary for the client to be Windows?
I think Microsoft is asking itself this very question.
Suppose that Microsoft were to drop the client part and focus on the virtual part of that configuration. Suppose Microsoft simply let the PC version of Windows die. What would happen?
For starters, Microsoft would lose the revenue from Windows licenses for "real" PCs. But it would also lose the expense of developing drivers and low-level functions. Keep in mind that there are lots of devices, and Microsoft has been very good at supporting them.
(Microsoft does a very good job of support, but they do not support every device that was ever made. Please don't complain to me because Microsoft doesn't support your ancient video card, or unusual disk drive, or odd-ball USB device.)
Why would Microsoft do that? Why would Microsoft walk away from all of that revenue? The answer is simple economics: Does the PC version of Windows pay for itself? Does the revenue from PC Windows cover the expense of hardware management? If it does, then there is an argument to keep providing Windows. If the revenue does not cover the expense... then the argument is to abandon that market.
It may be that Microsoft will keep support for real PCs, but a limited subset of hardware. (Perhaps only Microsoft's Surface tablets and laptops, or a limited subset of certified computers.) Such a world would be very different from today's land of PCs running Windows. Linux may step in and provide the client base for access to virtual PCs (be those virtual PCs Linux or Windows). Vendors such as Dell and Lenovo may develop their own operating systems, possibly based on Linux, and tuned to their hardware.
If Microsoft can successfully migrate a large percentage of application users (users of Office, Project, SQL Server, etc.) to their cloud-based virtual Windows desktops, then we might just see that new world.
Tuesday, October 1, 2019
Thursday, September 19, 2019
The PC Reverse Cambrian Explosion
The Cambrian Explosion is a term from archaeology. It describes a massive increase in the diversity of life that occurred half a billion of years ago. Life on earth went from a measly few thousands of species to hundreds of millions of species in the blink of a geologic eye.
Personal Computers have what I call a "PC Reverse Cambrian Explosion" or PC-RCE. It occurred in the mid-1980s, which some might consider to be half a billion year ago. In the PC-RCE, computers went from hundreds of different designs to one: the IBM PC compatible.
In the late 1970s and very early 1980s, there were lots of designs for small computers. These included the Apple II, the Radio Shack TRS-80, the Commodore PET and CBM machines, and others. There was a great diversity of hardware and software, including processors and operating systems. Some computers had floppy disks, although most did not. Many computers used cassette tape for storage, and some had neither cassette nor floppy disk. Some computers had built-in displays, and others required that you get your own terminal.
By the mid 1980s, that diversity was gone. The IBM PC was the winning design, and the market wanted that design and only that design. (Except for a few stubborn holdouts.)
One might think that the IBM PC caused the PC-RCE, but I think it was something else.
While the IBM PC was popular, other manufacturers could not simply start making compatible machines (or "clones" as they were later called). The hardware for the IBM PC was "open" in that the connectors and buss specification were documented, and this allowed manufacturers to make accessories for IBM PCs. But the software (the operating system and importantly the ROM BIOS) was not open. While both had documentation for the interfaces, they could not be copied without running afoul of copyright law.
Other computer manufacturers could not make IBM PC clones. Their choices were limited to 1) sell non-compatible PCs in a market and did not want them, or 2) go into another business.
Yet we now have many vendors of PCs. What happened?
The first part of the PC-RCE was the weakening of the non-IBM manufacturers. Most went out of business. (Apple survived, by offering compelling alternate designs and focussing on the education market.)
The second part was Microsoft's ability to sell MS-DOS to other manufacturers. It made custom versions for non-compatible hardware by Tandy, Victor, Zenith, and others. While "compatible with MS-DOS" wasn't the same as "compatible with the IBM PC", it allowed other manufacturers to use MS-DOS.
A near-empty market allowed upstart Compaq to introduce its Compaq portable, which was the first system not made by IBM and yet compatible with the IBM PC. It showed that there was a way to build IBM PC "compatibles" legally and profitably. Compaq was successful because it offered a product not available from IBM (a portable computer) that was also compatible (it ran popular software) and used premium components and designs to justify a hefty price tag. (Several thousand dollars at the time.)
The final piece was the Phoenix BIOS. This was the technology that allowed other manufacturers to build compatible PCs at low prices. Compaq had built their own BIOS, making it compatible with the API specified in IBM's documents, but it was an expensive investment. The Phoenix BIOS was available to all manufacturers, which let Phoenix amortize the cost over a larger number of PCs, for a lower per-unit cost.
The market maintained demand for the IBM PC design, but it wasn't fussy about the manufacturer. Customers bought "IBM compatible PCs" with delight. (Especially if the price was lower than IBM's.)
Those events (weakened suppliers, an operating system, a legal path forward, and the technology to execute it) made the PC the one and only design, and killed off the remaining designs. (Again, except for Apple. And Apple came close to extinction on several occasions.)
Now, this is all nice history, but what does it have to do with us folks living today?
The PC-RCE gave us a single design for PCs. That design has evolved over the decades, and just about every piece of the original IBM PC has mutated into something else, but the marketed PCs have remained uniform. At first, IBM specified the design, with the IBM PC, the IBM PC XT, and the IBM PC AT. Later, Microsoft specified the design with its "platform specification" for Windows. Microsoft could do this, due to its dominance of the market for operating systems and office software.
Today, the PC design is governed by various committees and standards organizations. They specify the design for things like the BIOS (or its replacement the UEFI), the power supply, and connectors for accessories. Individual companies have sway; Intel designs processors and support circuitry used in all PCs. Together, these organizations provide a single design which allows for modest variation among manufacturers.
That uniformity is starting to fracture.
Apple's computers joined the PC design in the mid-2000s. The "white MacBook" with an Intel processor was a PC design -- so much so that Windows and Linux can run on it. Yet today, Apple is moving their Macs and MacBooks in a direction different from the mainstream market. Apple-designed chips control certain parts of their computers, and these chips are not provided to other manufacturers. (Apple's iPhones and iPads are unique designs, with no connection to the PC design.)
Google is designing its Chromebooks and slowing moving them away from the "standard" PC design.
Microsoft is building Surface tablets and laptops with its proprietary designs, close to PCs yet not quite identical.
We are approaching a time when we won't think of PCs as completely interchangeable. Instead, we will think of them in terms of manufacturers: Apple PCs, Microsoft PCs, Google PCs, etc. There will still be a mainstream design; Dell and Lenovo and HP want to sell PCs.
The "design your own PC" game is for serious players. It requires a significant investment not only in hardware design but also in software. Apple has been playing that game all along. Microsoft and Google are big enough that they can join. Other companies may get involved, using Linux (or NetBSD as Apple did) as a base for their operating systems.
The market for PCs is fragmenting. In the future, I see a modest number of designs, not the thousands that we had in 1980. The designs will be similar but not identical, and more importantly, not compatible - at least for hardware.
A future with multiple hardware platforms will be a very different place. We have enjoyed a single (evolving) platform for the past four decades. A world with multiple, incompatible platforms will be a new experience for many. It will affect not only hardware designers, but everyone involved with PCs, from programmers to network administrators to purchasing agents. Software may follow the fragmentation, and we could see applications that run on one platform and not others.
A fragmented market will hold challenges. Once committed to one platform, it is hard to move to a different platform. (Just as it is difficult today to move from one operating system to another.) Instead of just the operating system, one will have to change the hardware, operating system, and possibly applications.
It may also be a challenge for Linux and open source software. They have used the common platform as a means of expansion. Will we see specific versions of Linux for specific platforms? Will Linux avoid some platforms as "too difficult" to implement? (The Apple MacBooks, with their extra chips for security, may be a challenge for Linux.)
The fragmentation I describe is a possible future -- its not here today. I wouldn't panic, but I wouldn't ignore it, either. Keep buying PCs, but keep your eyes on them.
Personal Computers have what I call a "PC Reverse Cambrian Explosion" or PC-RCE. It occurred in the mid-1980s, which some might consider to be half a billion year ago. In the PC-RCE, computers went from hundreds of different designs to one: the IBM PC compatible.
In the late 1970s and very early 1980s, there were lots of designs for small computers. These included the Apple II, the Radio Shack TRS-80, the Commodore PET and CBM machines, and others. There was a great diversity of hardware and software, including processors and operating systems. Some computers had floppy disks, although most did not. Many computers used cassette tape for storage, and some had neither cassette nor floppy disk. Some computers had built-in displays, and others required that you get your own terminal.
By the mid 1980s, that diversity was gone. The IBM PC was the winning design, and the market wanted that design and only that design. (Except for a few stubborn holdouts.)
One might think that the IBM PC caused the PC-RCE, but I think it was something else.
While the IBM PC was popular, other manufacturers could not simply start making compatible machines (or "clones" as they were later called). The hardware for the IBM PC was "open" in that the connectors and buss specification were documented, and this allowed manufacturers to make accessories for IBM PCs. But the software (the operating system and importantly the ROM BIOS) was not open. While both had documentation for the interfaces, they could not be copied without running afoul of copyright law.
Other computer manufacturers could not make IBM PC clones. Their choices were limited to 1) sell non-compatible PCs in a market and did not want them, or 2) go into another business.
Yet we now have many vendors of PCs. What happened?
The first part of the PC-RCE was the weakening of the non-IBM manufacturers. Most went out of business. (Apple survived, by offering compelling alternate designs and focussing on the education market.)
The second part was Microsoft's ability to sell MS-DOS to other manufacturers. It made custom versions for non-compatible hardware by Tandy, Victor, Zenith, and others. While "compatible with MS-DOS" wasn't the same as "compatible with the IBM PC", it allowed other manufacturers to use MS-DOS.
A near-empty market allowed upstart Compaq to introduce its Compaq portable, which was the first system not made by IBM and yet compatible with the IBM PC. It showed that there was a way to build IBM PC "compatibles" legally and profitably. Compaq was successful because it offered a product not available from IBM (a portable computer) that was also compatible (it ran popular software) and used premium components and designs to justify a hefty price tag. (Several thousand dollars at the time.)
The final piece was the Phoenix BIOS. This was the technology that allowed other manufacturers to build compatible PCs at low prices. Compaq had built their own BIOS, making it compatible with the API specified in IBM's documents, but it was an expensive investment. The Phoenix BIOS was available to all manufacturers, which let Phoenix amortize the cost over a larger number of PCs, for a lower per-unit cost.
The market maintained demand for the IBM PC design, but it wasn't fussy about the manufacturer. Customers bought "IBM compatible PCs" with delight. (Especially if the price was lower than IBM's.)
Those events (weakened suppliers, an operating system, a legal path forward, and the technology to execute it) made the PC the one and only design, and killed off the remaining designs. (Again, except for Apple. And Apple came close to extinction on several occasions.)
Now, this is all nice history, but what does it have to do with us folks living today?
The PC-RCE gave us a single design for PCs. That design has evolved over the decades, and just about every piece of the original IBM PC has mutated into something else, but the marketed PCs have remained uniform. At first, IBM specified the design, with the IBM PC, the IBM PC XT, and the IBM PC AT. Later, Microsoft specified the design with its "platform specification" for Windows. Microsoft could do this, due to its dominance of the market for operating systems and office software.
Today, the PC design is governed by various committees and standards organizations. They specify the design for things like the BIOS (or its replacement the UEFI), the power supply, and connectors for accessories. Individual companies have sway; Intel designs processors and support circuitry used in all PCs. Together, these organizations provide a single design which allows for modest variation among manufacturers.
That uniformity is starting to fracture.
Apple's computers joined the PC design in the mid-2000s. The "white MacBook" with an Intel processor was a PC design -- so much so that Windows and Linux can run on it. Yet today, Apple is moving their Macs and MacBooks in a direction different from the mainstream market. Apple-designed chips control certain parts of their computers, and these chips are not provided to other manufacturers. (Apple's iPhones and iPads are unique designs, with no connection to the PC design.)
Google is designing its Chromebooks and slowing moving them away from the "standard" PC design.
Microsoft is building Surface tablets and laptops with its proprietary designs, close to PCs yet not quite identical.
We are approaching a time when we won't think of PCs as completely interchangeable. Instead, we will think of them in terms of manufacturers: Apple PCs, Microsoft PCs, Google PCs, etc. There will still be a mainstream design; Dell and Lenovo and HP want to sell PCs.
The "design your own PC" game is for serious players. It requires a significant investment not only in hardware design but also in software. Apple has been playing that game all along. Microsoft and Google are big enough that they can join. Other companies may get involved, using Linux (or NetBSD as Apple did) as a base for their operating systems.
The market for PCs is fragmenting. In the future, I see a modest number of designs, not the thousands that we had in 1980. The designs will be similar but not identical, and more importantly, not compatible - at least for hardware.
A future with multiple hardware platforms will be a very different place. We have enjoyed a single (evolving) platform for the past four decades. A world with multiple, incompatible platforms will be a new experience for many. It will affect not only hardware designers, but everyone involved with PCs, from programmers to network administrators to purchasing agents. Software may follow the fragmentation, and we could see applications that run on one platform and not others.
A fragmented market will hold challenges. Once committed to one platform, it is hard to move to a different platform. (Just as it is difficult today to move from one operating system to another.) Instead of just the operating system, one will have to change the hardware, operating system, and possibly applications.
It may also be a challenge for Linux and open source software. They have used the common platform as a means of expansion. Will we see specific versions of Linux for specific platforms? Will Linux avoid some platforms as "too difficult" to implement? (The Apple MacBooks, with their extra chips for security, may be a challenge for Linux.)
The fragmentation I describe is a possible future -- its not here today. I wouldn't panic, but I wouldn't ignore it, either. Keep buying PCs, but keep your eyes on them.
Friday, September 13, 2019
Apple hardware has nowhere to go
Apple has long sold products based on advanced design.
But now Apple has a problem: there is very little space to advance. The iPhone is "done" -- it is as good as it is going to get. This week's announcements about new iPhones were, in brief, all about the cameras. There were some mentions about higher-resolution screens (better than "retina" resolution, which itself was as good as the human eye could resolve), longer battery life, and a new color (green).
The iPhone is not the only product that has little "runway".
The MacBook also has little room to grow. It is as good as a laptop can get, and competitors are just as good -- at least in terms of hardware. There is little advantage to the MacBook.
The Mac (the desktop) is a pricey device for the upper end of developers, and a far cry from a workstation for "the rest of us". But it, like the MacBook, is comparable to competing desktops. There is little advantage to the Mac.
Apple knows this. Their recent move into services (television and music) shows that it is better to invest in areas other than hardware.
But how to keep demand for those pricey Apple devices? Those shiny devices are how Apple makes money.
It is quite possible that Apple will limit their services to Apple devices. They may also limit the development tools for services to Apple devices (Macs and MacBooks with macOS). Consumers of Apple services (music, television) will have to use Apple devices. Developers of services for the Apple platform will have to use Apple devices.
Why would Apple do that? For the simple reason that they can charge premium prices for their hardware. Anyone who wants "in" to the Apple set of services will have to pay the entry fee.
It also separates Apple from the rest of computing, which carries its own risks. The mainstream platforms could move in a direction without Apple.
Apple has always maintained some distance from mainstream computing, which gives Apple the cachet of "different". Some distance is good, but too much distance puts Apple on their own "island of computing".
Being on one's own island is nice, while the tourists come. If the tourists stop visiting, then the island becomes a lonely place.
But now Apple has a problem: there is very little space to advance. The iPhone is "done" -- it is as good as it is going to get. This week's announcements about new iPhones were, in brief, all about the cameras. There were some mentions about higher-resolution screens (better than "retina" resolution, which itself was as good as the human eye could resolve), longer battery life, and a new color (green).
The iPhone is not the only product that has little "runway".
The MacBook also has little room to grow. It is as good as a laptop can get, and competitors are just as good -- at least in terms of hardware. There is little advantage to the MacBook.
The Mac (the desktop) is a pricey device for the upper end of developers, and a far cry from a workstation for "the rest of us". But it, like the MacBook, is comparable to competing desktops. There is little advantage to the Mac.
Apple knows this. Their recent move into services (television and music) shows that it is better to invest in areas other than hardware.
But how to keep demand for those pricey Apple devices? Those shiny devices are how Apple makes money.
It is quite possible that Apple will limit their services to Apple devices. They may also limit the development tools for services to Apple devices (Macs and MacBooks with macOS). Consumers of Apple services (music, television) will have to use Apple devices. Developers of services for the Apple platform will have to use Apple devices.
Why would Apple do that? For the simple reason that they can charge premium prices for their hardware. Anyone who wants "in" to the Apple set of services will have to pay the entry fee.
It also separates Apple from the rest of computing, which carries its own risks. The mainstream platforms could move in a direction without Apple.
Apple has always maintained some distance from mainstream computing, which gives Apple the cachet of "different". Some distance is good, but too much distance puts Apple on their own "island of computing".
Being on one's own island is nice, while the tourists come. If the tourists stop visiting, then the island becomes a lonely place.
Wednesday, September 4, 2019
Don't shoot me, I'm only the OO programming language!
There has been a lot of hate for object-oriented programming of late. I use the word "hate" with care, as others have described their emotions as such. After decades of success with object-oriented programming, now people are writing articles with titles like "I hate object-oriented programming".
Why such animosity towards object-oriented programming? And why now? I have some ideas.
First, we have the age of object-oriented programming (OOP) as the primary paradigm for programming. I put the acceptance of OOP somewhere after the introduction of Java (in 1995) and before Microsoft's C# and .NET initiative (in 1999), which makes OOP about 25 years old -- or one generation of programmers.
(I know that object-oriented programming was around much earlier than C# and Java, and I don't mean to imply that Java was the first object-oriented language. But Java was the first popular OOP language, the first OOP language that was widely accepted in the programming community.)
So it may be that the rejection of OOP is driven by generational forces. Object-oriented programming, for new programmers, has been around "forever" and is an old way of looking at code. OOP is not the shiny new thing; it is the dusty old thing.
Which leads to my second idea: What is the shiny new thing that replaces object-oriented programming? To answer that question, we have to answer another: what does OOP do for us?
Object-oriented programming, in brief, helps developers organize code. It is one of several techniques to organize code. Others include Structured Programming, subroutines, and functions.
Subroutines are possibly the oldest techniques to organize code. They date back to the days of assembly language, when code that was executed more than once was called with a "branch" or "call" or "jump subroutine" opcode. Instead of repeating code (and using precious memory), common code could be stored once and invoked as often as needed.
Functions date back to at least Fortran, consolidating common code that returns a value.
For two decades (from the mid 1950s to the mid-1970s), subroutines and functions were the only way to organize code. In the mid-1970s, the structured programming movement introduced an additional way to organize code, with IF/THEN/ELSE and WHILE statements (and an avoidance of GOTO). These techniques worked at a more granular level that subroutines and functions. Structured programming organized code "in the small" and subroutines and functions organized code "in the medium". Notice that we had no way (at the time) to organize code "in the large".
Techniques to organize code "in the large" did come. One attempt was dynamic-linked libraries (DLLs), introduced with Microsoft Windows but also used by earlier operating systems. Another was Microsoft's COM, which organized the DLLs. Neither were particularly effective at organizing code.
Object-oriented programming was effective at organizing code at a level higher than procedures and functions. And it has been successful for the past two-plus decades. OOP let programmers build large systems, sometimes with thousands of classes and millions of lines of code.
So what technique has arrived that displaces object-oriented programming? How has the computer world changed, that object-oriented programming would become despised?
I think it is cloud programming and web services, and specifically, microservices.
OOP lets us organize a large code base into classes (and namespaces which contain classes). The concept of a web service also lets us organize our code, in a level higher than procedures and functions. A web service can be a large thing, using OOP to organize its innards.
But a microservice is different from a large web service. A microservice is, by definition, small. A large system can be composed of multiple microservices, but each microservice must be a small component.
Microservices are small enough that they can be handled by a simple script (perhaps in Python or Ruby) that performs a few specific tasks and then exits. Small programs don't need classes and object-oriented programming. Object-oriented programming adds cost to simple programs with no corresponding benefit.
Programmers building microservices in languages such as Java or C# may feel that object-oriented programming is being forced upon them. Both Java and C# are object-oriented languages, and they mandate classes in your program. A simple "Hello, world!" program requires the definition of at least one class, with at least one static method.
Perhaps languages that are not object-oriented are better for microservices. Languages such as Python, Ruby, or even Perl. If performance is a concern, the compiled languages C and Go are available. (It might be that the recent interest in C is driven by the development of cloud applications and microservices for them.)
Object-oriented programming was (and still is) an effective way to manage code for large systems. With the advent of microservices, it is not the only way. Using object-oriented programming for microservices is overkill. OOP requires overhead that is not helpful for small programs; if your microservice is large enough to require OOP, then it isn't a microservice.
I think this is the reason for the recent animosity towards object-oriented programming. Programmers have figured out the OOP doesn't mix with microservices -- but they don't know why. They fell that something is wrong (which it is) but they don't have the ability to shake off the established programming practices and technologies (perhaps because they don't have the authority).
If you are working on a large system, and using microservices, give some thought to your programming language.
Why such animosity towards object-oriented programming? And why now? I have some ideas.
First, we have the age of object-oriented programming (OOP) as the primary paradigm for programming. I put the acceptance of OOP somewhere after the introduction of Java (in 1995) and before Microsoft's C# and .NET initiative (in 1999), which makes OOP about 25 years old -- or one generation of programmers.
(I know that object-oriented programming was around much earlier than C# and Java, and I don't mean to imply that Java was the first object-oriented language. But Java was the first popular OOP language, the first OOP language that was widely accepted in the programming community.)
So it may be that the rejection of OOP is driven by generational forces. Object-oriented programming, for new programmers, has been around "forever" and is an old way of looking at code. OOP is not the shiny new thing; it is the dusty old thing.
Which leads to my second idea: What is the shiny new thing that replaces object-oriented programming? To answer that question, we have to answer another: what does OOP do for us?
Object-oriented programming, in brief, helps developers organize code. It is one of several techniques to organize code. Others include Structured Programming, subroutines, and functions.
Subroutines are possibly the oldest techniques to organize code. They date back to the days of assembly language, when code that was executed more than once was called with a "branch" or "call" or "jump subroutine" opcode. Instead of repeating code (and using precious memory), common code could be stored once and invoked as often as needed.
Functions date back to at least Fortran, consolidating common code that returns a value.
For two decades (from the mid 1950s to the mid-1970s), subroutines and functions were the only way to organize code. In the mid-1970s, the structured programming movement introduced an additional way to organize code, with IF/THEN/ELSE and WHILE statements (and an avoidance of GOTO). These techniques worked at a more granular level that subroutines and functions. Structured programming organized code "in the small" and subroutines and functions organized code "in the medium". Notice that we had no way (at the time) to organize code "in the large".
Techniques to organize code "in the large" did come. One attempt was dynamic-linked libraries (DLLs), introduced with Microsoft Windows but also used by earlier operating systems. Another was Microsoft's COM, which organized the DLLs. Neither were particularly effective at organizing code.
Object-oriented programming was effective at organizing code at a level higher than procedures and functions. And it has been successful for the past two-plus decades. OOP let programmers build large systems, sometimes with thousands of classes and millions of lines of code.
So what technique has arrived that displaces object-oriented programming? How has the computer world changed, that object-oriented programming would become despised?
I think it is cloud programming and web services, and specifically, microservices.
OOP lets us organize a large code base into classes (and namespaces which contain classes). The concept of a web service also lets us organize our code, in a level higher than procedures and functions. A web service can be a large thing, using OOP to organize its innards.
But a microservice is different from a large web service. A microservice is, by definition, small. A large system can be composed of multiple microservices, but each microservice must be a small component.
Microservices are small enough that they can be handled by a simple script (perhaps in Python or Ruby) that performs a few specific tasks and then exits. Small programs don't need classes and object-oriented programming. Object-oriented programming adds cost to simple programs with no corresponding benefit.
Programmers building microservices in languages such as Java or C# may feel that object-oriented programming is being forced upon them. Both Java and C# are object-oriented languages, and they mandate classes in your program. A simple "Hello, world!" program requires the definition of at least one class, with at least one static method.
Perhaps languages that are not object-oriented are better for microservices. Languages such as Python, Ruby, or even Perl. If performance is a concern, the compiled languages C and Go are available. (It might be that the recent interest in C is driven by the development of cloud applications and microservices for them.)
Object-oriented programming was (and still is) an effective way to manage code for large systems. With the advent of microservices, it is not the only way. Using object-oriented programming for microservices is overkill. OOP requires overhead that is not helpful for small programs; if your microservice is large enough to require OOP, then it isn't a microservice.
I think this is the reason for the recent animosity towards object-oriented programming. Programmers have figured out the OOP doesn't mix with microservices -- but they don't know why. They fell that something is wrong (which it is) but they don't have the ability to shake off the established programming practices and technologies (perhaps because they don't have the authority).
If you are working on a large system, and using microservices, give some thought to your programming language.
Wednesday, August 28, 2019
Show me the optimizations!
Compilers have gotten good at optimizing code. So good, in fact, that we programmers take optimizations as granted. I don't object to optimizations, but I do think we need to re-think the opaqueness of them.
Optimizations are, in general, good things. They are changes to the code to make it faster and more efficient, while keeping the same functionality. Many times, they are changes that seem small.
For example, the code:
a = f(r * 4 + k)
b = g(r * 4 + k)
can be optimized to
t1 = r * 4 + k
a = f(t1)
b = g(t1)
The common expression r * 4 + k can be performed once, not twice, which reduces the time to execute. (It does require space to store the result, so this optimization is really a trade-off between time and space. Also, it assumes that r and k remain unchanged between calls to f() and g().)
Another example:
for i = 1 to 40
a[i] = r * 4 + k
which can be optimized to:
t1 = r * 4 + k
for i = 1 to 40
a[i] = t1
In this example, the operation r * 4 + k is repeated inside the loop, yet it does not change from iteration to iteration. (It is invariant during the loop.) The optimization moves the calculation outside the loop, which means it is calculated only once.
These are two simple examples. Compilers have made these optimizations for years, if not decades. Today's compilers are much better at optimizing code.
I am less concerned with the number of optimizations, and the types of optimizations, and more concerned with the optimizations themselves.
I would like to see the optimizations.
Optimizations are changes, and I would like to see the changes that the compiler makes to the code.
I would like to see how my code is revised to improve performance.
I know of no compiler that reports the optimizations it makes. Not Microsoft's compilers, not Intel's, not open source. None. And I am not satisfied with that. I want to see the optimizations.
Why do I want to see the optimizations?
First, I want to see how to improve my code. The above examples are trivial, yet instructive. (And I have, at times, written the un-optimized versions of those programs, although on a larger scale and with more variables and calculations to worry about.) Seeing the improvements to the code helps me become a better developer.
Second, I want to see what the compiler is doing. It may be making assumptions that are not true, possibly due to my failure to annotate variables and functions properly. I want to correct those failures.
Third, when reviewing code with other developers, I think we should review not only the original code but also the optimized code. The optimizations may give us insight into our code and data.
It is quite possible that future compilers will provide information about their optimizations. Compilers are sophisticated tools, and they do more than simply convert source code into executable bytes. It is time for them to provide more information to us, the programmers.
Optimizations are, in general, good things. They are changes to the code to make it faster and more efficient, while keeping the same functionality. Many times, they are changes that seem small.
For example, the code:
a = f(r * 4 + k)
b = g(r * 4 + k)
can be optimized to
t1 = r * 4 + k
a = f(t1)
b = g(t1)
The common expression r * 4 + k can be performed once, not twice, which reduces the time to execute. (It does require space to store the result, so this optimization is really a trade-off between time and space. Also, it assumes that r and k remain unchanged between calls to f() and g().)
Another example:
for i = 1 to 40
a[i] = r * 4 + k
which can be optimized to:
t1 = r * 4 + k
for i = 1 to 40
a[i] = t1
In this example, the operation r * 4 + k is repeated inside the loop, yet it does not change from iteration to iteration. (It is invariant during the loop.) The optimization moves the calculation outside the loop, which means it is calculated only once.
These are two simple examples. Compilers have made these optimizations for years, if not decades. Today's compilers are much better at optimizing code.
I am less concerned with the number of optimizations, and the types of optimizations, and more concerned with the optimizations themselves.
I would like to see the optimizations.
Optimizations are changes, and I would like to see the changes that the compiler makes to the code.
I would like to see how my code is revised to improve performance.
I know of no compiler that reports the optimizations it makes. Not Microsoft's compilers, not Intel's, not open source. None. And I am not satisfied with that. I want to see the optimizations.
Why do I want to see the optimizations?
First, I want to see how to improve my code. The above examples are trivial, yet instructive. (And I have, at times, written the un-optimized versions of those programs, although on a larger scale and with more variables and calculations to worry about.) Seeing the improvements to the code helps me become a better developer.
Second, I want to see what the compiler is doing. It may be making assumptions that are not true, possibly due to my failure to annotate variables and functions properly. I want to correct those failures.
Third, when reviewing code with other developers, I think we should review not only the original code but also the optimized code. The optimizations may give us insight into our code and data.
It is quite possible that future compilers will provide information about their optimizations. Compilers are sophisticated tools, and they do more than simply convert source code into executable bytes. It is time for them to provide more information to us, the programmers.
Monday, August 19, 2019
The Museum Principle for programming
Programming languages have, as one of their features, variables. A variable is a thing that holds a value and that value can vary over time. A simple example:
The statement
a = 1
defines a variable named 'a' and assigns a value of 1. Later, the program may contain the statement
a = 2
which changes the value from 1 to 2.
The exact operations vary from language to language. In C and C++, the name is closely associated with the underlying memory for the value. Python and Ruby separate the name from the underlying memory, which means that the name can be re-assigned to point to a different underlying value. In C and C++, the names cannot be changed in that manner. But that distinction has little to do with this discussion. Read on.
Some languages have the notion of constants. A constant is a thing that holds a value and that value cannot change over time. It remains constant. C, C++, and Pascal have this notion. In C, a program can contain the statement
const int a = 1;
A later statement that attempts to change the value of 'a' will cause a compiler error. Python and Ruby have no such notion.
Note that I am referring to constants, not literals such as '1' or '3.14' that appear in the code. These are truly constant and cannot be assigned new values. Some early language implementations did allow such behavior. It was never popular.
The notion of 'constness' is useful. It allows the compiler to optimize the code for certain operations. When applied to a parameter of a function, it informs the programmer that he cannot change the value. In C++, a function of a class can be declared 'const' and then that function cannot modify member variables. (I find this capability helpful to organize code and separate functions that change an object from functions that do not.)
The notion of 'constness' is a specific form of a more general concept, one that we programmers tend to not think about. That concept is 'read but don't write', or 'look but don't touch'. Or as I like to think of it, the "Museum Principle".
The Museum Principle states that you can observe the value of a variable, but you cannot change it. This principle is different from 'constness', which states that the value of a variable cannot (and will not) change. The two are close but not identical. The Museum Principle allows the variable to change; but you (or your code) are not making the change.
It may surprise readers to learn that the Museum Principle has been used already, and for quite a long time.
The idea of "look but don't touch" is implemented in Fortran and Pascal, in loop constructs. In these languages, a loop has an index value. The index value is set to an initial value and later modified for each iteration of the loop. Here are some examples that print the numbers from 1 to 10:
An example in Fortran:
do 100 i = 1, 10
write(*,*) 'i =', i
100 continue
An example in Pascal:
for i:= 1 to 10 do
begin
writeln('i =', i)
end;
In both of these loops, the variable i is initialized to the value 1 and incremented by 1 until it reaches the value 10. The body of each loop prints the value of i.
Now here is where the Museum Principle comes into play: In both Fortran and Pascal, you cannot change the value of i within the loop.
That is, the following code is illegal and will not compile:
In Fortran:
do 100 i = 1, 10
i = 20
write(*,*) 'i =', i
100 continue
In Pascal:
for i:= 1 to 10 do
begin
i := 20
writeln('i =', i)
end;
The highlighted lines are not permitted. It is part of the specification for both Fortran and Pascal that the loop index is not to be assigned. (Early versions of Fortran and Pascal guaranteed this behavior. Later versions of the languages, which allowed aliases via pointers, could not.)
Compare this to a similar loop in C or C++:
for (unsigned int i = 1; i <= 10; i++)
{
printf("%d\n", i);
}
The specifications for the C and C++ languages have no such restriction on loop indexes. (In fact, C and C++ do not have the notion of a loop index; they merely allow a variable to be declared and assigned at the beginning of the loop.)
The following code is legal in C and C++ (and does what you expect):
for (unsigned int i = 1; i <= 10; i++)
{
i = 20;
printf("%d\n", i);
}
My point here is not to say that Fortran and Pascal are superior to C and C++ (or that C and C++ are superior to Fortran and Pascal). My point is to show that the Museum Principle is useful.
Preventing changes to a loop index variable is the Museum Principle. The programmer can see the value of the variable, and the value does change, but the programmer cannot change the value. The programmer is constrained.
Some might chafe at the idea of such a restraint. Many have complained about the restrictions of Pascal and lauded the freedom of C. Yet over time, modern languages have implemented the restraints of Pascal, such as bounds-checking and type conversion.
Modern languages often eliminate loop index variables, by providing "for-each" loops that iterate over a collection. This feature is a stronger form of the "look but don't touch" restriction on loop index variables. One cannot complain about Fortran's limitations of loop index variables, unless one also dislikes the 'for-each' construct. A for-each iterator has a loop index, invisible (and untouchable!) inside.
For the "normal" loop (in which the index variable is not modified), there is no benefit from a prohibition of change to the index variable. (The programmer makes no attempt to change it.) It is the unusual loops, the loops which have extra logic for special cases, that benefit. Changing the loop index value is a shortcut, often serving a purpose that is not clear (and many times not documented). Preventing that short-cut forces the programmer to use code that is more explicit. A hassle in the short term, but better in the long term.
Constraints -- the right type of constraints -- are useful to programmers. The "structured programming" method was all about constraints for control structures (loops and conditionals) and the prohibition of "goto" operations. Programmers at the time complained, but looking back we can see that it was the right thing to do.
Constraints on loop index variables are also the right thing to do. Applying the Museum Principle to loop index variables will improve code and reduce errors.
The statement
a = 1
defines a variable named 'a' and assigns a value of 1. Later, the program may contain the statement
a = 2
which changes the value from 1 to 2.
The exact operations vary from language to language. In C and C++, the name is closely associated with the underlying memory for the value. Python and Ruby separate the name from the underlying memory, which means that the name can be re-assigned to point to a different underlying value. In C and C++, the names cannot be changed in that manner. But that distinction has little to do with this discussion. Read on.
Some languages have the notion of constants. A constant is a thing that holds a value and that value cannot change over time. It remains constant. C, C++, and Pascal have this notion. In C, a program can contain the statement
const int a = 1;
A later statement that attempts to change the value of 'a' will cause a compiler error. Python and Ruby have no such notion.
Note that I am referring to constants, not literals such as '1' or '3.14' that appear in the code. These are truly constant and cannot be assigned new values. Some early language implementations did allow such behavior. It was never popular.
The notion of 'constness' is useful. It allows the compiler to optimize the code for certain operations. When applied to a parameter of a function, it informs the programmer that he cannot change the value. In C++, a function of a class can be declared 'const' and then that function cannot modify member variables. (I find this capability helpful to organize code and separate functions that change an object from functions that do not.)
The notion of 'constness' is a specific form of a more general concept, one that we programmers tend to not think about. That concept is 'read but don't write', or 'look but don't touch'. Or as I like to think of it, the "Museum Principle".
The Museum Principle states that you can observe the value of a variable, but you cannot change it. This principle is different from 'constness', which states that the value of a variable cannot (and will not) change. The two are close but not identical. The Museum Principle allows the variable to change; but you (or your code) are not making the change.
It may surprise readers to learn that the Museum Principle has been used already, and for quite a long time.
The idea of "look but don't touch" is implemented in Fortran and Pascal, in loop constructs. In these languages, a loop has an index value. The index value is set to an initial value and later modified for each iteration of the loop. Here are some examples that print the numbers from 1 to 10:
An example in Fortran:
do 100 i = 1, 10
write(*,*) 'i =', i
100 continue
An example in Pascal:
for i:= 1 to 10 do
begin
writeln('i =', i)
end;
In both of these loops, the variable i is initialized to the value 1 and incremented by 1 until it reaches the value 10. The body of each loop prints the value of i.
Now here is where the Museum Principle comes into play: In both Fortran and Pascal, you cannot change the value of i within the loop.
That is, the following code is illegal and will not compile:
In Fortran:
do 100 i = 1, 10
i = 20
write(*,*) 'i =', i
100 continue
In Pascal:
for i:= 1 to 10 do
begin
i := 20
writeln('i =', i)
end;
The highlighted lines are not permitted. It is part of the specification for both Fortran and Pascal that the loop index is not to be assigned. (Early versions of Fortran and Pascal guaranteed this behavior. Later versions of the languages, which allowed aliases via pointers, could not.)
Compare this to a similar loop in C or C++:
for (unsigned int i = 1; i <= 10; i++)
{
printf("%d\n", i);
}
The specifications for the C and C++ languages have no such restriction on loop indexes. (In fact, C and C++ do not have the notion of a loop index; they merely allow a variable to be declared and assigned at the beginning of the loop.)
The following code is legal in C and C++ (and does what you expect):
for (unsigned int i = 1; i <= 10; i++)
{
i = 20;
printf("%d\n", i);
}
My point here is not to say that Fortran and Pascal are superior to C and C++ (or that C and C++ are superior to Fortran and Pascal). My point is to show that the Museum Principle is useful.
Preventing changes to a loop index variable is the Museum Principle. The programmer can see the value of the variable, and the value does change, but the programmer cannot change the value. The programmer is constrained.
Some might chafe at the idea of such a restraint. Many have complained about the restrictions of Pascal and lauded the freedom of C. Yet over time, modern languages have implemented the restraints of Pascal, such as bounds-checking and type conversion.
Modern languages often eliminate loop index variables, by providing "for-each" loops that iterate over a collection. This feature is a stronger form of the "look but don't touch" restriction on loop index variables. One cannot complain about Fortran's limitations of loop index variables, unless one also dislikes the 'for-each' construct. A for-each iterator has a loop index, invisible (and untouchable!) inside.
For the "normal" loop (in which the index variable is not modified), there is no benefit from a prohibition of change to the index variable. (The programmer makes no attempt to change it.) It is the unusual loops, the loops which have extra logic for special cases, that benefit. Changing the loop index value is a shortcut, often serving a purpose that is not clear (and many times not documented). Preventing that short-cut forces the programmer to use code that is more explicit. A hassle in the short term, but better in the long term.
Constraints -- the right type of constraints -- are useful to programmers. The "structured programming" method was all about constraints for control structures (loops and conditionals) and the prohibition of "goto" operations. Programmers at the time complained, but looking back we can see that it was the right thing to do.
Constraints on loop index variables are also the right thing to do. Applying the Museum Principle to loop index variables will improve code and reduce errors.
Labels:
constraints,
Fortran,
loops,
Museum Principle,
Pascal,
structured code
Wednesday, July 31, 2019
Programming languages, structured or not, immediate or not
I had some spare time on my hands, and any of my friends will tell you that when I have spare time, I think about things. This time, I thought about programming languages.
That's not a surprise. I often think about programming languages. This time I thought about two aspects of programming languages that I call structuredness and immediacy. Immediacy is simply the rapidity in which a program can respond. The languages Perl, Python, and Ruby all have high immediacy, as one can start a REPL (for read-evaluate-print-loop) that takes input and provides the result right away. (In contrast, programs in the languages C#, Java, Go, and Rust must be compiled, so there is an extra step to get a response.
Structuredness, in a language, is how much organization was encouraged by the language. I say "encouraged" because many languages will allow unstructured code. Some languages do require careful thought and organization prior to coding. Functional programming languages require a great deal of thought. Object-oriented languages such as C++, C#, and Java provide some structure. Old-school BASIC did not provide structure at all, with only a GOTO and a simple IF statement to organize your code. (Visual Basic has much more structure than old-school BASIC, and it is closer to C# and Java, although it has a bit more immediacy than those languages.)
My thoughts on structuredness and immediacy led me to think about the combination of the two. Some languages are high in one aspect, and some languages mix the two aspects. Was there an overall pattern?
I built a simple grid with structure on one axis and immediacy on the other. Structure was on the vertical axis: languages with high structure were higher on the chart, languages with less structure were lower. Immediacy was on the horizontal axis, with languages with high immediacy to the right and languages that provided slower response were to the left.
Here's the grid:
structured
^
Go C++ Objective-C Swift |
C# Java VB.NET |
| Python Ruby
(Pascal) | Matlab
| Visual Basic
C | SQL Perl
COBOL Fortran | JavaScript (Forth)
slow <------------------------------------------------> <----------------------------------------------->immediate----------------------------------------------->------------------------------------------------>
(FORTRAN) | R
| (BASIC)
|
|
|
| spreadsheet
v
unstructured
Some notes on the grid:
- Languages in parentheses are older, less-used languages.
- Fortran appears twice: "Fortran" is the modern version and "(FORTRAN)" is the 1960s version
- I have included "spreadsheet" as a programming language
Compiled languages appear on the left (slow) side. This is not related to the performance of programs written in these languages, but the development experience. When programming in a compiled language, one must edit the code, stop and compile, and then run the program. Languages on the right-hand side (the "immediate" side) do not need the compile step and provide feedback faster.
Notice that, aside from the elder FORTRAN, there are no slow, unstructured languages. Also notice that the structured immediate languages (Python, Ruby, et al.) cluster away from the extreme corner of structured and immediate. They are closer to the center.
The result is (roughly) a "main sequence" of programming languages, similar to the main sequence astronomers see in the types of stars. Programming languages tend to a moderate zone, where trade-offs are made between structure and immediacy.
The unusual entry was the spreadsheet, which I consider a programming language for this exercise. It appears in the extreme corner for unstructured and immediate. The spreadsheet, as a programming environment, is the fastest thing we have. Enter a value or a formula in a cell and the change "goes live" immediately. ("Before your finger is off the ENTER key", as a colleague would say.) This is faster than any IDE or compiler or interpreter for any other language.
Spreadsheets are also unstructured. There are no structures in spreadsheets, other than multiple sheets for different sets of data. While it is possible to carefully organize data in a spreadsheet, there is nothing that mandates the organization or even encourages it. (I'm thinking about the formulas in cells. A sophisticated macro programming language is a different thing.)
I think spreadsheets took over a specific type of computing. They became the master of immediate, unstructured programming. BASIC and Forth could not compete with them, and no language since has tried to compete with the spreadsheet. The spreadsheet is the most effective form of this kind of computing, and I see nothing that will replace it.
Therefore, we can predict that spreadsheets will stay with us for some time. It may not be Microsoft Excel, but it will be a spreadsheet.
We can also predict that programming languages will stay within the main sequence of compromise between structure and immediacy.
In other words, BASIC is not going to make a comeback. Nor will Forth, regrettably.
That's not a surprise. I often think about programming languages. This time I thought about two aspects of programming languages that I call structuredness and immediacy. Immediacy is simply the rapidity in which a program can respond. The languages Perl, Python, and Ruby all have high immediacy, as one can start a REPL (for read-evaluate-print-loop) that takes input and provides the result right away. (In contrast, programs in the languages C#, Java, Go, and Rust must be compiled, so there is an extra step to get a response.
Structuredness, in a language, is how much organization was encouraged by the language. I say "encouraged" because many languages will allow unstructured code. Some languages do require careful thought and organization prior to coding. Functional programming languages require a great deal of thought. Object-oriented languages such as C++, C#, and Java provide some structure. Old-school BASIC did not provide structure at all, with only a GOTO and a simple IF statement to organize your code. (Visual Basic has much more structure than old-school BASIC, and it is closer to C# and Java, although it has a bit more immediacy than those languages.)
My thoughts on structuredness and immediacy led me to think about the combination of the two. Some languages are high in one aspect, and some languages mix the two aspects. Was there an overall pattern?
I built a simple grid with structure on one axis and immediacy on the other. Structure was on the vertical axis: languages with high structure were higher on the chart, languages with less structure were lower. Immediacy was on the horizontal axis, with languages with high immediacy to the right and languages that provided slower response were to the left.
Here's the grid:
structured
^
Go C++ Objective-C Swift |
C# Java VB.NET |
| Python Ruby
(Pascal) | Matlab
| Visual Basic
C | SQL Perl
COBOL Fortran | JavaScript (Forth)
slow <------------------------------------------------> <----------------------------------------------->immediate----------------------------------------------->------------------------------------------------>
(FORTRAN) | R
| (BASIC)
|
|
|
| spreadsheet
v
unstructured
Some notes on the grid:
- Languages in parentheses are older, less-used languages.
- Fortran appears twice: "Fortran" is the modern version and "(FORTRAN)" is the 1960s version
- I have included "spreadsheet" as a programming language
Compiled languages appear on the left (slow) side. This is not related to the performance of programs written in these languages, but the development experience. When programming in a compiled language, one must edit the code, stop and compile, and then run the program. Languages on the right-hand side (the "immediate" side) do not need the compile step and provide feedback faster.
Notice that, aside from the elder FORTRAN, there are no slow, unstructured languages. Also notice that the structured immediate languages (Python, Ruby, et al.) cluster away from the extreme corner of structured and immediate. They are closer to the center.
The result is (roughly) a "main sequence" of programming languages, similar to the main sequence astronomers see in the types of stars. Programming languages tend to a moderate zone, where trade-offs are made between structure and immediacy.
The unusual entry was the spreadsheet, which I consider a programming language for this exercise. It appears in the extreme corner for unstructured and immediate. The spreadsheet, as a programming environment, is the fastest thing we have. Enter a value or a formula in a cell and the change "goes live" immediately. ("Before your finger is off the ENTER key", as a colleague would say.) This is faster than any IDE or compiler or interpreter for any other language.
Spreadsheets are also unstructured. There are no structures in spreadsheets, other than multiple sheets for different sets of data. While it is possible to carefully organize data in a spreadsheet, there is nothing that mandates the organization or even encourages it. (I'm thinking about the formulas in cells. A sophisticated macro programming language is a different thing.)
I think spreadsheets took over a specific type of computing. They became the master of immediate, unstructured programming. BASIC and Forth could not compete with them, and no language since has tried to compete with the spreadsheet. The spreadsheet is the most effective form of this kind of computing, and I see nothing that will replace it.
Therefore, we can predict that spreadsheets will stay with us for some time. It may not be Microsoft Excel, but it will be a spreadsheet.
We can also predict that programming languages will stay within the main sequence of compromise between structure and immediacy.
In other words, BASIC is not going to make a comeback. Nor will Forth, regrettably.
Tuesday, July 16, 2019
Across and down
All programming languages have rules. These rules define what can be done and what cannot be done in a valid program. Some languages even have rules for certain things that must be done. (COBOL, for example, requires the four 'DIVISION' sections in each program.)
Beyond rules, there are styles. Styles are different from rules. Rules are firm. Styles are soft. Styles are guidelines: good to follow, but break them when necessary.
Different languages have different styles. Some style guidelines are common: Many languages have guidelines for indentation and the naming of classes, functions, and variables. Some style guidelines are unique to languages.
The Python programming language has a style which limits line length. (To 80 characters, if you are interested.)
Ruby has a style for line length, too. (That is, if you use Rubocop with its default configuration.)
They are not the first languages to care about line length. COBOL and FORTRAN limited line length to 72 characters. These were rules, not guidelines. The origin was in punch cards, and the language standards specified the column layout and specifies 72 as a limit. Compilers ignored anything past column 72, and woe to the programmer who let a line exceed that length.
The limit in Python is a guideline. One is free to write Python with lines that exceed 80 characters, and the Python interpreter will run the code. Similarly, Ruby's style checker, Rubocop, can be configured to warn about any line length. Ruby itself will run the long lines of code. But limits on line length make for code that is more readable.
Programs exist in two dimensions. Not just across, but also down. Code consist of lines of text.
While some languages limit the width of the code (the number of columns), no language limits the "height" of the code -- the number of lines in a program, or a module, or a class.
Some implementations of languages impose a limit on the number of lines. Microsoft BASIC, for example, limited line numbers to four digits, and since each line had to have a unique line number, that imposed an upper bound of 10,000 lines. Some compilers can handle as many lines as will fit in memory -- and no more. But these are limits imposed by the implementation. I am free, for example, to create an interpreter for BASIC that can handle more than 10,000 lines. (Or fewer, stopping at 1,000.) The language does not dictate the limit.
I don't want the harshly-enforced and unconfigurable limits of the days of early computing. But I think we could use with some guidelines for code length. Rubocop, to its credit, does warn about functions that exceed a configurable limit. There are tools for other languages that warn about the complexity of functions and classes. The idea of "the code is too long" has been bubbling in the development community for decades.
Perhaps it is time we gave it some serious thought.
One creative idea (I do not remember who posed it) was to use the IDE (or the editor) to limit program size. The idea was this: Don't allow scrolling in the window that holds the code. Instead of scrolling, as a programmer increased the length of a function, the editor reduced the font size. (The idea was to keep the entire function visible.) As the code grows in size, the text shrinks. Eventually, one reaches a point when the code becomes unreadable.
The idea of shrinking code on the screen is amusing, but the idea of limiting code size may have merit. Could we set style limits for the length of functions and classes? (Such limits and warnings already exist in Rubocop, so the answer is clearly 'yes'.)
The better question is: How do limits on code length (number of lines) help stakeholders? How do they help developers, and how do they help users?
The obvious response is that shorter functions (and shorter classes) are easier to read and comprehend, perform fewer tasks, and are easier to verify (and to correct). At least, that is what I want the answer to be -- I don't know that we have hard observations that confirm that point of view. I can say that my experience confirms this opinion; I have worked on several systems, in different languages, splitting large functions and classes into smaller ones, with the result being that the re-designed code is easier to maintain. Smaller functions are easier to read.
I believe that code should consist of small classes and small functions. Guidelines and tools that help us keep functions short and classes small will improve our code. Remember that code exists in two dimensions (across and down) and that it should be moderate in both.
Beyond rules, there are styles. Styles are different from rules. Rules are firm. Styles are soft. Styles are guidelines: good to follow, but break them when necessary.
Different languages have different styles. Some style guidelines are common: Many languages have guidelines for indentation and the naming of classes, functions, and variables. Some style guidelines are unique to languages.
The Python programming language has a style which limits line length. (To 80 characters, if you are interested.)
Ruby has a style for line length, too. (That is, if you use Rubocop with its default configuration.)
They are not the first languages to care about line length. COBOL and FORTRAN limited line length to 72 characters. These were rules, not guidelines. The origin was in punch cards, and the language standards specified the column layout and specifies 72 as a limit. Compilers ignored anything past column 72, and woe to the programmer who let a line exceed that length.
The limit in Python is a guideline. One is free to write Python with lines that exceed 80 characters, and the Python interpreter will run the code. Similarly, Ruby's style checker, Rubocop, can be configured to warn about any line length. Ruby itself will run the long lines of code. But limits on line length make for code that is more readable.
Programs exist in two dimensions. Not just across, but also down. Code consist of lines of text.
While some languages limit the width of the code (the number of columns), no language limits the "height" of the code -- the number of lines in a program, or a module, or a class.
Some implementations of languages impose a limit on the number of lines. Microsoft BASIC, for example, limited line numbers to four digits, and since each line had to have a unique line number, that imposed an upper bound of 10,000 lines. Some compilers can handle as many lines as will fit in memory -- and no more. But these are limits imposed by the implementation. I am free, for example, to create an interpreter for BASIC that can handle more than 10,000 lines. (Or fewer, stopping at 1,000.) The language does not dictate the limit.
I don't want the harshly-enforced and unconfigurable limits of the days of early computing. But I think we could use with some guidelines for code length. Rubocop, to its credit, does warn about functions that exceed a configurable limit. There are tools for other languages that warn about the complexity of functions and classes. The idea of "the code is too long" has been bubbling in the development community for decades.
Perhaps it is time we gave it some serious thought.
One creative idea (I do not remember who posed it) was to use the IDE (or the editor) to limit program size. The idea was this: Don't allow scrolling in the window that holds the code. Instead of scrolling, as a programmer increased the length of a function, the editor reduced the font size. (The idea was to keep the entire function visible.) As the code grows in size, the text shrinks. Eventually, one reaches a point when the code becomes unreadable.
The idea of shrinking code on the screen is amusing, but the idea of limiting code size may have merit. Could we set style limits for the length of functions and classes? (Such limits and warnings already exist in Rubocop, so the answer is clearly 'yes'.)
The better question is: How do limits on code length (number of lines) help stakeholders? How do they help developers, and how do they help users?
The obvious response is that shorter functions (and shorter classes) are easier to read and comprehend, perform fewer tasks, and are easier to verify (and to correct). At least, that is what I want the answer to be -- I don't know that we have hard observations that confirm that point of view. I can say that my experience confirms this opinion; I have worked on several systems, in different languages, splitting large functions and classes into smaller ones, with the result being that the re-designed code is easier to maintain. Smaller functions are easier to read.
I believe that code should consist of small classes and small functions. Guidelines and tools that help us keep functions short and classes small will improve our code. Remember that code exists in two dimensions (across and down) and that it should be moderate in both.
Monday, July 8, 2019
Lots of (obsolete) Chromebooks
We users of PCs are used to upgrades, for both hardware and software. We comfortably expect this year's PC to be faster than last year's PC, and this year's Windows (or macOS, or Linux) to be better than last year's Windows.
We're also used to obsolescence with hardware and software. Very few people use Windows XP these days, and the number of people using Windows 3.1 (or MS-DOS) is vanishingly small. The modern PC uses an Intel or AMD 64-bit processor.
Hardware and software both follow a pattern of introduction, acceptance, popularity, and eventual replacement. It should not surprise us that Chromebooks follow the same pattern. Google specifies hardware platforms and manufacturers build those platforms and install Chrome OS. After some time, Google drops support for a platform. (That period of time is a little over six years.)
For obsolete PCs (those not supported by Windows) and MacBooks (those not supported by macOS) the usual "upgrade" is to install Linux. There are several Linux distros that are suitable for older hardware. (I myself am running Ubuntu 16.04 on an old 32-bit Intel-based MacBook.)
Back to Chromebooks. What will happen with all of those Chromebooks that are marked as "obsolete" by Google?
There are a few paths forward.
The first (and least effort) path is to simply continue using the Chromebook and its version of Chrome. Chrome OS should continue to run, and Chrome should continue to run. The Chromebook won't receive updates, so Chrome will be "frozen in time" and gradually become older, compared to other browsers. There may come a time when its certificates expire, and it will be unable to initiate secure sessions with servers. At that point, Chrome (and the Chromebook) will have very few uses.
Another obvious path is to replace it. Chromebooks are typically less expensive than PCs, and one could easily buy a new Chromebook. (And since the Chromebook model of computing is to store everything on the server and nothing on the Chromebook, there is no data to migrate from the old Chromebook to the new one.)
Yet there is another option between "continue as is" and "replace".
One could replace the operating system (and the browser). The Chromebook is a PC, effectively, and there are ways to replace its operating system. Microsoft has instructions for installing Windows 10 on a Chromebook, and there are many sites that explain how to install Linux on a Chromebook.
Old Chromebooks will be fertile ground for tinkerers and hobbyists. Tinkerers and hobbyists are willing to open laptops (Chromebooks included), adjust hardware, and install operating systems. When Google drops support for a specific model of Chromebook, there is little to lose in replacing Chrome OS with something like Linux. (Windows 10 on a Chromebook is tempting, but many Chromebooks have minimal hardware, and Linux may be the better fit.)
I expect to see lots of Chromebooks on the used market, in stores and online, and lots of people experimenting with them. They are low-cost PCs suitable for small applications. The initial uses will be as web browsers or remote terminals to server-based applications (because that what we use Chromebooks for now). But tinkerers and hobbyists are clever and imaginative, and we may see new uses, such as low-end games or portable word processors.
Perhaps a new operating system will emerge, one that is specialized for low-end hardware. There are already Linux distros which support low-end PCs (Puppy Linux, for one) and we may see more interest in those.
Those Chromebooks that are converted to Linux will probably end up running a browser. It may be Firefox, or, in an ironic twist, they may run Chromium -- or even Chrome! The machine that Google says is "not good enough" may be just good enough to run Google's browser.
We're also used to obsolescence with hardware and software. Very few people use Windows XP these days, and the number of people using Windows 3.1 (or MS-DOS) is vanishingly small. The modern PC uses an Intel or AMD 64-bit processor.
Hardware and software both follow a pattern of introduction, acceptance, popularity, and eventual replacement. It should not surprise us that Chromebooks follow the same pattern. Google specifies hardware platforms and manufacturers build those platforms and install Chrome OS. After some time, Google drops support for a platform. (That period of time is a little over six years.)
For obsolete PCs (those not supported by Windows) and MacBooks (those not supported by macOS) the usual "upgrade" is to install Linux. There are several Linux distros that are suitable for older hardware. (I myself am running Ubuntu 16.04 on an old 32-bit Intel-based MacBook.)
Back to Chromebooks. What will happen with all of those Chromebooks that are marked as "obsolete" by Google?
There are a few paths forward.
The first (and least effort) path is to simply continue using the Chromebook and its version of Chrome. Chrome OS should continue to run, and Chrome should continue to run. The Chromebook won't receive updates, so Chrome will be "frozen in time" and gradually become older, compared to other browsers. There may come a time when its certificates expire, and it will be unable to initiate secure sessions with servers. At that point, Chrome (and the Chromebook) will have very few uses.
Another obvious path is to replace it. Chromebooks are typically less expensive than PCs, and one could easily buy a new Chromebook. (And since the Chromebook model of computing is to store everything on the server and nothing on the Chromebook, there is no data to migrate from the old Chromebook to the new one.)
Yet there is another option between "continue as is" and "replace".
One could replace the operating system (and the browser). The Chromebook is a PC, effectively, and there are ways to replace its operating system. Microsoft has instructions for installing Windows 10 on a Chromebook, and there are many sites that explain how to install Linux on a Chromebook.
Old Chromebooks will be fertile ground for tinkerers and hobbyists. Tinkerers and hobbyists are willing to open laptops (Chromebooks included), adjust hardware, and install operating systems. When Google drops support for a specific model of Chromebook, there is little to lose in replacing Chrome OS with something like Linux. (Windows 10 on a Chromebook is tempting, but many Chromebooks have minimal hardware, and Linux may be the better fit.)
I expect to see lots of Chromebooks on the used market, in stores and online, and lots of people experimenting with them. They are low-cost PCs suitable for small applications. The initial uses will be as web browsers or remote terminals to server-based applications (because that what we use Chromebooks for now). But tinkerers and hobbyists are clever and imaginative, and we may see new uses, such as low-end games or portable word processors.
Perhaps a new operating system will emerge, one that is specialized for low-end hardware. There are already Linux distros which support low-end PCs (Puppy Linux, for one) and we may see more interest in those.
Those Chromebooks that are converted to Linux will probably end up running a browser. It may be Firefox, or, in an ironic twist, they may run Chromium -- or even Chrome! The machine that Google says is "not good enough" may be just good enough to run Google's browser.
Friday, June 21, 2019
The complexity of programming languages
A recent project saw me examining and tokenizing code for different programming languages. The languages ranged from old languages (COBOL and FORTRAN, among others) to modern languages (Python and Go, among others). It was an interesting project, and I learned quite a bit about many different languages. (By 'tokenize', I mean to identify the type of each item in a program: Variables, identifiers, function names, operators, etc. I was not parsing the code, or building an abstract syntax tree, or compiling the code into op-codes. Tokenizing is the first step of compiling, but a far cry from actually compiling the code.)
One surprising result: newer languages are easier to tokenize than older languages. Python is easier to tokenize than COBOL, and Go is easier to tokenize than FORTRAN.
This is counterintuitive. One would think that older languages would be primitive (and therefore easy to tokenize) and modern languages sophisticated (and therefore difficult to tokenize). Yet my experience shows the opposite.
Why would this be? I can think of two -- no, three -- reasons.
First, the old languages (COBOL, FORTRAN, and PL/I) were designed in the age of punch cards, and punch cards impose limits on source code. COBOL, FORTRAN, and PL/I have few things in common, but one thing that they do have in common is line layout and the 'identification' field in columns 72 through 80.
When your program is stored on punch cards, a risk is that someone will drop the deck of cards and the cards will become out of order. Such a thing cannot happen with programs stored in disk files, but with punch cards such an event is a real risk. To recover from that event, the right-most columns were reserved for identification: a code, unique to each line, that would let a card sorter machine (there were such things) put the cards back into their proper order.
The need for an identification column is tied to the punch card medium, yet it became part of each language standard. COBOL, FORTRAN, and PL/I standards all refer to the columns 72 through 80 as reserved for identification, and they could not be used for "real" source code. Programs transferred from punch cards to disk files (when disks became available to programmers) kept the rule for the identification field -- probably to make conversion easy. Later versions of languages did drop the rule, but the damage had been done. The identification field was part of the language specification.
As part of the language specification, I had to tokenize the identification numbers. Mostly they were not a problem -- just another "thing" to tokenize -- but sometimes they occurred in the middle of a string literal or a comment, which are awkward situations.
Anyway, the tokenization of old languages has its challenges.
New languages don't suffer from such problems. Their source code was never stored on punch cards, and they never had identification fields. (Either within string literals or not.)
But the tokenization of modern languages is easier. Each language has a set of token types, but older languages have a larger set, and a more varied set. Most languages have identifiers, numeric literals, and operators; COBOL also has picture values and level indicators, and PL/I has attributes and conditions (among other token types).
Which brings me to the second reason for modern languages to have simpler tokenizing requirements: The languages are designed to be easy to tokenize.
It seems to me that, intentionally or not, the designers of modern languages have made design choices that reduce the work for tokenizers. They have built languages that are easy to tokenize, and therefore have simple logic for tokenizers. (All compilers and interpreters have tokenizers; it is a step in converting the source to executable bytes.)
So maybe the simplicity of language tokenization is the result of the "laziness" of language designers.
But I have a third reason, one that I believe is the true reason for the simplicity of modern language tokenizers.
Modern languages are easy to tokenize because they are easy to read (by humans).
A language that is easy to read (for a human) is also easy to tokenize. Language designers have been consciously designing languages to be easy to read. (Python is the leading example, but all designers claim their language is "easy to read".)
Languages that are easy to read are easy to tokenize. It's that simple. We've been designing languages are humans, and as a side effect we have made them easy for computers.
I, for one, welcome the change. Not only does it make my job easier (tokenizing all of those languages) but it makes every developer's job easier (reading code from other developers and writing new code).
So I say three cheers for simple* programming languages!
* Simple does not imply weak. A simple programming language may be easy to understand, yet it may also be powerful. The combination of the two is the real benefit here.
One surprising result: newer languages are easier to tokenize than older languages. Python is easier to tokenize than COBOL, and Go is easier to tokenize than FORTRAN.
This is counterintuitive. One would think that older languages would be primitive (and therefore easy to tokenize) and modern languages sophisticated (and therefore difficult to tokenize). Yet my experience shows the opposite.
Why would this be? I can think of two -- no, three -- reasons.
First, the old languages (COBOL, FORTRAN, and PL/I) were designed in the age of punch cards, and punch cards impose limits on source code. COBOL, FORTRAN, and PL/I have few things in common, but one thing that they do have in common is line layout and the 'identification' field in columns 72 through 80.
When your program is stored on punch cards, a risk is that someone will drop the deck of cards and the cards will become out of order. Such a thing cannot happen with programs stored in disk files, but with punch cards such an event is a real risk. To recover from that event, the right-most columns were reserved for identification: a code, unique to each line, that would let a card sorter machine (there were such things) put the cards back into their proper order.
The need for an identification column is tied to the punch card medium, yet it became part of each language standard. COBOL, FORTRAN, and PL/I standards all refer to the columns 72 through 80 as reserved for identification, and they could not be used for "real" source code. Programs transferred from punch cards to disk files (when disks became available to programmers) kept the rule for the identification field -- probably to make conversion easy. Later versions of languages did drop the rule, but the damage had been done. The identification field was part of the language specification.
As part of the language specification, I had to tokenize the identification numbers. Mostly they were not a problem -- just another "thing" to tokenize -- but sometimes they occurred in the middle of a string literal or a comment, which are awkward situations.
Anyway, the tokenization of old languages has its challenges.
New languages don't suffer from such problems. Their source code was never stored on punch cards, and they never had identification fields. (Either within string literals or not.)
But the tokenization of modern languages is easier. Each language has a set of token types, but older languages have a larger set, and a more varied set. Most languages have identifiers, numeric literals, and operators; COBOL also has picture values and level indicators, and PL/I has attributes and conditions (among other token types).
Which brings me to the second reason for modern languages to have simpler tokenizing requirements: The languages are designed to be easy to tokenize.
It seems to me that, intentionally or not, the designers of modern languages have made design choices that reduce the work for tokenizers. They have built languages that are easy to tokenize, and therefore have simple logic for tokenizers. (All compilers and interpreters have tokenizers; it is a step in converting the source to executable bytes.)
So maybe the simplicity of language tokenization is the result of the "laziness" of language designers.
But I have a third reason, one that I believe is the true reason for the simplicity of modern language tokenizers.
Modern languages are easy to tokenize because they are easy to read (by humans).
A language that is easy to read (for a human) is also easy to tokenize. Language designers have been consciously designing languages to be easy to read. (Python is the leading example, but all designers claim their language is "easy to read".)
Languages that are easy to read are easy to tokenize. It's that simple. We've been designing languages are humans, and as a side effect we have made them easy for computers.
I, for one, welcome the change. Not only does it make my job easier (tokenizing all of those languages) but it makes every developer's job easier (reading code from other developers and writing new code).
So I say three cheers for simple* programming languages!
* Simple does not imply weak. A simple programming language may be easy to understand, yet it may also be powerful. The combination of the two is the real benefit here.
Friday, May 17, 2019
Procedures and functions are two different things
The programming language Pascal had many good ideas. Many of those ideas have been adopted by modern programming languages. One idea that hasn't been adopted was the separation of functions and procedures.
Some definitions are in order. In Pascal, a function is a subroutine that accepts input parameters, can access variables in its scope (and containing scopes), performs some computations, and returns a value. A procedure is similar: it accepts input parameters, can access variables in its scope and containing scopes, performs calculations, and ... does not return a value.
Pascal has the notion of functions and a separate notion of procedures. A function is a function and a procedure is a procedure, and the two are different. A function can be used (in early Pascal, must be used) in an expression. It cannot stand alone.
A procedure, in contrast, is a computational step in a program. It cannot be part of an expression. It is a single statement, although it can be part of an 'if' or 'while' statement block.
Functions and procedures have different purposes, and I believe that the creators of Pascal envisioned functions to be unable to change variables outside of themselves. Procedures, I believe, were intended to change variables outside of their immediate scope. In C++, a Pascal-style function would be a function that is declared 'const', and a procedure would be a function that returns 'void'.
This arrangement is different from the C idea of functions. C combines the idea of function and procedure into a single 'function' construct. A function may be designed to return a value, or it may be designed to return nothing. A function may change variables outside of its scope, but it doesn't have to. (It may or may not have "side effects".)
In the competition among programming languages, C won big early on, and Pascal (or rather, the ideas in Pascal), have gained acceptance slowly. The C notion of function has been carried by other popular languages: C++, Java, C#, Python, Ruby, and even Go.
I remember quite clearly learning about Pascal (many years ago) and feeling that C was superior to Pascal due to its single approach. I sneered (mentally) at Pascal's split between functions and procedures.
I have come to regret those feelings, and now see the benefit of separating functions and procedures. When building (or maintaining) large-ish systems in modern languages (C++, C#, Java, Python), I have created functions that follow the function/procedure split. These languages force one to write functions -- there is no construct for a procedure -- yet I designed some functions to return values and others to not return values. The value-returning functions I made 'const' when possible, and avoided side effects. The functions with side effects I designed to not return values. In sum, I built functions and procedures, although the compiler uses only the 'function' construct.
The future may hold programming languages that provide functions and procedures as separate constructs. I'm confident that we will see languages that have these two ideas. Here's why:
First, there is a new class of programming languages called "functional languages". These include ML, Erlang, Haskell, and F#, to name a few. These functional languages use Pascal's original idea of functions as code blocks that perform a calculation with no side effects and return a value. Language designers have already re-discovered the idea of the "pure function".
Second, most ideas from Pascal have been implemented in modern languages. Bounds-checking for arrays. Structured programming. Limited conversion of values from one type to another. The separation of functions and procedures is one more of these ideas.
The distinction between functions and procedures is one more concept that Pascal got right. I expect to see it in newer languages, perhaps over the next decade. The enthusiasts of functional programming will realize that pure functions are not sufficient and that they need procedures. We'll then see variants of functional languages that include procedures, with purists holding on to procedure-less languages. I'm looking forward to the division of labor between functions and procedures; it has worked well for me in my efforts and a formal recognition will help me convey this division to other programmers.
Some definitions are in order. In Pascal, a function is a subroutine that accepts input parameters, can access variables in its scope (and containing scopes), performs some computations, and returns a value. A procedure is similar: it accepts input parameters, can access variables in its scope and containing scopes, performs calculations, and ... does not return a value.
Pascal has the notion of functions and a separate notion of procedures. A function is a function and a procedure is a procedure, and the two are different. A function can be used (in early Pascal, must be used) in an expression. It cannot stand alone.
A procedure, in contrast, is a computational step in a program. It cannot be part of an expression. It is a single statement, although it can be part of an 'if' or 'while' statement block.
Functions and procedures have different purposes, and I believe that the creators of Pascal envisioned functions to be unable to change variables outside of themselves. Procedures, I believe, were intended to change variables outside of their immediate scope. In C++, a Pascal-style function would be a function that is declared 'const', and a procedure would be a function that returns 'void'.
This arrangement is different from the C idea of functions. C combines the idea of function and procedure into a single 'function' construct. A function may be designed to return a value, or it may be designed to return nothing. A function may change variables outside of its scope, but it doesn't have to. (It may or may not have "side effects".)
In the competition among programming languages, C won big early on, and Pascal (or rather, the ideas in Pascal), have gained acceptance slowly. The C notion of function has been carried by other popular languages: C++, Java, C#, Python, Ruby, and even Go.
I remember quite clearly learning about Pascal (many years ago) and feeling that C was superior to Pascal due to its single approach. I sneered (mentally) at Pascal's split between functions and procedures.
I have come to regret those feelings, and now see the benefit of separating functions and procedures. When building (or maintaining) large-ish systems in modern languages (C++, C#, Java, Python), I have created functions that follow the function/procedure split. These languages force one to write functions -- there is no construct for a procedure -- yet I designed some functions to return values and others to not return values. The value-returning functions I made 'const' when possible, and avoided side effects. The functions with side effects I designed to not return values. In sum, I built functions and procedures, although the compiler uses only the 'function' construct.
The future may hold programming languages that provide functions and procedures as separate constructs. I'm confident that we will see languages that have these two ideas. Here's why:
First, there is a new class of programming languages called "functional languages". These include ML, Erlang, Haskell, and F#, to name a few. These functional languages use Pascal's original idea of functions as code blocks that perform a calculation with no side effects and return a value. Language designers have already re-discovered the idea of the "pure function".
Second, most ideas from Pascal have been implemented in modern languages. Bounds-checking for arrays. Structured programming. Limited conversion of values from one type to another. The separation of functions and procedures is one more of these ideas.
The distinction between functions and procedures is one more concept that Pascal got right. I expect to see it in newer languages, perhaps over the next decade. The enthusiasts of functional programming will realize that pure functions are not sufficient and that they need procedures. We'll then see variants of functional languages that include procedures, with purists holding on to procedure-less languages. I'm looking forward to the division of labor between functions and procedures; it has worked well for me in my efforts and a formal recognition will help me convey this division to other programmers.
Labels:
C,
functional programming,
new programming languages,
Pascal
Tuesday, April 23, 2019
Full-stack developers and the split between development and system administration
The notion of a "full stack" developer has been with us for a while, Some say it is a better way to develop and deploy systems, others take the view that it is a way for a company to build systems at lower cost. Despite their differing opinions on the value of a full stack engineer, everyone agrees on the definition: A "full stack" developer (or engineer) is a person who can "do it all" from analysis to development and testing (automated testing), from database design to web site deployment.
But here is a question: Why was there a split in functions? Why did we have separate roles for developers and system administrators? Why didn't we have combined roles from the beginning?
Well, at the very beginning of the modern computing era, we did have a single role. But things became complicated, and specialization was profitable for the providers of computers. Let's go back in time.
We're going way back in time, back before the current cloud-based, container-driven age. Back before the "old school" web age. Before the age of networked (but not internet-connected) PCs, and even before the PC era. We're going further back, before minicomputers and before commercial mainframes such as the IBM System/360.
We're going back to the dawn of modern electronic computing. This was a time before the operating system, and individuals who wanted to use a computer had to write their own code (machine code, not a high-level language such as COBOL) and those programs managed memory and manipulated input-output devices such as card readers and line printers. A program had total control of the computer -- there was no multiprocessing -- and it ran until it finished. When one programmer was finished with the computer, a second programmer could use it.
In this age, the programmer was a "full stack" developer, handling memory allocation, data structures, input and output routines, business logic. There were no databases, no web servers, and no authentication protocols, but the programmer "did it all", including scheduling time on the computer with other programmers.
Once organizations developed programs that they found useful, especially programs that had to be run on a regular basis, they dedicated a person to the scheduling and running of those tasks. That person's job was to ensure that the important programs were run on the right day, at the right time, with the right resources (card decks and magnetic tapes).
Computer manufacturers provided people for those roles, and also provided training for client employees to learn the skills of the "system operator". There was a profit for the manufacturer -- and a cost to be avoided (or at least minimized) by the client. Hence, only a few people were given the training.
Of the five "waves" of computing technology (mainframe, minicomputers, personal computers, networked PCs, and web servers) most started with a brief period of "one person does it all" and then shifted to a model that divided labor among specialists. Mainframes specialized with programmers and system operators (and later, database administrators). Personal computers, by their very nature, had one person but later specialists for word processing, databases, and desktop publishing. Networked PCs saw specialization with enterprise administrators (such as Windows domain administrators) and programmers each learning different skills.
It was the first specialization of tasks, in the early mainframe era, that set the tone for later specializations.
Today, we're moving away from specialization. I suspect that the "full stack" engineer is desired by managers who have tired of the arguments between specialists. Companies don't want to hear sysadmins and programmers bickering about who is at fault when an error occurs; they want solutions. Forcing sysadmins and programmers to "wear the same hat" eliminates the arguments. (Or so managers hope.)
The specialization of tasks on the different computing platforms happened because it was more efficient. The different jobs required different skills, and it was easier (and cheaper) to train some individuals for some tasks and other individuals for other tasks, and manage the two groups.
Perhaps the relative costs have changed. Perhaps, with our current technology, it is more difficult (and more expensive) to manage groups of specialists, and it is cheaper to train full-stack developers. That may say more about management skills than it does about technical skills.
But here is a question: Why was there a split in functions? Why did we have separate roles for developers and system administrators? Why didn't we have combined roles from the beginning?
Well, at the very beginning of the modern computing era, we did have a single role. But things became complicated, and specialization was profitable for the providers of computers. Let's go back in time.
We're going way back in time, back before the current cloud-based, container-driven age. Back before the "old school" web age. Before the age of networked (but not internet-connected) PCs, and even before the PC era. We're going further back, before minicomputers and before commercial mainframes such as the IBM System/360.
We're going back to the dawn of modern electronic computing. This was a time before the operating system, and individuals who wanted to use a computer had to write their own code (machine code, not a high-level language such as COBOL) and those programs managed memory and manipulated input-output devices such as card readers and line printers. A program had total control of the computer -- there was no multiprocessing -- and it ran until it finished. When one programmer was finished with the computer, a second programmer could use it.
In this age, the programmer was a "full stack" developer, handling memory allocation, data structures, input and output routines, business logic. There were no databases, no web servers, and no authentication protocols, but the programmer "did it all", including scheduling time on the computer with other programmers.
Once organizations developed programs that they found useful, especially programs that had to be run on a regular basis, they dedicated a person to the scheduling and running of those tasks. That person's job was to ensure that the important programs were run on the right day, at the right time, with the right resources (card decks and magnetic tapes).
Computer manufacturers provided people for those roles, and also provided training for client employees to learn the skills of the "system operator". There was a profit for the manufacturer -- and a cost to be avoided (or at least minimized) by the client. Hence, only a few people were given the training.
Of the five "waves" of computing technology (mainframe, minicomputers, personal computers, networked PCs, and web servers) most started with a brief period of "one person does it all" and then shifted to a model that divided labor among specialists. Mainframes specialized with programmers and system operators (and later, database administrators). Personal computers, by their very nature, had one person but later specialists for word processing, databases, and desktop publishing. Networked PCs saw specialization with enterprise administrators (such as Windows domain administrators) and programmers each learning different skills.
It was the first specialization of tasks, in the early mainframe era, that set the tone for later specializations.
Today, we're moving away from specialization. I suspect that the "full stack" engineer is desired by managers who have tired of the arguments between specialists. Companies don't want to hear sysadmins and programmers bickering about who is at fault when an error occurs; they want solutions. Forcing sysadmins and programmers to "wear the same hat" eliminates the arguments. (Or so managers hope.)
The specialization of tasks on the different computing platforms happened because it was more efficient. The different jobs required different skills, and it was easier (and cheaper) to train some individuals for some tasks and other individuals for other tasks, and manage the two groups.
Perhaps the relative costs have changed. Perhaps, with our current technology, it is more difficult (and more expensive) to manage groups of specialists, and it is cheaper to train full-stack developers. That may say more about management skills than it does about technical skills.
Wednesday, April 10, 2019
Program language and program size
Can programs be "too big"? Does it depend on the language?
In the 1990s, the two popular programming languages from Microsoft were Visual Basic and Visual C++. (Microsoft also offered Fortran and an assembler, and I think COBOL, but they were used rarely.)
I used both Visual Basic and Visual C++. With Visual Basic it was easy to create a Windows application, but the applications in Visual Basic were limited. You could not, for example, launch a modal dialog from within a modal dialog. Visual C++ was much more capable; you had the entire Windows API available to you. But the construction of Visual C++ applications took more time and effort. A simple Visual Basic application could be "up and running" in a minute. The simplest Visual C++ application took at least twenty minutes. Applications with dialogs took quite a bit of time in Visual C++.
Visual Basic was better for small applications. They could be written quickly, and changed quickly. Visual C++ was better for large applications. Larger applications required more design and coding (and more testing) but could handle more complex tasks. Also, the performance benefits of C++ were only obtained for large applications.
(I will note that Microsoft has improved the experience since those early days of Windows programming. The .NET framework has made a large difference. Microsoft has also improved the dialog editors and other tools in what is now called Visual Studio.)
That early Windows experience got me thinking: are some languages better at small programs, and other languages better at large programs? Small programs written in languages that require a lot of code (verbose languages) have a disadvantage because of the extra work. Visual C++ was a verbose language; Visual Basic was not -- or was less verbose. Other languages weigh in at different points on the scale of verbosity.
Consider a "word count" program. (That is, a program to count the words in a file.) Different languages require different amounts of code. At the small-program end of the scale we have languages such as AWK and Perl. At the large-end of the scale we have COBOL.
(I am considering lines of code here, and not executable size or the size of libraries. I don't count run-time environments or byte-code engines.)
I would much rather write (and maintain) the word-count program in AWK or Perl (or Ruby or Python). Not because these languages are modern, but because the program itself is small. (Trival, actually.) The program in COBOL is large; COBOL has some string-handling functions (but not many) and it requires a fair amount of overhead to define the program. A COBOL program is long, by design. The COBOL language is a verbose language.
Thus, there is an incentive to build small programs in certain languages. (I should probably say that there is an incentive to build certain programs in certain languages.)
But that is on the small end of the scale of programs. What about the other end? Is there an incentive to build large programs in certain languages?
I believe that the answer is yes. Just as some languages are good for small programs, other languages are good for large programs. The languages that are good for large programs have structures and constructs which help us humans manage and understand the code in large scale.
Over the years, we have developed several techniques we use to manage source code. They include:
These techniques help us by partitioning the code. We can "lump" and "split" the code into different subroutines, functions, modules, classes, and contexts. We can define rules to limit the information that is allowed to flow between the multiple "lumps" of a system. Limiting the flow of information simplifies the task of programming (or debugging, or documenting) a system.
Is there a point when a program is simply "too big" for a language?
I think there are two concepts lurking in that question. The first is a relative answer, and the second is an absolute answer.
Let's start with a hypothetical example. A mind experiment, if you will.
Let's imagine a program. It can be any program, but it is small and simple. (Perhaps it is "Hello, world!") Let's pick a language for our program. As the program is small, let's pick a language that is good for small programs. (It could be Visual Basic or AWK.)
Let's continue our experiment by increasing the size of our program. As this was a hypothetical program, we can easily expand it. (We don't have to write the actual code -- we simply expand the code in our mind.)
Now, keeping our program in mind, and remembering our initial choice of a programming language, let us consider other languages. Is there a point when we would like to switch from our chosen programming language to another language?
The relative answer applies to a language when compared to a different language. In my earlier example, I compared Visual Basic with Visual C++. Visual Basic was better for small programs, Visual C++ for large programs.
The exact point of change is not clear. It wasn't clear in the early days of Windows programming, either. But there must be a crossover point, where the situation changes from "better in Visual Basic" to "better in Visual C++".
The two languages don't have to be Visual Basic and Visual C++. They could be any pair. One could compare COBOL and assembler, or Java and Perl, or Go and Ruby. Each pair has its own crossover point, but the crossover point is there. Each pair of languages has a point in which it is better to select the more verbose language, because of its capabilities at managing large code.
That's the relative case, which considers two languages and picks the better of the two. Then there is the absolute case, which considers only one language.
For the absolute case, the question is not "Which is the better language for a given program?", but "Should we write a program in a given language?". That is, there may be some programs which are too large, too complex, too difficult to write in a specific programming language.
Well-informed readers will be aware that a program written in a language that is "Turing complete" can be translated into any other programming language that is also "Turing complete". That is not the point. The question is not "Can this program be written in a given language?" but "Should this program be written in a given language?".
That is a much subtler question, and much more subjective. I may consider a program "too big" for language X while another might consider it within bounds. I don't have metrics for such a decision -- and even if I did, one could argue that my cutoff point (a complexity value of 2000, say) is arbitrary and the better value is somewhat higher (perhaps 2750). One might argue that a more talented team can handle programs that are larger and more complex.
Someday we may have agreed-upon metrics, and someday we may have agreed-upon cutoff values. Someday we may be able to run our program through a tool for analysis, one that computes the complexity and compares the result to our cut-off values. Such a tool would be an impartial judge for the suitability of the programming language for our task. (Assuming that we write programs that are efficient and correct in the given programming language.)
Someday we may have all of that, and the discipline to discard (or re-design) programs that exceed the boundaries.
But we don't have that today.
In the 1990s, the two popular programming languages from Microsoft were Visual Basic and Visual C++. (Microsoft also offered Fortran and an assembler, and I think COBOL, but they were used rarely.)
I used both Visual Basic and Visual C++. With Visual Basic it was easy to create a Windows application, but the applications in Visual Basic were limited. You could not, for example, launch a modal dialog from within a modal dialog. Visual C++ was much more capable; you had the entire Windows API available to you. But the construction of Visual C++ applications took more time and effort. A simple Visual Basic application could be "up and running" in a minute. The simplest Visual C++ application took at least twenty minutes. Applications with dialogs took quite a bit of time in Visual C++.
Visual Basic was better for small applications. They could be written quickly, and changed quickly. Visual C++ was better for large applications. Larger applications required more design and coding (and more testing) but could handle more complex tasks. Also, the performance benefits of C++ were only obtained for large applications.
(I will note that Microsoft has improved the experience since those early days of Windows programming. The .NET framework has made a large difference. Microsoft has also improved the dialog editors and other tools in what is now called Visual Studio.)
That early Windows experience got me thinking: are some languages better at small programs, and other languages better at large programs? Small programs written in languages that require a lot of code (verbose languages) have a disadvantage because of the extra work. Visual C++ was a verbose language; Visual Basic was not -- or was less verbose. Other languages weigh in at different points on the scale of verbosity.
Consider a "word count" program. (That is, a program to count the words in a file.) Different languages require different amounts of code. At the small-program end of the scale we have languages such as AWK and Perl. At the large-end of the scale we have COBOL.
(I am considering lines of code here, and not executable size or the size of libraries. I don't count run-time environments or byte-code engines.)
I would much rather write (and maintain) the word-count program in AWK or Perl (or Ruby or Python). Not because these languages are modern, but because the program itself is small. (Trival, actually.) The program in COBOL is large; COBOL has some string-handling functions (but not many) and it requires a fair amount of overhead to define the program. A COBOL program is long, by design. The COBOL language is a verbose language.
Thus, there is an incentive to build small programs in certain languages. (I should probably say that there is an incentive to build certain programs in certain languages.)
But that is on the small end of the scale of programs. What about the other end? Is there an incentive to build large programs in certain languages?
I believe that the answer is yes. Just as some languages are good for small programs, other languages are good for large programs. The languages that are good for large programs have structures and constructs which help us humans manage and understand the code in large scale.
Over the years, we have developed several techniques we use to manage source code. They include:
- Multiple source files (#include files, copybooks, separate compiled files in a project, etc.)
- A library of subroutines and functions (the "standard library")
- A repository of libraries (CPAN, CRAN, gems, etc.)
- The ability to define subroutines
- The ability to define functions
- Object-oriented programming (the ability to define types)
- The ability to define interfaces
- Mix-in fragments of classes
- Lambdas and closures
These techniques help us by partitioning the code. We can "lump" and "split" the code into different subroutines, functions, modules, classes, and contexts. We can define rules to limit the information that is allowed to flow between the multiple "lumps" of a system. Limiting the flow of information simplifies the task of programming (or debugging, or documenting) a system.
Is there a point when a program is simply "too big" for a language?
I think there are two concepts lurking in that question. The first is a relative answer, and the second is an absolute answer.
Let's start with a hypothetical example. A mind experiment, if you will.
Let's imagine a program. It can be any program, but it is small and simple. (Perhaps it is "Hello, world!") Let's pick a language for our program. As the program is small, let's pick a language that is good for small programs. (It could be Visual Basic or AWK.)
Let's continue our experiment by increasing the size of our program. As this was a hypothetical program, we can easily expand it. (We don't have to write the actual code -- we simply expand the code in our mind.)
Now, keeping our program in mind, and remembering our initial choice of a programming language, let us consider other languages. Is there a point when we would like to switch from our chosen programming language to another language?
The relative answer applies to a language when compared to a different language. In my earlier example, I compared Visual Basic with Visual C++. Visual Basic was better for small programs, Visual C++ for large programs.
The exact point of change is not clear. It wasn't clear in the early days of Windows programming, either. But there must be a crossover point, where the situation changes from "better in Visual Basic" to "better in Visual C++".
The two languages don't have to be Visual Basic and Visual C++. They could be any pair. One could compare COBOL and assembler, or Java and Perl, or Go and Ruby. Each pair has its own crossover point, but the crossover point is there. Each pair of languages has a point in which it is better to select the more verbose language, because of its capabilities at managing large code.
That's the relative case, which considers two languages and picks the better of the two. Then there is the absolute case, which considers only one language.
For the absolute case, the question is not "Which is the better language for a given program?", but "Should we write a program in a given language?". That is, there may be some programs which are too large, too complex, too difficult to write in a specific programming language.
Well-informed readers will be aware that a program written in a language that is "Turing complete" can be translated into any other programming language that is also "Turing complete". That is not the point. The question is not "Can this program be written in a given language?" but "Should this program be written in a given language?".
That is a much subtler question, and much more subjective. I may consider a program "too big" for language X while another might consider it within bounds. I don't have metrics for such a decision -- and even if I did, one could argue that my cutoff point (a complexity value of 2000, say) is arbitrary and the better value is somewhat higher (perhaps 2750). One might argue that a more talented team can handle programs that are larger and more complex.
Someday we may have agreed-upon metrics, and someday we may have agreed-upon cutoff values. Someday we may be able to run our program through a tool for analysis, one that computes the complexity and compares the result to our cut-off values. Such a tool would be an impartial judge for the suitability of the programming language for our task. (Assuming that we write programs that are efficient and correct in the given programming language.)
Someday we may have all of that, and the discipline to discard (or re-design) programs that exceed the boundaries.
But we don't have that today.
Thursday, March 28, 2019
Spring cleaning
Spring is in the air! Time for a general cleaning.
An IT shop of any significant size will have old technologies. Some folks will call them "legacy applications". Other folks try not to think about them. But a responsible manager will take inventory of the technology in his (or her) shop and winnow out those that are not serving their purpose (or are posing threats).
Here are some ideas for tech to get rid of:
Perl: I have used Perl. When the alternatives were C++ and Java, Perl was great. We could write programs quickly, and they tended to be small and easy to read. (Well, sort of easy to read.)
Actually, Perl programs were often difficult to read. And they still are difficult to read.
With languages such as Python and Ruby, I'm not sure that we need Perl. (Yes, there may be a module or library that works only with Perl. But they are few.)
Recommendation: If you have no compelling reason to stay with Perl, move to Python.
Visual Basic and VB.NET: Visual Basic (the non-.NET version), is old and difficult to support. It will only become older and more difficult to support. It does not fit in with web development -- much less cloud development. VB.NET has always been a second chair to C#.
Recommendation: Migrate from VB.NET to C#. Migrate from Visual Basic to anything (except Perl).
Any version of Windows other than Windows 10: Windows 10 has been with us for years. There is no reason to hold on to Windows 8 or Windows 7 (or Windows Vista).
If you have applications that can run only on Windows 7 or Windows 8, you have an application that will eventually die.
You don't have to move to Windows 10. You can move some applications to Linux, for example. If people are using only web-based applications, you can issue them ChromeBooks or low-end Windows laptops.
Recommendation: Replace older versions of Windows with Windows 10, Linux, or Chrome OS.
CVS and Subversion: Centralized version control systems require administration, which translates into expense. Distributed version control systems often cost less to administer, once you teach people how to use them. (The transition is not always easy, and the conversion costs are not zero, but in the long run the distributed systems will cost you less.)
Recommendation: Move to git.
Everyone has old technology. The wise manager knows about it and decides when to replace it. The foolish manager ignores the old technology, and often replaces it when forced to by external events, and not at a time of his choosing.
Be a wise manager. Take inventory of your technology, assess risk, and build a plan for replacements and upgrades.
An IT shop of any significant size will have old technologies. Some folks will call them "legacy applications". Other folks try not to think about them. But a responsible manager will take inventory of the technology in his (or her) shop and winnow out those that are not serving their purpose (or are posing threats).
Here are some ideas for tech to get rid of:
Perl: I have used Perl. When the alternatives were C++ and Java, Perl was great. We could write programs quickly, and they tended to be small and easy to read. (Well, sort of easy to read.)
Actually, Perl programs were often difficult to read. And they still are difficult to read.
With languages such as Python and Ruby, I'm not sure that we need Perl. (Yes, there may be a module or library that works only with Perl. But they are few.)
Recommendation: If you have no compelling reason to stay with Perl, move to Python.
Visual Basic and VB.NET: Visual Basic (the non-.NET version), is old and difficult to support. It will only become older and more difficult to support. It does not fit in with web development -- much less cloud development. VB.NET has always been a second chair to C#.
Recommendation: Migrate from VB.NET to C#. Migrate from Visual Basic to anything (except Perl).
Any version of Windows other than Windows 10: Windows 10 has been with us for years. There is no reason to hold on to Windows 8 or Windows 7 (or Windows Vista).
If you have applications that can run only on Windows 7 or Windows 8, you have an application that will eventually die.
You don't have to move to Windows 10. You can move some applications to Linux, for example. If people are using only web-based applications, you can issue them ChromeBooks or low-end Windows laptops.
Recommendation: Replace older versions of Windows with Windows 10, Linux, or Chrome OS.
CVS and Subversion: Centralized version control systems require administration, which translates into expense. Distributed version control systems often cost less to administer, once you teach people how to use them. (The transition is not always easy, and the conversion costs are not zero, but in the long run the distributed systems will cost you less.)
Recommendation: Move to git.
Everyone has old technology. The wise manager knows about it and decides when to replace it. The foolish manager ignores the old technology, and often replaces it when forced to by external events, and not at a time of his choosing.
Be a wise manager. Take inventory of your technology, assess risk, and build a plan for replacements and upgrades.
Wednesday, March 27, 2019
Something new for programming
One of my recent side projects involved R and R Studio. R is a programming language, an interpreted language with powerful data manipulation capabilities.
I am not impressed with R and I am quite disappointed with R Studio. I have ranted about them in a previous post. But in my ... excitement ... of R and R Studio, I missed the fact that we have something new in programming.
That something new is a new form of IDE, one that has several features:
R Studio has a desktop version, which you install and run locally. It also has a cloud-based version -- all you need is a browser, an internet connection, and an account. The online version looks exactly like the desktop version -- something that I think will change as the folks at R Studio add features.
R Studio's puts code and documentation into the same file. R Studio uses a variant of Markdown language (named 'Rmd').
The concept of comments in code is not new. Comments are usually short text items that are denoted with special markers ('//' in C++ and '#' in many languages). The model has always been: code contains comments and the comments are denoted by specific characters or sequences.
Rmd inverts that model: You write a document and denote the code with special markers ('$$' for TeX and '```' for code). Instead of comments (small documents) in your code, you have code in your document.
R Studio runs your code -- all of it or a section that you specify -- and displays the results as part of your document. It is smart enough to pick through the document and identify the code.
The concept of code and documentation in one file is not exclusive to R Studio. There are other tools that do the same thing: Jupyter notebooks, Mozilla's Iodide, and Mathematica (possibly the oldest of the lot). Each allow for text and code, with output. Each also allow for sharing.
At a high level, these online IDEs do the same thing: Create a document, add code, see the results, and share.
Over the years, we've shared code through various means: physical media (punch cards, paper tape, magnetic tape, floppy disk), shared storage locations (network disks), and version-control repositories (CVS, Subversion, Github). All of these methods required some effort. The new online-IDEs reduce that effort; no need to attach files to e-mail, just send a link.
There are a few major inflection points in software development, and I believe that this is one of them. I expect the concept of mixing text and code and results will become popular. I expect the notion of sharing projects (the text, the code, the results) will become popular.
I don't expect all programs (or all programmers) to move to this model. Large systems, especially those with hard performance requirements, will stay in the traditional compile-deploy-run model with separate documentation.
I see this new model of document-code-results as a new form of programming, one that will open new areas. The document-code-results combination is a good match for sharing work and results, and is close in concept to academic and scientific journals (which contain text, analysis, and results of that analysis).
Programming languages have become powerful, and that supports this new model. A Fortran program for simulating water in a pipe required eight to ten pages; the Matlab language can perform the same work in roughly half a page. Modern languages are more concise and can present their ideas without the overhead of earlier computer languages. A small snippet of code is enough to convey a complex study. This makes them suitable for analysis and especially suitable for sharing code.
It won't be traditional programmers who flock to the document-code-results-share model. Instead it will be non-programmers who can use the analysis in their regular jobs.
The online IDE supports a project with these characteristics:
I am not impressed with R and I am quite disappointed with R Studio. I have ranted about them in a previous post. But in my ... excitement ... of R and R Studio, I missed the fact that we have something new in programming.
That something new is a new form of IDE, one that has several features:
- on-line (cloud-based)
- mixes code and documentation
- immediate display of output
- can share the code, document, and results
R Studio has a desktop version, which you install and run locally. It also has a cloud-based version -- all you need is a browser, an internet connection, and an account. The online version looks exactly like the desktop version -- something that I think will change as the folks at R Studio add features.
R Studio's puts code and documentation into the same file. R Studio uses a variant of Markdown language (named 'Rmd').
The concept of comments in code is not new. Comments are usually short text items that are denoted with special markers ('//' in C++ and '#' in many languages). The model has always been: code contains comments and the comments are denoted by specific characters or sequences.
Rmd inverts that model: You write a document and denote the code with special markers ('$$' for TeX and '```' for code). Instead of comments (small documents) in your code, you have code in your document.
R Studio runs your code -- all of it or a section that you specify -- and displays the results as part of your document. It is smart enough to pick through the document and identify the code.
The concept of code and documentation in one file is not exclusive to R Studio. There are other tools that do the same thing: Jupyter notebooks, Mozilla's Iodide, and Mathematica (possibly the oldest of the lot). Each allow for text and code, with output. Each also allow for sharing.
At a high level, these online IDEs do the same thing: Create a document, add code, see the results, and share.
Over the years, we've shared code through various means: physical media (punch cards, paper tape, magnetic tape, floppy disk), shared storage locations (network disks), and version-control repositories (CVS, Subversion, Github). All of these methods required some effort. The new online-IDEs reduce that effort; no need to attach files to e-mail, just send a link.
There are a few major inflection points in software development, and I believe that this is one of them. I expect the concept of mixing text and code and results will become popular. I expect the notion of sharing projects (the text, the code, the results) will become popular.
I don't expect all programs (or all programmers) to move to this model. Large systems, especially those with hard performance requirements, will stay in the traditional compile-deploy-run model with separate documentation.
I see this new model of document-code-results as a new form of programming, one that will open new areas. The document-code-results combination is a good match for sharing work and results, and is close in concept to academic and scientific journals (which contain text, analysis, and results of that analysis).
Programming languages have become powerful, and that supports this new model. A Fortran program for simulating water in a pipe required eight to ten pages; the Matlab language can perform the same work in roughly half a page. Modern languages are more concise and can present their ideas without the overhead of earlier computer languages. A small snippet of code is enough to convey a complex study. This makes them suitable for analysis and especially suitable for sharing code.
It won't be traditional programmers who flock to the document-code-results-share model. Instead it will be non-programmers who can use the analysis in their regular jobs.
The online IDE supports a project with these characteristics:
- small code
- multiple people
- output is easily visualizable
- sharing and enlightenment, not production
Labels:
documentation,
interactive,
Iodide,
Jupyter,
Mathematica,
programming,
R Studio
Tuesday, March 19, 2019
C++ gets serious
I'm worried that C++ is getting too ... complicated.
I am not worries that C++ is a dead language. It is not. The C++ standards committee has adopted several changes over the years, releasing new C++ standards. C++11. C++14. C++17 is the most recent. C++20 is in progress. Compiler vendors are implementing the new standards. (Microsoft has done an admirable job in their latest versions of their C++ compiler.)
But the changes are impressive -- and intimidating. Even the names of the changes are daunting:
Here is an example of range, which simplifies the common "iterate over a collection" loop:
int array[5] = { 1, 2, 3, 4, 5 };
for (int& x : array)
x *= 2;
This is a nice improvement. Notice that it does not use STL iterators; this is pure C++ code.
Somewhat more complex is an implementation of the spaceship operator:
template
struct pair {
T t;
U u;
auto operator<=> (pair const& rhs) const
-> std::common_comparison_category_t<
decltype(std::compare_3way(t, rhs.t)),
decltype(std::compare_3way(u, rhs.u)>
{
if (auto cmp = std::compare_3way(t, rhs.t); cmp != 0)
return cmp;
return std::compare3_way(u, rhs.u);
}
}
That code seems... not so obvious.
The non-obviousness of code doesn't end there.
Look at two functions, one for value types and one for all types (value and reference types):
For simple value types, for our two functions, we can write the following code:
std::for_each(vi.begin(), vi.end(), [](auto x) { return foo(x); });
The most generic form:
#define LIFT(foo) \
[](auto&... x) \
noexcept(noexcept(foo(std::forward(x)...))) \
-> decltype(foo(std::forward(x)...)) \
{ return foo(std::forward(x)...); }
I will let you ponder that bit of "trivial" code.
Notice that the last example uses the #define macro to do its work, with '\' characters to continue the macro across multiple lines.
* * *
I have been pondering that code (and more) for some time.
- C++ is becoming more capable, but also more complex. It is now far from the "C with Classes" that was the start of C++.
- C++ is not obsolete, but it is for applications with specific needs. C++ does offer fine control over memory management and can provide predictable run-time performance, which are advantages for embedded applications. But if you don't need the specific advantages of C++, I see little reason to invest the extra effort to learn and maintain C++.
- Development work will favor other languages, mostly Java, C#, Python, JavaScript, and Go. Java and C# have become the "first choice" languages for business applications; Python has become the "first choice" for one's first language. The new features of C++, while useful for specific applications, will probably discourage the average programmer. I'm not expecting schools to teach C++ as a first language again -- ever.
- There will remain a lot of C++ code, but C++'s share of "the world of code" will become smaller. Some of this is due to systems being written in other languages. But I'm willing to bet that the total lines of code for C++ (if we could measure it) is shrinking in absolute numbers.
All of this means that C++ development will become more expensive.
There will be fewer C++ programmers. C++ is not the language taught in schools (usually) and it is not the language taught in the "intro to programming" courses. People will not learn C++ as a matter of course; only those who really want to learn it will make the effort.
C++ will be limited to the projects that need the features of C++, projects which are larger and more complex. Projects that are "simple" and "average" will use other languages. It will be the complicated projects, the projects that need high performance, the projects that need well-defined (and predictable) memory management which will use C++.
C++ will continue as a language. It will be used on the high end projects, with specific requirements and high performance. The programmers who know C++ will have to know how to work on those projects -- amateurs and dabblers will not be welcome. If you are managing projects, and you want to stay with C++, be prepared to hunt for talent and be prepared to pay.
I am not worries that C++ is a dead language. It is not. The C++ standards committee has adopted several changes over the years, releasing new C++ standards. C++11. C++14. C++17 is the most recent. C++20 is in progress. Compiler vendors are implementing the new standards. (Microsoft has done an admirable job in their latest versions of their C++ compiler.)
But the changes are impressive -- and intimidating. Even the names of the changes are daunting:
- contracts, with preconditions and postconditions
- concepts
- transactional memory
- ranges
- networking
- modules
- concurrency
- coroutines
- reflection
- spaceship operator
Here is an example of range, which simplifies the common "iterate over a collection" loop:
int array[5] = { 1, 2, 3, 4, 5 };
for (int& x : array)
x *= 2;
This is a nice improvement. Notice that it does not use STL iterators; this is pure C++ code.
Somewhat more complex is an implementation of the spaceship operator:
template
struct pair {
T t;
U u;
auto operator<=> (pair const& rhs) const
-> std::common_comparison_category_t<
decltype(std::compare_3way(t, rhs.t)),
decltype(std::compare_3way(u, rhs.u)>
{
if (auto cmp = std::compare_3way(t, rhs.t); cmp != 0)
return cmp;
return std::compare3_way(u, rhs.u);
}
}
That code seems... not so obvious.
The non-obviousness of code doesn't end there.
Look at two functions, one for value types and one for all types (value and reference types):
For simple value types, for our two functions, we can write the following code:
std::for_each(vi.begin(), vi.end(), [](auto x) { return foo(x); });
The most generic form:
#define LIFT(foo) \
[](auto&... x) \
noexcept(noexcept(foo(std::forward
-> decltype(foo(std::forward
{ return foo(std::forward
Notice that the last example uses the #define macro to do its work, with '\' characters to continue the macro across multiple lines.
* * *
I have been pondering that code (and more) for some time.
- C++ is becoming more capable, but also more complex. It is now far from the "C with Classes" that was the start of C++.
- C++ is not obsolete, but it is for applications with specific needs. C++ does offer fine control over memory management and can provide predictable run-time performance, which are advantages for embedded applications. But if you don't need the specific advantages of C++, I see little reason to invest the extra effort to learn and maintain C++.
- Development work will favor other languages, mostly Java, C#, Python, JavaScript, and Go. Java and C# have become the "first choice" languages for business applications; Python has become the "first choice" for one's first language. The new features of C++, while useful for specific applications, will probably discourage the average programmer. I'm not expecting schools to teach C++ as a first language again -- ever.
- There will remain a lot of C++ code, but C++'s share of "the world of code" will become smaller. Some of this is due to systems being written in other languages. But I'm willing to bet that the total lines of code for C++ (if we could measure it) is shrinking in absolute numbers.
All of this means that C++ development will become more expensive.
There will be fewer C++ programmers. C++ is not the language taught in schools (usually) and it is not the language taught in the "intro to programming" courses. People will not learn C++ as a matter of course; only those who really want to learn it will make the effort.
C++ will be limited to the projects that need the features of C++, projects which are larger and more complex. Projects that are "simple" and "average" will use other languages. It will be the complicated projects, the projects that need high performance, the projects that need well-defined (and predictable) memory management which will use C++.
C++ will continue as a language. It will be used on the high end projects, with specific requirements and high performance. The programmers who know C++ will have to know how to work on those projects -- amateurs and dabblers will not be welcome. If you are managing projects, and you want to stay with C++, be prepared to hunt for talent and be prepared to pay.
Thursday, March 14, 2019
A Relaxed Waterfall
We're familiar with the two development methods: Waterfall and Agile. Waterfall operates in a sequence of large steps: gather requirements, design the system, build the system, test the system, and deploy the system; each step must wait for the prior step to complete before it starts. Agile uses a series of iterations that each involve specifying, implementing and testing a new feature.
Waterfall's advantage is that it promises delivery on a specific date. Agile makes no such promise, but instead promises that you can always ship whatever you have built.
Suppose there was a third method?
How about a modified version of Waterfall: the normal Waterfall but no due date -- no schedule.
This may seem a bit odd, and even nonsensical. After all, the reason people like Waterfall is the big promise of delivery on a specific date. Bear with me.
If we change Waterfall to remove the due date, we can build a very different process. The typical Waterfall project runs a number of phases (analysis, design, coding, etc.) and there is pressure to, once a phase has been completed, to never go back. One cannot go back; the schedule demands that the next phase begin. Going back from coding, say, because you find ambiguities in the requirements, means spending more time in the analysis phase and that will (most likely) delay the coding phase, which will then delay the testing phase, ... and the delays reach all the way to the delivery date.
But if we remove the delivery date, then there is no pressure of missing the delivery date! We can move back from coding to analysis, or from testing to coding, with no risk. What would that give us?
For starters, the process would be more like Agile development. Agile makes no promise about a specific delivery date, and neither does what I call the "Relaxed Waterfall" method.
A second effect is that we can now move backwards in the cycle. If we complete the first phase (Analysis) and start the second phase (Design) and then find errors or inconsistencies, we can move back to the first phase. We are under no pressure to complete the Design phase "on schedule" so we can restart the analysis and get better information.
The same holds for the shift from Design to the third phase (Coding). If we start coding and find ambiguities, we can easily jump back to Design (or even Analysis) to resolve questions and ensure a complete specification.
While Relaxed Waterfall may sound exactly like Agile, it has differences. We can divide the work into different teams, one team handling each phase. You can have a team that specializes in analysis and the documentation of requirements, a second team that specializes in design, a third team for coding, and a fourth team for testing. The advantage is that people can specialize; Agile requires that all team members know how to design, code, test, and deploy a product. For large projects the latter approach may be infeasible.
This is all speculation. I have not tried to manage a project with Relaxed Waterfall techniques. I suspect that my first attempt might fail. (But then, early attempts with traditional Waterfall failed, too. We would need practice.) And there is no proof that a project run with Relaxed Waterfall would yield a better result.
It was merely an interesting musing.
But maybe it could work.
Waterfall's advantage is that it promises delivery on a specific date. Agile makes no such promise, but instead promises that you can always ship whatever you have built.
Suppose there was a third method?
How about a modified version of Waterfall: the normal Waterfall but no due date -- no schedule.
This may seem a bit odd, and even nonsensical. After all, the reason people like Waterfall is the big promise of delivery on a specific date. Bear with me.
If we change Waterfall to remove the due date, we can build a very different process. The typical Waterfall project runs a number of phases (analysis, design, coding, etc.) and there is pressure to, once a phase has been completed, to never go back. One cannot go back; the schedule demands that the next phase begin. Going back from coding, say, because you find ambiguities in the requirements, means spending more time in the analysis phase and that will (most likely) delay the coding phase, which will then delay the testing phase, ... and the delays reach all the way to the delivery date.
But if we remove the delivery date, then there is no pressure of missing the delivery date! We can move back from coding to analysis, or from testing to coding, with no risk. What would that give us?
For starters, the process would be more like Agile development. Agile makes no promise about a specific delivery date, and neither does what I call the "Relaxed Waterfall" method.
A second effect is that we can now move backwards in the cycle. If we complete the first phase (Analysis) and start the second phase (Design) and then find errors or inconsistencies, we can move back to the first phase. We are under no pressure to complete the Design phase "on schedule" so we can restart the analysis and get better information.
The same holds for the shift from Design to the third phase (Coding). If we start coding and find ambiguities, we can easily jump back to Design (or even Analysis) to resolve questions and ensure a complete specification.
While Relaxed Waterfall may sound exactly like Agile, it has differences. We can divide the work into different teams, one team handling each phase. You can have a team that specializes in analysis and the documentation of requirements, a second team that specializes in design, a third team for coding, and a fourth team for testing. The advantage is that people can specialize; Agile requires that all team members know how to design, code, test, and deploy a product. For large projects the latter approach may be infeasible.
This is all speculation. I have not tried to manage a project with Relaxed Waterfall techniques. I suspect that my first attempt might fail. (But then, early attempts with traditional Waterfall failed, too. We would need practice.) And there is no proof that a project run with Relaxed Waterfall would yield a better result.
It was merely an interesting musing.
But maybe it could work.
Monday, March 4, 2019
There is no Linux desktop
Every year, Linux enthusiasts hope that the new year will be the "year of the Linux desktop", the year that Linux dethrones Microsoft Windows as the chief desktop operating system.
I have bad news for the Linux enthusiasts.
There is no Linux desktop.
More specifically, there is not one Linux desktop. Instead, there is a multitude. There are multiple Linux distributions ("distros" in jargon) and it seems that each has its own ideas about the desktop. Some emulate Microsoft Windows, in an attempt to make it easy for people to convert from Windows to Linux. Other distros do things their own (and presumably better) way. Some distros focus on low-end hardware, others focus on privacy. Some focus on forensics, and others are tailored for tinkerers.
Distributions include: Debian, Ubuntu, Mint, SuSE, Red Hat, Fedora, Arch Linux, Elementary, Tails, Kubuntu, CentOS, and more.
The plethora of distributions splits the market. No one distribution is the "gold standard". No one distribution is the leader.
Here's what I consider the big problem for Linux: The split market discourages some software vendors from entering it. If you have a new application, do you support all of the distros or just some? Which ones? How do you test all of the distros that you support? What do you do with customers who use distros that you don't support?
Compared to Linux, the choice of releasing for Windows and macOS is rather simple. Either you support Windows or you don't. (And by "Windows" I mean "Windows 10".) Either you support mac OS or you don't. (The latest version of mac OS.) Windows and macOS each provide a single platform, with a single installation method, and a single API. (Yes, I am simplifying here. Windows has multiple ways to install an application, but it is clear that Microsoft is transitioning to the Universal app.)
I see nothing to reduce the number of Linux distros, so this condition will continue. We will continue to enjoy the benefits of multiple Linux distributions, and I believe that to be good for Linux.
But it does mean that the Evil Plan to take over all desktops will have to wait.
I have bad news for the Linux enthusiasts.
There is no Linux desktop.
More specifically, there is not one Linux desktop. Instead, there is a multitude. There are multiple Linux distributions ("distros" in jargon) and it seems that each has its own ideas about the desktop. Some emulate Microsoft Windows, in an attempt to make it easy for people to convert from Windows to Linux. Other distros do things their own (and presumably better) way. Some distros focus on low-end hardware, others focus on privacy. Some focus on forensics, and others are tailored for tinkerers.
Distributions include: Debian, Ubuntu, Mint, SuSE, Red Hat, Fedora, Arch Linux, Elementary, Tails, Kubuntu, CentOS, and more.
The plethora of distributions splits the market. No one distribution is the "gold standard". No one distribution is the leader.
Here's what I consider the big problem for Linux: The split market discourages some software vendors from entering it. If you have a new application, do you support all of the distros or just some? Which ones? How do you test all of the distros that you support? What do you do with customers who use distros that you don't support?
Compared to Linux, the choice of releasing for Windows and macOS is rather simple. Either you support Windows or you don't. (And by "Windows" I mean "Windows 10".) Either you support mac OS or you don't. (The latest version of mac OS.) Windows and macOS each provide a single platform, with a single installation method, and a single API. (Yes, I am simplifying here. Windows has multiple ways to install an application, but it is clear that Microsoft is transitioning to the Universal app.)
I see nothing to reduce the number of Linux distros, so this condition will continue. We will continue to enjoy the benefits of multiple Linux distributions, and I believe that to be good for Linux.
But it does mean that the Evil Plan to take over all desktops will have to wait.
Wednesday, February 27, 2019
R shows that open source permits poor quality
We like to think that open source projects are better than closed source projects. Not just cheaper, but better -- higher quality, more reliable, and easier to use. But while high quality and reliability and usability may be the result of some open source projects, they are not guaranteed.
Consider the R tool chain, which includes the R interpreter, the Rmd markdown language, the Rstudio IDE, and commonly used models built in R. All of these are open source, and all have significant problems.
The R interpreter is a large and complicated program. It is implemented in multiple programming languages: C, C++, Fortran -- and Java seems to be part of the build as well. To build R you need compilers for all of these languages, and you also need lots of libraries. The source code is not trivial; it takes quite a bit of time to compile the R source and get a working executable.
The time for building R concerns me less than the mix of languages and the number of libraries. R sits on top of a large stack of technologies, and a problem in any piece can percolate up and become a problem in R. If one is lucky, R will fail to run; if not, R will run and use whatever data happens to be available after the failure.
The R language itself has problems. It uses one-letter names for common functions ('t' to transpose a matrix, 'c' to combine values into a list) which means that these letters are not available for "normal" variables. (Or perhaps they are, if R keeps variables and functions in separate namespaces. But even then, a program would be confusing to read.)
R also suffers from too many data containers. One can have a list, which is different from a vector, which is different from a matrix, which is different from a data frame. The built-in libraries all expect data of some type, but woe to he that uses one when a different structure is expected. (Some functions do the right thing, and others complain about a type mismatch.)
Problems are not confined to the R language. The Rmd markdown language is another problem area. Rmd is based on Markdown, which has problems of its own. Rmd inherits these problems and adds more. A document in Rmd can contain plain text, markdown for text effects such as bold and underline, blocks of R code, blocks of Tex. Rmd is processed into regular Markdown, which is then processed into the output form of your choice (PDF, HTML, MS-Word, and a boatload of other formats).
Markdown allows you to specify line breaks by typing two space characters at the end of a line. (Invisible markup at the end of a line! I thought 'make' had poor design with TAB characters at the front of lines.) Markdown also allows you to force a line break with a backslash at the end of a line, which is at least visible -- but Rmd removes this capability and requires the invisible characters at the end of a line.
The Rstudio IDE is perhaps the best of the different components of R, yet it too has problems. It adds packages as needed, but when it does it displays status messages in red, a color usually associated with errors or warnings. It allows one to create documents in R or Rmd format, asking for a name. But for Rmd documents, the name you enter is not the name of the file; it is inserted into the file as part of a template. (Rmd documents contain metadata in the document, and a title is inserted into the metadata.) When creating an Rmd document in Rstudio, you have to start the process, enter a name to satisfy the metadata, see the file displayed, and then save the file -- Rstudio then asks you (again) for a name -- but this time it is for the file, not the metadata.
The commonly used models (small or not-so-small programs written in R or a mix of languages) is probably the worst area of the R ecosystem. The models can perform all sorts of calculations, and the quality of the models ranges from good to bad. Some models, such as those for linear programming, use variables and formulas to specify the problem you want solved. But the variables of the model are not variables in R; do not confuse the two separate things with the same name. There are two namespaces (one for R and one for the model) and each namespace holds variables. The programmer must mentally keep variables sorted. Referring to a variable in the wrong namespace yields the expected "variable not found" error.
Some models have good error messages, others do not. One popular model for linear programming, upon finding a variable name that has not been specified for the model's namespace, simply reports "A variable was specified that is not part of the model." (That's the entire message. It does not report the offending name, nor even a program line number. You have to hunt through your code to find the problem name.)
Given the complexity of R, the mix of languages in Rmd, the foibles of Rstudio, and the mediocre quality of commonly used extensions to R, I can say that R is a problem. The question is, how did this situation arise? All of the components are open source. How did open source "allow" such poor quality?
Its a good question, and I don't have a definite answer. But I do have an idea.
We've been conditioned to think of open source as a way to develop quality projects from the commonly known successful open source projects: Linux, Perl, Python, Ruby, and OfficeLibre. These projects are well-respected and popular with many users. They have endured; each is over ten years old. (There are other well-respected open source projects, too.)
When we think of these projects, we see success and quality. Success and quality is the result of hard work, dedication, and a bit of luck. These projects had all of those elements. Open source, by itself is not enough to force a result of high quality.
These successful projects have been run by developers, and more importantly, for developers. That is certainly true of the early, formative years of Linux, and true for any open source programming language. I suspect that the people working on OfficeLibre are primarily developers.
I believe that this second concept does not hold for the R ecosystem. I suspect that the people working on the R language, the Rmd markdown format, and especially the people building the commonly used models are first and foremost data scientists and analysts, and developers second. They are building R for their needs as data scientists.
(I omit Rstudio from this list. It appears that Rstudio is built to be a commercial endeavor, which means that their developers are paid to be developers. It makes status messages in red even more embarrassing.)
I will note that the successful open source projects have had an individual as a strong leader for the project. (Linux has Linus Torvalds, Perl has Larry Wall, etc.) I don't see a strong individual leading the R ecosystem or any component. These projects are -- apparently -- run by committee, or built by separate teams. It is a very Jeffersonian approach to software development, one which may have an effect on the quality of the result. (Again, this is all an idea. I have not confirmed this.)
Where does this leave us?
First, am reluctant to trust any important work to R. There are too many "moving pieces" for my taste -- too many technologies, too many impediments to good code, too many things that can go wrong. The risks outweigh the benefits.
Second, in the long term, we may move away from R. The popularity of R is not the R language, it is the ability to create (or use) linear programming models. Someone will create a platform for analysis, with the ability to define and run linear programming models. It will "just work", to borrow a phrase from Apple.
Moving away from R and the current toolchain is not guaranteed. The current tools may have achieved a "critical mass" of acceptance in the data science community, and the cost of moving to a different toolchain may viewed as unacceptable. In that case, the data science community can look forward to decades of struggles with the tools.
The real lesson is that open source does not guarantee quality software. R and the models are open source, but the quality is... mediocre. High quality requires more than just open source. Open source may be, as the mathematicians say, "necessary but not sufficient". We should consider this when managing any of our projects. Starting a project as open source will not guarantee success, nor will converting an existing project to open source. A successful project needs more: capable management, good people, and a vision and definition of success.
Consider the R tool chain, which includes the R interpreter, the Rmd markdown language, the Rstudio IDE, and commonly used models built in R. All of these are open source, and all have significant problems.
The R interpreter is a large and complicated program. It is implemented in multiple programming languages: C, C++, Fortran -- and Java seems to be part of the build as well. To build R you need compilers for all of these languages, and you also need lots of libraries. The source code is not trivial; it takes quite a bit of time to compile the R source and get a working executable.
The time for building R concerns me less than the mix of languages and the number of libraries. R sits on top of a large stack of technologies, and a problem in any piece can percolate up and become a problem in R. If one is lucky, R will fail to run; if not, R will run and use whatever data happens to be available after the failure.
The R language itself has problems. It uses one-letter names for common functions ('t' to transpose a matrix, 'c' to combine values into a list) which means that these letters are not available for "normal" variables. (Or perhaps they are, if R keeps variables and functions in separate namespaces. But even then, a program would be confusing to read.)
R also suffers from too many data containers. One can have a list, which is different from a vector, which is different from a matrix, which is different from a data frame. The built-in libraries all expect data of some type, but woe to he that uses one when a different structure is expected. (Some functions do the right thing, and others complain about a type mismatch.)
Problems are not confined to the R language. The Rmd markdown language is another problem area. Rmd is based on Markdown, which has problems of its own. Rmd inherits these problems and adds more. A document in Rmd can contain plain text, markdown for text effects such as bold and underline, blocks of R code, blocks of Tex. Rmd is processed into regular Markdown, which is then processed into the output form of your choice (PDF, HTML, MS-Word, and a boatload of other formats).
Markdown allows you to specify line breaks by typing two space characters at the end of a line. (Invisible markup at the end of a line! I thought 'make' had poor design with TAB characters at the front of lines.) Markdown also allows you to force a line break with a backslash at the end of a line, which is at least visible -- but Rmd removes this capability and requires the invisible characters at the end of a line.
The Rstudio IDE is perhaps the best of the different components of R, yet it too has problems. It adds packages as needed, but when it does it displays status messages in red, a color usually associated with errors or warnings. It allows one to create documents in R or Rmd format, asking for a name. But for Rmd documents, the name you enter is not the name of the file; it is inserted into the file as part of a template. (Rmd documents contain metadata in the document, and a title is inserted into the metadata.) When creating an Rmd document in Rstudio, you have to start the process, enter a name to satisfy the metadata, see the file displayed, and then save the file -- Rstudio then asks you (again) for a name -- but this time it is for the file, not the metadata.
The commonly used models (small or not-so-small programs written in R or a mix of languages) is probably the worst area of the R ecosystem. The models can perform all sorts of calculations, and the quality of the models ranges from good to bad. Some models, such as those for linear programming, use variables and formulas to specify the problem you want solved. But the variables of the model are not variables in R; do not confuse the two separate things with the same name. There are two namespaces (one for R and one for the model) and each namespace holds variables. The programmer must mentally keep variables sorted. Referring to a variable in the wrong namespace yields the expected "variable not found" error.
Some models have good error messages, others do not. One popular model for linear programming, upon finding a variable name that has not been specified for the model's namespace, simply reports "A variable was specified that is not part of the model." (That's the entire message. It does not report the offending name, nor even a program line number. You have to hunt through your code to find the problem name.)
Given the complexity of R, the mix of languages in Rmd, the foibles of Rstudio, and the mediocre quality of commonly used extensions to R, I can say that R is a problem. The question is, how did this situation arise? All of the components are open source. How did open source "allow" such poor quality?
Its a good question, and I don't have a definite answer. But I do have an idea.
We've been conditioned to think of open source as a way to develop quality projects from the commonly known successful open source projects: Linux, Perl, Python, Ruby, and OfficeLibre. These projects are well-respected and popular with many users. They have endured; each is over ten years old. (There are other well-respected open source projects, too.)
When we think of these projects, we see success and quality. Success and quality is the result of hard work, dedication, and a bit of luck. These projects had all of those elements. Open source, by itself is not enough to force a result of high quality.
These successful projects have been run by developers, and more importantly, for developers. That is certainly true of the early, formative years of Linux, and true for any open source programming language. I suspect that the people working on OfficeLibre are primarily developers.
I believe that this second concept does not hold for the R ecosystem. I suspect that the people working on the R language, the Rmd markdown format, and especially the people building the commonly used models are first and foremost data scientists and analysts, and developers second. They are building R for their needs as data scientists.
(I omit Rstudio from this list. It appears that Rstudio is built to be a commercial endeavor, which means that their developers are paid to be developers. It makes status messages in red even more embarrassing.)
I will note that the successful open source projects have had an individual as a strong leader for the project. (Linux has Linus Torvalds, Perl has Larry Wall, etc.) I don't see a strong individual leading the R ecosystem or any component. These projects are -- apparently -- run by committee, or built by separate teams. It is a very Jeffersonian approach to software development, one which may have an effect on the quality of the result. (Again, this is all an idea. I have not confirmed this.)
Where does this leave us?
First, am reluctant to trust any important work to R. There are too many "moving pieces" for my taste -- too many technologies, too many impediments to good code, too many things that can go wrong. The risks outweigh the benefits.
Second, in the long term, we may move away from R. The popularity of R is not the R language, it is the ability to create (or use) linear programming models. Someone will create a platform for analysis, with the ability to define and run linear programming models. It will "just work", to borrow a phrase from Apple.
Moving away from R and the current toolchain is not guaranteed. The current tools may have achieved a "critical mass" of acceptance in the data science community, and the cost of moving to a different toolchain may viewed as unacceptable. In that case, the data science community can look forward to decades of struggles with the tools.
The real lesson is that open source does not guarantee quality software. R and the models are open source, but the quality is... mediocre. High quality requires more than just open source. Open source may be, as the mathematicians say, "necessary but not sufficient". We should consider this when managing any of our projects. Starting a project as open source will not guarantee success, nor will converting an existing project to open source. A successful project needs more: capable management, good people, and a vision and definition of success.
Subscribe to:
Posts (Atom)