Fitzpatrick's Fabulous Future: big data

Showing posts with label big data. Show all posts

Wednesday, December 14, 2016

Steps to AI

The phrase "Artificial Intelligence" (AI) has been used to describe computer programs that can perform sophisticated, autonomous operations, and it has been used for decades. (One wag puts it as "artificial intelligence is twenty years away... always".)

Along with AI we have the term "Machine Learning" (ML). Are they different? Yes, but the popular usages make no distinction. And for this post, I will consider them the same.

Use of the term waxes and wanes. The AI term was popular in the 1980s and it is popular now. Once difference between the 1980s and now: we may have enough computing power to actually pull it off.

Should anyone jump into AI? My guess is no. AI has preconditions, things you should be doing before you start with a serious commitment to AI.

First, you need a significant amount of computing power. Second, you need a significant amount of human intelligence. With AI and ML, you are teaching the computer to make decisions. Anyone who has programmed a computer can tell you that this is not trivial.

It strikes me that the necessary elements for AI are very similar to the necessary elements for analytics. Analytics is almost the same as AI - analyzing large quantities of data - except it uses humans to interpret the data, not computers. Analytics is the predecessor to AI. If you're successful at analytics, then you are ready to move on to AI. If you haven't succeeded (or even attempted) at analytics, you're not ready for AI.

Of course, one cannot simply jump into analytics and expect to be successful. Analytics has its own prerequisites. Analytics needs data, the tools to analyze the data and render it for humans, and smart humans to interpret the data. If you don't have the data, the tools, and the clever humans, you're not ready for analytics.

But we're not done with levels of prerequisites! The data for analytics (and eventually AI) has its own set of preconditions. You have to collect the data, store the data, and be able to retrieve the data. You have to understand the data, know its origin (including the origin date and time), and know its expiration date (if it has one). You have to understand the quality of your data.

The steps to artificial intelligence are through data collection, metadata, and analytics. Each step has to be completed before you can advance to the next level. (Much like the Capability Maturity Model.) Don't make the mistake of starting a project without the proper experience in place.

Tuesday, December 15, 2015

The new real time

In the 1970s, people in IT pursued the elusive goal of "real time" computing. It was elusive because the term was poorly defined. With no clear objective, any system could be marketed as "real time". Marketing folks recognized the popularity of the term and anything that could remotely be described as "real time" was described as "real time".

But most people didn't need (or want) "real time" computing. They wanted "fast enough" computing, which generally meant interactive computing (not batch processing) that responded to requests quickly enough that clerks and bank tellers could answer customers' questions in a single conversation. Once we had interactive computing, we didn't look for "real time" and interest in the term waned.

To be fair to "real time", there *is* a definition of it, one that specifies the criteria for a real-time system. But very few systems actually fall under those criteria, and only a few people in the industry actually care about the term "real time". (Those that do care about the term really do care, though.)

Today, we're pursuing something equally nebulous: "big data".

Lots of people are interested in big data. Lots of marketers are getting involved, too. But do we have a clear understanding of the term?

I suspect that the usage of the term "big data" will follow an arc similar to that of "real time", because the forces driving the interest are similar. Both "real time" and "big data" are poorly defined yet sound cool. Further, I suspect that, like "real time", most people looking for "big data" are really looking for something else. Perhaps they want better analytics ("better" meaning faster, more frequent, more interaction and drill-down capabilities, or merely prettier graphics) for business analysis. Perhaps they want cheaper data storage. Perhaps they want faster development times and fewer challenges with database management.

Whatever the reason, in a few years (I think less than a decade) we will not be using the term "big data" -- except for a few folks who really need it and who really care about it.

Tuesday, March 4, 2014

After Big Data comes Big Computing

The history of computing is a history of tinkering and revision. We excel at developing techniques to handle new challenges.

Consider the history of programming:

Tabulating machines

plug-boards with wires

Von Neumann architecture (mainframes)

machine language
assembly language
compilers (FORTRAN and COBOL)
interpreters (BASIC) and timeshare systems

The PC revolution (the IBM PC)

assembly language
Microsoft BASIC

The Windows age

Object-oriented programming
Event-driven programming
Visual Basic

Virtual machines

UCSD p-System
Java and the JVM

Dynamic languages

Perl
Python
Ruby
Javascript

This (severely abridged) list of hardware and programming styles shows how we change our technology. Our progress is not a smooth advance from one level to the present, but a series of jumps, some of them quite large. It was a large jump from plug-boards to memory-resident programs. It was another large jump to an assembler. One can argue that later jumps were larger or smaller, but those arguments are not important to the basic idea.

Notice that we do not know where things are going. We do not see the entire chain up front. In the 1950s, we did not know that we would end up here (in 2014) with dynamic languages and cloud computing. Often we cannot see the next step until it is upon us and only the best of visionaries can see past it.

Big Data is such a jump, enabled by cheap storage and cloud computing. That change in technology is upon us.

Big Data is the acquisition and storage (and use) of large quantities of data. Not just "lots of data" but mind-boggling quantities of data. Data that makes our current "very large" databases look small and puny. Data that contains not only financial transactions but server logs, e-mails, security videos, medical records, and sensor readings from just about any kind of device. (The sensor readings may be from building sensors for temperature, from vehicles for position and speed and engine performance, from packages in transit, from assembly lines, from gardens and parks for temperature and humidity, ... the list is endless.)

But what happens once we acquire and store these mind-boggling heaps of data?

The obvious solution is to do something with it. And we are doing something with it; we use tools like Hadoop to process and analyze and visualize it.

I think Hadoop (and its brethren) are a good start. We're at the dawn of the "Big Data Age", and we don't really know what we want -- in terms of analyses and tools. We have some tools, and they seem okay.

But this is just the dawn of the "Big Data Age". I think we will develop new techniques and tools to analyze our data. And, I suspect those tools and techniques will require lots of computation. So much computation that someone will coin the term "Big Computing" to represent the use of mind-boggling amounts of computing power.

Big Computing seems a natural follow-on to Big Data. And just as we have developed languages to handle new programming challenges, we will develop new languages for Big Computing.

We have two hints for programming in the era of Big Computing. One hint is cloud computing, with its ability to scale up as we need more power. We've already seen that programs for the cloud have a different organization than "classic" programs. Cloud programs use small modules connected by message queues. The modules hold no state, which allows the system to route transactions to any available module.

The other hint is at the small end of the computing world, at the chip level. Here we see advances in processor design: more cores, more caching, more processing. The GreenArrays GA144 is a chip that contains 144 computers -- not cores, but computers. This is another contender for Big Computing.

I'm not sure what "Big Computing" and its programming will look like, but I am confident that they will be interesting!

Wednesday, January 29, 2014

The PC revolution was about infrastructure

Those of us who lived through the PC revolution like to think that PCs were significant advances in technology. They were advances, but in retrospect they were simply infrastructure.

Let's review the advances in PC technology:

Stand-alone PCs The original PCs were brought in as replacements for typewriters and calculators. This was a tactical use of PCs, one that improved the efficiency of the company but did not change the internal organization or the products and services offered by the company. The PC, with only word processors and spreadsheets, is not strong enough to make a strategic difference for a company.

Databases After some time, people figured out that PCs could be more than typewriters and calculators. There were custom PC applications, but more importantly there were the early databases (dBase II, dBase III, R:Base) and database languages (Clipper, Paradox) that let people store and retrieve data. These databases were single-user and stand-alone.

Networks The original PC networks (Novell, Banyan, Corvus) were introduced to share resources such as disks and printers. Printers (especially letter-quality printers) were expensive. Disks (large disks, say 40 MB) were also expensive. Sharing a common resource made economic sense. But the early networks were LANs (Local Area Networks) and confined to a single building.

Servers Initially part of "client/server systems", servers were database engines that handled requests from multiple clients. Client/server systems gave networks a significant purpose for existing: the ability to update a single database from multiple locations made it possible to migrate mainframe applications onto the cheaper PC platform.

The Internet Connecting networks made it possible for businesses to exchange information. The first big use of internet connections was e-mail; calendars followed quickly. Strictly speaking, the Internet is not a PC technology -- it was built mostly with minicomputers and Unix. The sockets libraries (WinSock) for PCs made the Internet accessible.

Web servers Built on these previous layers, the web is (now) a combination of PCs, minicomputers, mainframes, and rack-mounted servers. It is this layer that enables strategic as well as tactical advantages. Companies can provide self-service web pages (more of a tactical change, I think) and new services (strategic). New companies can form (Facebook, Twitter).

Virtualization The true advantage of virtualization is not consolidation of servers, but the ability to create or destroy machines quickly.

Cloud computing Once virtual machines were available and cheap, we created the cloud paradigm. Using an array of virtual computers, we can design applications that are distributed across multiple servers and are resistant to failure of any one of those servers. The distribution of work allows for scaling (up or down) as needed, adding or removing servers to handle the current load.

All of these technologies are now infrastructure. They are well-understood and easily available.

New technologies are plugging in to this infrastructure. Smartphones, tablets, and big data are all sitting on top of this (impressive) stack of technology. Smartphone and tablet apps use low-wattage user interfaces and connect to cloud computing systems for processing. Big data system use a similar design, with cloud computing engines providing the data for visualization software on PCs (or in web browsers).

When we built the first microcomputers, when we installed DOS on PCs, when we used modems to connect to bulletin-board systems, we thought we were creating the crest of technology. We thought we were building the top dog. But it didn't turn out that way. The PC and its later technologies let us build a significant computing stack.

Now that we have that stack, I think we can discard the traditional PC. The next decade should see the replacement of PCs. Not all at once, and at different rates in different environments. I expect PCs to exist in businesses for quite some time.

But disappear they shall.

Thursday, December 26, 2013

Big data is about action, and leadership

Big data. One of the trends of the year. A new technology that brings new opportunities for businesses and organizations that use it.

Big data also brings challenges: the resources to collect and store large quantities of data, the tools to analyze and present large quantities of data, and the ability to act on that data. That last item is the most important.

Collecting data, storing it, and analyzing it are tasks that mostly consist of technology, and technology is easily available. Data storage (through SAN or NAS or cloud-based storage) is a matter of money and equipment. Collecting data may be a little harder, since you must decide on what to collect and then you must make the programming changes to perform the actual collection -- but those are not that hard.

Analyzing data is also mostly a matter of technology. We have the computing hardware and the analytic software to "slice and dice" data and serve it up in graphs and visualizations.

The hard part of big data is none of the above. The hard part of big data is deciding a course of action and executing it. Big data gives you information. It gives you insight. And I suspect that it gives you those things faster than your current systems.

It's one thing to collect the data. It's another thing to change your procedures and maybe even your business. Collecting the data is primarily a technology issue. Changing procedures is often a political one. People are (often) reluctant to change. Changes to business plans may shift the balance of power within an organization. Your co-workers may be unwilling to give up some of that power. (Of course, others may be more than happy to gain power.)

The challenge of big data is not in the technology but in the changes driven by big data and the leadership for those changes. Interpreting data, deciding on changes, executing those changes, and repeating that cycle (possibly more frequently than before) is the payoff of big data.

Thursday, October 10, 2013

Hadoop shows us a possible future of computing

Computing has traditionally been processor-centric. The classic model of computing has a "central processing unit" which performs computations. The data is provided by "peripheral devices", processed by the central unit, and then routed back to peripheral devices (the same as the original devices or possibly others). Mainframes, minicomputers, and PCs all use this model. Even web applications use this model.

Hadoop changes this model. It is designed for Big Data, and the size of data requires a new model. Hadoop stores your data in segments across a number of servers -- with redundancy to prevent loss -- with each segment being 64MB to 2GB. If your data is smaller than 64MB, moving to Hadoop will gain you little. But that's not important here.

What is important is Hadoop's model. Hadoop moves away from the traditional computing model. Instead of a central processor that performs all calculations, Hadoop leverages servers that can hold data and also perform calculations.

Hadoop makes several assumptions:

The code is smaller than the data (or a segment of data)
Code is transported more easily than data (because of size)
Code can run on servers

With these assumptions, Hadoop builds a new model of computing. (To be fair, Hadoop may not be the only package that builds this new model of distributed processing -- or even the first. But it has a lot of interest, so I will use it as the example.)

All very interesting. But here is what I find more interesting: the distributed processing model of Hadoop can be applied to other systems. Hadoop's model makes sense for Big Data, and systems with Little (that is, not Big) data should not use Hadoop.

But perhaps smaller systems can use the model of distributed processing. Instead of moving data to the processor, we can store data with processors and move code to the data. A system could be constructed from servers holding data, connected with a network, and mobile code that can execute anywhere. The chief tasks then become identifying the need for code and moving code to the correct location.

That would give us a very different approach to system design.

Wednesday, September 18, 2013

Big Data proves the value of open source

Something significant happened with open source software in the past two years. An event that future historians may point to and say "this is when open source software became a force".

That event is Big Data.

Open source has been with us for decades. Yet for all the technologies we have, from the first plug-board computers to smart phones, from the earliest assemblers to the latest language compilers, from the first IDE to Visual Studio, open source software has always copied the proprietary tools. Open source tools have always been implementations of existing ideas. Linux is a functional copy of Unix. The open source compilers and interpreters are for existing languages (C, C++, Fortran, Java). LibreOffice and Open Office are clones of Microsoft Office. Eclipse is an open source IDE, an idea that predates the IBM PC.

Yes, the open source versions of these tools have their own features and advantages. But the ideas behind these tools, the big concepts, are not new.

Big Data is different. Big Data is a new concept, a new entry in the technology toolkit, and its tools are (just about) all open source. Hadoop, NoSQL databases, and many analytics tools are open source. Commercial entities like Oracle and SAS may claim to support Big Data, their support seems less "Big Data" and more "our product can do that too".

A few technologies came close to being completely open source. Web servers are mostly open source, with stiff competition from Microsoft's (closed source) IIS. The scripting languages (Perl, Python, and Ruby) are all open source, but they are extensions of languages like AWK and the C Shell, which were not initially open source.

Big Data, from what I can see, is the first "new concept" technology that has a clear origin in open source. It is the proof that open source can not only copy existing concepts, but introduce new ideas to the world.

And that is a milestone that deserves recognition.

Sunday, September 8, 2013

The coming problem of legacy Big Data

With all the fuss about Big Data, we seem to have forgotten about the problems of legacy Big Data.

You may think that Big Data is too new to have legacy problems. Legacy problems affect old systems, systems that were designed and built by Those Who Came Before And Did Not Know How To Plan For The Future. Big Data cannot possibly have those kinds of problems, because 1) the systems are new, and 2) they have been built by us.

Big Data systems are new, which is why I say that the problems are coming. The problems are not here now. But they will arrive, in a few years.

What kind of problems? I can think of several.

Data formats Newer tools (or newer versions of existing tools) change the formats of data and cannot read old formats. (For example, Microsoft Excel, which cannot read Lotus 1-2-3 files.)

Data value codes Values used in data to encode specific ideas, changed over time. These might be account codes, or product categories, or status codes. The problem is not that you cannot read the files, but that the values mean things other than what you think.

Missing or lost data Non-Big Data (should that be "Small Data"?) can be easily stored in version control systems or other archiving systems. Big Data, by its nature, doesn't fit well in these systems. Without an easy way to back up or archive Big Data, many shops will take the easy way and simply not make copies.

Inconsistent data Data sets of any size can hold inconsistencies. Keeping traditional data sets consistent requires discipline and proper tools. Finding inconsistencies in larger data sets is a larger problem, requiring the same discipline and mindset but perhaps more capable tools.

In short, the problems of legacy Big Data are the same problems as legacy Small Data.

The savvy shops will be prepared for these problems. They will put the proper checks in place to identify inconsistencies. They will plan for changes to formats. They will ensure that data is protected with backup and archive copies.

In short, the solutions to the problems of legacy Big Data are the same solutions to the problems of legacy Small Data.

Monday, May 6, 2013

A Risk of Big Data: Armchair Statisticians

In the mid-1980s, laser printers became affordable, word processor software became more capable, and many people found that they were able to publish their own documents. They proceeded to do so. Some showed restraint in the use of fonts; others created documents that were garish.

In the mid-1990s, web pages became affordable, web page design software became more capable, and many people found that they were able to create their own web sites. They proceeded to do so. Some showed restraint in the use of fonts, colors, and the blink tag; others created web sites that were hideous.

In the mid-2010s, storage became cheap, data became collectable, analysis tools became capable, and I suspect many people will find that they are able to collect and analyze large quantities of data. I further predict that many will do so. Some will show restraint in their analyses; others will collect some (almost) random data and create results that are less than correct.

The biggest risk of Big Data may be the amateur. Professional statisticians understand the data, understand the methods used to analyze the data, and understand the limits of those analyses. Armchair statisticians know enough to analysis the data but not enough to criticize the analysis. This is a problem because it is easy to mis-interpret the results.

Typical errors are:

Omitting relevant data (or including irrelevant data) due to incorrect "select" operations.
Identifying correlation as causation. (In an economic downturn, the unemployment rate increases as does the payments for unemployment insurance. But the UI payments do not cause the UI rate; both are driven by the economy.)
Identifying the reverse of a causal relationship (Umbrellas do not cause rain.)
Improper summary operations (Such as calculating an average of a quantized value like processor speed. You most likely want either the median or the mode.)

It is easy to make these errors, which is why the professionals take such pains to evaluate their work. Note that none of these are obvious in the results.

When the cost of performing these analyses was high, only the professionals could play. The cost of such analyses is dropping, which means that amateurs can play. And their results will look (at first glance) just as pretty as the professionals.

In desktop publishing and web page design, it was easy to separate the professionals and the amateurs. The visual aspects of the finished product were obvious.

With big data, it is hard to separate the two. The visual aspects of the final product do not show the workmanship of the analysis. (They show the workmanship of the presentation tool.)

Be prepared for the coming flood of presentations. And be prepared to ask some hard questions about the data and the analyses. It is the only way you will be able to tell the wheat from the chaff.

Sunday, February 17, 2013

Losing data in the cloud of big data

NoSQL databases have several advantages over traditional SQL databases -- in certain situations. I think most folks agree that NoSQL databases are better for some tasks, and SQL databases are better in others. And most discussions about Big Data agree that NoSQL is the tool for Big Data databases.

One aspect that I have not seen discussed is auditing. That is, knowing that we have all of the data we expect to have. Traditional data processing systems (accounting, insurance, banking, etc.) have lots of checks in place to ensure that all transactions are processed and none are lost.

These checks and audits were put in place over a long time. I suspect that each error, when detected, was reviewed and a check was added to prevent such errors, or at least detect them early.

Do we have these checks in our Big Data databases? Is it even possible to build the checks for accountability? Big Data is, by definition, big. Bigger than normal, and bigger than one can conveniently inventory. Big Data can also contain things that are not always auditable. We have the techniques to check bank accounts, but how can we check something non-numeric such as photographs, tweets, and Facebook posts?

On the other hand, there may be risks from losing data, or subsets of data. Incomplete datasets may contain bias, a problem for sampling and projections. How can you trust your data if you don't have the checks in place?

Tuesday, July 17, 2012

How big is "big"?

A recent topic of interest in IT has been "big data", sometimes spelled with capitals: "Big Data". We have no hard and fast definition of big data, no specific threshold to cross from "data" to "big data". Does one terabyte constitute "big data"? If not, what about one petabyte?

This puzzle is similar to the question of "real time". Some systems must perform actions in "real time", yet we do not have a truly standard definition of them. If I design a dashboard system for an automobile and equip the automobile with sensors that report data every two seconds, then a real-time dashboard system must process all of the incoming data, by definition. Should I replace the sensors with units that report data every 1/2 second and the dashboard cannot keep up with the faster rate, then the system is not "real time".

But this means that the definition of "real time" depends not only on the design of the processing unit, but also the devices to which it communicates. The system may be considered "real time" until we change a component, then it is not.

I think that the same logic holds for "big data" systems. Today, we consider multiple petabytes to be "big data". Yet in in 1990 when PCs had disks of 30 megabytes, a data set of one gigabyte would be considered "big data". And in the 1960s, a data set of one megabyte would be "big data".

I think that, in the end, the best we can say is that "big" is as big as we want to define it, and "real time" is as fast as we want to define it. "Big data" will always be larger than the average organization can comfortably handle, and "real time" will always be fast enough to process the incoming transactions.

Which means that we will always have some systems that handle big data (and some that do not), and some systems that run in real time (and some that do not). Using the terms properly will rely not on the capabilities of the core components alone, but on our knowledge of the core and peripheral components. We must understand the whole system to declare it to be "big data" or "real time".

Fitzpatrick's Fabulous Future