Tuesday, November 28, 2023

Today's AI means QA for data

Some time ago, I experimented with n-grams. N-grams are a technique that reads an existing text and produces a second text that is similar but not the same. It splits the original text into pieces; for 2-grams it uses two letters, for 3-grams it uses three letters, etc. It computes the frequency of each combination of letters and then generates new text, selecting each letter based on the frequency of occurrence after a set of letters.

For 2-grams, the word 'every' is split into 'ev', 've', 'er', and 'ry'. When generating text, the program sees that 'e' is followed by either 'v' or 'r' and builds text with that same pattern. That's with an input of one word. With a larger input, the letter 'e' is followed by many different letters, each with its own frequency.

Using a program (in C, I believe) that read text, split it into n-grams, and generated new text, I experimented with names of friends. I gave the program a list of names and the program produced a list of names that were recognizable as names, but not the names of the original list. I was impressed, and considered it pretty close to magic.

It strikes me that the AI model ChatGPT uses a similar technique, but with words instead of individual letters. Given a large input, or rather, a condensation of frequencies of words (the 'weights') it can generate text using the frequencies of words that follow other words.

There is more to ChatGPT, of course, as the output is not simply random text but text about a specified topic. But let's focus on the input data, the "training text". That text is half of what makes ChatGPT possible. (The other half being the code.)

The training text enables, and also limits, the text generated by ChatGPT. If the training text (to create the factors) were limited to Shakespeare's plays and sonnets, for example, any output from ChatGPT would strongly resemble Shakespeare's work. Or if the training were limited to the Christian Bible, then the output would be in the style of the Bible. Or if the training text were limited to lyrics of modern songs, then the output would be... you get the idea.

The key point is this: The output of ChatGPT (or any current text-based AI engine) is defined by the training text.

Therefore, any user of text-based AI should understand the training text for the AI engine. And this presents a new aspect of quality assurance.

For the entire age of automated data processing, quality assurance has focussed on code. The subject of scrutiny has been the program. The input data has been important, but generally obtained from within the organization or from reputable sources. It was well understood and considered trustworthy.

And for the entire age of automated data processing, the tests have been pointed at the program and the data that it produces. All of the procedures for tests have been designed for the program and the data that it produces. There was little consideration to the input data, and almost no tests for it. (With the possible exception of completeness of input data, and input sets for unusual cases.)

I think that this mindset must change. We must now understand and evaluate the data that is used to train AI models. Is the data appropriate for our needs? Is the data correct? Is it marked with the correct metadata?

With a generally-available model such as ChatGPT, where one does not control the training data, nor does one have visibility into the training data, such analyses are not possible. We have to trust that the administrators of ChatGPT have the right data.

Even with self-hosted AI engines, where we control the training data, the effort is significant. The work includes collecting the data, verifying its provenance, marking it with the right metadata, updating it over time, and removing it when it is no longer appropriate.

It strikes me that the work is somewhat similar to that of a librarian, managing books in a library. New books must be added (and catalogued), old books must be removed.

Perhaps we will see "Data Librarian" as a new job title.