Friday, January 09, 2009

Prototype, Proof-of-Concept, and Pilot. Oh my!

These terms are often used in IT contexts often without much consideration to nuances. Whatever you call it, the first important thing is that you understand and state your objective. The second is that you meet it.

My suggested definitions.

Pilot:
This is an implementation of a system that is often functional complete. It is typically deployed in production, but usually constrained to a small number of users. Although we hope everything is perfect, there is an expectation that there will be faults that require rework - otherwise we could have just gone full production. The fault maybe in deployment, code, design, or in user experience. A pilot is typically time-boxed. A pilot is usually fully productionalized from an operational perspective.

Beta
: Similar in many aspects to Pilot. There is a lesser expectation that it is functionally complete, but it typically is. With a Beta, there is a much more explicit understanding that it is not final. It will change in the final release. A Beta is often supported, if at all, by a different organization than a production instance, typically developers. Traditionally a beta was not to be used for production, however some companies are making it part of their normal process - the never ending beta. Something that Google has done many times.

Release Candidate:
A software build and might be view as living in the middle between Beta and Pilot. A release candidate may be promoted to pilot or production.

Proof of Concept:
I view this as a very narrow and well defined activity. There is a well defined concept. The objective of the activity is to prove that the concept is viable in some aspect. Functionally it is only complete enough to meet the objective. The resulting code is not intended to be used for anything else - although most programmers will harvest some aspects for other things. Agile methods talk of 'early pain'. Significant projects has technology aspects that well be challenging and be a source of risk. It is desirable to execute on those aspects first, if there is a problem you want to know about it early so that you can change your plans or maybe cut your losses. This is what a proof of concept is about; if you are going to fail, fail early. Many times I have seen people propose a 'proof of concept', with no idea what concept they wish to prove; often what they mean is that they want to start coding. PoCs are not implemented in production.

Prototype
: Perhaps this one has the most varied definitions: Experimental Prototype, Engineering Prototype, etc. In software development, a prototype is a rudimentary working model of a product or information system, usually built for demonstration purposes or as part of the development process. As part of an SDLC approach, a simple version of the system is built, tested, and then reworked iteratively until ready for use. Prototypes are not usually implemented in production. Go read the wikipedia article http://en.wikipedia.org/wiki/Prototype

Do you agree? Are there any characteristics of any of these terms that you think would help define them.

Monday, January 05, 2009

Unstructured, Semi-Structured, and Structured Data

I originally wrote this over two years ago and have intended to post it ever since. Not too late.

Often in the IT world we hear or even use these terms. But what do they really mean? Here is my view.

All bits and bytes that we deal with in the IT world we considered to be data (at least). It all has some form of syntax and structure, so what do we really mean, and why is it useful to distinguish between them?

These three classifications represent a continuum which spans from unstructured to structured data that represents the degree to which the data's semantic model (meaning) matches our processing requirements. In general what we are trying to describe is the readiness of the data to be processed in a particular business context .

For example, if the data in question is the raw audio recordings from the call centre, and the business context is we need to review all verbal instructions spoken by customer "Joe Smith" last year over the phone, we may consider that recordings to be unstructured. We have no easy way to process the request.

If we have augmented those recordings with additional data from other systems and have added customer number and call timestamp to the recordings (or an index) then we would consider that to be semi-structured data. Although we could quickly sift through the millions of minutes of recordings to get Joe's subset, somebody would still have to listen to the recordings to find the things that Joe said.

The structured data, in this case, could be represented by the actual transaction records that the call centre agent created in response to Joe's instructions.

Likewise, A TIFF image might be considered structured data within the Context of a GIS application (geographic information system), but might be considered unstructured within a mortgage appraisal application. (Perhaps even GIS would consider it unstructured since they might ideally wish to run queries over an image set to find all lakes larger than a certain size. That would be hard on untagged TIFF images.)

All the data we typically deal with has a known syntax, even if that syntax is only really understood by MS Word. And although a Word document may have semantic meaning to a human, that semantic meaning is not easily extract by computer. We consider a Word document to be unstructured (in most cases).

An Excel spreadsheet may have a well defined layout of rows and columns. Although Excel may find it easy to 'understand' its content, other programs may or may not. If they layout is regular and complete, programs other than Excel maybe able to extract that data from the spreadsheet and do useful things with it. We would consider that to be semi-structured. I suggest that the 'semi' aspect of the term introduces the concept of a degree of uncertainty. Perhaps this is because it source is not well controlled and the form (layout) may change and it suddenly becomes unstructured in our context.

Structured data has an aspect of surety about it . We know that there are 'fields', we know where they are, we know what values to expect. We know how to understand it. We expect there to be some kind of formal model which defines this structure, and we expect that there will be controls in place that enforce our expectations. We may often visualise such data as being a relational form stored in a RDBMS. But that is not a requirement.

All that said, here are my definitions:

Unstructured Data: Data which does not have the appropriate semantic structure which allows for computer processing within a particular business context .

Semi-structured Data: Data which has some form of semantic structure which would allow for a degree of computer process within a particular business context, but may need some human assistance. It may apply some heuristics but the process may fail due to volatility of the structure or incorrect assumptions about the structure.

Structured Data: Data which is well positioned to be reliably processed by computer within a particular business context. It has a well-defined and rigorously controlled syntactic and semantic structure. The elements of the data have a well defined datatype and rules about valid values and ranges. The meaning of these data elements is well understood in isolation as well and their relationships to other elements. Elements are also traceable to their originating sources and that path is verifiable.