Monday, January 05, 2009

Unstructured, Semi-Structured, and Structured Data

I originally wrote this over two years ago and have intended to post it ever since. Not too late.

Often in the IT world we hear or even use these terms. But what do they really mean? Here is my view.

All bits and bytes that we deal with in the IT world we considered to be data (at least). It all has some form of syntax and structure, so what do we really mean, and why is it useful to distinguish between them?

These three classifications represent a continuum which spans from unstructured to structured data that represents the degree to which the data's semantic model (meaning) matches our processing requirements. In general what we are trying to describe is the readiness of the data to be processed in a particular business context .

For example, if the data in question is the raw audio recordings from the call centre, and the business context is we need to review all verbal instructions spoken by customer "Joe Smith" last year over the phone, we may consider that recordings to be unstructured. We have no easy way to process the request.

If we have augmented those recordings with additional data from other systems and have added customer number and call timestamp to the recordings (or an index) then we would consider that to be semi-structured data. Although we could quickly sift through the millions of minutes of recordings to get Joe's subset, somebody would still have to listen to the recordings to find the things that Joe said.

The structured data, in this case, could be represented by the actual transaction records that the call centre agent created in response to Joe's instructions.

Likewise, A TIFF image might be considered structured data within the Context of a GIS application (geographic information system), but might be considered unstructured within a mortgage appraisal application. (Perhaps even GIS would consider it unstructured since they might ideally wish to run queries over an image set to find all lakes larger than a certain size. That would be hard on untagged TIFF images.)

All the data we typically deal with has a known syntax, even if that syntax is only really understood by MS Word. And although a Word document may have semantic meaning to a human, that semantic meaning is not easily extract by computer. We consider a Word document to be unstructured (in most cases).

An Excel spreadsheet may have a well defined layout of rows and columns. Although Excel may find it easy to 'understand' its content, other programs may or may not. If they layout is regular and complete, programs other than Excel maybe able to extract that data from the spreadsheet and do useful things with it. We would consider that to be semi-structured. I suggest that the 'semi' aspect of the term introduces the concept of a degree of uncertainty. Perhaps this is because it source is not well controlled and the form (layout) may change and it suddenly becomes unstructured in our context.

Structured data has an aspect of surety about it . We know that there are 'fields', we know where they are, we know what values to expect. We know how to understand it. We expect there to be some kind of formal model which defines this structure, and we expect that there will be controls in place that enforce our expectations. We may often visualise such data as being a relational form stored in a RDBMS. But that is not a requirement.

All that said, here are my definitions:

Unstructured Data: Data which does not have the appropriate semantic structure which allows for computer processing within a particular business context .

Semi-structured Data: Data which has some form of semantic structure which would allow for a degree of computer process within a particular business context, but may need some human assistance. It may apply some heuristics but the process may fail due to volatility of the structure or incorrect assumptions about the structure.

Structured Data: Data which is well positioned to be reliably processed by computer within a particular business context. It has a well-defined and rigorously controlled syntactic and semantic structure. The elements of the data have a well defined datatype and rules about valid values and ranges. The meaning of these data elements is well understood in isolation as well and their relationships to other elements. Elements are also traceable to their originating sources and that path is verifiable.

1 comment:

Don said...

Does XML fit your definition of structured data?