Terry's Technology Topics: 2009

Wednesday, October 07, 2009

International Kindle not shipping until Oct 19th

See the Bloomberg article. http://www.bloomberg.com/apps/news?pid=20601087&sid=aiywhz5s9L7g
Given that info it is not surprising that Canadian orders are declined. Although Amazon should have thought about accepting pre-orders.

in reference to: Canada snubbed as Kindle goes global - The Globe and Mail (view on Google Sidewiki)

Friday, September 25, 2009

Sidewiki - kind of cool

Since I like recursion, I'll make a sidewiki entry about my first sidewiki usage. Of course I'll post that to my blog as well.

I wonder if they will tie this into Google Wave?

in reference to: Terry's Technology Topics: My first test of sidewiki (view on Google Sidewiki)

My first test of sidewiki

Not that this is a race, but I did need to beat Tom to the punch on sidewiki. Just because. It may be an interesting and useful tool. Lets see how it goes.

For this first test, I'll share the entry to my blog as well as Twitter.

in reference to: Google Sidewiki (view on Google Sidewiki)

Friday, July 10, 2009

Brilliant Attacks

Brilliant attacks. (especially the laser one !)
- Monitor the powerline and detect and decode keystrokes from 15m away
- Monitor vibrations on desktop object via laser reflections and decode keystrokes.

The more secure we think our systems are, the more we must remind ourselves that there are styles of attacks that we cannot conceive.

If anybody tells me ‘this is complete secure, it is unhackable’, then this I know for sure about the speaker is that they are NOT very imaginative. They might not be thinking very hard about possible attacks. To me this makes products from vendors who make such claims less secure, they will be blind-sided by some hacker with more imagination.

Tuesday, July 07, 2009

What is Scalability?

In computer systems we usually mean "linear scalability" and describes a relation between two measures. It is typically a reference to capacity. Two common measures are CPU capacity and number of users. If an application supports 10 users with 1 CPU and 20 users with 2 CPUs and so on until the numbers get quite large, we would say that it is scalable. because "number of CPUs" = 0.1 * "number of users".

It is a bit more complicated than that because we often have other constraints such as "user response time remains constant" and memory use scales as well. My first example also ignored the possibility of a constant offset, which would represent some fixed overhead. perhaps one-half of a CPU is required even if there are no users. For those who remember their math, y=mx+b is the equation which describes a line.

In my first paragraph i mentioned a possible exception "until the numbers get quite large". How large is large? That depends on the situation the solution is in. We typically would consider sizes that are significantly larger than what we expect, but still within limits. These limits can vary by the nature of the solution. If a workload is driven by internet stock trades, our degree of growth we could expect is much more volatile than if it is the number of Canadian branch locations for a bank (which is pretty much already saturated).

It would be false to assume that the linear scale continues without bound. Linear scale without bound, would be a truly rare situation. At various points as workload grows you will run into "walls". They are called that because when you look a the graph of this situation your resource usage grows much faster than your workload - as if it hits a wall.

These wall situations usually occur because some other resource becomes saturated. For example, perhaps your database server "maxes" out. In that case adding more CPUs to your application server won't help. But, perhaps re-engineering the database server will remove that constraint and allow for further growth. Sometimes these walls are "hard"; that is re-engineering won't alleviate the constraint.

There are two sources of these hard walls: coding and architecture. Some may argue that "coding" is just a different type of engineering constraint. I won't argue that, but in my company we have an engineering department that specializes in server sizing and configuration, and development departments that do the coding. so we classify them as two distinct problem types. I have another reason for that differentiation as well. Engineering constraints can usually be quickly fixed with the addition of more resources (server, CPU, memory, etc) or reconfiguration/reallocation of existing resources (add more threads, connections, heap). Ccoding problems on the other hand usually take much longer to diagnose, recode the problem area, retest and redeploy. A trivial example of this would be the replacement of a linear search with a hash table lookup.

Architectural constraints are more fundamental design decisions which can not easily be altered. For example a design decision that requires an application to execute completely within a single server. This might be a simple design that performs well - as long as you can buy a larger server. whether this is a good decision or not depends greatly on how reasonable your assumptions about the potential for growth may be.

Is scalability always a good thing? Perhaps not. it depends on what you are measuring. I recently read a product evaluation that said (incorrectly) that the product's license model was not scalable. The truth is that it is high scalable. The more of the product we used, the more we paid. It wasn't linear though, because volume discounts meant that unit costs dropped as volume increased (and that is a good thing!). What the author really meant was that they wanted NON-scalable pricing; they wanted a price ceiling (somewhat like a wall except on the other dimension). At a certain point of volume growth, they didn't want to pay anymore. a desirable feature for the buyer, bit maybe not the seller.

There is much more that could be written, horizontal versus vertical scaling. 'knees in the curve', etc. But until then you might like to read the wikipedia article.

Monday, April 20, 2009

8,000 US Banks

"the U.S. still has over 8,000 banking companies": Anybody who wants to understand the US banking industry need to understand this point. The referenced article gives just one viewpoint on the degree of diversity that is 'The US Banking Market'.

Friday, February 13, 2009

Wacky pricing... paper, ebook, and audiobook

Something is just plain wrong here. I have been listening to Taleb's "The Black Swan: The Impact of the Highly Improbable". A very interesting book, by the way. Recommended.

Anyhow, this is about price. I bought the audiobook from Audible.com so I could listen to it during transit times, etc. I paid C$15 for it. Expensive for a download I thought, but Amazon wanted US$21 for their audio download. The audio CD? US$26 !!

I decided to check out what a paper copy of the book would cost. I wasn't surprised to find it on amazon.ca for more money, but not too bad at $20 for the paperback. But I don't have any more space on my bookshelf.

So I looked for an ebook. I was shocked to find the price to be US$27 at several sites. That is insane. Finally, what about amazon/kindle? US$12. Much more reasonable, but you have to buy a Kindle. Sony's price is the same.

I think there is a bit of room here for price competition. I can't see any reason why the electronic download should not be universally half-price compared to the original media. This is somewhat the way it is for music. Perhaps libraries provide price competition on the paper?

One other thing I noticed. It doesn't seem to be general practice for a retailer to offer multiple media formats for an item - the one exception being the Kindle view on amazaon. Surprisingly the reverse was not the case. The regular amazon entry for the hardcover book did not reveal the other options available. A missed opportunity.

Friday, January 09, 2009

Prototype, Proof-of-Concept, and Pilot. Oh my!

These terms are often used in IT contexts often without much consideration to nuances. Whatever you call it, the first important thing is that you understand and state your objective. The second is that you meet it.

My suggested definitions.

Pilot: This is an implementation of a system that is often functional complete. It is typically deployed in production, but usually constrained to a small number of users. Although we hope everything is perfect, there is an expectation that there will be faults that require rework - otherwise we could have just gone full production. The fault maybe in deployment, code, design, or in user experience. A pilot is typically time-boxed. A pilot is usually fully productionalized from an operational perspective.

Beta: Similar in many aspects to Pilot. There is a lesser expectation that it is functionally complete, but it typically is. With a Beta, there is a much more explicit understanding that it is not final. It will change in the final release. A Beta is often supported, if at all, by a different organization than a production instance, typically developers. Traditionally a beta was not to be used for production, however some companies are making it part of their normal process - the never ending beta. Something that Google has done many times.

Release Candidate: A software build and might be view as living in the middle between Beta and Pilot. A release candidate may be promoted to pilot or production.

Proof of Concept: I view this as a very narrow and well defined activity. There is a well defined concept. The objective of the activity is to prove that the concept is viable in some aspect. Functionally it is only complete enough to meet the objective. The resulting code is not intended to be used for anything else - although most programmers will harvest some aspects for other things. Agile methods talk of 'early pain'. Significant projects has technology aspects that well be challenging and be a source of risk. It is desirable to execute on those aspects first, if there is a problem you want to know about it early so that you can change your plans or maybe cut your losses. This is what a proof of concept is about; if you are going to fail, fail early. Many times I have seen people propose a 'proof of concept', with no idea what concept they wish to prove; often what they mean is that they want to start coding. PoCs are not implemented in production.

Prototype: Perhaps this one has the most varied definitions: Experimental Prototype, Engineering Prototype, etc. In software development, a prototype is a rudimentary working model of a product or information system, usually built for demonstration purposes or as part of the development process. As part of an SDLC approach, a simple version of the system is built, tested, and then reworked iteratively until ready for use. Prototypes are not usually implemented in production. Go read the wikipedia article http://en.wikipedia.org/wiki/Prototype

Do you agree? Are there any characteristics of any of these terms that you think would help define them.

Monday, January 05, 2009

Unstructured, Semi-Structured, and Structured Data

I originally wrote this over two years ago and have intended to post it ever since. Not too late.

Often in the IT world we hear or even use these terms. But what do they really mean? Here is my view.

All bits and bytes that we deal with in the IT world we considered to be data (at least). It all has some form of syntax and structure, so what do we really mean, and why is it useful to distinguish between them?

These three classifications represent a continuum which spans from unstructured to structured data that represents the degree to which the data's semantic model (meaning) matches our processing requirements. In general what we are trying to describe is the readiness of the data to be processed in a particular business context .

For example, if the data in question is the raw audio recordings from the call centre, and the business context is we need to review all verbal instructions spoken by customer "Joe Smith" last year over the phone, we may consider that recordings to be unstructured. We have no easy way to process the request.

If we have augmented those recordings with additional data from other systems and have added customer number and call timestamp to the recordings (or an index) then we would consider that to be semi-structured data. Although we could quickly sift through the millions of minutes of recordings to get Joe's subset, somebody would still have to listen to the recordings to find the things that Joe said.

The structured data, in this case, could be represented by the actual transaction records that the call centre agent created in response to Joe's instructions.

Likewise, A TIFF image might be considered structured data within the Context of a GIS application (geographic information system), but might be considered unstructured within a mortgage appraisal application. (Perhaps even GIS would consider it unstructured since they might ideally wish to run queries over an image set to find all lakes larger than a certain size. That would be hard on untagged TIFF images.)

All the data we typically deal with has a known syntax, even if that syntax is only really understood by MS Word. And although a Word document may have semantic meaning to a human, that semantic meaning is not easily extract by computer. We consider a Word document to be unstructured (in most cases).

An Excel spreadsheet may have a well defined layout of rows and columns. Although Excel may find it easy to 'understand' its content, other programs may or may not. If they layout is regular and complete, programs other than Excel maybe able to extract that data from the spreadsheet and do useful things with it. We would consider that to be semi-structured. I suggest that the 'semi' aspect of the term introduces the concept of a degree of uncertainty. Perhaps this is because it source is not well controlled and the form (layout) may change and it suddenly becomes unstructured in our context.

Structured data has an aspect of surety about it . We know that there are 'fields', we know where they are, we know what values to expect. We know how to understand it. We expect there to be some kind of formal model which defines this structure, and we expect that there will be controls in place that enforce our expectations. We may often visualise such data as being a relational form stored in a RDBMS. But that is not a requirement.

All that said, here are my definitions:

Unstructured Data: Data which does not have the appropriate semantic structure which allows for computer processing within a particular business context .

Semi-structured Data: Data which has some form of semantic structure which would allow for a degree of computer process within a particular business context, but may need some human assistance. It may apply some heuristics but the process may fail due to volatility of the structure or incorrect assumptions about the structure.

Structured Data: Data which is well positioned to be reliably processed by computer within a particular business context. It has a well-defined and rigorously controlled syntactic and semantic structure. The elements of the data have a well defined datatype and rules about valid values and ranges. The meaning of these data elements is well understood in isolation as well and their relationships to other elements. Elements are also traceable to their originating sources and that path is verifiable.

Terry's Technology Topics