Business Intelligence for Small Businesses with SQL Server
In these times of economic uncertainty, it is vital for your customers to understand how their business is performing. Are they losing customers? If so which ones, and are they gaining others? What do the new customers buy?
Those businesses that can’t react rapidly to changing trends may not survive the recession. Business Intelligence (BI) systems are designed to provide companies with precisely the information they need to maximise opportunities and hence minimise the effects of the downturn. BI has traditionally been restricted to large enterprises but it can work just as effectively for smaller businesses – indeed in some ways it can work even better because smaller organisations can often react more quickly. If you are already familiar with Microsoft SQL Server you’re in a great position to start now and offer your customers a service that will become a core part of their business tools.
BI can help your clients to understand their business; but how does it do that? At its simplest, BI is a set of methodologies, tools and practices that allow you to extract information from data. Once you have the information, business people can analyse it. At the enterprise level, BI systems are often complex because the data held by large organisations is complex. The good news is the modular nature of BI allows us to pick and choose the appropriate bits for smaller companies, to simplify the BI system that we build for them.
So, a crucial question at this point is what’s the difference between data and the information that we want to extract?
Data is the stuff that organisations (of any size) collect as they go about their business: sales records, stock control, purchases and personnel data for example. These types of data are often stored in operational databases with each row of data typically recording a single transaction; for instance, a row from the sales database might tell you that Shapes-To-Go Ltd bought 1,000 blue widgets on the 16th of April 2008. That’s the raw data and, as in this example, it is often very focused; it’s about a single event. Information on the other hand typically tells you something that cannot be revealed by one transaction – for example, the fact that that blue widgets sell twice as well as yellow ones, except in the south-west where yellow dominates; or that the majority of your customers are over 50. At a very simplistic level, we turn data into information by aggregation – by looking at many transactions together.
This information is clearly useful to the business because it can now ensure that the depot serving the south-west is kept stocked with widgets of the favoured colour and set the marketing department to develop an advertising campaign to appeal to younger customers. The right information gives us the knowledge we need to make the correct decisions and, as Sir Francis Bacon told us over 400 years ago ‘Knowledge is power’; he was right then and he’s still right.
Structuring Data for Transactions and Analysis
OK, so BI appears to be really simple. If it’s about information, and information is simply an aggregation of data, surely all we have to do is to run aggregating queries against the operational systems (such as the sales database) to get the information we need?
Well, you knew it was never going to be quite as easy as that and there are good reasons why we have to make it somewhat more complex. The most important issue is to do with the way we structure data.
Most operational databases store data as discrete transactions; each order that’s placed, each invoice that’s paid and so on. These systems are designed to be very efficient at processing discrete operations, and they are. Sadly almost all systems designed to be good at manipulating transactions are lamentably poor at running the kind of aggregating queries we need to extract information.
Why? Well, transactional systems are usually normalised, which means they don’t store repeated data and they don’t store aggregated and other cumulative values that can be derived from the individual rows. That’s bad for analytical systems because it is precisely these aggregated values that we need.
So, what would happen if we tried to run an analytical type query against an operational system? Well, imagine for a moment that you are Mr. (or Ms.) Walmart. You have a sales database that records every transaction. Given the size of Walmart, that database will be storing millions of rows every day. You want some information; say you want to know how the sales of cat food have varied, month by month, in Wapping. So you run a query against that database. The problem is twofold. For a start, the query has to hit millions (possibly billions) of rows to find the ones that refer to cat food and Wapping. That takes a very long time. And there is another problem. Remember that this database is operational, so while that query is running, other transactions (the life blood of the company) will be slowed to a crawl or stopped. This will significantly impact your popularity in the short term and your employability in the long. Essentially, operational data and analytical queries don’t mix.
But, of course, we have a way around this problem. We accept the fact that, for analysis, we need an alternative way of storing and structuring data. This alternative is often called OLAP (On-Line Analytical Processing). We extract a copy of the data from the operational database and restructure it as an OLAP structure – which is called an OLAP cube.
OLAP cubes are designed specifically for analysis. This means that they do store aggregated values (these are calculated when the data is loaded into the structure). In addition they hold the data in a multi-dimensional form which is specifically designed for analytical queries. For example, suppose we want to analyse the sale of different products, by location, over time.
In our diagram of a cube, there is a cell at the intersection of each time period on the x axis, product on the y axis, and city on the z axis, and in each cell is the corresponding sales total. At the intersection marked with a red star we find the number of blue widgets sold in Newcastle during Q1 of last year. As we’ve said, in a cube the values at the intersections are very often aggregated from the raw data: here we’re looking at the aggregate of all sales of a particular product in a quarter.
Our cube shows just three dimensions (time, location and product) but an OLAP cube can have many more dimensions. A 3-D cube is easy to draw but further dimensions are almost impossible to illustrate. However, understanding that values are stored at the intersections of dimensions gives you the key to understanding OLAP cubes.
Measures and Dimensions
In order to analyse the data, we need to extract a copy of the data from the transactional system and restructure it as an OLAP cube. That’s fine, but how do you decide what data to put into the OLAP cube and how to organise it?
Imagine you have a client with a sales system. The starting point is the analysis that people want to perform. Talk to the client: ask what analysis is currently performed and what analysis they would like to perform in an ideal world. Ask them to draw or sketch out the kind of analysis they would like to see on screen. Their reply will almost always be in the form of graphs, grids of data (as in spreadsheets) and reports, for the simple reason that these are the way people think about, and picture, information. If the people are keen to get information (and they should be!) then you’ll probably be presented with a bewildering array of dozens of different graphs, grids and reports. The good news is that the bewilderment is easily dispelled because, no matter how they describe it, all the analysis that business people wish to perform can be broken down into two components: measures and dimensions.
Dimensions are the facets of the business by which you wish to analyse. A dimension that occurs in almost every cube is Time (as in the simple cube above): we’re often interested in behaviour over time so we can spot trends. Product, Customer and Employee are also likely dimensions for a sales database. Other dimensions might show the manufacturer of the product, the store location and so on.
Dimensions are often hierarchical: Time, for instance, can have a hierarchy of years, quarters, months and days. The hierarchy reflects the way the business operates so if it reports in two-monthly blocks that can be incorporated into the Time dimension. Customers can be grouped by title (Mr, Mrs, Ms. etc.) and these can in turn be grouped by gender (Ms, Mrs. and Miss are all female, Mr is Male). (This won’t work for titles like Professor and Doctor but, to keep it simple, we will assume that the sales system only stores the more common titles).
Measures are almost always numerical and are values like the profit or the quantity sold. Measures are so called because they measure the business process in some way. They are plotted against one or more dimensions.
Any graph that you care to draw can be broken down into a set of measures and dimensions: the measures usually appear on the Y axis, the dimension on the X. For example, this graph has one measure (Unit Sales) and two dimensions (Time and Product, with two members of the Product dimension – Red widgets and Blue widgets).
Starting with the users’ requirements, the design process proceeds through three stages.
First gather details of the users’ analytical requirements in terms of grids, graphs and reports. These form what we call the user model, which is a representation of the way in which the users think about the information they need.
Next decompose these into their component parts (measures and dimensions) and, from those, construct a logical model. In this case, the type of logical model is called a Sun model. It shows the measures in a circle, surrounded by the dimensions. Show the model to the users and discuss it with them. Any measure that appears in the Sun model can be analysed by any of the dimensions that surround it. If a user looked at this model they could ask ‘So, will I be able to analyse the profit we make on a particular product, month by month, for each gender group?’ The answer is yes. Indeed, the users will be able to analyse any measure shown by any and all combinations of the associated dimensions. And they will be able to see that as a graph, in a spreadsheet or in a report. A little thought shows that this simple Sun model represents many thousand possible grids, graphs and reports.
Finally, from the logical model you produce the physical model (specifically a type of physical model called a Star Schema). Physical models are typically similar to logical ones, they just have more technical detail. In this case we’re representing the data as tables with the measures stored in a central Fact table and each dimension as what is known (not unreasonably) as a Dimension table. It is at this point that you usually start to use software rather than whiteboards. In this case we’re using SQL Server Business Intelligence Development Studio (BIDS to its friends).
This physical model is linked back to the data in the operational database so that when we process the cube we pull the current data from the operational database, aggregate it so that we have the total sales for March and so on, and write it all to an OLAP cube. Once the cube is processed the users can connect to it with a visualisation tool capable of supporting multi-dimensional querying.
Here we’re using a tool called ProClarity but Excel 2007 has excellent capabilities for connecting to OLAP cubes and allowing data analysis. The crucial point is that, once the cube has been built, the users can very easily and intuitively query the data. In fact, querying is really the wrong word because, after years and years of reliance on the SQL language, the term ‘querying’ has come to mean something difficult. The user interfaces that come with these tools (Excel included) mean that the users can simply select the measures and dimensions they want, chose the display format they want (which type of grid, graph or report) and the information appears.
You might have been told you need to implement a centralised data warehouse before multi-dimensional analysis can be performed. Whilst this may well prove a necessary step in large-scale businesses, in a small business you can analyse very successfully without.
A data warehouse is a central repository into which is placed the data necessary to support analysis. One extra problem faced by large enterprises is that they tend to have a very large number of transactional database. Not only do they have multiple databases (HR, sales, finance and so on) they very often have multiple databases of each type because of mergers, take-overs and so on. Inevitably this data is stored in incompatible systems (Oracle, DB2, SQL Server and more) with incompatible data types. In addition there might be three HR systems but, for analysis, we want only one employee list. So a great deal of effort is involved in extracting the data, transforming it into compatible format, cleansing it and finally loading it into a central repository. This is so common in BI that a range of so-called ETL (Extract, Transform and Load) tools have been developed. The one in SQL Server is called SQL Server Integration Services.
The transformation process also has to ensure that the meaning of data is standardised so that the results of analyses tell a coherent story. For example, suppose that in the order entry database the term ‘sales person’ means anyone who makes a sale whilst the finance system uses the same simply to mean those in tele-sales gopherland (and excluding the people on the sales floor). In this case we cannot simply combine the two ‘sales person’ lists into the data warehouse because they have very different meanings. In the data warehouse we may well want to be able to analyse on both definitions but the business must agree two new terms (perhaps ‘all sales people’ and ‘tele-sales people’) and provide exact definitions so that the data we extract exactly matches the definitions. This semantic and definition work initially sounds trivial but it turns out to be a hugely significant part of the work in creating an enterprise level BI system. The final step is to load the clean data into the warehouse.
In smaller companies, however, almost all this extra complexity can usually be ignored and analysis performed without the need for a warehouse. The semantic analysis is still important but with smaller companies, it is usually far, far less complex.
Data mining describes the process of running tests against data which look for complex correlations, anomalies and trends hidden within the mass of data. It’s very different from normal querying of relational or multi-dimensional data where you ask the question to which you want the answer. With data mining you do not necessarily know what answer you are looking for when you begin.
Data mining used to be very difficult and expensive, and it’s often still regarded in this way. In recent years the whole process has been made much simpler, especially for users of Analysis Services. An excellent range of data mining algorithms is provided as part of Analysis Services, and there is also a data mining add-in for Excel which offers a familiar environment for looking at the results.
BI will give your customers insight into how their business actually works, so they can improve it. Imagine your customer starts mining their sales data, which reveals that a particular product is bought by males in the 35-45 age bracket in the latter half of the year. They would never have asked that question specifically but the mining process shows them a trend that is encapsulated in the data – giving them a business opportunity to capitalise on that knowledge and improve sales for the rest of the year.