Posts Tagged 'SQL Server'

Google Refine and Data Quality Services (Denali)

I’ve recently come across a new(ish) product from Google called Refine – http://code.google.com/p/google-refine/ and I thought I would see how it compares to the upcoming Data Quality Services (DQS) that is going to be part of SQL 11 (Denali).

Refine is a tool that can easily and powerfully cleanse and enhance data from a flat file source. If you haven’t seen the videos on Refine then check them out at – http://code.google.com/p/google-refine/. To be honest I’m pretty impressed with Refines features and it’s something I will defiantly use on some projects. But I can see some huge limitations especially when wanting to use it as part of a larger project. A quick pros and cons:

Pros

  • Powerful text search and clustering allowing quick data cleansing (Facets)
  • Integration of Freebase data allowing “reconciliation” with external data sets and including additional columns
  • Scripting language, web service and JSON support
  • Full Undo/Redo support

Cons

  • Only seems to work on flat file or web service based data
  • Once a project is created its not easily repeatable or automatable on similar data
  • It’s run as a standalone application

All up this means that Refine is more suited to cleansing/creating data sets in a smaller once-off fashion. For example using Refine for a monthly ETL of *.CSV based data involves the manual steps of opening up a previous project, exporting the Undo/Redo JSON object, creating a new project based on the new data, and applying the JSON object – a bit of a pain, and it’s something which hopefully the project addresses in the future.

So how does Refine compare to DQS? DQS was not part of Denali CTP1 and there isn’t much information that I can find, however I was able to get a look at it in action while I was at the SQL PASS 2010 conference. While its too early to do a pros and cons on DQS I can say that it far surpasses Refine in usability and scale.

DQS has broken the area of data cleansing into 3 general areas:

1. Knowledge management – DQS uses a similar method of searching and clustering data, calling them Domains instead of Facets. Domains are created either via external data sets (from the Azure Marketplace or other reference data) or from your own data. Once defined domains become part of whats called a DQ Knowledge Base (DQKB) which can be mapped to a data source, which leads to part 2.

2. Cleansing and Matching – It’s this stage where DQS shines over Refine. The created DQKB can then be applied over existing data or can be incorporated into an ETL process via SSIS. The domains apply their rules over the data and produce 1 of 4 results: Correct, Corrected, Not Correct or Auto Suggested. While the Not Correct and Auto Suggested results still require some manual intervention, the domain rules can then be updated to further enhance the DQKB. DQS can also apply matching over columns to remove duplicated in the same way Refine does.  This then leads on to part 3.

3. Administration - Not much has been revealed about this stage of the process except that Microsoft will make available the means to monitor and control the quality of data.

While there isn’t much to see yet, DQS is looking like a far superior product than Refine – I can’t wait to get my hands on it!

Denali Columnstore Indexes

Denali introduces a new type of index called the columnstore. This new type of index is built up on the values across columns instead of traditional row based indexes. As data tends to be less unique across a column it allows the columnstore to efficiently compress and store data. The columnstore is currently read-only, however it can be updated via dropping and recreating the index, or switching in a partition.

Due to the ability to compress and keep the index in memory the Columnstore can give massive (10x, 100x, 1000x…) increase in speed to warehouse queries – but have been warned that not all queries can benefit and some can regress. In general typical star join queries found in a Data Warehouse when only a portion of the columns are selected will get the biggest benefit.

For more info see the following white paper - http://download.microsoft.com/download/8/C/1/8C1CE06B-DE2F-40D1-9C5C-3EE521C25CE9/Columnstore%20Indexes%20for%20Fast%20DW%20QP%20SQL%20Server%2011.pdf

My SQL Server Execution plan is a lier!

UPDATE: It’s been brought to my attention that apparently I spelt lier wrong. Just letting everyone know I still stand by the original title – http://dictionary.reference.com/browse/lier

In preparation for my talk today at the Adelaide SQL Server User Group (AdSSUG) on what’s new in SQL11, I’ve been putting together a few T-SQL demos based on the talk given by Tobias Ternström at the PASS Summit 2010.

One new feature that I’m really liking is the new OFFSET operator so I thought I would give it a closer look comparing it to traditional paging methods.

To start I put together a simple query using the AdventureWorks2008 database just to see the execution plan that was generated -

select CustomerID, StoreID, TerritoryID, AccountNumber, ModifiedDate
from Sales.Customer
order by CustomerID
offset 10 rows
fetch next 10 rows only

With the plan as follows:

Execution plan from query using the new OFFSET syntax

I then compared the query with the current equivalent using a CTE -

WITH a AS ( -- CTE
	SELECT ROW_NUMBER() OVER (ORDER BY CustomerID) AS RowNum, -- Ranking by CustomerID
	CustomerID, StoreID, TerritoryID, AccountNumber, ModifiedDate
	from Sales.Customer
)
SELECT a.CustomerID, a.StoreID, a.TerritoryID, a.AccountNumber, a.ModifiedDate
FROM a
WHERE a.RowNum BETWEEN 10 AND 20
ORDER BY RowNum;

When comparing the two execution plans the OFFSET clause performed slightly better 0.003964 vs. 0.0033952 excellent!
So I then decided to raise the stakes a little, to see what happens when paging a much larger data set.

I created a test table with about 2 million rows -

-- Create a table of numbers
create table dbo.nums (num int primary key);
insert dbo.nums values (1);
go
insert dbo.nums select num + count( * ) over () from dbo.nums;
go 20

create table dbo.MillionRows
(ID int identity(1,1) primary key,
someDate datetime not null,
someVarchar varchar(20) not null)

-- Insert 2000000 rows
insert dbo.MillionRows
select dateadd(second, num, '20000101') as someDate, 'Number ' + cast(num as varchar(10)) as someVarchar
from dbo.nums
where num <= 2000000;

I then ran both paging queries to collect 20 000 rows this time with the time statistics turned on -

SET STATISTICS TIME ON
GO
WITH a AS ( -- CTE
	SELECT
	ROW_NUMBER() OVER (ORDER BY ID) AS RowNum,
	ID, someDate, someVarchar
	from dbo.MillionRows
)
SELECT a.ID, a.someDate, a.someVarchar
FROM a
WHERE a.RowNum BETWEEN 600001 AND 620000
ORDER BY RowNum;

-- New way
select ID, someDate, someVarchar
from dbo.MillionRows
order by ID
offset 600000 rows
fetch next 20000 rows only;

SET STATISTICS TIME OFF;
GO

Once I ran it I got a surprise, upon looking at the execution plan the OFFSET paging performs horribly.

But looking at the execution times reveals the exact opposite -

Method Estimated Subtree Cost Actual Execution Times
CTE 0.0038488 CPU time = 703 ms, elapsed time = 1158 ms.
OFFSET 2.93175 CPU time = 172 ms, elapsed time = 608 ms.

Something isn’t quite right here…

To get to the bottom of the problem I asked Rob Farley for some help and quickly discovered that the Query Optimiser had incorrectly estimated the amount of rows returned by the clustered index scan causing the plan to wrongly assume that the CTE was better!

So the good news is that is safe to use the OFFSET clause :)

CTE Bad Estimate

For more info on Paging and performance have a look at this great post by Dave Ballantyne - http://sqlblogcasts.com/blogs/sqlandthelike/archive/2010/11/19/denali-paging-key-seek-lookups.aspx

Denali T-SQL Enhancements

While T-SQL hasn’t been the major focus in Denali there are still a few new things that should at the very least raise an eyebrow.

The first and I think most useful is the new paging syntax. Gone are the days of using a CTE with a ROW_NUMBER ranking function in order to page returned rows. The new syntax provides a much more elegant solution. For example:

select CustomerID, AccountNumber, CustomerType
from Sales.Customer
order by CustomerID
offset 10 rows --Where to start page
fetch next 10 rows only --How many rows to return

Will return rows 11  - 20 from the Sales.Customer table. It’s worth noting that the ORDER BY and OFFSET both required in order for it to work.

Throw

Another useful feature that most developers will notice is the inclusion of the THROW clause. One of the great uses for this is when using a try/catch in a stored procedure that is called by another stored procedure. In the following example:

begin try
	RAISERROR('Error', 18, 1);
end try
begin catch
	--handle in some way
	throw
end catch

The calling or parent stored procedure can still be notified of the exact exception, but the child can still perform any necessary clean-up.

Containment

Since forever the issue of moving or restoring databases from different environments has been a problem. The most difficult of course being collation mismatches and missing roles. Denali aims to bring an end to the pain by introducing the CONTAINMENT option.

Databases can now be created with the option of a containment level, in Denali the only supported containment is PARTIAL meaning non-enforced containment (The database will still let you create un-contained entities). For example:

CREATE DATABASE [Contained] SET CONTAINMENT = PARTIAL;

It’s also possible to create per database users without a SQL logon. For example:

CREATE USER [DatabaseUser] WITH PASSWORD = 'DB1';

For more info see Aaron Bertrans in depth look on Contained Databases – http://sqlblog.com/blogs/aaron_bertrand/archive/2010/11/16/sql-server-v-next-denali-contained-databases.aspx

Sequence Generators

Sequence generators are something that might not be new in other database offerings, but has made it to SQL Server in Denali. Used as an alternative to server generated identity values the CREATE SEQUENCE syntax creates a unique identity that can be accessed globally. See here for a detailed run down on how it can be used – http://www.sergeyv.com/blog/archive/2010/11/09/sql-server-sequence-generators.aspx

Other T-SQL enhancements of note are support for UTF-16 with new collations (marked as <collation>_SC for Special Character) and circular arc segments for spacial data types. See here for more info – http://msdn.microsoft.com/en-us/library/cc645577(v=SQL.110).aspx

Microsoft Codename “Atlanta”

Atlanta is a new cloud based monitoring service for SQL Server 2008 and up. Set to be released in the Denali time frame, the goal of Atlanta is to be able to monitor and analyse all SQL Server instances within an organisation and provide alerts in order to monitor SQL Server.

More info can be found here: http://onlinehelp.microsoft.com/en-us/atlanta/ff962512.aspx

Its currently running as an open beta, and signing up is relatively painless here: https://www.microsoftatlanta.com/

Installation is relatively straight forward and consists of two separate components.

The first is the Agent, which is installed on every server  with a SQL instance installed. Its the agent that monitors all activity in SQL Server.

The second component is the Gateway. The Gateway is one or more internet connected server(s) and is used by the Agents when they have something to report. The Gateway is connected to the cloud and secured to a Windows Live ID via a certificate that is generated via the dashboard. During installation the certificate file is all that is required to set the connection. Additional servers can also be added using the same certificate.

Setting up both the Agent and the Gateway is very straight forward with no restart required. Once installed the information gathered is access via a silverlight enabled dashboard which displays a list of alerts for all servers registered. Access to server configurations are also available as well as a full history of past configurations.

Of course the first thing that jumps to anyone’s mind when they think of their data being pushed to the cloud is how is this secured? As Brent Ozar rightly points out:

Some of my clients can’t let their own developers get access to queries running on the production server, let alone send them offsite to servers that could be in any geographic location – http://www.brentozar.com/archive/2010/11/microsoft-atlanta-cloudbased-sql-server-monitoring/

I had the same initial fear when I first heard about Atlanta and would have to agree with Brent’s conclusion – that this would be really useful for small business with no full time DBA. I know that as a consultant who is at times required to jump in and fix a problem, I could be far more effective with a remote dashboard which allows me to see what’s going on.

For me Atlanta is a product that I can defiantly see has some great potential uses. As far as features go I certainly hope there is more in store. One limitation that I found was the lack of being able to assign other Windows Live ID to the same “set” of servers, unless I’ve missed something I can’t see how to do that just yet.

Overview of Denali announcements

After a whirlwind trip to the US I’m back from SQL Pass 2010. It was an awesome conference, I can honestly say that I learnt a lot, met some great people and generally has a good time.

I’m planning to put together a series of posts that covers the new features and announcements to be in SQL 2011 a.k.a. Denali. Below is the 30 000 foot view of some of what I’m planing to cover

  • Project Atlanta
  • T-SQL enhancements
  • Data Quality Services
  • Column-store indexes
  • PowerPivot and BISM
  • Project Crescent (SSRS)
  • Parallel Data Warehouse
  • SSIS enhancements
  • CTP’s and possible release date

What we currently know about SQL Server 2011

I’m just waiting in Sydney for our flight to LA with Rob and thought I’d quickly look around in preparation for my presentation on SQL 2011 (aka Denali), mostly about what has been announced/leaked to date.

Unfortunately for me there is scant information around so I’ll have to wait till the Keynote of Pass 2010 when I’m sure more will come to light.

There are two sessions on Denali planned for Pass, one on high availability and the other on Data Warehouse query performance. I’m really looking forward to the Data Warehouse improvements talk by Susan Price, as I’m really hoping that Microsoft will continue to build upon the BI emphasis from SQL Server 2008 R2.

Flight time, gotta go!

First post

When thinking recently about ways to fast track self improvement to be honest starting a blog was not at the top of my list. I already have a blog, but I haven’t used it in a long time, and it’s looking kinda sad.

I’m presenting in a few weeks at the Adelaide SQL Server User Group (AdSSUG from now on) and my topic is on what’s new in SQL Server 2011 (Denali). Now to prepare I could just type something up in my word processor of choice, but I want to make the whole process a bit more open. In the next few days and weeks I’ll be posting here anything that I can find on Denali as I prepare for my talk. I’ll also be attending Pass Summit 2010 which is where more will also be announced.

In the future expect to see posts on SQL Server, SharePoint and general BI.

Enjoy


Roger Noble

Developer and Data Visualisation Consultant at Roger Noble Analytics
Follow me on Twitter
LinkedIn

Twitter