First experiences with Project Gemini

What are my Gemini First Impressions?

I’m one of the chosen many who have access to the Gemini CTP2 – and had my first hands on experience with Gemini yesterday. I was up and running with it very quickly and grinning very shortly afterwards. The time it took me from connecting to the AdventureworksDW to producing a nice interactive “super” pivot table with slicers and very rapid response was very short.

As someone with experience of being an analyst, Gemini looks like a great tool for rapidly accessing and manipulating data. You connect directly to your data source (which doesn’t have to be SQL Server, btw), create some relationships and start playing. There are some new formulas to understand, but if you can work with a relational database this stuff will be a breeze. The response times are impressive, as are the connectivity options once you have finished. A Gemini workbook can become a data source for reports and deployed to MOSS without losing its connectivity as it falls into Excel Services.

Is this a Data Warehouse Killer?

One of my contemporaries said he was very concerned about Gemini leading to the next generation of spreadmarts, and that it could damage the market for Data Warehouses. I half agree, half disagree.

Firstly, the concern that Gemini will create the next spreadmart nightmare. Yes, it will create a new mess of user created, uncentralised data. However – as with spreadmarts – we will as an industry learn from this and find a way to manage it. Only this time we have prior experience of the phenomenon. A further mitigation factor for this risk is that I see the number of people using Gemini will be smaller than those using Excel. Those people will also be more likely to be analysts and hopefully more conscious of the risks of a Gemini-mart.

Will it kill the Data Warehouse? Probably not – analysts – the market for Gemini – don’t want to spend their time preparing and cleaning data. The DW will remain a good source of clean data for analysts to work with, and will remain a valuable part of the enterprise BI suite. It may remove the need for some warehouses – especially those based on clean data sources. I can also see it opening the door for new warehouses – as Gemini will make more data sources available to analysts. They will see it, play with it – declare it as important and bring it into the warehouse. So overall I expect the impact to be neutral.

So how does the BI Monkey feel about Gemini?

Easy – very, very excited!

The Term Extraction Transformation and “Animal Farm”

b

Fig 1: The Term Extraction Transformation

In this post I will be covering the Term Extraction Transformation. The sample package can be found here for 2005 and guidelines on use are here. Todays exercise will be a fun one as i’m going to apply the transformation to George Orwell’s book Animal Farm – a copy of which I obtained in text form from Project Gutenburg Australia.

What does the Term Extraction Transformation do?

In simplest terms it can extract individual nouns and collections of nouns and adjectives from text (these are the “Terms”) and returns them with a frequency count or score.In my example, a common Noun term is “Animal”, and a common Noun Phrase term is “Animal Farm”.

Because it uses an internal dictionary to simplify terms to identify repeated elements – such as removing plurals – it only works for English text. The dictionary is not exposed and cannot be edited, nor can the component be pointed at a custom dictionary of your choosing, so like the Fuzzy Lookup it is a bit of a black box in terms of your ability to tweak its operation – the algorithms and dictionary are fixed – i’ll pick up some flaws with this later. The only real control you can have over the content of the output is the use of an Exclusion List, which allows you to feed a list of terms to ignore into the component.

Configuring the Term Extraction Transformation

b

Fig 2: The Term Extraction Tab

The first thing to configure is the input column on the “Term Extraction” tab – this transformation accepts a single input column which must be either a Unicode Text Stream or Unicode String. In the example package i’ve simply used a Data Conversion task to convert my Non-Unicode input stream prior to the Term Extraction. You can also assign custom names to the Term and Score columns as well.

b

Fig 3: The Exclusion Tab

Next up is to specify your Exclusion list, if you are using one – this must be in the form of a single column in a table in either a SQL Server or Access Database (apparently Excel is also an undocumented  option) .In my example I have used the Name column of the Adventureworks Departments table, so the names of any Departments that appear in the text won’t appear in the output. Admittedly this is unlikely in Animal Farm, but if you were web mining your own website may choose to ignore your company name as it will appear often and may tell you nothing.

b

Fig 4: The Advanced Tab

The final page is the most important in terms of affecting the output. Term Type controls whether the component returns Nouns, Noun Phrases – or both. Score type controls whether the score returned is a simple count or the TFIDF – the Inverse Document Frequency – TFIDF of a Term T = (frequency of T) * log( (#rows in Input) / (#rows having T) ). I’m sure that’s a useful number to someone. Parameters sets the minimum frequency a term has to have before it will be output – obviously a setting of 1 would  return every siingle noun and /or noun phrase found. Maximum length of term sets the maximum number of words in a term. Finally Options sets the case sensitivity of the search.

The Term Extraction Transformations’ dictionary limits

The problem with this component stems from its black box dictionary which limits how well it can handle data. As an example, despite it claiming to remove plurals, if you look in the results of the example package, both Commandment and Commandments appear as distinct terms. If you extend this to the real world – say, mining emails or web pages – misspellings are common, product names are often nonsensical from a dictionary point of view – and a custom dictionary would allow you to work around that. As it is you would end up having to fix it after extracting it.

By adding a custom dictionary, or allowing it to be extended in the reverse of an exclusion table, this component would become more useful. I’ve added a connect article suggesting this – please vote it up if you think it will improve your lot.

When would you use the Term Extraction Transformation?

Douglas Laudenschlager comments here on some scenarios envisaged by Microsoft Research in China for use of terms within text data for mining. It should be applied to situations where you need to trawl through large amounts of (English) text data to pull out common terms. One use I attempted when learning SSIS was to try and emulate the Quackometer, a web based tool that tries to analyse web pages and determine if their content is valid science or junk science. I did this by pulling down the web pages as text, running them through the Term Extraction and then trying to detect common valid and junk science terms (and using an Exclusion list to remove common HTML terms). I never finished it but it remains a lurking project which may yet reappear on these pages.

MSDN Documentation for the Term Extraction Transformation can be found here for 2008 and here for 2005.

A caution on using Dimensional DSVs in Data Mining – part 2

As a followup to this post I have found that not only does using a table external to the one being mined to provide a grouping fail to actually group within the model, it also confuses the Mining Legend in the Mining Model Viewer.

What I was seeing in the Mining Legend for a node in a Decision Tree was like this:

Total Cases: 100

Category A: 10 Cases

Category B: 25 Cases

Category C: 0 Cases

Category D: 9 Cases

… so the Total cases and the cases displayed didn’t tie up. By digging further using the Microsoft Mining Content Viewer and looking at the NODE_DISTRIBUTION I saw that there were multiple rows for the categories, and the Mining Legend was just picking one of those values.

So if you find youself looking at a node and wondering why the numbers don’t add up – it’s because your grouping hasn’t been used by the model.

A caution on using Dimensional DSVs in Data Mining

If you are using a dimensional-style DSV in a Data Mining project, such as below:
b

Fig 1: A Dimensional DSV

Be aware that if you include a column from a Dimension table in your Mining Structure, the model will actually identify each key entry on the source table as a distinct value, rather than each distinct value in the Dimension table. I found this out because I added a grouping category to one of my dimensional tables – a simple high – medium – low group – and there were multiple values in the attribute states for each grouping, as below:

b

Fig 2: Mining Legend

To work around this you will need to add a Named Calculation to get the group on the main table, or convert the main table to a Named Query.

Quick book review: Data Mining with SQL Server 2005

I’ve just about squeezed all I can from Data Mining with SQL Server 2005 by ZhaoHui Tang and Jamie MacLennan – both of whom were part of the Data Mining development team for SQL Server 2005.

This book provides a lot of what seems to be absent from BOL and MSDN – it goes through most facets of Data Mining using SQL server reasonably thoroughly, but from a very technical angle. It is littered with big chunks of code and feels and reads like technical documentation most of the way through. It doesn’t provide much insight into how to carry out effective Data Mining or interpret results – what little is there is useful, but it’s a slog to find it.

As a technical reference I’d recommend it, not least because of the dearth of decent documentation. If you’re a beginner trying to work out how to use the product to get results, you need to look elsewhere.

Cannot View Data Mining Model in BIDS – function does not exist

I’d been running some Naive Bayes Data Mining models without problems as part of initiating a Data Mining exercise, so it was time to move on and cut the data some different ways. So I set up a Decision Tree model and it processed fine, but when I tried to view it a message box appeared telling me it wasn’t going to co-operate:

The tree graph cannot be created because of the following error:

‘Query (1,6) The

‘[System].[Microsoft].[AnalysisServices].[System].[DataMining].[DecisionTrees].[GetTreeScores] function does not exist.’.

Fortunately someone had hit this before, as the solution is rather obscure. The install I am working against is non-standard, being split across two drives. What had happened is the path for the Data Mining dll’s set up in the install process didn’t actually match where they were placed.

So when I looked under the assembly location – SSMS > AS Server > Assemblies > System > Properties, the Source Path referenced a dll that didn’t actually exist – so it appears this incorrect path does not raise an error when trying to start the server. To fix it, I located located where the dll really was, then updated the config files where this path is stored – System.0.asm.xml and VBAMDX.0.asm.xml – to point to that path.

A restart of the server and the models reprocessed and I could happily view the output!

SSAS Training Resource

I have added a link to Craig Utley’s excellent SSAS training video resource site LearnMicrosoftBI.com, which contains training videos on a variety of subjects in SSAS – dimensional modelling, Actions and the one I found most useful explaining the tricky but critical subject of Attribute Relationships (Video SSAS 109). Recommended for anyone starting out in SSAS or needing concepts clarifying.

Registration is required to download the videos (not sure why) – but it seems to generate no spam so not a big issue, and the content is very high quality for free content.