In this post I will be covering the Term Extraction Transformation. The sample package can be found here for 2005 and guidelines on use are here. Todays exercise will be a fun one as i’m going to apply the transformation to George Orwell’s book Animal Farm – a copy of which I obtained in text form from Project Gutenburg Australia.
What does the Term Extraction Transformation do?
In simplest terms it can extract individual nouns and collections of nouns and adjectives from text (these are the “Terms”) and returns them with a frequency count or score.In my example, a common Noun term is “Animal”, and a common Noun Phrase term is “Animal Farm”.
Because it uses an internal dictionary to simplify terms to identify repeated elements – such as removing plurals – it only works for English text. The dictionary is not exposed and cannot be edited, nor can the component be pointed at a custom dictionary of your choosing, so like the Fuzzy Lookup it is a bit of a black box in terms of your ability to tweak its operation – the algorithms and dictionary are fixed – i’ll pick up some flaws with this later. The only real control you can have over the content of the output is the use of an Exclusion List, which allows you to feed a list of terms to ignore into the component.
Configuring the Term Extraction Transformation
The first thing to configure is the input column on the “Term Extraction” tab – this transformation accepts a single input column which must be either a Unicode Text Stream or Unicode String. In the example package i’ve simply used a Data Conversion task to convert my Non-Unicode input stream prior to the Term Extraction. You can also assign custom names to the Term and Score columns as well.
Next up is to specify your Exclusion list, if you are using one – this must be in the form of a single column in a table in either a SQL Server or Access Database (apparently Excel is also an undocumented option) .In my example I have used the Name column of the Adventureworks Departments table, so the names of any Departments that appear in the text won’t appear in the output. Admittedly this is unlikely in Animal Farm, but if you were web mining your own website may choose to ignore your company name as it will appear often and may tell you nothing.
The final page is the most important in terms of affecting the output. Term Type controls whether the component returns Nouns, Noun Phrases – or both. Score type controls whether the score returned is a simple count or the TFIDF – the Inverse Document Frequency – TFIDF of a Term T = (frequency of T) * log( (#rows in Input) / (#rows having T) ). I’m sure that’s a useful number to someone. Parameters sets the minimum frequency a term has to have before it will be output – obviously a setting of 1 would return every siingle noun and /or noun phrase found. Maximum length of term sets the maximum number of words in a term. Finally Options sets the case sensitivity of the search.
The Term Extraction Transformations’ dictionary limits
The problem with this component stems from its black box dictionary which limits how well it can handle data. As an example, despite it claiming to remove plurals, if you look in the results of the example package, both Commandment and Commandments appear as distinct terms. If you extend this to the real world – say, mining emails or web pages – misspellings are common, product names are often nonsensical from a dictionary point of view – and a custom dictionary would allow you to work around that. As it is you would end up having to fix it after extracting it.
By adding a custom dictionary, or allowing it to be extended in the reverse of an exclusion table, this component would become more useful. I’ve added a connect article suggesting this – please vote it up if you think it will improve your lot. Update 21/10/2010: The SSIS Team are not implementing this feature, which is a shame.
When would you use the Term Extraction Transformation?
Douglas Laudenschlager comments here on some scenarios envisaged by Microsoft Research in China for use of terms within text data for mining. It should be applied to situations where you need to trawl through large amounts of (English) text data to pull out common terms. One use I attempted when learning SSIS was to try and emulate the Quackometer, a web based tool that tries to analyse web pages and determine if their content is valid science or junk science. I did this by pulling down the web pages as text, running them through the Term Extraction and then trying to detect common valid and junk science terms (and using an Exclusion list to remove common HTML terms). I never finished it but it remains a lurking project which may yet reappear on these pages.