What does the Percentage Sampling Transformation do?
This component is very simple – it splits a dataset by randomly directing rows to one of two possible outputs (as you can see in example 2 in the package, you can use just a single output if you want). All you need to decide is in what proportion (as a whole percentage) you want the rows split into the two output data flows. In the picture below you see the configuration options – the percentage split, the names of the two outputs and the Random Seed.
The effect of the Random Seed can be seen in the sample package – if you run it multiple times you will get different results for the split each time, as each time you run it the Random Seed is different because the package decides what it is based on the tick count of the operating system (and no, I don’t know what that is either!). Note that in the example even though the percentage sample is set to 30% it’s unusual for the output rows to be split exactly 30:70. This is because the rows are allocated to an output by a throw of the randomisers dice. If you set a value for the Random Seed you fix the results of the throws and will always get the same rows sent to the same outputs, though there is still no guarantee it will be 30:70. As the data set you split gets bigger, the impact of this effect will be less significant.
Where would you use this transformation?
The main use for this as far as Microsoft is concerned is carving up data sets for Data Mining into training and test cases. But anywhere you need to divide a dataset truly randomly – e.g. separating out customers for a different target mailing – this is the component for the job.