So far we’ve opened up the client and taken a look at the areas for working with. In this post I’ll look at setting up a Knowledge Base from scratch using some sample data I’ve mocked up based on some Netflix catalogue data.
Note all of the images are of near full screen so I’ve shrunk them in the post, just click on the image to see the full screen.
Creating a Knowledge Base
To create a new KB just select “New Knowledge Base” from the client front screen, and give it a name and description. You can either clone an existing one on already the server or import, but in this case I’m just going to start from scratch. There’s three possible modes in which you can create a new KB:
- Domain Management – Creating from scratch with no guidance
- Knowledge Discovery – Using a sample data set to guide building your KB with a view to using it for data cleansing rules
- Matching Policy – Using a sample data set to guide building your KB with a view to using its record matching capabilities
As I’m not attempting to do any matching, I’m using the Knowledge Discovery approach. (Note: because I’m using an Excel source I need Excel installed on the machine).
Mapping Columns to Domains
Once I’ve picked my data source, I need to map columns:
At this stage, there is no facility to automatically create domains, so before I map the fields to a Domain, I need to create each of them:
Now a (known) defect appears to be fields from Excel are automatically treated as text – so if they contain numbers you cannot map them to a Numeric domain. So I’ll skip over evaluating Year for now.
Analysing and Managing the results
Once all the mapping is done, click next and you can upload the file for analysis. Now my sample file is 5,000 records, and it took DQS a few minutes just to upload this – so big sets may take a while. Anyway, the output of this is an analysis of the data, displayed in a profiler screen:
This breaks down the records by the following criteria:
- New records – in this case, all of them as it’s a first pass
- Unique – how many unique records
- Valid – how many valid records – again, as it’s a first pass, everything is valid
So, we skip on to the next stage – to manage the results of the findings. This is the first time we start to see what DQS can offer us in terms of cleansing:
Here we are reviewing the values found for Country, and can manage the values that come through, flagging them as Correct, Error or Invalid – and assign a corrected value to incorrect ones.
I’m trying to see if there is a functional difference between Error and Invalid. As per documentation:
The status of the value, as determined by the discovery process. You can change the type by clicking the down arrow and selecting a different type. A green check indicates that the value is correct or corrected; a red cross indicates that the value is in error; and an orange triangle with an exclamation point indicates that the value is not valid.
A value that is not valid does not conform to the data requirements for the domain. A value that is in error can be valid, but is not the correct value for data reasons.
Update: the difference has been clarified here in the DQS Forum.
Skipping over the semantics issues, what we see here is a list of the values that the DQS Client has found in the Knowledge Discovery analysis of the data. We can then flag these values to Invalid or Error as we see fit – or leave them at their default value of Correct. Once we have flagged them as not correct, it is then possible to enter the Correct value in the “Correct To” column. Handily, the client then groups your corrected values under the correct value in the list.
The final thing is to click Next and Publish the Knowledge Base (i.e. store the results back on the DQS Server)
So in this post we have quickly reviewed the creation of Knowledge Base through the Knowledge Discovery Mode. This has allowed us to create a set of values in our Knowledge Base using some sample data and then apply some corrections to those values, using a simple GUI to manage the results.
In the next post I will look at working in more depth with this created Knowledge Base using the “Domain Management” mode.