Data masking or data obfuscation is the process of hiding original data with random characters or data.
The main reason for applying masking to a data field is to protect data that is classified as personal identifiable data, personal sensitive data or commercially sensitive data, however the data must remain usable for the purposes of undertaking valid test cycles. It must also look real and appear consistent. It is more common to have masking applied to data that is represented outside of a corporate production system. In other words, where data is needed for the purpose of application development, building program extensions and conducting various test cycles. It is common practice in enterprise computing to take data from the production systems to fill the data component, required for these non-production environments. However the practice is not always restricted to non-production environments. In some organizations, data that appears on terminal screens to call centre operators may have masking dynamically applied based on user security permissions. (e.g.: Preventing call centre operators from viewing Credit Card Numbers in billing systems)
The primary concern from a corporate governance perspective is that personnel conducting work in these non-production environments are not always security cleared to operate with the information contained in the production data. This practice represents a security hole where data can be copied by unauthorised personnel and security measures associated with standard production level controls can be easily bypassed. This represents an access point for a data security breach.
The overall practice of Data Masking at an organisational level should be tightly coupled with the Test Management Practice and underlying Methodology and should incorporate processes for the distribution of masked test data subsets.
Data involved in any data-masking or obfuscation must remain meaningful at several levels:
Substitution is one of the most effective methods of applying data masking and being able to preserve the authentic look and feel of the data records.
It allows the masking to be performed in such a manner that another authentic looking value can be substituted for the existing value. There are several data field types where this approach provides optimal benefit in disguising the overall data sub set as to whether or not it is a masked data set. For example, if dealing with source data which contains customer records, real life surname or first name can be randomly substituted from a supplied or customised look up file. If the first pass of the substitution allows for applying a male first name to all first names, then the second pass would need to allow for applying a female first name to all first names where gender equals "F". Using this approach we could easily maintain the gender mix within the data structure, apply anonymity to the data records but also maintain a realistic looking database which could not easily be identified as a database consisting of masked data.