What is Data Masking? Benefits & Use Cases
What does Data Masking do?
Data masking uses obfuscation to desensitize an original value while preserving a level of user or business context. For example, masking a credit card’s Primary Account Number (PAN) of 3566-0020-2036-0505 to reveal the last four-digits might look like XXXX-XXXX-XXXX-0505.
Why mask data?
With nearly 2,000 data breaches in the first half of 2022 alone and the increasing number of data privacy regulations, like CCPA and the GDPR, businesses need to use confidential information intelligently. Data masking offers a smart way to minimize or eliminate compliance requirements while maintaining day-to-day operations.
Here is a comprehensive overview of data masking, its benefits, and examples.
What is Data Masking?
Masked values provide a level of functionality while maintaining a desired level of privacy and confidentiality. Because of this, masked data can be used for things, like sales demos, user training, account validation, and software testing, without increasing a company’s risk footprint, as plaintext data does. It also protects private information when sharing your business data with a third party.
Data Masking Use Cases and Examples
When secured with strong privacy policies, identity access management tools, and role-based access controls, data masking provides a powerful and dynamic tool for desensitizing and using data. Let’s look at a few ways organizations do masking today.
Using masked data to as a visual prompt
Masked data provides powerful visual cues to those familiar with the underlying data. For example, when a customer checks out at an online store, her digital wallet prompts her to select a stored card and uses their last four-digits of their respective PANs to help.
Using masked data for testing
Humans and applications need data to test various system functions or standard operating procedures. Using sensitive plaintext data, or original values, is dangerous and expands compliance scope considerably (i.e. costs).
When done right, masked data provides a cost-effective way to help test if a system or design will perform as expected in real-life scenarios.
Using masked data to help migrate data
Data masking can apply new formats to the underlying data. When combined with an abstraction layer, like tokenization, masked data may help format, structure, or clean data to satisfy new business or schema requirements encountered during a migration.
Types of Data Masking
While other methods exist, masked data is primarily created as a copy of the original data or during processing. Let’s take a closer look at the two.
Static Data Masking (SDM)
SDM duplicates your data with your data masking rules and algorithms applied to the new data set. This is helpful for test data environments, especially with third parties, where you may want to avoid the risk of something bad happening to your source data.
Dynamic Data Masking (DDM)
DDM does not require a second data source to store the masked data dynamically. Instead, it masks and presents data according to the access policies and permissions of the actor requesting the data. For example, a customer support manager may have access to full PAN while his support representatives only have access to the last four-digits.
In doing so, DDM provides access to masked data in real-time to an authorized user and limits the exposure risk static data may create.
What are some data masking techniques?
Because there are so many ways to obfuscate data, it’s helpful to consider how each of the following technique gets you closer to your usability and risk goals. In fact, most companies mix and match these approaches based on the effort it takes to build, maintain, and use them.
For example, while using format-preserving encryption to mask a social security number adds a significant level of security, an organization may not pursue it due to layers of complexity and cost to the application required to decrypt it. (We’ll get into encryption a bit more, later).
Obfuscating all but the last four-digits of a card’s Primary Account Number (PAN) provides some level of protection and usability to both users and organizations alike.
Substitution replaces part of an original value with other characters or numbers—sometimes similar in nature. For example, if your email was Claude.Shannon@fakeemail.com, a substituted versions might look like, Terry.Shannon@fakeemail.com.
There are many methods for populating the data are used to substitute all or part of the original value. For example, you could draw from a look-up table of fake names or dates, use a generator or function to generate values, or calculate an average using aggregate values.
Redaction or masking out data
Redaction, also known as masking, is worth calling out as a notable sub-genre of substitution because of its popularity. This approach obscures a portion of the original value with fixed characters, like “X” or “*” and is by far the simplest form of data abstraction (it requires the least amount of power from humans and computers alike to understand and process).
Deleting or truncating data
Deletion or truncation desensitizes a piece of information by removing its essential elements.
Shuffling creates a mask by rearranging the characters or numbers of the original value; however, it’s inclusion in this this is purely informational. While it’s low effort to build, shuffling is easily reverse-engineered and most strongly recommend against using it for sensitive data.
Is encrypting data the same as masking data?
Encryption uses math and a combination of keys to transform an original value into an unrecognizable string of characters. The same set of keys to encrypt the original value is then used to decrypt it.
There are many that maintain encryption is a form of masking. Does it obfuscate data and render it unreadable, yes, but an encrypted string provides no context to a user or business trying to complete a task, so it struggles to meet our definition of data masking.
While format preserving encryption does offer some value, here, the compute costs, latency, as well as the operational overhead of managing the necessary keys and services have traditionally made the other data masking techniques much more attractive.
Learn more about managing your keys with Basis Theory’s Open Source KMS.
What’s the difference between data masking and data tokenization?
Masking data modifies the existing original value while tokenizing data creates a net new one, called a token.
Using masked data in applications and databases is a great way to reduce your compliance footprint in your environment, but it doesn’t dismiss the compliance obligations or security risks that come with storing the original value. Depending on the type of data being held and its usage, this could cost hundreds of thousands of dollars to build, maintain, and audit.
Tokenization platforms, like Basis Theory, store the original sensitive data in a compliant, hosted environment outside of your system. Instead of holding onto the full plaintext value, you’d store references, called Token Identifiers or Token IDs, to it in your application or database. Systems use these Token IDs to authorize and call back different properties of the original value value (e.g., masks), allowing them to do many of the same things its plaintext counterparts can without increasing scope or risk. Let’s look at a tokenization example while incorporating what we know about DDM.
Say your customer support application needs access to the last four-digits of a customer’s PAN to help a representative process a return on behalf of a customer. The application would use the Token ID to request a redacted version of the data. Basis Theory uses DDM, so once the application and user (i.e. the support rep) have been authorized, a mask of the customer’s card, “XXXX-XXXX-XXXX-0505”, would be generated and passed back to the application.
In this situation, the original value stays encrypted in a PCI-compliant environment and a masked value is returned so the representative so the customer can confirm which card to use to issue a refund. This workflow keeps your larger system and the web app out of PCI scope while reducing or even eliminating many of the obligations and costs that come with hosting full plaintext card numbers.
What are the Benefits of Data Masking?
Here are reasons organizations might use data masking:
- Reduce compliance scope: Using masked data inside your applications and databases instead of plaintext greatly reduces your compliance scope and the costs, complexity, and speed of your compliance efforts.
- Least privilege: When combined with strong access rules and policies, masking ensures authorized actors have just the right amount of visibility needed to complete a job or function.
- Compliance: If done well, masked data not only satisfies business needs but also sovereign or industry requirements, like GDPR, CCPA, or PCI, too.
- Programmability: Masked data can be generated to assume or change its value based on various static or dynamic factors.
- Made for humans: Human memory is a fickle thing. It’s easier for us to remember four-digits than 12 and recall an email address when provided half of it.
- Reduce the impact of a breach: Masking data greatly decreases the value of your data to outside parties, thus, increasing its cost to cyber criminals.
What are some of the drawbacks or limitations of data masking?
Not a silver bullet: While using masked data in your applications inherently lowers your risk and scope compared to plaintext, you’ll still have obligations related to holding the original value. As discussed, storing the last four-digits in your support services web application may not bring it into PCI scope, but organizations still need to prove that the database storing the full PAN is PCI compliant. Similarly, Personally Identifiable Information (PII) may have different masking requirements depending on where and for how long that information is stored (e.g. USA vs. European Union).
Some things sold separately: You may have noted several times the use of the words “permissions”, “access controls,” “policies,” etc., in this blog. Having solid data governance policies, identity access management systems, and role-based access controls (RBAC) to govern the “who”, “what”, “when”, “where”, and “how” around data permissions is the best way to ensure an organization meets its usability, risk, and compliance goals for its data. Data masking will only be as strong as these systems are complete and accurate.
Preserving formats: If you’re using an algorithm to mask data, you’ll need to accommodate the multitude of variations for that data. For example, while social security numbers in the USA are predictably formatted, emails can vary in length, numbers, and characters. Masking these incorrectly could lead to unfortunate downstream consequences.
Preserving social norms: Robots have a tough time determining things that humans may instinctively know as correct. One often cited example is gener. If you’re using an algorithm to substitute PII, like names and gender, in a data set, you’ll want a lookup table(s) with a population of male or female names proportional to your data set’s gender breakdown (or use popular neutral names, like Pat or Terry).
Re-identification: Given enough context, some data can be pieced back together. To put that into perspective, let’s look at one study that found 87% of Massachusetts could be identified with just 3 data points: gender, zip, and birth date. While the study used plaintext data (i.e,. not masked) to identify names and addresses, the same principle applies to data masking. Suppose multiple variations of masked data in your production system were to leak. In that case, hackers could use a combination of manual efforts, AI/ML programs, or other data sets to reverse engineer an original value.
Avoiding duplicates: While rare in smaller data sets, large datasets with thousands or millions of stored values could accidentally have two of the same masked values. While this doesn’t matter for some things, like birthdates, it may for unique values, like social security numbers.
Data Masking with Basis Theory
Unlike other tokenization providers, Basis Theory allows developers to segment their data’s access controls as containers and define their tokens’ masks using a familiar Liquid syntax, called Expressions. If you’re interested in learning more, reach out for a quick demo!