Key Takeaways
- Labeling sensitive content in unstructured data repositories aids, but does not completely address, data privacy compliance
- Manually applying data sensitivity labels is time-consuming and difficult to implement consistently and accurately
- Automated sensitive data label application programs can be costly and slow to apply labels
- Some states have data privacy statutes that require limiting personal data collected in addition to managing it properly, requiring approaches beyond labeling.
In Part 1, we provided an approach for conducting a sensitive data-driven cleanup project. In this post, we’ll discuss why using privacy labeling alone does not fully address data privacy compliance. In our next post, we’ll describe an approach for utilizing insights from a sensitive data cleanup project and ongoing review actions to reduce and minimize sensitive data collection going forward.
Shifting from Sensitive Data Accumulation to Data Minimalization
Back in 2019, we called for data minimization as a new way of working to make the disparate and shifting data privacy requirements more manageable. Under this approach, data privacy and retention policies are operationalized, sensitive data collection is limited to only what is necessary, disposition occurs in the normal course of business, and continuous monitoring and reporting ensure ongoing compliance.
Before we get into how to achieve sensitive data minimization, let’s look at some of the current thinking on how to manage sensitive data through its lifecycle.
Labeling Sensitive Content Will Only Take You Part of the Way to Compliance
One approach gaining popularity is using labels to identify unstructured data with sensitive information and manage it where it lives through its lifecycle. There are two approaches to using labels and both fall short for gaining full privacy compliance.
In the first approach, users apply labels to content that contains personally identifiable information (PII) and other sensitive information. There are several limitations with the user-driven approach to applying labels. One is that it takes a lot of time for people to review and classify their existing unstructured content. Once existing content is reviewed, this approach is tough to scale going forward. It requires significant change management to motivate users to add a new step to existing workflows. Even if users do apply sensitivity labels, they don’t necessarily do it consistently or accurately. There is a constant risk of users under- or over-applying privacy labels, which leads to the inconsistent application of policy.
Another option is the auto-application of privacy labels within unstructured content management platforms. These programs rely on a combination of data privacy rules and file analysis technology to automatically apply the appropriate sensitivity label to content at the time of creation. One limitation of this approach is the cost. Many auto-application labeling solutions require premium licenses and are expensive, especially when used with optical character recognition (OCR) technology for images. In a pay-as you-go pricing model, organizations are charged on a per-page basis for PDF and image files. For example, if a program charges $0.001 per page to review a PDF file, and if the average PDF for an organization is 50 pages and 1MB in size, it will cost $50,000 per terabyte of analyzed content.
In addition, many automated labeling solutions are slow to process large volumes of data, taking up bandwidth and storage space as they crawl and copy data for analysis. Some have set limits on the amount of data that can be processed within a specific period of time and labels are not always applied in a timely manner. In some cases, it can take weeks for the label to be applied, leaving sensitive data unprotected in the interim. It’s why at ActiveNav, we challenged ourselves to find faster, more efficient ways to conduct sensitive data discovery in large volumes of unstructured data that does not require copying the data.
Bottom line, finding and labeling sensitive content in unstructured data will only take you so far in achieving compliance. California and Maryland have requirements for organizations to limit the collection of sensitive data to only what is essential for interacting with stakeholders. Organizations doing business in or with stakeholders in these states must review and take action on what data is collected. Labels are useful for understanding and managing sensitive content; however, they do not address the root cause of collecting and storing unnecessary data. In our next post, we’ll share an approach for using the insights from your sensitive data cleanup and labeling efforts to reduce and minimize sensitive data going forward.