Start Here: How to Clean Up a Petabyte of Unstructured Data

Written by Christian Paschke | May 6, 2026 6:51:34 PM

“But we still have the petabyte.”

I hear it frequently from IG professionals at AmLaw 150 firms. The DMS is configured. The collaboration channels are governed. The policies are finally in place. And then they pause, because none of that touches the decades of data that accumulated before any of it existed.

You know what it looks like. Windows file shares. Email. Personal workspaces that ballooned when users moved content from shared drives to the cloud. At many firms, it’s a combination of all three: decades of data created without governance, blocking the program’s future-state plans for lifecycle data management.

It is solvable. What follows is the methodology I recommend for working through it: technology-agnostic, defensible, and designed to scale from the first terabyte to the last.

Make policy the basis for cleanup

Using data retention and deletion policy as the basis for cleanup is the first step in setting your project up for success. Before beginning cleanup work, review your existing policy and procedures to identify and address any gaps. The policy needs to outline the guardrails for your project and how data is treated once you know what it is: what information goes where, how long it needs to be retained, the process for defensibly deleting eligible data, how sensitive information is treated, and the legal hold preservation standards. Getting clarity on the rules before you start is what makes every subsequent decision defensible.

As you do this work, imagine your firm is hit with a spoliation claim by opposing counsel for data that was deleted during your dark data cleanup project. How will you defend the decision to destroy the data? Do your information governance policies and procedures and planned cleanup project documentation provide sound cover for how and why decisions were made?

Working with your general counsel and stakeholders to plan for this scenario ensures you have agreement on a defensible approach for acting on your firm’s dark data at scale. The policy framework you create now will also serve as the basis for configuring your software for data discovery and how you document your decisions and actions once the data is classified. Taking the time to get it right upfront will make your project both more defensible and more efficient.

Determine how to handle data that won’t come out of the dark

Most dark data remediation projects encounter data that wants to stay dark. Sometimes it’s corrupt files or legacy file formats that cannot be read. Other times, you may encounter files where you can’t read a date for when the file was last modified or accessed. Another common challenge is attributing the ownership of a file to an individual or business unit. Agreeing with your general counsel and stakeholders on policy for addressing corrupt, undated, or orphaned files allows you to plan for these challenges and prevents your project from hitting a roadblock.

Map the destinations before moving anything

There is valuable data buried inside your unstructured data mountain. Knowing where it will go is an essential question to answer before cleanup begins, because migrating large volumes of orphaned matter data to the DMS without guidelines can quickly exceed cloud storage limits.

Before anyone starts moving files, map each data category to a destination. Open matter content, closed matter content with retention remaining, and back-office records for IT, HR, Finance, and Legal Operations all need a home. Establish guidelines for what goes to active versus cold storage.

Work with the willing

Start with the back office, and start with your most willing partner: IT. Whether they’re planning a cloud migration, strengthening security posture, or consolidating systems, IT shares your motivation to address the legacy data. Get them to the table early. Consider IT the pilot for your cleanup project, where the process gets built.

Before anyone touches a file, IT and IG need to agree on what gets deleted versus retained, where retained data migrates, and what tools will handle classifying, moving, and deleting data in large quantities. Disposition authority needs to be established in advance. Documentation and reporting need to be agreed upon before the work starts, not after. Then map out which repositories you’ll act on and in what order.

Configure before you classify

When using technology to analyze and classify data, let the policy work you did at the outset do the heavy lifting. Use the policy to define what you’re looking for and how you need to see the results. The goals of your cleanup determine what data you need to surface. Are you looking for active or expired matter content? Sensitive data? ROT and duplicates? Configuring your classification tools to find exactly what you need, both inside and outside of policy, sets the project up to succeed.

Then think through reporting. In most organizations, review and disposition decisions happen at the department or team level. Setting up your software to organize results by business area, practice group, or location, to match your actual workflow, saves time at every stage that follows.

Test small before you scale

For every repository you tackle, start with a small subset, a project folder or a user sub-share, before running full discovery. Different repositories hold different types of data, and classification configurations often need adjustment to accurately surface what you want to find and report in the format you need to make a decision on it. At this step, the goal is to test the combination of technology, policy, workflow, and agreed-upon risk tolerance on a small dataset.

Plan on finding things to refine. Discovering that your software settings need adjustment after running classification across all of IT’s shared drive folders means restarting discovery from scratch. That’s the smaller problem. Revising your policy or remediation workflow after deciding and acting on data could compromise both your project timeline and its defensibility. Restoring deleted data is difficult, if not impossible in some cases.

Refining on a small dataset first keeps the process efficient and improves accuracy. Once results are reliable and the process is repeatable to your stakeholders’ comfort, proceed with full discovery and work through review, decision, and disposition in segments: back-office functions first, legal data after. The sequence is deliberate. Back-office cleanup is where you build the process. Legal data is where you prove it.

Start legal review with the General Counsel’s office

With the back office complete, begin legal practice area review with the General Counsel’s office. It’s the right starting point because it lets you optimize your tools on actual legal data, verify classification of matter content, ROT, and sensitive material, and refine your workflow before the stakes get higher.

At this stage, it’s critical to engage the appropriate practice area SMEs as data owners and test decision workflows and authority. How much or how little information do they need to make a decision? Is there any data that requires no review? Are there classes of data that require spot-checking? Are there categories that require precise review and validation? What you learn from this process will help you refine your workflows and engage with stakeholders on the remaining legal data.

After completing the OGC data, conduct a structured post-mortem before expanding to the billing practice areas.

Sequence the remaining practice areas deliberately

How you expand to the rest of the legal data depends on what’s driving the project. Systems consolidation points toward sequencing by office location. Compliance priorities argue for starting with the highest-risk teams. Storage cost reduction points toward prioritizing teams by volume.

The sequencing question is the easy part. For each practice area or office, the harder work happens before review begins: confirming the workflow for decision and disposition, establishing who has authority to approve disposition, verifying legal hold preservation requirements, deciding whether the defensible deletion standard applies, confirming where retained data is going, and agreeing on how disposition actions will be documented. Engage your practice area SMEs in these questions and share what you learned from the OGC data disposition process. Getting those answers in advance is what keeps the project from stalling mid-execution.

Make ongoing monitoring the finish line

Once you’ve worked through the practice areas and addressed all the legacy data, the work isn’t over. It’s entering a new phase. Data always ends up where it doesn’t belong, stays past its expiration date, and multiplies as technology makes content creation faster and easier. If you conduct a cleanup project to policy and then don’t carry this work forward, you weaken the defensibility of any data deletions completed in the cleanup. Claiming you deleted the data during the cleanup project in accordance with policy and then failing to enforce that standard going forward is just as bad, if not worse, than having no policy at all from a defensibility standpoint. Other risks of not analyzing and addressing dark data across unstructured repositories going forward include non-compliance with regulatory and outside counsel guidelines and a larger attack surface in a cybersecurity event.

Aside from the risk, there are real benefits to regular data discovery and reporting. Ongoing monitoring, on a regular cadence, shows you where IG policy is working and where it’s falling short. That visibility lets you strengthen policy, improve procedures, and act before the problem compounds. This work also reduces storage and eDiscovery costs, simplifies matter mobility requests, and improves data quality for AI applications.

Most importantly, ongoing monitoring of unstructured data prevents the need for large-scale cleanup projects in the future. The petabyte sitting between your firm and its future state didn’t accumulate overnight. Clearing it requires a methodology that’s systematic, defensible, and built to last. The goal is not to move mountains from one repository to the next. It’s to flatten them entirely and make sure they never grow back.

For more on building a sustainable unstructured data governance program, explore ActiveNav’s resources on data discovery and classification.

Unstructured Data Cleanup Checklist for Law Firms

Use this as a starting point for every function or practice area you tackle.

Before you begin

IG policy defines what is and is not a record
Retention, deletion, and legal hold requirements are documented
A defensible deletion standard is established
Migration destinations are mapped for all data categories

For each function or practice area

SMEs identified and briefed on project goals and process
Workflow for discovery, review, and disposition is defined
Disposition decision authority is established
Legal hold preservation requirements are confirmed
Retained data destination is confirmed
Documentation and reporting process is in place

Before scaling discovery

Tools are tested on a small subset
Configurations are refined until technology and results are reliable
Stakeholders support workflows and decisions

After each function is complete

Disposition actions are fully documented
Progress is reported to stakeholders

After cleanup project is completed

Go-forward unstructured repositories are identified
Ongoing discovery cadence is decided and scheduled
Workflow for addressing discovered data is established
Reporting process is in place

View full post