Automating Data Classification with AI Agents

BS - Ben Saunders

As organisations race to adopt AI technologies, the foundation of trustworthy AI implementations often gets overlooked: data governance. In our AI Governance month, we're exploring how AI itself can be leveraged to strengthen governance practices, starting with one of the most crucial yet challenging aspects - data classification.

Understanding Data Classification

Before diving into automation, it's crucial to understand what data classification means and why it's become even more critical in the era of generative AI. Data classification is the process of organising data based on its sensitivity, value, and handling requirements.

Consider a typical HR department: whilst some information like holiday policies and organisational charts should be accessible to all employees, other data such as salary information, performance reviews, or personal development plans must be restricted to authorised personnel only. Without proper classification, organisations risk exposing sensitive information or, conversely, creating unnecessary barriers to accessing public information.

The classification process typically involves:

  • Identifying data types and their sensitivity

  • Determining appropriate access levels

  • Establishing handling requirements

  • Implementing protection measures

  • Regular review and updates

In the context of generative AI, proper data classification becomes even more crucial. AI models trained on corporate data need clear guidelines about what data they can access, process, and include in their outputs. Without robust classification, organisations risk exposing sensitive information through AI-generated content or limiting their AI's effectiveness by being overly restrictive with data access. This is especially the case with embedded AI systems like Microsoft Co-Pilot, Amazon Q and Google’s Gemini.

The Data Governance Challenge

Traditional approaches to data governance have relied heavily on frameworks like DAMA (Data Management Association) and CDMC (Cloud Data Management Capabilities), providing organisations with robust guidelines for managing their data assets. These frameworks have served as the backbone for establishing data quality, security, and compliance practices. However, implementing these standards at scale presents a significant challenge: it's resource-intensive, time-consuming, and often struggles to demonstrate immediate ROI despite its critical importance.

In the age of AI, this challenge has become even more pressing. The quality and trustworthiness of AI outputs are directly tied to the governance of their underlying data. As the saying goes: garbage in, garbage out. But what if we could use AI itself to solve this challenge?

Enter AI Agents: The New Data Stewards

In a previous blog, we explained the DNA of an AI agent. Though, we believe that AI agents represent a revolutionary approach to scaling data governance practices. By combining Large Language Models (LLMs) with specialised tools and domain knowledge, these agents can automatically assess data assets, identify sensitive information, and assign classification scores based on established frameworks.

The process works as follows:

  1. Initial Assessment: AI agents can analyse data from multiple sources depending on an organisation's data maturity level. This includes direct scans of systems of record, analysis of data products in data lakes or warehouses, or assessment of data extracts. The flexibility to work with different data sources ensures organisations can begin their governance journey regardless of their current data infrastructure maturity.

  2. Classification Analysis: AI agents require context about data standards to make informed classification decisions. This context can come from two sources:

    • Existing Standards: Organisations with established data governance frameworks can provide their current policies and standards to the agents, enabling classification aligned with existing practices.

    • AI-Generated Standards: For organisations starting their governance journey, AI can help draft appropriate data policies and standards. By providing information about:

      • Jurisdictional operations (e.g., which countries you operate in)

      • Applicable regulations (e.g., GDPR, CCPA, HIPAA)

      • Preferred framework basis (e.g., DAMA, CDMC)

      AI agents can generate tailored data governance policies that align with industry best practices whilst meeting your specific regulatory requirements.

Bringing It All Together: The Agent Studio

To make these capabilities accessible and practical for organisations, we've developed the Agent Studio - a self-service, natural language platform that enables you to create, deploy, and manage intelligent agents for any operational workflow in your organisation. The Agent Studio significantly simplifies the process of implementing AI-driven data governance.

Here's an excerpt from a recent analysis performed by one of our agents on a dataset that we synthetically created. The agent was instructed to scan the data and provide a breakdown and report on what PII the data-set contained:

The following image provides a brief glimpse into what we have been building with the Agent Studio. Our plan is to provide users with the power to create intelligent agents that can address virtually any task or operational workflow within their organisation. Whilst enabling our customers to connect them to their existing data sources and systems to unlock new levels of automation and operational efficiencies.

This is all done in natural language and powered by a structured blueprint and governance standard for building validated and approved agents and multi agent workflows in the enterprise.

We will be sharing more on this front in the weeks ahead….

Human Validation: Ensuring Accuracy and Compliance

This agent based workflow would incorporate a sophisticated human-in-the-loop feedback mechanism that brings together key stakeholders in the data governance process:

Data Stewards: Responsible for overseeing data quality and governance processes, data stewards review agent classifications for accuracy and consistency with organisational standards. They can:

  • Validate or modify classification decisions

  • Add context-specific notes

  • Flag edge cases for further review

  • Provide feedback to improve agent accuracy

Data Owners: Business unit leaders or subject matter experts who have ultimate accountability for specific data domains. They:

  • Approve final classification decisions

  • Define access requirements

  • Determine handling procedures

  • Set review frequency requirements

This dual-validation process creates several key advantages:

  1. Digital Audit Trail: Every review and approval can be automatically documented, creating a comprehensive audit trail that demonstrates compliance with governance requirements.

  2. Flexible Review Cycles: Organisations can configure review frequencies based on their risk profile and regulatory requirements. High-risk data might require quarterly reviews, whilst lower-risk data could be reviewed annually.

  3. Continuous Improvement: Feedback from both data stewards and owners helps refine agent decision-making, creating a virtuous cycle of improving accuracy.

Beyond Structured Data: Tackling Unstructured Content

While data classification is often associated with structured databases, modern organisations face an equally significant challenge with unstructured data. AI agents can extend their classification capabilities to content stored in systems like SharePoint, network drives, or document management systems.

For example, an agent can:

  1. Scan document repositories for unclassified files

  2. Analyse content for sensitive information (PII, financial data, intellectual property)

  3. Assess document sharing settings and access patterns

  4. Generate risk reports highlighting:

    • Documents containing sensitive data but lacking proper classification

    • Files with overly permissive access settings

    • Potential data exposure risks

  5. Provide recommended classification levels based on content analysis

  6. Request human feedback on the proposed changes and execute metadata labelling or classification settings on object level data.

The Path Forward

As organisations continue their AI journey, establishing robust data governance practices becomes increasingly critical. AI agents offer a scalable solution to this challenge, enabling organisations to:

  • Implement comprehensive data classification at scale

  • Maintain consistent governance standards across the enterprise

  • Reduce the manual burden on data governance teams

  • Build a strong foundation for trustworthy AI implementations

The future of data governance lies in this symbiotic relationship between human expertise and AI capabilities. By leveraging AI agents for the heavy lifting of data classification and governance, organisations can focus their human resources on strategic decisions and edge cases that require nuanced judgement.

If You Are Just Getting Started

Organisations looking to implement AI agents for data governance should consider a phased approach:

  1. Start with a pilot project focusing on a specific dataset or system

  2. Define clear success criteria based on your organisation's governance framework

  3. Implement human-in-the-loop validation to build confidence in the system

  4. Gradually expand scope based on learned experiences

  5. Continuously refine agent capabilities based on feedback and changing requirements

When beginning your data classification journey, consider these additional steps:

  1. Identify key data domains and their owners

  2. Document current classification practices and gaps

  3. Define success metrics for both structured and unstructured data

  4. Plan for regular stakeholder reviews and feedback sessions

  5. Establish clear escalation paths for classification disputes

The journey to robust AI governance starts with strong data governance. By embracing AI agents as partners in this journey, organisations can build the foundation needed for responsible and effective AI adoption.

Remember, in the world of AI, the quality of your outputs is only as good as the governance of your inputs. Making this investment in automated data classification today will pay dividends in your AI initiatives tomorrow.

Previous
Previous

RAG, Agents and Graph: Your AI Compliance Dream Team

Next
Next

The Paris AI Action Summit Day 2: When Politics Met Technology