I’m sure we all did something similar once or twice in our DBA lives. I had to create a simple script to perform regular expression based data discovery for Oracle. This script will be used as a check in our McAfee Database Vulnerability Manager. We do support data discovery directly in the tool but the advantage of such a script is that all the data is processed directly in the database without pulling it over the network to the discovery tool.

First, we need to create the tables to hold the discovered data:

CREATE SEQUENCE seq_data_discovery MAXVALUE 9999999999 CYCLE
CREATE TABLE data_discovery
 id NUMBER(10),
 owner VARCHAR2(30),
 table_name VARCHAR2(30),
 column_name VARCHAR2(30),
 CONSTRAINT data_discovery_pk PRIMARY KEY (id)
CREATE TABLE data_discovery_rows
 id NUMBER(10),
 row_id VARCHAR2(20),
 val VARCHAR2(4000),
 CONSTRAINT data_discovery_rows_fk FOREIGN KEY (id) REFERENCES data_discovery(id)

Next, comes the stored procedure to populate the data using simple PL/SQL.

CREATE OR REPLACE procedure discover_data(p_regex IN VARCHAR2, p_owner IN VARCHAR2 := '%',
    p_table_name IN VARCHAR2 := '%', p_limit IN NUMBER := NULL)
-- Discover data based on the given parameters.
-- We iterate on relevant column types for the given owners and tables and match
-- the contents to the given regex returning data into a table to the specified limit
    l_stmt VARCHAR2(32760);
    l_id NUMBER(10);
    -- Might want to handle varios XML types as well
    FOR r_col IN (
        SELECT tc.owner, tc.table_name, tc.column_name, tc.data_type
        FROM all_tab_columns tc, all_tables t
        WHERE tc.owner LIKE p_owner AND
            tc.table_name LIKE p_table_name AND
            tc.table_name NOT LIKE 'DATA_DISCOVERY%' AND
            tc.owner = t.owner AND
            tc.table_name = t.table_name AND
            tc.data_type IN ('VARCHAR2', 'NVARCHAR2', 'NUMBER', 'CHAR', 'NCHAR', 'CLOB', 'NCLOB')
        ORDER BY owner, table_name, column_id) LOOP
        -- Insert the header
        INSERT INTO data_discovery
        VALUES (seq_data_discovery.nextval, r_col.owner, r_col.table_name, r_col.column_name)
        RETURNING id INTO l_id;
        l_stmt := 'INSERT INTO data_discovery_rows SELECT ' || TO_CHAR(l_id) || ', rowid, $col FROM "' ||
            r_col.owner || '"."' || r_col.table_name ||
            '" WHERE REGEXP_LIKE($col, :1, ''i'')';
        IF p_limit IS NOT NULL
            l_stmt := l_stmt || ' AND rownum <= :2';
        END IF;
        IF r_col.data_type IN ('VARCHAR2', 'NVARCHAR2', 'CHAR', 'NCHAR')
            l_stmt := REPLACE(l_stmt, '$col', '"' || r_col.column_name || '"');
        ELSIF r_col.data_type IN ('CLOB', 'NCLOB')
            l_stmt := REPLACE(l_stmt, '$col', 'dbms_lob.substr("' || r_col.column_name || '", 4000)');
        ELSIF r_col.data_type = 'NUMBER'
            l_stmt := REPLACE(l_stmt, '$col', 'TO_CHAR("' || r_col.column_name || '")');
        END IF;
        IF p_limit IS NOT NULL
            EXECUTE IMMEDIATE l_stmt USING p_regex, p_limit;
            EXECUTE IMMEDIATE l_stmt USING p_regex;
        END IF;

I’m sure this can be improved in many ways but I basically had 30 minutes to hack this together and connect it to the scan tool.