Sanitized SQL

3 weeks ago 3

A couple times within the past month, I’ve had people send me a message asking if I have any suggestions about where to learn postgres. I like to share the collection of links that I’ve accumulated (and please send me more, if you have good ones!) but another thing I always say is that the public postgres slack is a nice place to see people asking questions (Discord, Telegram and IRC also have thriving Postgres user communities). Trying to answer questions and help people out can be a great way to learn!

Last month there was a brief thread on the public postgres slack about the idea of sanatizing SQL and this has been stuck in my head for awhile.

The topic of sensitive data and SQL is actually pretty nuanced.

First, I think it’s important to directly address the question about how to treat databases schemas – table and column names, function names, etc. We can take our cue from the many industry vendors with data catalog, data lineage and data masking products. Schemas should be internal and confidential to a company – but they are not sensitive in the same way that PII or PCI data is. Within a company, it’s desirable for most schemas to be discoverable by engineers across multiple development teams – this is worth the benefits of better collaboration and better architecture of internal software.

Unfortunately, the versatile SQL language does not cleanly separate things. A SQL statement is a string that can mix keywords and schema and data all together. As Benoit points out in the slack thread – there are prepared (parameterized) statements, but you can easily miss a spot and end up with literal strings in queries. And I would add that most enterprises have the occasional need for manual “data fixes” which often involve simple scripts where literal values are common.

Benoit’s suggestion was to run a full parse of the query text. This is a good idea – in fact PgAnalyze already maintains a standalone open-source library which can be used to directly leverage Postgres’ query parser in many languages. This is really the best solution. However it is worth noting that I’m interested in cases of post-processing query texts from pg_stat_activity and pg_stat_statements, both of which have maximum lengths and will truncate text that’s longer. So query parsing would need to still work with truncated texts that throw syntax errors.

The PgAnalyze library approach is interesting, but I think a simple regex-based approach actually has a lot of merit. This can give very useful sanatized SQL for developers to debug, it can still make strong guarantees that we won’t include sensitive data, and the code is incredibly simple… especially compared with importing the entire postgres parser and trying to link to compiled C libraries in other languages!

Tonight I finally got around to a POC for this.

My design choices here were very intentional:

  • I’m stripping out comments because libraries like sqlcommenter will add unique values via comments which break any ability to aggregate and summarize and report top queries or problem queries.
  • I would always include the query_id alongside the sanitized SQL text. A user can always go back to the database later and look directly at pg_stat_statements to get the full query text as long as they have the Query ID.
  • My decision to include the first three words (excluding comments) and two words following every occurrence of FROM is very strategic. In most cases (CTEs being the exception), the first three words will tell what kind of command is being executed – SELECT or DML or DDL or some utility/misc statement. By including two words after the command, we will generally see the table name for inserts and updates. By including words after FROM, we’ll know at least one of the tables being operated on for queries and deletes. This means we always know at a glance “it’s updating table X” or “it’s querying table Y”.
  • The likelihood of this algorithm including a string literal with sensitive data is next-to-zero. We’re guaranteed never to get literals from INSERTs or UPDATEs. Function and procedure calls must always include parentheses, so that’s easily mitigated with a simple regex to nuke anything after an open-parenthesis.
  • There may be a few cases where this algorithm’s sanitized SQL isn’t as useful as it could be. But that’s why we include the Query ID for retrieving the full query text if needed – and my main goal here is just to have something that’s helpful most of the time and that we can ensure is safe for developers and operators without requiring PII/PCI data controls.
  • If the string ‘FROM’ occurs in a string literal, then we aren’t going to distinguish that from a keyword. I think the risk of sensitive data exposure here is minimal. (See test 15 below for an example.)

Sanitize SQL PL/pgSQL Function

https://gist.github.com/ardentperf/44e94ac484e53ff8353f6c1dc0b8f272

Here’s what the code looks like:

CREATE OR REPLACE FUNCTION sanatize_sql(sql_text text) RETURNS text AS $$ DECLARE cleaned_text text; first_part_regex_3words text := '([^[:space:]]+)[[:space:]]+([^[:space:]]+)[[:space:]]+([^[:space:]]+)'; first_part_regex_2words text := '([^[:space:]]+)[[:space:]]+([^[:space:]]+)'; first_part_regex_1words text := '([^[:space:]]+)'; first_part text; match_array text; from_parts_regex_3words text := '(FROM)[[:space:]]+([^[:space:]]+)[[:space:]]*([^[:space:]]*)'; from_parts text := ''; BEGIN -- Remove multi-line comments (/* ... */) cleaned_text := regexp_replace(sql_text, '/\*.*?\*/', '', 'g'); -- Remove single-line comments (-- to end of line) cleaned_text := regexp_replace(cleaned_text, '--.*?(\n|$)', '', 'g'); -- Extract the first keyword and up to two words after it first_part := array_to_string(regexp_match(cleaned_text,first_part_regex_3words),' '); if first_part is null or first_part ILIKE '% FROM %' or first_part ILIKE '% FROM' then first_part := array_to_string(regexp_match(cleaned_text,first_part_regex_2words),' '); if first_part is null or first_part ILIKE '% FROM' then first_part := array_to_string(regexp_match(cleaned_text,first_part_regex_1words),' '); end if; end if; first_part := regexp_replace(first_part, '\(.*','(...)'); -- Find all occurrences of FROM and two words after each FOR match_array IN SELECT array_to_string(regexp_matches(cleaned_text,from_parts_regex_3words,'gi'),' ') LOOP match_array := regexp_replace(match_array, '\(.*','(...)'); from_parts := from_parts || '...' || match_array; END LOOP; -- Return combined result RETURN first_part || from_parts; END; $$ LANGUAGE plpgsql;

Test 1: Literal Sensitive Data with the Letters “FROM”

postgres=# SELECT sanatize_sql($test$ -- Example: inserting sensitive data as literals INSERT INTO customers (first_name, last_name, email, ssn) VALUES ('ERICH', 'FROM', '[email protected]', '123-45-6789'); $test$); sanatize_sql ----------------------- INSERT INTO customers (1 row)

This is very unlikely to happen in practice but if we are very intentional about the spacing, then we can break the algorithm (this is effectively a SQL injection attack):

postgres=# SELECT sanatize_sql($test$ -- Example: inserting sensitive data as literals INSERT INTO customers (first_name, last_name, email, ssn) VALUES (' ERICH ',' FROM ','[email protected]', '123-45-6789'); $test$); sanatize_sql --------------------------------------------------------------------------- INSERT INTO customers...FROM ','[email protected]', '123-45-6789'); (1 row)

Test 1: Sensitive Data in a Function Call

postgres=# SELECT sanatize_sql($test$ SELECT pgp_sym_encrypt('123-45-6789', 'my_secret_key') AS encrypted_ssn; $test$); sanatize_sql ----------------------------- SELECT pgp_sym_encrypt(...) (1 row)

Test 2: Simple SELECT with Inline and Block Comments

postgres=# SELECT sanatize_sql($test$ -- Fetch active users only SELECT id, name -- user info FROM users /* main table */ WHERE active = TRUE; /* status flag */ $test$); sanatize_sql ------------------------------------ SELECT id, name...FROM users WHERE (1 row)

Test 3: SELECT with Subquery and Mixed Comment Styles

postgres=# SELECT sanatize_sql($test$ SELECT id, name FROM users WHERE id IN ( /* subquery for high-value customers */ SELECT user_id -- link to users.id FROM orders WHERE total > 100 -- filter expensive orders ); -- end of query $test$); sanatize_sql -------------------------------------------------------- SELECT id, name...FROM users WHERE...FROM orders WHERE (1 row)

Test 4: SELECT + CTE with Comments Inside and Outside

postgres=# SELECT sanatize_sql($test$ -- recent orders per user WITH recent_orders AS ( SELECT user_id, MAX(created_at) AS last_order FROM orders GROUP BY user_id /* aggregation */ ) SELECT u.name, r.last_order FROM users u JOIN recent_orders r ON u.id = r.user_id; -- join results $test$); sanatize_sql ---------------------------------------------------------- WITH recent_orders AS...FROM orders GROUP...FROM users u (1 row)

Test 5: INSERT with Comments in Values

postgres=# SELECT sanatize_sql($test$ INSERT INTO users (name, email, created_at) VALUES ( 'Alice', -- first name '[email protected]', /* email */ NOW() /* timestamp */ ); -- new user inserted $test$); sanatize_sql ------------------- INSERT INTO users (1 row)

Test 6: UPDATE with Trailing and Embedded Comments

postgres=# SELECT sanatize_sql($test$ UPDATE users SET last_login = NOW() -- set current time WHERE id = 42 /* target specific user */; -- done $test$); sanatize_sql ------------------ UPDATE users SET (1 row)

Test 7: DELETE with Multi-line Comment Block

postgres=# SELECT sanatize_sql($test$ /* * Delete old sessions. * Keep data from the last 30 days. * Be careful: irreversible. */ DELETE FROM sessions WHERE last_access < NOW() - INTERVAL '30 days'; $test$); sanatize_sql ------------------------------ DELETE...FROM sessions WHERE (1 row)

Test 8: UPSERT (Insert … On Conflict) with Inline + Header Comments

postgres=# SELECT sanatize_sql($test$ -- Upsert settings INSERT INTO user_settings (user_id, theme, notifications) VALUES ( 1, /* user id */ 'dark', -- theme TRUE -- notifications on ) ON CONFLICT (user_id) DO UPDATE SET theme = EXCLUDED.theme, -- overwrite notifications = EXCLUDED.notifications; $test$); sanatize_sql --------------------------- INSERT INTO user_settings (1 row)

Test 9: CTE-Based UPDATE with Nested Comments

postgres=# SELECT sanatize_sql($test$ -- mark inactive users WITH inactive_users AS ( SELECT id FROM users WHERE last_login < NOW() - INTERVAL '1 year' /* cutoff */ ) UPDATE users SET active = FALSE WHERE id IN ( SELECT id FROM inactive_users -- CTE reference ); $test$); sanatize_sql -------------------------------------------------------------------- WITH inactive_users AS...FROM users WHERE...FROM inactive_users ); (1 row)

Test 10: DDL with Comments Everywhere

postgres=# SELECT sanatize_sql($test$ -- create table if missing CREATE TABLE IF NOT EXISTS audit_log ( /* audit records */ id SERIAL PRIMARY KEY, -- identity column user_id INT REFERENCES users(id), /* FK */ action TEXT NOT NULL, -- what happened created_at TIMESTAMP DEFAULT NOW() /* timestamp */ ); $test$); sanatize_sql ----------------- CREATE TABLE IF (1 row)

Test 11: Complex Query with Multi-CTE, Inline + Block Comments

postgres=# SELECT sanatize_sql($test$ /* This query finds top customers. It uses multiple CTEs and subqueries. */ WITH order_totals AS ( SELECT user_id, SUM(total) AS lifetime_value FROM orders GROUP BY user_id -- one row per user ), top_customers AS ( SELECT user_id FROM order_totals WHERE lifetime_value > 10000 /* threshold */ ) SELECT u.id, u.name, o.lifetime_value -- main output FROM users u JOIN order_totals o ON u.id = o.user_id WHERE u.id IN (SELECT user_id FROM top_customers) ORDER BY o.lifetime_value DESC /* high to low */ LIMIT 10; -- top 10 $test$); sanatize_sql --------------------------------------------------------------------------------------------------------------- WITH order_totals AS...FROM orders GROUP...FROM order_totals WHERE...FROM users u...FROM top_customers) ORDER (1 row)

Test 12: Function Call in the FROM Clause

postgres=# SELECT sanatize_sql($test$ SELECT * FROM generate_series(1,10); $test$); sanatize_sql -------------------------------------- SELECT *...FROM generate_series(...) (1 row)

Test 13: Anonymous Code Block

postgres=# SELECT sanatize_sql($test$ DO $$ DECLARE tbl RECORD; BEGIN OPEN table_cursor; LOOP FETCH table_cursor INTO tbl; EXIT WHEN NOT FOUND; EXECUTE 'VACUUM ' || tbl.tablename; END LOOP; CLOSE table_cursor; END $$; $test$); sanatize_sql --------------- DO $$ DECLARE (1 row)

Test 14: Declaring a Cursor

postgres=# SELECT sanatize_sql($test$ -- Declare a cursor for employees in Engineering DECLARE emp_cursor CURSOR FOR SELECT id, name, salary FROM employees WHERE department = 'Engineering'; $test$); sanatize_sql -------------------------------------------------- DECLARE emp_cursor CURSOR...FROM employees WHERE (1 row)

Test 15: Joining Multiple Tables and FROM in a String Literal

postgres=# SELECT sanatize_sql($test$ SELECT c.name AS customer_name, o.order_id, o.order_date, oi.product_name, oi.quantity, 'Orders coming from customers are listed below' AS description FROM customers c, orders o, order_items oi WHERE c.customer_id = o.customer_id AND o.order_id = oi.order_id ORDER BY c.name, o.order_date; $test$); sanatize_sql ----------------------------------------------------------- SELECT c.name AS...from customers are...FROM customers c, (1 row)

.

Read Entire Article