The CASE of the Conditional Records

I’m working on a change request for a web service tied to an application we support at work, and I ran into one requirement (out of many) which I didn’t immediately know how to resolve. Maybe it was just the way it was worded in the spec, but it had me scratching my head for a good few hours before I got to the heart of the matter.

This post is mostly a reminder to myself to take a step back, and work through a problem in a notebook or similar; go back and forward a few times, from different angles until the puzzle is clear and you can find a way to approach it. Once I worked through it, and figured out exactly what was being asked for, the problem itself turned out to be significantly easier than I had first imagined.

The Problem

I have two linked tables of journey information: 1 for the journey record, one for each leg of that journey. For each leg, there could be up to 2 records, the planned leg (P), and the actual leg (A). When we return the results of the query, we want a row combining the flight information and the leg information, for every leg. The kicker was we never want to return a P leg record, when an A leg record existed. But we don’t know if an A leg record exists before we run the query.

[Side note – I am massively simplifying the data structure, and some of the problem for clarity and confidentiality]

 ---------------------------    ------------------------------------ 
| journey_id | journey_date |  | journey_id | leg_number | leg_type |
 ---------------------------    ------------------------------------ 
|     1      |  01/07/2015  |  |     1      |      1     |    P     |
 ---------------------------    ------------------------------------ 
|     2      |  01/07/2015  |  |     1      |      1     |    A     |
 ---------------------------    ------------------------------------ 
|     3      |  01/07/2015  |  |     1      |      2     |    P     |
 ---------------------------    ------------------------------------ 

So from the above data we want to return the following:

 ------------------------------------------------------- 
| journey_id | journey_date |  leg_number  |  leg_type  |
 ------------------------------------------------------- 
|     1      |  01/07/2015  |       1      |     A      |
 ------------------------------------------------------- 
|     1      |  01/07/2015  |       2      |     P      |
 ------------------------------------------------------- 

And not this, which a straightforward JOIN would produce.

 ------------------------------------------------------- 
| journey_id | journey_date |  leg_number  |  leg_type  |
 ------------------------------------------------------- 
|     1      |  01/07/2015  |       1      |     P      |
 ------------------------------------------------------- 
|     1      |  01/07/2015  |       1      |     A      |
 ------------------------------------------------------- 
|     1      |  01/07/2015  |       2      |     P      |
 ------------------------------------------------------- 

Obviously we need a WHERE clause in there, but like I said, we don’t know what records exist for each leg, so most common clauses aren’t going to give desirable results.

Effectively we are conditionally returning a record only if a certain other, nearly identical record in the same table does not exist – otherwise we want to return that record instead.

Additional Restraints

  1. Everything has to come from running a single SQL query against a table/set of tables. We can’t utilise Oracle packages or procedures. The query is embedded in the application and called as a prepared statement (to feed in the “real” filtering criteria later). We can use functions, but should be mindful of the performance implications
  2. We can’t do any “post-processing” of the data in the web service application code. Well, we could, but it’d be costly and clunky to do in the current framework.
  3. Everything should (obviously) be as performant as possible.

Solution

First Attempt

An infrequently used, but powerful, part of SQL is the CASE construct:

CASE
	WHEN foo THEN bar
	ELSE baz
END

By utilising a CASE statement and some subqueries, I could check if a row was in the actual legs of a journey, or whether it was a planned leg and there was no matching actual leg:

SELECT  j.journey_id, j.journey_date,
		jl.leg_number, jl.leg_type
FROM    journies j, journey_legs jl
WHERE
(CASE
	WHEN (jl.journey_id, jl.leg_number) IN
	     (SELECT journey_id, leg_number
	         FROM flight_legs
	        WHERE journey_id = fl.journey_id
	          AND leg_number = fl.leg_number
	          AND leg_type = 'A') THEN
	 (1)
	WHEN ((fl.journey_id, fl.leg_number) NOT IN
	     (SELECT journey_id, leg_number
	         FROM flight_legs
	        WHERE journey_id = fl.journey_id
	          AND leg_number = fl.leg_number
	          AND leg_type = 'A') AND fl.leg_type = 'P') THEN
	 (1)
	ELSE
	 (0)
END) = 1;

Execution time: 1.8 seconds

Second Attempt

A nice start, the above query returns what we want, but it’s a bit bloated. Do we need the first subquery at all? No, it turns out:

SELECT  j.journey_id, j.journey_date,
		jl.leg_number, jl.leg_type
FROM    journies j, journey_legs jl
WHERE
(CASE
	WHEN (jl.leg_type = 'A') THEN
	 (1)
	WHEN ((fl.journey_id, fl.leg_number) NOT IN
	     (SELECT journey_id, leg_number
	         FROM flight_legs
	        WHERE journey_id = fl.journey_id
	          AND leg_number = fl.leg_number
	          AND leg_type = 'A') AND fl.leg_type = 'P') THEN
	 (1)
	ELSE
	 (0)
END) = 1;

Execution time: 1 second

Final Attempt

Awesome, we are getting the data we want, the query is running at an acceptable pace for the number of records we have… but it’s kinda clunky looking. We want to try simplify it if possible so it’s easier for anyone who comes after us to maintain. Is it possible to rewrite this to take out the CASE, and make it more readable? Of course:

SELECT  j.journey_id, j.journey_date,
		jl.leg_number, jl.leg_type
FROM    journies j, journey_legs jl
WHERE (jl.leg_type = 'A' OR
       (jl.leg_type = 'P' AND
        (jl.flight_number, jl.leg_order) NOT IN
       	(SELECT journey_id, leg_order
             FROM journey_legs
            WHERE journey_id = jl.journey_id
              AND leg_order = jl.leg_order
              AND leg_type = 'A')));

Execution time: 0.85 seconds

That’s a bit better for anyone who maybe isn’t familiar with CASE to understand, and has a bit of hierarchy to it so you can mostly see at a glance what it’s doing. I don’t think we can get rid of the subquery, but sometimes these things are unavoidable.

%d bloggers like this: