The Meaning of “Set-Based”

The term set-based is used to describe an approach to handle querying tasks and is based on principles from the relational model. Remember that the relational model is based in part on mathematical set theory. Set-based solutions use T-SQL queries, which operate on the input tables as sets of rows. Such solutions are contrasted with iterative solutions that use cursors or other iterative constructs to handle one row at a time.

According to set theory, a set should be considered as a whole. This means that your attention should be focused on the set and not on its individual elements. With iterative solutions, you break this principle by operating on one element (row) at a time. Also, a set has no particular order to its elements. So when you use set-based solutions, you cannot make any assumptions about the order of the data. Similarly, unless you add an ORDER BY clause to the query, you’re not guaranteed that the data will be returned in any particular order. With iterative solutions, you process one row at a time and you can do so in a specific order.

When you use set-based solutions, you provide your request as a declarative plain language query. In your request, you focus on the “what” part of the request and let the database engine worry about the “how” part. With iterative solutions, you need to implement both the what and the how parts in your code. As a result, iterative solutions tend to be much longer than set-based ones and harder to follow and maintain.

Another reason why you should stick to set-based solutions is a very pragmatic one—performance. Iterative constructs in T-SQL are very slow. For one thing, loops in T-SQL are much slower than those in other programming languages such as in Microsoft .NET code. Secondly, each record fetch from a cursor by using the FETCH NEXT command has quite a high overhead associated with it. There’s no such overhead when SQL Server processes a set-based solution, even if internally the execution plan for the query involves iterations. As a result, if you know how to tune queries, you are often able to achieve much better performance compared to using iterative solutions.

For example, I have seen something like this:
SET NOCOUNT ON;
DECLARE @Result AS TABLE (
actid INT,
mx MONEY );

DECLARE
@actid AS INT,
@val AS MONEY,
@prevactid AS INT,
@prevval AS MONEY;

INSERT INTO #tbTemp
SELECT actid, val
  FROM dbo.Transactions
  ORDER BY actid, val;

SELECT TOP(1) @prevactid = actid, @prevval = val
FROM #tbTemp;

WHILE @prevactid NOT IS NULL
BEGIN
IF @actid @prevactid
     INSERT INTO @Result(actid, mx) VALUES(@prevactid, @prevval);
 

  DELETE #tbTemp
  WHERE actid = @actid;
 

  SELECT TOP(1) @prevactid = @actid, @prevval = @val
  FROM #tbTemp;
END

IF @prevactid  IS NOT NULL
INSERT INTO @Result(actid, mx) VALUES(@prevactid, @prevval);

SELECT actid, mx
FROM @Result;
GO

Some people think this solution is set-based because it doesn’t explicitly declare and use a cursor object (we talk about this later or never 🙂 ). However, recall that one of the principles implemented by set-based solutions is that they treat the set as a whole as opposed to handling one element at a time. This principle is violated here. Also, set-based solutions do not rely on the order of the data, and this one does.

How we can do this with a set-based rule. Well, lets see the following code:
SELECT
T1.actid, T1.tranid, T1.val,
SUM(T2.val) AS balance
FROM dbo.Transactions AS T1
JOIN dbo.Transactions AS T2 ON T2.actid = T1.actid AND T2.tranid <= T1.tranid
GROUP BY T1.actid, T1.tranid, T1.val;

The code above use a recursive join, recursive querys are commonly used with CTE (we talk later) and for this time this works well, but the best way in SQL Server 2012 is using Window Functions.

SELECT actid, tranid, val,
SUM(val) OVER( PARTITION BY actid
               ORDER BY tranid
               ROWS UNBOUNDED PRECEDING)
AS balance
FROM dbo.Transactions;

The conclusion is, always have a way to work with sets, the first intent wasn’t set-based because works one by one element at time.
I used that code with a temporary table (#tbTemp) because I will discuss this subject in the next post.

 

Execution Plan

books

Why I started my blog with some posts about query optimizer, execution plans and logical processing? Well, before you learn about codes and more codes we need to know how this works in SQL Server. For example, the developer know how to write a single select, insert or update statement or  a batch with conditions clauses like IF, but what happens when you execute the query? We need to know how the whole thing works and the concepts to work with sets.

So, SQL Server and others databases has a lot of information for your simple or complex batch to work well and others tools to help our job. The most important is to write the best query as possible. If you write a simple select, the first thing your have to see before you put in your application is the execution plan, IO and time of query response.

People are not very keen on waiting. They get nervous in a traffic jam. They are not too satisfied if they have to sit without a drink for a while in a bar. Similarly, they want their applications to be as responsive as possible. End users perceive performance problems through waiting.

Except for very simple queries, a query can be executed in many different ways. How many ways? Well, the number of different ways of execution or execution plans grows exponentially with query complexity.

For example, analyze the following pseudo-query very superficially:
SELECT
    A.col5
    , SUM (C.col6) AS col6sum
FROM
    TableA AS A
INNER JOIN TableB AS ON A.col1 B.col1
INNER JOIN TableC AS C ON B.col2 = C.col2
WHERE
    A.col3constant1 AND B.col4 constant2
GROUP BY
    A.col5;

Start with the FROM part. Which tables should SQL Server join first, TableA and TableB or TableB and TableC? And in each join, which of the two tables joined should be the left and which one the right table? The number of all possibilities is six, if the two joins are evaluated linearly, one after another. With evaluation of multiple joins at the same time, the number of
all possible combinations for processing the joins is already 12. The actual formula for possible combinations of join evaluation is n!, or n factorial, for linear evaluation, and (2n -2)!/ (n-1)! for parallel evaluation of possible joins.

This already gives four options for each join. So far, there are 6 x 4 = 24 different options for only the FROM part of this query. But the real situation is even worse. SQL Server can execute a hash join in three different ways. As mentioned, this is just a quick superficial analysis of pseudo-query execution, and for this introduction to query optimization problems, such details are not needed.

In the WHERE clause, two expressions are connected with a logical AND operator. The logical AND operator is commutative, so SQL Server can evaluate the second expression first.

Again, there are two choices. Altogether, there are already 6 x 4 x 2 = 48 choices. And again, the real situation is much worse. Because in the pseudo-query all joins are inner joins and because expressions in the WHERE clause are commutative, SQL Server can even start executing the query with any of the expressions of the WHERE clause, then switch to the FROM clause and perform first a join, evaluate the second expression from the WHERE clause, and so on.

So the number of possible plans is already much higher than 48. For this superficial overview, continue with the GROUP BY clause. SQL Server can execute this part in two ways, as an ordered group or as a hash group. Therefore, the number of options for executing the pseudo-query is already 6 x 4 x 2 x 2 = 96.

The important conclusion is that you can see that the number of different possible execution plans for a query grows factorially with query complexity. You can quickly get billions of possible execution plans. SQL Server has to decide which one to use in a very short time. You wouldn’t want to wait, for example, for a whole day for SQL Server to find out the best possible plan and then execute your queries in 5 seconds instead of in 15 seconds. Now you can imagine the complexity of the problems the SQL Server Query Optimizer has to solve with any single query.

 Writing queries requires basic T-SQL knowledge; writing well-performing queries needs much more advanced knowledge.