Locks everywhere

LOCKSIn this post I’m going to talk about locks on SQL Server. Locks are necessary, they are used in all operations in the database. Don’t get confused about blocking, locking and blocking are totally different.

When we talk about lock, doing something in the database, like an update and select though will cause a type of lock. The select stantement has a lock operation called shared lock. This means you can share reads with someone else and that may not cause blocks.

SQL Server has different kinds of lock modes, such as (S) Shared, (U) Update, (X) Exclusive, (I) Intent (Sch) Schema, Bulk Update and Key-Range.

  • (S) Shared lock is used in read operations.
  • (U) Update to avoid potential deadlock problem.
  • (X) Exclusive prevent access to a resource by concurrent transactions.
  • (I) Intent prevent other transactions from modifying the higher-level resource and improve the efficiency of the Database Engine in detecting lock conflicts at the higher level.
  • (Sch) Schema uses schema modification (Sch-M) locks during a table data definition language (DDL) operation.

The following table shows the compatibility of the most commonly encountered lock modes.

Existing granted mode
Requested mode IS S U IX SIX X
Intent shared (IS) Yes Yes Yes Yes Yes No
Shared (S) Yes Yes Yes No No No
Update (U) Yes Yes No No No No
Intent exclusive (IX) Yes No No Yes No No
Shared with intent exclusive (SIX) Yes No No No No No
Exclusive (X) No No No No No No

 

Memory-optimized Logging

In a previous post I talked about transaction log works, and what about using a memory-optimized table?

SQL Server has the feature Memory-Optimized Objects to improve performance. In-memory nonclustered indexes are implemented using a data structure called a Bw-Tree. A Bw-Tree is a lock and latch-free variation of a B-Tree.

In-memory architecture:inmemory

To enable an application to use In-Memory OLTP, you need to complete the following tasks:

  • Create a memory-optimized data filegroup and add a container to the filegroup.
  • Create memory-optimized tables and indexes.
  • Load data into the memory-optimized table and update statistics after loading the data and before creating the compiled stored procedures.
  • Create natively compiled stored procedures to access data in memory-optimized tables. You can also use a traditional, interpreted Transact-SQL to access data in memory-optimized tables.

  • As needed, migrate data from existing tables to memory-optimized tables.

In this new architecture, let’s see what happens in the logging level.

I created a table using the code:

CREATE TABLE InMemoryTable (
 ID INT IDENTITY NOT NULL PRIMARY KEY NONCLUSTERED
 , FIRST_NAME VARCHAR(100)
) WITH (MEMORY_OPTIMIZED=ON, DURABILITY=SCHEMA_ONLY)

As you can see, the first part is equal to a normal table and to create the in-memory table we use MEMORY_OPTIMIZED and the DURABILITY. In this case, I’d like you to pay attention to the DURABILITY configured to SCHEMA_ONLY.

Let’s take a look in the transaction log after create the table.inmemotable

We can see all the sys changes to create the table. Now, inserting a row in this table and see the log again:

INSERT INTO InMemoryTable VALUES('Douglas Correa')
GO

inmemotable

You can see nothing changed, but where’s my data? The data is there in the table but only in memory. As you can imagine, both the logging and saving the data to disk are expensive operations.

That means with DURABILITY schema_only the data won’t be there after a crash or restart the server. Changing that for SCHEMA_AND_DATA and look in the log file we are going to see the log operation when inserting data.

inmemotable2

As you can see, the log operation is LOP_HK, the row is in the log in case of crash SQL Server can redo.

Conclusion

The memory-optimized table is fast and can improve performance especially if don’t need to save the data, but there are limitations and one of limitation I didn’t like was I can’t detach and attach the database recreating the log file.

 

What the Checkpoint does?

Checkpoints flush dirty data pages from the buffer cache of the current database to disk. This minimizes the active portion of the log that must be processed during a full recovery of a database. During a full recovery, the following types of actions are performed:

  • The log records of modifications not flushed to disk before the system stopped are rolled forward.
  • All modifications associated with incomplete transactions, such as transactions for which there is no COMMIT or ROLLBACK log record, are rolled back.

Checkpoints occur in the following situations:

  • A CHECKPOINT statement is explicitly executed. A checkpoint occurs in the current database for the connection.
  • A minimally logged operation is performed in the database; for example, a bulk-copy operation is performed on a database that is using the Bulk-Logged recovery model.
  • Database files have been added or removed by using ALTER DATABASE.
  • An instance of SQL Server is stopped by a SHUTDOWN statement or by stopping the SQL Server (MSSQLSERVER) service. Either action causes a checkpoint in each database in the instance of SQL Server.
  • An instance of SQL Server periodically generates automatic checkpoints in each database to reduce the time that the instance would take to recover the database.
  • A database backup is taken.
  • An activity requiring a database shutdown is performed. For example, AUTO_CLOSE is ON and the last user connection to the database is closed, or a database option change is made that requires a restart of the database.

recovery_process

VLF (Virtual Log Files)

tranlog3

To see how many VLFs you have solely look at the number of rows returned by DBCC LOGINFO.

The size and number of VLFs you’ll have depends largely on the size that the chunk is when it’s added to you transaction log.

There is no general rule how to determine the best values for the auto-growth option, as these vary from case to case. Having too many or too little virtual log files causes bad performance.

Having an excessive number of VLFs can negatively impact all transaction log related activities and you may even see degradation in performance when transaction log backups occur.

Most of the time excessive VLF fragmentation is brought about by excessive file growth at small intervals. For example, a database that is set to grow a transaction log file by 5mb at a time is going to have a large number of VLFs should the log decide to grow.

Growth Number of VLFs created
<= 64Mb 4
>64 but <=1Gb 8
>1Gb 16

Wait wait wait…

Let’s talk about why we have to wait and how to understand the wait types.

Paul Randal in his post Wait statistics, or please tell me where it hurts said:

A thread is using the CPU (called RUNNING) until it needs to wait for a resource. It then moves to an unordered list of threads that are SUSPENDED. In the meantime, the next thread on the FIFO (first-in-first-out) queue of threads waiting for the CPU (called being RUNNABLE) is given the CPU and becomes RUNNING. If a thread on the SUSPENDED list is notified that it’s resource is available, it becomes RUNNABLE and is put on the bottom of the RUNNABLE queue. Threads continue this clockwise movement from RUNNING to SUSPENDED to RUNNABLE to RUNNING again until the task is completed

That’s explain a lot, because the SQL Server threads doesn’t run all in the same time. A good example is when our query is doing physical reads. The IO subsystem is the slowest part of our resources and probably will take some time if the query is reading gigabytes of data.

Capture

After the CPU request the data from the disk, the disk will run for it, but before send the data back. All data need to go to memory first and that may don’t have the necessary space.  The thread is going to wait until some resources been released first. There are many scenarios, for example, how many threads are running this query? How long will take to the application to show that data?

So, every time a thread needs to wait for a resource it will increase a wait time type, such as PAGEIOLATCH_XX , PAGELATCH_XX, ASYNC_NETWORK_IO, CXPACKET, RESOURCE_SEMAPHORE. I will talk more about waits in the next posts.

SQL Operations Studio Linux

Great news, a visual way to manage SQL Server from Linux.

SQL Operations Studio came to fill a gap we were waiting to develop and manage SQL Server databases on Linux. In the link https://github.com/Microsoft/sqlopsstudio we can see more information like the features below.

Feature Highlights

  • Cross-Platform DB management for Windows, macOS and Linux with simple XCopy deployment
  • SQL Server Connection Management with Connection Dialog, Server Groups, and Registered Servers
  • Object Explorer supporting schema browsing and contextual command execution
  • T-SQL Query Editor with advanced coding features such as autosuggestions, error diagnostics, tooltips, formatting and peek definition
  • Query Results Viewer with advanced data grid supporting large result sets, export to JSON\CSV\Excel, query plan and charting
  • Management Dashboard supporting customizable widgets with drill-through actionable insights
  • Visual Data Editor that enables direct row insertion, update and deletion into tables
  • Backup and Restore dialogs that enables advanced customization and remote filesystem browsing, configured tasks can be executed or scripted
  • Task History window to view current task execution status, completion results with error messages and task T-SQL scripting
  • Scripting support to generate CREATE, SELECT and DROP statements for database objects
  • Workspaces with full Git integration and Find In Files support to managing T-SQL script libraries
  • Modern light-weight shell with theming, user settings, full screen support, integrated terminal and numerous other features

Let’s see how it works:

The process is very simple. First step download de file in this link https://go.microsoft.com/fwlink/?linkid=862646

After download the next step is to extract the files in a folder wherever you want and run the sqlops file.

My first impression is quite good. A good visual tool to work with SQL Server on Linux. I like the execution plan view and how to manage the databases.

sqlops1

 

This slideshow requires JavaScript.

Comparing with SQL Server Management Studio I miss some options such as:

  • LinkedServer Objects
  • Maintenance Plans
  • Replication
  • Database Mail
  • Extended Events

The conclusion is I really can work on that and run all my scripts and create objects like linkedservers, extended events or replications. That helps a lot to try a new platform concepts.

 

Search Arguments

SARG or Search Arguments are very important for a good query, even if the predicate is very selective and supported by an index, does not guarantee that SQL Server is going to use an index. You need to write an appropriate predicate to allow the Query Optimizer to take advantage of the indexes. The Query Optimizer is not omnipotent. It can decide to use an index only when the arguments in the predicate are searchable. You have to learn how to write appropriate search arguments (SARGs).

sarg

To write an appropriate SARG, you must ensure that a column that has an index on it appears in the predicate alone, not as a function parameter. SARGs must take the form of column inclusive_operator <value> or <value> inclusive_operator column. The column name is alone on one side of the expression, and the constant or calculated value appears on the other side. Inclusive operators include the operators =, >, <, =>, <=, BETWEEN, and LIKE.
However, the LIKE operator is inclusive only if you do not use a wildcard % or _ at the beginning of the string you are comparing the column to.

Above the query returns two rows only; therefore, the WHERE predicate is very selective. There is a nonclustered index on the orderdate column. However, SQL Server did not use the index, as you can see.

The orderdate in the predicate does not appear alone; it is instead an argument of a function. You can rewrite such a query many times. The following query produces the same result, but this time the predicate is a SARG and we can see the different cost for both querys.

difference sarg

Using the AND operator in the WHERE clause predicate means that each part of the predicate limits the result set even more than the previous part. For example, if the first condition limits a query to five rows, then the next condition connected to the first one with the logical AND operator limits the query to five rows at most. The Query Optimizer understands how the logical AND operator works, and can use appropriate indexes.

However, the logical OR operator is inclusive. For example, if the first condition in a predicate would limit the query to 5 rows and the next condition connected to the first condition with the logical OR operator would limit the query to 6 rows, then the result set could have anything between 6 and 11 rows. If the two conditions use two different columns, then SQL Server conservatively takes the worst case and estimates that the query would return 11 rows.

Having multiple conditions in a predicate connected with the OR operator lowers the possibility for SQL Server to use indexes. You should consider rewriting the predicate to a logically equivalent predicate that uses the AND operato.

Another thing you have in mind is that SQL Server executing the IN and the OR operators in the same way. For example the following two queries return the same two rows and are internally treated as equal.

sarg or

Temporary Tables

Sometimes we need to use temporary tables to do aggregations, joins or to simplify the job. SQL Server supports a number of options that you can use to store data temporarily. You can use temporary tables and table variables, before we start with codes we need know some concepts.

Local temporary tables are named with a single number sign as a prefix; for example, #T1.

They are visible only to the session that created them. Different sessions can actually create temporary tables with the same name, and each session will see only its own table. Behind the scenes, SQL Server adds unique suffixes to make the names unique in the database, but this is transparent to the sessions.

Local temporary tables are visible throughout the level that created them, across batches, and in all inner levels of the call stack. So if you create a temporary table in a specific level in your code and then execute a dynamic batch or a stored procedure, the inner batch can access the temporary table. If you don’t drop the temporary table explicitly, it is destroyed when the level that created it terminates.

Global temporary tables are named with two number signs as a prefix; for example, ##T1.
They are visible to all sessions. They are destroyed when the session that created them terminates and there are no active references to them.

Table variables are declared, as opposed to being created. They are named with the at sign (@) as a prefix; for example, @T1. They are visible only to the batch that declared them and are destroyed automatically at the end of the batch. They are not visible across batches in the same level, and are also not visible to inner levels in the call stack.

Temporary tables are created in tempdb in the dbo schema. As already mentioned, you can create two temporary tables with the same name in different sessions, because SQL Server internally adds a unique suffix to each. But if you create temporary tables in different sessions with the same constraint name, only one will be created and the other attempts will fail.

Physical Representation in tempdbtempDB

There’s a common misconception that only temporary tables have a physical representation in tempdb and that table variables reside only in memory. This isn’t true. Both temporary tables and table variables have a physical representation in tempdb.

You can find entries in the sys.objects view for the internal tables that SQL Server creates in tempdb to implement your temporary tables and table variables. As an example, the following code creates a temporary table called #T1 and then queries the sys.objects view in tempdb looking for table names starting with #.

As we can see, both the variable as a temporary table are created implicitly and explicitly in tempdb.

So when should I use one or the other?

When it comes to performance, there’s a very important difference between temporary tables and table variables. SQL Server maintains distribution statistics (histograms) for temporary tables but not for table variables. This means that, generally speaking, you tend to get more optimal plans for temporary tables. This comes at the cost of maintaining histograms, and at the cost of recompilations that are associated with histogram refreshes.

temporary1

This plan is very efficient. The optimizer examined the histogram on col2 and estimated that a very small number of rows are supposed to be filtered. The plan decided to use the index on col2 because the filter is very selective. Only a small number of key lookups are required to obtain the respective data rows. For such a selective filter, this plan is preferable to one that does a full clustered index scan.

temporary2

Unlike with temporary tables, SQL Server doesn’t maintain histograms for table variables. Not being able to accurately estimate the selectivity of the filter, the optimizer relies on hardcoded estimates that assume fairly low selectivity (30 percent). It just doesn’t realize that the filter is actually very selective, and that a plan more similar to the one shown earlier in the temporary table would have been much more efficient.

A common question is whether table expressions such as common table expressions (CTEs) also get persisted like temporary tables and table variables. The answer is no. When SQL Server optimizes a query against a table expression, it unnests the inner query’s logic and interacts directly with the underlying tables. This means that unlike temporary tables and table variables, table expressions have no physical side to them.

The conclusion from this example is that when the plan efficiency depends on existence of histograms, you should use temporary tables. Table variables are fine to use in two general cases. One is when the volume of data is so small, like a page or two, that the efficiency of the plan isn’t important. The other case is when the plan is trivial. A trivial plan means that there’s only one sensible plan and the optimizer doesn’t really need histograms to come up with this conclusion. An example for such a plan is a range scan in a clustered index or a covering index. Such a plan is not dependent on selectivity of the filter—it’s simply always a better option
than a full scan.

The Meaning of “Set-Based”

The term set-based is used to describe an approach to handle querying tasks and is based on principles from the relational model. Remember that the relational model is based in part on mathematical set theory. Set-based solutions use T-SQL queries, which operate on the input tables as sets of rows. Such solutions are contrasted with iterative solutions that use cursors or other iterative constructs to handle one row at a time.

According to set theory, a set should be considered as a whole. This means that your attention should be focused on the set and not on its individual elements. With iterative solutions, you break this principle by operating on one element (row) at a time. Also, a set has no particular order to its elements. So when you use set-based solutions, you cannot make any assumptions about the order of the data. Similarly, unless you add an ORDER BY clause to the query, you’re not guaranteed that the data will be returned in any particular order. With iterative solutions, you process one row at a time and you can do so in a specific order.

When you use set-based solutions, you provide your request as a declarative plain language query. In your request, you focus on the “what” part of the request and let the database engine worry about the “how” part. With iterative solutions, you need to implement both the what and the how parts in your code. As a result, iterative solutions tend to be much longer than set-based ones and harder to follow and maintain.

Another reason why you should stick to set-based solutions is a very pragmatic one—performance. Iterative constructs in T-SQL are very slow. For one thing, loops in T-SQL are much slower than those in other programming languages such as in Microsoft .NET code. Secondly, each record fetch from a cursor by using the FETCH NEXT command has quite a high overhead associated with it. There’s no such overhead when SQL Server processes a set-based solution, even if internally the execution plan for the query involves iterations. As a result, if you know how to tune queries, you are often able to achieve much better performance compared to using iterative solutions.

For example, I have seen something like this:
SET NOCOUNT ON;
DECLARE @Result AS TABLE (
actid INT,
mx MONEY );

DECLARE
@actid AS INT,
@val AS MONEY,
@prevactid AS INT,
@prevval AS MONEY;

INSERT INTO #tbTemp
SELECT actid, val
  FROM dbo.Transactions
  ORDER BY actid, val;

SELECT TOP(1) @prevactid = actid, @prevval = val
FROM #tbTemp;

WHILE @prevactid NOT IS NULL
BEGIN
IF @actid @prevactid
     INSERT INTO @Result(actid, mx) VALUES(@prevactid, @prevval);
 

  DELETE #tbTemp
  WHERE actid = @actid;
 

  SELECT TOP(1) @prevactid = @actid, @prevval = @val
  FROM #tbTemp;
END

IF @prevactid  IS NOT NULL
INSERT INTO @Result(actid, mx) VALUES(@prevactid, @prevval);

SELECT actid, mx
FROM @Result;
GO

Some people think this solution is set-based because it doesn’t explicitly declare and use a cursor object (we talk about this later or never 🙂 ). However, recall that one of the principles implemented by set-based solutions is that they treat the set as a whole as opposed to handling one element at a time. This principle is violated here. Also, set-based solutions do not rely on the order of the data, and this one does.

How we can do this with a set-based rule. Well, lets see the following code:
SELECT
T1.actid, T1.tranid, T1.val,
SUM(T2.val) AS balance
FROM dbo.Transactions AS T1
JOIN dbo.Transactions AS T2 ON T2.actid = T1.actid AND T2.tranid <= T1.tranid
GROUP BY T1.actid, T1.tranid, T1.val;

The code above use a recursive join, recursive querys are commonly used with CTE (we talk later) and for this time this works well, but the best way in SQL Server 2012 is using Window Functions.

SELECT actid, tranid, val,
SUM(val) OVER( PARTITION BY actid
               ORDER BY tranid
               ROWS UNBOUNDED PRECEDING)
AS balance
FROM dbo.Transactions;

The conclusion is, always have a way to work with sets, the first intent wasn’t set-based because works one by one element at time.
I used that code with a temporary table (#tbTemp) because I will discuss this subject in the next post.

 

Execution Plan

books

Why I started my blog with some posts about query optimizer, execution plans and logical processing? Well, before you learn about codes and more codes we need to know how this works in SQL Server. For example, the developer know how to write a single select, insert or update statement or  a batch with conditions clauses like IF, but what happens when you execute the query? We need to know how the whole thing works and the concepts to work with sets.

So, SQL Server and others databases has a lot of information for your simple or complex batch to work well and others tools to help our job. The most important is to write the best query as possible. If you write a simple select, the first thing your have to see before you put in your application is the execution plan, IO and time of query response.

People are not very keen on waiting. They get nervous in a traffic jam. They are not too satisfied if they have to sit without a drink for a while in a bar. Similarly, they want their applications to be as responsive as possible. End users perceive performance problems through waiting.

Except for very simple queries, a query can be executed in many different ways. How many ways? Well, the number of different ways of execution or execution plans grows exponentially with query complexity.

For example, analyze the following pseudo-query very superficially:
SELECT
    A.col5
    , SUM (C.col6) AS col6sum
FROM
    TableA AS A
INNER JOIN TableB AS ON A.col1 B.col1
INNER JOIN TableC AS C ON B.col2 = C.col2
WHERE
    A.col3constant1 AND B.col4 constant2
GROUP BY
    A.col5;

Start with the FROM part. Which tables should SQL Server join first, TableA and TableB or TableB and TableC? And in each join, which of the two tables joined should be the left and which one the right table? The number of all possibilities is six, if the two joins are evaluated linearly, one after another. With evaluation of multiple joins at the same time, the number of
all possible combinations for processing the joins is already 12. The actual formula for possible combinations of join evaluation is n!, or n factorial, for linear evaluation, and (2n -2)!/ (n-1)! for parallel evaluation of possible joins.

This already gives four options for each join. So far, there are 6 x 4 = 24 different options for only the FROM part of this query. But the real situation is even worse. SQL Server can execute a hash join in three different ways. As mentioned, this is just a quick superficial analysis of pseudo-query execution, and for this introduction to query optimization problems, such details are not needed.

In the WHERE clause, two expressions are connected with a logical AND operator. The logical AND operator is commutative, so SQL Server can evaluate the second expression first.

Again, there are two choices. Altogether, there are already 6 x 4 x 2 = 48 choices. And again, the real situation is much worse. Because in the pseudo-query all joins are inner joins and because expressions in the WHERE clause are commutative, SQL Server can even start executing the query with any of the expressions of the WHERE clause, then switch to the FROM clause and perform first a join, evaluate the second expression from the WHERE clause, and so on.

So the number of possible plans is already much higher than 48. For this superficial overview, continue with the GROUP BY clause. SQL Server can execute this part in two ways, as an ordered group or as a hash group. Therefore, the number of options for executing the pseudo-query is already 6 x 4 x 2 x 2 = 96.

The important conclusion is that you can see that the number of different possible execution plans for a query grows factorially with query complexity. You can quickly get billions of possible execution plans. SQL Server has to decide which one to use in a very short time. You wouldn’t want to wait, for example, for a whole day for SQL Server to find out the best possible plan and then execute your queries in 5 seconds instead of in 15 seconds. Now you can imagine the complexity of the problems the SQL Server Query Optimizer has to solve with any single query.

 Writing queries requires basic T-SQL knowledge; writing well-performing queries needs much more advanced knowledge.