Fast Data Loading with SQL Server 2008
A common operation in many systems is bulk loading of data into a database, be it from another system, a flat file, or generated output, which may or may not involve updating existing records. A lesser known class in .Net, SqlBulkCopy, and a new feature in SQL Server 2008, MERGE can greatly increase the speed of these operations.
Testing
All tests and performance results in this post are based on a test bed program, linked to later in this post. This program is set up to repeat each test a set number of times (25 for the figures below), and average the results. Included in each test is removing and recreating the table from the last run, creating a transaction that all operations take place in, performing the operation, and then committing the transaction.
Prepared Statements vs. SqlBulkCopy
A naïve implementation of data loading would be along the lines of the following:
static void InsertWithPreparedQuery(SqlConnection connection, int rowCount) { var cmd = new SqlCommand("INSERT BenchmarkTable VALUES (@Id, @Data1, @Data2, @Data3, @Data4)", connection); cmd.Parameters.Add("@Id", SqlDbType.Int); cmd.Parameters.Add("@Data1", SqlDbType.NVarChar, 255); cmd.Parameters.Add("@Data2", SqlDbType.NVarChar, 255); cmd.Parameters.Add("@Data3", SqlDbType.Int); cmd.Parameters.Add("@Data4", SqlDbType.Int); cmd.Prepare(); foreach(var record in GetValues(rowCount)) { cmd.Parameters["@Id"].Value = record.Item1; cmd.Parameters["@Data1"].Value = record.Item2; cmd.Parameters["@Data2"].Value = record.Item3; cmd.Parameters["@Data3"].Value = record.Item4; cmd.Parameters["@Data4"].Value = record.Item5; cmd.ExecuteNonQuery(); } }
Straightforward and easy to understand, although not the most performant. GetValues is an IEnumerable function yield returning tuples. Inserting 100k rows with this method takes 6496ms on average.
The SqlBulkCopy class has been in the Framework since 2.0. It, however, seems to be one of the obscure of the framework. It allows you to perform a fast bulk insert into a table. However, there is one major complication to using it, but that can be easily overcome.
The WriteToServer function is overloaded to take either an array of DataRows, a DataTable, or an IDataReader. As most new development no longer uses DataSets, this makes the first two overloads problematic, which leads us with the IDataReader interface. This interface looks daunting to implement at first, with 6 properties, and 27 methods. However, for the purposes of SqlBulkCopy, only a handful need to be implemented. Using this, it’s possible to make a IDataReader that provides records from an IEnumerable instance, opening up use to all of the collection types in .Net. Included in the code is an implementation of this.
With the ability to get data from any IEnumerable, using the SqlBulkCopy class is straightforward:
static void InsertWithBulkLoad(SqlConnection connection, int rowCount) { using (var bulkCopy = new SqlBulkCopy(connection)) using (var reader = new EnumerableDataReader(GetValues(rowCount).ToList())) { bulkCopy.DestinationTableName = "BenchmarkTable"; bulkCopy.ColumnMappings.Add("Item1", "Id"); bulkCopy.ColumnMappings.Add("Item2", "Data1"); bulkCopy.ColumnMappings.Add("Item3", "Data2"); bulkCopy.ColumnMappings.Add("Item4", "Data3"); bulkCopy.ColumnMappings.Add("Item5", "Data4"); bulkCopy.WriteToServer(reader); } }
With the same parameters as the previous example, the load takes 1151ms, well over a 5x speedup for a minimal increase in complexity. In addition, if you are using Entities, you can create instances of the entities, add them to a collection, and then pass it off to SqlBulkCopy, as most ORMs have less than stellar bulk insert performance.
Merging Data
The downside of SqlBulkCopy: While it is fast, it doesn’t work if you have to update existing data, failing on the first unique key violation. In most cases you will want to update existing rows with updated data from the feed, along with inserting new rows. To simulate this with my test bed program, after dropping the table, I insert 100k rows, and then when calling the test functions, I tell it to process 125k rows. This gives a decent amount of new rows for testing overall performance.
A simple update/insert method might resemble the following:
static void UpdateWithPreparedQueries(SqlConnection connection, int rowCount) { var existingIds = new HashSet<int>(); using (var selectCmd = new SqlCommand("SELECT Id FROM BenchmarkTable",connection)) using (var insertCmd = new SqlCommand("INSERT BenchmarkTable VALUES (@Id, @Data1, @Data2, @Data3, @Data4)", connection)) using (var updateCmd = new SqlCommand("UPDATE BenchmarkTable SET Data1=@Data1, Data2=@Data2, Data3=@Data3, Data4=@Data4 WHERE Id=@Id", connection)) { insertCmd.Parameters.Add("@Id", SqlDbType.Int); insertCmd.Parameters.Add("@Data1", SqlDbType.NVarChar, 255); insertCmd.Parameters.Add("@Data2", SqlDbType.NVarChar, 255); insertCmd.Parameters.Add("@Data3", SqlDbType.Int); insertCmd.Parameters.Add("@Data4", SqlDbType.Int); updateCmd.Parameters.Add("@Id", SqlDbType.Int); updateCmd.Parameters.Add("@Data1", SqlDbType.NVarChar, 255); updateCmd.Parameters.Add("@Data2", SqlDbType.NVarChar, 255); updateCmd.Parameters.Add("@Data3", SqlDbType.Int); updateCmd.Parameters.Add("@Data4", SqlDbType.Int); insertCmd.Prepare(); updateCmd.Prepare(); using (var reader = selectCmd.ExecuteReader()) { while (reader.Read()) { existingIds.Add((int) reader["Id"]); } } foreach (var record in GetValues(rowCount)) { if (existingIds.Contains(record.Item1)) { updateCmd.Parameters["@Id"].Value = record.Item1; updateCmd.Parameters["@Data1"].Value = record.Item2; updateCmd.Parameters["@Data2"].Value = record.Item3; updateCmd.Parameters["@Data3"].Value = record.Item4; updateCmd.Parameters["@Data4"].Value = record.Item5; updateCmd.ExecuteNonQuery(); } else { insertCmd.Parameters["@Id"].Value = record.Item1; insertCmd.Parameters["@Data1"].Value = record.Item2; insertCmd.Parameters["@Data2"].Value = record.Item3; insertCmd.Parameters["@Data3"].Value = record.Item4; insertCmd.Parameters["@Data4"].Value = record.Item5; insertCmd.ExecuteNonQuery(); } } } }
This loads all existing IDs into memory, and then runs either the update or the insert based on if that ID exists in the collection. This method takes 9296ms.
In SQL 2008, the MERGE statement was added. This allows you to update one table based on the contents of another table, with specific actions if there is a match (Update), not matched (Insert), or not matched by the source table (Delete). It also allows you to return fields from the merge operation, so, for example, you could get the resultant data set, or just the key fields. Since it is designed to work from two tables, the updated data needs to be loaded into a temporary table. This is the one downside to this method: It can grow the transaction log for tempdb to quite a large size, depending on your dataset.
Example usage of the MERGE statement:
static void UpdateWithMerge(SqlConnection connection, int rowCount) { var cmd = new SqlCommand("CREATE TABLE #BenchmarkTemp (Id int, Data1 nvarchar(255), Data2 nvarchar(255), Data3 int, Data4 int)", connection); cmd.ExecuteNonQuery(); using (var bulkCopy = new SqlBulkCopy(connection)) using (var reader = new EnumerableDataReader(GetValues(rowCount).ToList())) { bulkCopy.DestinationTableName = "#BenchmarkTemp"; bulkCopy.ColumnMappings.Add("Item1", "Id"); bulkCopy.ColumnMappings.Add("Item2", "Data1"); bulkCopy.ColumnMappings.Add("Item3", "Data2"); bulkCopy.ColumnMappings.Add("Item4", "Data3"); bulkCopy.ColumnMappings.Add("Item5", "Data4"); bulkCopy.WriteToServer(reader); } const string mergeStatement = @"MERGE BenchmarkTable AS target USING (SELECT Id, Data1, Data2, Data3, Data4 FROM #BenchmarkTemp) AS source (Id, Data1, Data2, Data3, Data4) ON (target.Id = source.Id) WHEN MATCHED THEN UPDATE SET target.Data1 = source.Data1, target.Data2 = source.Data2, target.Data3 = source.Data3, target.Data4 = source.Data4 WHEN NOT MATCHED THEN INSERT VALUES (Id, Data1, Data2, Data3, Data4);"; var mergeCmd = new SqlCommand(mergeStatement, connection); mergeCmd.ExecuteNonQuery(); }
First, a temp table is created, then it is filled with SqlBulkCopy. Then the MERGE statement is used on both tables to update the permanent table. Please be aware that the semi-colon at the end of the MERGE statement is required. This method averages 2913ms to work with the same dataset as the previous, giving an over 3x speedup.
Conclusion
Using the SqlBulkCopy class and the MERGE statement, you can greatly increase the speed of your bulk loading and updating operations, without adding too much complexity to your program.
Code
The Test Bed program and the implementation of EnumerableDataReader is available on BitBucket at the following URL: https://bitbucket.org/msieker/unidevsamples/src

i used same approach for data loading but my merge statement take lot of time,
and finally sql time out error comes,
so,how to improve merge performance
any idea for that