Tuesday, September 4, 2012

Bulk Inserts with the Entity Framework

A bulk insert is a SQL command that allows you to insert a large number (100s or 1000s) of records for a single table all at once (similar to an Oracle prepared statement).  Without a bulk insert, the records would have to be inserted one at a time and incur the expense of many SQL commands as opposed to just one bulk insert command. 

As of version 5.0 of the Entity Framework (EF), there is still no support for bulk inserts.  If records are to be inserted in bulk, it will have to be done outside the EF.  There are many examples online on how to do this with an EF context and the SqlBulkCopy object.  However, they assume that the entire SQL transaction will consist of nothing but the records in the bulk insert.  In reality, transactions are almost always going to involve inserting records into more than just one table.  Also, you'll likely want to use bulk insert when it improves performance, but continue to use EF for adding/updating other objects because it is so much more convenient.  This article shows how to accomplish this using the EF.

NOTE: consider disabling lazy loading in conjunction with the below technique.  Lazy-loading (combined with the EF "fixup" feature) has serious performance implications, especially when large numbers or objects are involved.

For example, imagine we are creating a business function that inserts 1 object of type Customer and 10000 of type Order.  To get better performance, we'll use a bulk insert for the Order objects:

public void CreateCustomer(int id)
{
    using(var dc = GetContext())    
    {
        var customer = new Customer { Id = id };
        dc.Customers.AddObject(customer);
        IList<order> orders = new List<order>();

        for(int orderId = 1; orderId <= 10000; orderId++)
        {
            // for example simplification, I did not associate order with customer
            var order = new Order { Id = orderId };       
            orders.Add(order);
        }

        var copy = new SqlBulkCopy(dc.Database.Connection.ConnectionString, "Orders");

        // AsDataReader() taken from http://code.msdn.microsoft.com/LinqEntityDataReader
        copy.WriteToServer(order.AsDataReader());
        dc.SaveChanges();
    }
}

This will "work", but it definitely will not be transactional.  Heck, it won't even all occur using the same SQL connection!  This is because SqlBulkCopy is given a connection string, not a connection - so it won't share a connection with the context.  Second, SqlBulkCopy will begin and end its own transaction, as will the call to SaveChanges().  This means that if there is a failure the Customer might be inserted but not the Orders, or vice versa.

To make this transactional we must (unfortunately) apply several reflection and casting hacks in the process of performing the following steps:
  • Forcibly open a database connection
  • Manually begin and end an entity transaction
  • Obtain the current SQL transaction from the entity transaction
  • Instantiate a SqlBulkCopy object using the connection and transaction
We would restructure the above method as follows:

public void CreateCustomer(int id)
{
    using(ObjectContext dc = GetContext())
    {
        // force open the connection
        // if we don't explicitly open it first, the EF will open and close it at will
        // making it impossible to reliably obtain one when we need it
        dc.Connection.Open();

        try
        {
            // create a transaction now so that either the entire operation
            // succeeds or nothing succeeds.  Also, we'll need this transaction
            // later when instantiating the SqlBulkCopy object
            using (var tx = entityConnection.BeginTransaction())
            {
                var customer = new Customer { Id = id };
                dc.Customers.AddObject(customer);

                IList<order> orders = new List<order>();

                for(int orderId = 1; orderId <= 10000; orderId++)
                {
                    // for example simplification, I did not associate order with customer
                    var order = new Order { Id = orderId };      
                    orders.Add(order);
                }

                BulkInsert(dc, orders.AsDataReader());
                dc.SaveChanges();
            }
        }
        finally
        {
            Context.Connection.Close();
        }
    }
}

public void BulkInsert(ObjectContext context, IDataReader reader)
{
    var sqlConnection = context.Connection.StoreConnection as SqlConnection;
    if (sqlConnection != null)
    {
        var parserProp = sqlConnection.GetType().GetProperty("Parser", BindingFlags.NonPublic | BindingFlags.Instance);

        if (parserProp != null)
        {
            var parser = parserProp.GetValue(sqlConnection, null);
            var sqltxProp = parser.GetType().GetProperty("CurrentTransaction", BindingFlags.NonPublic | BindingFlags.Instance);
            var currentTransaction = sqltxProp.GetValue(parser, null);
            var sqlTransaction = currentTransaction as SqlTransaction;

            if (sqlTransaction == null)
            {
                // this means that the current transaction is likely an SqlInternalTransaction and
                // that we must obtain the true, underlying transaction from the Parent property
                var parentProp = currentTransaction.GetType().GetProperty("Parent", BindingFlags.NonPublic | BindingFlags.Instance);
                currentTransaction = parentProp.GetValue(currentTransaction, null);
                sqlTransaction = currentTransaction as SqlTransaction;
            }

            if (sqlTransaction != null)
            {
                var sqlBulkCopy = new SqlBulkCopy(sqlConnection, SqlBulkCopyOptions.Default, sqlTransaction);
                sqlBulkCopy.DestinationTableName = typeof(T).Name;
                sqlBulkCopy.WriteToServer(entities.AsDataReader());
            }
        }
    }
} 

This technique is limited in that objects you are bulk inserting are not tracked by the EF.  This means that the EF is unaware of any object that is bulk inserted which means there is the potential for conflicts if the same object is both bulk inserted and inserted via AddObject().  Also, SqlBulkCopy will not use any of the EF's navigational properties because they have no meaning to it.  You must instead ensure all required scalar properties are initialized.

If the limitations are not an issue for you or easy to avoid (which in general they should be), this technique is useful for improving the efficiency of any business function that requires the insertion of large numbers of records.

2 comments:

  1. I never was able to figure out what to do with calculated columns. There doesn't seem to be any way to tell SqlBulkCopy to ignore those columns. As a result, it attempts to insert a null for those columns and causes a SQL error.

    ReplyDelete
  2. ZZZ Projects offer 2 kind of bulk operations via the Entity Framework Extensions Library (http://zzzprojects.com/entity-framework-extensions/) which increase the performance drastically over the SaveChanges method from Entity Framework.

    BulkSaveChanges
    The first way is via our main features, the BulkSaveChanges method which literally replace the SaveChanges method. You can expect to save your entities 15x time faster and more when you need to save thousands of entities while having the same or better performance with even only one entities. This methods support all kind of associations and entity types (TPC, TPH, and TPT), you use it like you use the SaveChanges methods.

    Bulk Operations
    The second ways is via Bulk Operations methods (BulkDelete, BulkInsert, BulkUpdate and BulkMerge) which increase even more the performance and allow to customize many setting like the primary key to use.

    ReplyDelete