Generate a SSIS Package from Metadata & Exploit Parallelism

 

 

“With a final, working SSIS package, you’ve got access to the good performance of parallel loading using a work queue – and you’ve also got a way to keep it ‘in sync’ with your evolving materialized view, freeing you from plumbing work, letting you focus on the business value contained in the source query that you’re aiming to persist for reporting and other needs.”

In a recent article, I dove into a strategy for loading materialized views that has a number of advantages including reduced contention and improved performance for some types of source queries. I offered a few variations on how to implement this strategy, and the clear winner in a production scenario I used as a benchmark was to do this while taking advantage of parallelism, as we can do easily from SSIS. In this article I’ll offer a simple way to generate the necessary SSIS package and supporting SQL to do this, without writing much code. This sorcery is brought to you by xSkrape Tool Suite where we can use templates to generate any type of output: SQL, C#, SSIS packages – etc.

 

Concept

 

I won’t spend a lot of time explaining the problem to solve since the linked article gives most of the background. Were the article ends, I’ve presented an SSIS package that resembles this:

The population task here involves inserting simple key values into a queue table. I generally accomplish this using a stored procedure. This has the benefit of letting me include logging that can reference @@ROWCOUNT, for example, so I can tell how many records have been added to my queue – and other similar instrumentation. A simple version of my populate procedure might look like this (code available on GitHub):

CREATE PROCEDURE dbo.up_WidgetLatestStateQueue_Populate

       @start datetime2 = NULL

AS

BEGIN

       SET NOCOUNT ON;

 

       IF @start IS NULL

              SET @start = DATEADD(dd, -1, SYSDATETIME());

 

       INSERT dbo.WidgetLatestStateQueue (WidgetID)

       SELECT DISTINCT e.WidgetID

       FROM

              [Source].[Event] e

       WHERE

              e.EventDate >= @start

       AND NOT EXISTS (SELECT 0 FROM dbo.WidgetLatestStateQueue q WHERE e.WidgetID = q.WidgetID);

END

GO

 

My SSIS SQL task “Populate queue” simply invokes this procedure with no parameter provided, so the default of @start = NULL is used, meaning the filter will be calculated as 24 hours in the past. This means if this job runs at least once a day, we should capture all widgets where there could be a real change in our materialized view. If I had other tables that participated in the source query for the materialized view, I’d include them in either the INSERT from a SELECT shown here or include them in separate INSERT statements – how I do it would depend on performance. (If you index your dates properly, this step should be very fast – in my production benchmark, it represented less than 1% of the overall cost.)

After the populate step, I’ve included an SSIS script task that runs the following C# script:

public void Main()

{

    const int MAX_RETRIES = 2;

    const int COMMAND_TIMEOUT = 60;

    const int MAX_THREADS = 8;

 

    var connectionString = Dts.Connections["SQLDatabase"].ConnectionString;

 

    using (var conn = new SqlConnection(connectionString))

    {

        conn.Open();

 

        int lastCount = -1;

        Dictionary<string, int> retries = new Dictionary<string, int>();

 

        while (lastCount != 0)

        {

            using (var da = new SqlDataAdapter($"SELECT TOP {MAX_THREADS} [WidgetID] FROM [Dest].[WidgetLatestStateQueue]", conn))

            {

                using (var dt = new DataTable())

                {

                    da.Fill(dt);

                    lastCount = dt.Rows.Count;

 

                    Parallel.ForEach(dt.DefaultView.Cast<DataRowView>(), (drv) =>

                    {

                        try

                        {

                            using (var connInner = new SqlConnection(connectionString))

                            {

                                connInner.Open();

 

                                using (var cmd = new SqlCommand("[Dest].[up_WidgetLatestState_ByID]", connInner))

                                {

                                    cmd.CommandTimeout = COMMAND_TIMEOUT;

                                    cmd.CommandType = CommandType.StoredProcedure;

                                    cmd.Parameters.AddWithValue("WidgetID", drv["WidgetID"]);

                                    cmd.ExecuteNonQuery();

                                }

                            }

                        }

                        catch

                        {

                            if (MAX_RETRIES > 0)

                            {

                                StringBuilder sb = new StringBuilder();

                                sb.Append(drv["WidgetID"]);

                                string key = sb.ToString();

 

                                if (!retries.ContainsKey(key))

                                {

                                    retries[key] = 1;

                                }

                                else

                                {

                                    retries[key] = retries[key] + 1;

 

                                    if (retries[key] > MAX_RETRIES)

                                    {

                                        throw;

                                    }

                                }

                            }

                            else

                            {

                                throw;

                            }

                        }

                    });

                }

            }

        }

    }

 

    Dts.TaskResult = (int)ScriptResults.Success;

}

 

By taking advantage of the .NET FCL’s Parallel class, we’re spinning up concurrent worker threads to perform work in chunks. The “TOP” clause is returning up to 8 rows of data – in this case simply WidgetID’s that were queued in the population step. The workers invoke a stored procedure, passing the WidgetID. The “ByID” procedure performs a filtered MERGE, with the passed @WidgetID limiting the scope of our source query (and MERGE) to a single WidgetID, which can prove favorable in many types of workload. (To be fair: it can lead to worse performance in some types of workloads as well, so you will likely want to “test and tune” as you implement.)

Of note in this solution:

-          I’m using standardized object naming conventions to be consistent. I call the queue table the name of my materialized view table, with a suffix of “Queue”. I call the population procedure up_<MaterializedViewName>_Populate. I call the processing procedure up_<MaterializedViewName>_ByID.

-          As described in an older article, I’ll often create a SQL view that matches the query I want to persist. This helps centralize the logic and as we’ll see shortly, makes it easy to “drop in” this query to my processing procedure. My convention is to call the view “uv_” (user view) followed by the materialized view table name. (You don’t have to follow this pattern, but the template I discuss shortly understands these standards.)

-          The “ByID” stored procedure does double-duty here: it also deletes the input widget ID from the work queue. Since this is done after the MERGE, we can be sure our data is up-to-date when the item is removed, not to be seen again until a fresh change in the source data requires another MERGE. Note that the project I have on GitHub to demonstrate queued merging does not include deletion in the “ByID” procedure, since it’s showing the row-by-row all-SQL approach where we can do deletion in our WHILE loop, contained in the stored procedure up_WidgetLatestStateQueue_Process.

-          Yes, the script task is light on error handling 😊. That’s somewhat intentional here since what should be done for error handling should match whatever your standards are related to error handling within packages. (I discuss customization later.) In the above code, SQL errors are effectively “retried” for up to 2 times, which can be helpful if, for example, a transient problem such as a deadlock occurs. Then again: you should care about deadlocks, but you can also reflect that by setting the MAX_RETRIES constant to “0” to enforce single-execution semantics.

 

Automation

 

I recently created an XS Tool Suite template called “Populate Materialized View Using SSIS” and have published this for everyone to use:

This template is tied to tables, which in this case should match your materialized view’s persisted home. This template generates two types of output: supporting SQL that can become your “populate and processing” procedures, and XML that can become your .DTSX package to use from SSIS. (You’ll need version 5.3.1 or higher of XS Tool Suite to take advantage of this dual output-type template.)

When we run this template inside of XS Tool Suite, it opens two windows. The first is SQL and includes script that creates the objects I’ve previously discussed: the ability to populate a work queue table and a procedure to process individual items. The population procedure includes the text “DO NOT DROP” in its body, which is important! The script looks for this text in determining whether an object can be overwritten or not, so if you re-run this template, by default you won’t lose any customizations you make in the population procedure.

The second window looks like this:

This is the body of a .DTSX file that you could easily “save as” and include in your SSIS project, and in doing so, looks like the package I showed above. There is one difference, though:

One extra manual step is required: you need to open the script task, open the script itself, and without needing to make any changes, close the script editor and voila – a working package!

In terms of the template parameters available, they bear a close resemblance to the MERGE template parameters I covered in another article:

The natural key for the materialized view and the columns to queue by are defaulted to any primary key that might exist on the materialized view table. You’ve got the option to queue at a level that differs from the natural key of the view. For example, if the view includes potentially multiple records for a widget (e.g. say widget ID and trip ID are the natural key), you might want to recalculate over all trips, widget by widget. (Reasons for doing it this way could include data integrity concerns, or even just simplicity of code, depending on your solution.)

The population date range defaults to a trailing 8 hour time window, but you can change this in your population stored procedure – one piece that doesn’t lend itself easily to complete automation. You’ll need to include the necessary rules to write records into the queue which represent keys for entities that may require refreshing.

Some of the SSIS parameters influence the generated script, including the maximum number of worker threads that are used. An example of customization could be to pull some of these values from a configuration table instead of hard-coding them into packages.

If you check “source query is view,” the item by item MERGE encapsulated in the generated “ByID” procedure assumes that your materialized view is sourced from an actual SQL VIEW object, named “uv_” followed by the materialized view table name. If you choose to implement your source query say within the “ByID” stored procedure, you’ll undoubtedly want to include the text “DO NOT DROP” within the body, since running the template again and executing the generated SQL would drop your version of the “ByID” procedure, losing all your customizations! (Whereas encapsulating it in a view means there’s no problem at all with re-generating and re-running the SQL, which is helpful if you’ve say added new columns and want your MERGE to be updated to match these changes.)

“Allow Delete” is unchecked by default. This assumes your materialized view can insert or update data, but deletion is either not material or is impossible. Note that the performance of MERGE when doing NOT MATCHED BY SOURCE and deletion can be negative, so this is something you should weigh.

Clearly this overall process isn’t difficult to do, and if you change your materialized view, re-doing these steps yields a working package again, with your changed columns accounted for. However, the steps can be simplified even more by using a .shcmd file. To do this, when we’re prompted for template parameters, check the “Save .shcmd file” checkbox, as shown above, at the bottom of the prompt window. When you do this, you can an additional pop-up:

The output file here can name your final .dtsx file (e.g. as located in your SSIS Visual Studio project); the SQL that’s generated will be executed when the .shcmd file is run because the “execute generated” checkbox is checked, but it will also be saved, side-by-side with the .dtsx file. “Interactive status” being checked will show progress in a command window. Adding this to the quick launch list makes this specific template invocation accessible right within XS Tool Suite:

You can also invoke this by double-clicking on the .shcmd file from Windows Explorer: the SHCMD file association understands it should invoke the XS Tool Suite command line tool, which does all the steps encompassed by the template and .shcmd settings.

Note that if you’re using SQL Authentication in your XS Tool Suite connections, one additional change you’ll need to make in your package will be to correct your connection manager connection – your password will not be included in generated code.

 

Summary

 

To summarize: if you’re going to use the XS Tool Suite template to generate your infrastructure, all you need to do is:

1.       Create your materialized view table, with the columns you want. If you start with an actual SQL view that matches the table format, you could create your materialized view table simply by running something like this:

SELECT * INTO MaterializedViewTableName FROM ViewName WHERE 1=0

                Here we get an empty table (because 1=0 will match no rows), with a schema that’s based on the view. (Remember to add applicable indexes!)

2.       Run the template against the materialized view table. Assume we’ll save a .shcmd file at the same time. This step creates our queue table, along with stored procedures.

3.       Customize your Populate stored procedure to populate the queue table based on the data that your materialized view uses.

4.       Optional: customize the “ByID” procedure if any only if you aren’t using a SQL view as the source of your data. If you are, then you’ve presumably checked the “Source query is a view” checkbox when you ran the template, and this step can be omitted.

5.       Add the generated .dtsx file to a SSIS Visual Studio project and perform any clean-up needed, as described above.

6.       Run the Populate procedure, back-dating to whatever date you wish to start “as of.” This juices your queue table so that the next time the package runs, it’ll populate your materialized view table with not just items included based on changes in the last 8 hours (the default), but as much history as you choose to start with.

Now say you add a new field in your materialized view. Steps to follow would be:

1.       Add the field to your materialized view table.

2.       Customize the “ByID” procedure if any only if you aren’t using a SQL view as the source of your data. (By now I hope you agree that using a SQL view is a good idea: it cuts out work!) If you are using a SQL view, you’ll need to add the new field here as well, based on whatever the rules are for calculating it – but nothing special needs to be done related to code generation, at least.

3.       Run your previously generated .shcmd file.

4.       Open your .dtsx file that now sits in a SSIS Visual Studio project and perform any clean-up, such as recompiling its binary code (described above).

5.       Optional: modify your Populate stored procedure if the new field you added requires changes to the rules that determine which items should get queued for reprocessing. Not every new field may require this, which is why I note it as “optional.”

6.       Run the Populate procedure, back-dating your start date to whatever time you feel is needed to capture values for the new field. (Not doing this means you’ll presumably be missing values for the new field for everything except “new changes” in your source data.)

That’s it!

With a final, working SSIS package, you’ve got access to the good performance of parallel loading using a work queue – and you’ve also got a way to keep it “in sync” with your evolving materialized view, freeing you from plumbing work, letting you focus on the business value contained in the source query that you’re aiming to persist for reporting and other needs.

If you want to change something about the generated SQL or DTSX contents, you can do so by copying the template file and modifying it – it’s just an XML file you can edit with a text editor (or XS Tool Suite). I discuss the mechanics of customization more as it relates to the MERGE template here, and the principles apply to this template as well.

Feel free to download XS Tool Suite for free and try your hand at automation and provide feedback!

 


- codexguy

Did you like this article? Please rate it!

Back to Article List