Building Sample Data Sets Using xSkrape

 

 

“It's easy to justify the cost savings to generate a very large data set using xSkrape”

What can we do with sample data and why would we want to generate it in Excel? Some possibilities include:

·         Create data that we can experiment with.

 

·         Create a data set that resembles what we think might be realistic in the future (i.e. we might not have comparable data today), and we would like to demonstrate it to others.

 

·         Modeling situations when we don't have a complete data set (i.e. we may want to augment an existing data set, and adding completely random data would make it look "weird").

 

·         Use it for creating "test data" that can be imported into other product such as SQL databases, where it becomes usable for testing (e.g. stress testing, unit testing, etc.) or demonstration for all the reasons described above. For our purposes, we talk about "sample data" and "test data" interchangeably since xSkrape as the means to generate them is the same for either case.

Using the approach described here to obscure existing data (aka data obfuscation) is also possible and is covered in a different article.

How easy is it to use the xSkrape for Excel add in to generate sample data? Extremely easy! For example, here we've added a data creation task and configured it to populate a cell range using lastname, firstname as our data generation pattern, and requiring that all generated names be unique (100% uniqueness):

If we click on the "Create" button or the “Create Random Data” button on the Excel toolbar:

Then we would see data that looks like this:

If we added a second data generation task that uses the pattern {"numberfrom":"20", "numberto":"100", "precision":"2", "exponent":"0.3"} and execute it, our data now looks like this:

Of note from this example:

·         The use of "precision":"2" means we get two decimal places, which could be used as a currency value.

 

·         The use of "exponent":"0.3" means we favor numbers closer to 100 than 20. This might match a scenario where say the number represents a price paid for an item we sell, where it's more common to sell items that are higher in value. The evidence this works can be seen by applying this task to a number of cells and looking at the average value as shown here:

We can see that the average is 83.2 where if values were distributed evenly, the expected average would be closer to "60" (the mid-point between 20 and 100).

The "language" used here to describe what should be generated is documented fully on our web site and as more features are added, this documentation will evolve.

You may wonder: "This isn't too complex - I could probably write some code that would create this kind of sample data myself." Let's examine that premise.

Let's assume you're interested in data that's similar to what you could get from xSkrape: realistic names, control over distributions of numbers and dates, etc. As such, we'd expect this is task that might take a few hours, not a few minutes.

Next, how much is your time worth? Maybe $80/hour? That's likely a conservative estimate, but if we go with that, you could purchase 60,000 xSkrape credits for 3 hours of your time over-and-above the relatively small amount of time you'd need to set up one or more xSkrape data creation tasks.

With 60,000 credits available, you could generate 6,000,000 data values. (This is at prices as of 8/13/2016.) If you had a table with 6 columns, that's a million rows of data.

In other words, it's easy to justify the cost savings to generate a very large data set using xSkrape. Not to mention incremental improvements in our obfuscation language, support and other excellent feature areas.

You can use xSkrape data generation for free when you only need a reasonably small amount of test data. Current rates imply you could generate 25,000 data values for free, per month.

Also check out the test data generation capabilities in our SQL-Hero tool: you can do very similar things to what we've described here, but import generated data directly to your SQL Server database quickly (using bulk insert), respecting referential integrity, check constraints & parent-child relationships, with automation options, and more! We cover some practical applications of data generation for SQL-Hero in a different whitepaper.

Does this all seem interesting but too complicated to implement yourself? Feel free to contact us about consulting options.


- @codexguy

Did you like this article? Please rate it!

Back to Article List