Friday, June 27, 2014

Powershell - Adding Concurrency in Powershell 3 and 4

Our scheduler was throwing errors when evoking a decompression routine I wrote in Powershell. The requirement was to evoke the decompression script many times concurrently with different file name parameter values. Due to limitations in the command field length in the scheduler, we had to call my script from a bat file, which was swallowing any of my errors and output.

I had a few scenarios I had to test.
Can a simple Powershell script be called a second time while is still executing.  I had assumed it could but needed to prove this to move to more relevant scenarios.

I was unsure if my code was written to run concurrently. Rather than confuse my code with concurrency, I wrote a simple test script and a simple concurrency test. This proved I could evoke a standard PS script at least 6 times in the same nano second using  different parameters, without any issue.

So calling the script itself was not an issue. I was unsure whether my code was thread safe. I was also unsure whether the scheduled jobs were written correctly. I saved the thread safety test for last, assuming the issue was with the way my code was being called in the scheduler.



Create a file named ConcurrencyTest.ps1
param(
[INT]$WAITER=$(throw "You did not provide a value for WAITER parameter.")
)

$time=Get-Date -format o| foreach {$_ -replace ":", "."}
$fileName=".\OutPut$time.txt"
New-Item $fileName -type file
Write-output "start $WAITER $time" > $fileName
Start-Sleep $Waiter
$time=Get-Date -format o| foreach {$_ -replace ":", "."}
Write-output "end attempt $WAITER $time" >> $fileName

Create a file named ConcurrencyWorkflow.ps1

workflow RunStuffParallel
{
    $executionAttempts=@(1..100)

    ForEach -Parallel ($Attempt in $ExecutionAttempts)
    {
        Invoke-Expression ".\ConcurrencyTest.ps1 $Attempt"  #adds some variability      
#        Invoke-Expression ".\ConcurrencyTest.ps1 5" #hard coded adds linearity       
    }
}

RunStuffParallel

Tuesday, June 10, 2014

Informatica Performance Tuning Cheat Sheet

If you are reading this, stop. Log into the Informatica support site and download the performance tuning guide. That is the first step. It is not comprehensive as each database structure and corporate infrastructure is different. But the basics are there in detail.

Before you continue, read your thread statistics in the session log. I assure me that this is going to point in the general direction of the issue. Save the thread stats for comparison after tuning.

Use Grid Computing
If your company is not penny wise and pound foolish, take the time to do a cost estimate on a grid and an integration server load balancer. Without this, which I do not have at the time of this writing, you are constrained beyond any measure your coding can resolve.

Use Concurrency
This is true of all databases, but has a different flavor depending on your target and source structure (flat file to OLTP, OLTP to flat DW, Flat stage to schema data mart, OLTP to flat file).

Override Tracing
One more option in session properties under config object tab --> Error handling section just change Override tracing to NORMAL from NONE..atleast 25% performacne improvement you can observe.

OLTP
Make sure your OLTP database has no more dependencies than required. For an OLTP design where you are not the database developer, work with the application development team to understand the requirements around all the foreign keys. This helps you determine what the load order for tables can be, as well as allowing you to see what groups of tables can be loaded at the same time.

Data Marts
For a star schema, only allow dimensions to be linked through the fact. I know that is obvious, but people always ask for one little compromise for one report or another. Load conformed dimensions before all other data mart processing. Then load as many dims at the same time as your database engine can handle. then load the facts. I prefer using post SQL to call a sproc to calculate the dimensions start and end dates.

Load hub and satellite models using the logic of the Data Vault 2.0 logic. Somewhat similar to loading dims before facts, but you have links to account for as well. It has been so long since I have done, but I think the order is satellites, Hubs, and then links. Look it up.

Line Size
All session tuning revolves around limiting the precision of each port and knowing how big the largest datum will be for each port. For file sources or targets count the max size of each field for a row. That per field count is set as the line sequential length.

DTM Buffer Block Size


Increase the DTM buffer block setting in relation to the size of the rows. The integration service only allocates two blocks for each source and target by default. Large volume systems do not play well with this. However, over allocation  will gracelessly fail the session for reasons that only Informatica understands. To me I would rather have that as an option (degraded through put vs failure) rather than have it rammed down my throat.


The block size calculation depends on the number of Source and Target and the precision of their rows.

Calculation for SessionBufferBlock
SessionBufferBlock = (((NumberOfSource + NumberOfTarget) * 2))

Calculation for Port Precision
Add the maximum size of the data types all the ports in the source and target. I *think* this excludes the intermediary transformations. Mapping Designer - Target Instance - Ports tab and
Mapping Designer - Source Instance - Ports tab.

Calculation for Buffer Block Size
BufferBlockSize = 20 * (Total Precision of Source and Target)

Calculation for DTM size to accommodate the calculated BufferBlock Size;
(SessionBufferBlock * BufferBlockSize * number of partition)/(0.9)

WF Mgr - Session - Config Object - Advanced - Default Buffer Block Size: How many blocks do I need to provide for all sources and targets. Don't use auto is what I was told unless you have one of each.



To quote the guide:

A session that contains n partitions, set the DTM Buffer Size to at least n times the value for the session with one partition. The Log Manager writes a warning message in the session log if the number of memory blocks is so small that it causes performance degradation. The Log Manager writes this warning message even if the number of memory blocks is enough for the session to run
successfully. The warning message also gives a recommended value.

DTM Buffer Size
Minimum size to allow 20 rows to be processed. The DTM Buffer Size setting specifies the amount of memory the Integration Service uses for the Data Trasnform Manager to buffer memory. Default allocates a minimum of 12 MB for DTM buffer memory. When you enter a value without a unit of measure ittakes that to mean bytes 1024 = 1024 byte. UoM you can use are: KB, MB, GB.

You keep the buffer for non-double byte data as 12MB.
You increase the buffer size for Unicode double byte chars to 24MB
Multiple either of these by the number of partitions.
If any block you are writing is of a data type larger than the specified size (e.g. BLOB) increase the size to the max expected for a single insert.

WF Mgr - Session - Properties - DTM Buffer Size:
(NumberOfSource + NumberOfTarget) * 2) * BufferBlockSize * Partitions )/0.9
This equation is from the Performance Analyzer output from the Communities pages.  I have also seen (session Buffer Blocks) = (.9) * (DTM Buffer Size) / (Default Buffer Block Size) * (number of partitions)

Increase the property by multiples of the buffer block size, and then run and time the session after each increase.


Caches
Some transformations require caching: Agg, Rank, LKP, Join.
¨ Limit the number of connected input/output and output only ports.
¨ Select the optimal cache directory location that is available to the int service as a power center resource.
¨ Increase the cache sizes.
¨ Use the 64-bit version of PowerCenter to run large cache sessions.

When you attach the session log to a post-session email, enable flat file logging.

Connections
Always use native/manufacturer connections over generic ODBC connections.
Microsoft SQL Server, consult your database documentation for information about how to increase the packet size. Microsoft SQL Server, must also change the packet size in the relational connection object in the Workflow Manager to reflect the database server packet size. the current size can be found using SSMS - Server - Properties - Advanced - Network Packet Size (4096 default I think).

TIPS:
If you are not sure you have a target bottle neck, change the target to a flat file local to the integration server. If performance is better then you need to tune the target checkpoint interval , packet size , database design (indexing, partitioning), or timing in relationship to other operations on the target (indexing, backups).

REFERENCES:
http://aambarish.blogspot.com/2012/05/tuning-sessions-for-better-performance.html

http://makingdatameaningful.com/2012/09/18/data_vault-hubs_links_and_satellites_with_associated_loading_patterns/