Thursday, December 11, 2014

Conditionally Running SQL Based on Server

In data operations we often have to limit access to personally identifiable information (PII) in environments with fewer safe guards. So in production we may import personal identifiers like credit card numbers or government issued ids (social security numbers in the US). In a development environment we may want to treat that data with a mask on importing it. This can require two different import processes if written incorrectly, meaning that which is tested in dev or QA is not that which is promoted to production. Conditional logic can be based on server name, if that name is a consistent alias for the host managing the distributed transaction coordination. Here is a template.

USE <DATABASE>


IF (SELECT @@SERVERNAME) IN ('Server1\Instance', 'Server2')
BEGIN

    BEGIN TRY
        --insert your production code here

        --no changes past here
        COMMIT
    END TRY

    BEGIN CATCH
        ROLLBACK
        PRINT ERROR_MESSAGE()
    END CATCH
END


IF (SELECT @@SERVERNAME) NOT IN ('Server1\Instance', 'Server2')
BEGIN

    BEGIN TRY
        --insert your SPI masking code here

        --no changes past here
        COMMIT
    END TRY

    BEGIN CATCH
        ROLLBACK
        PRINT ERROR_MESSAGE()
    END CATCH
END

Thursday, November 13, 2014

Numeric Datatype Precision

.Net Datatype Max Exponent .Net Precision .Net Datatype Max Val  SQL DataType SQL DataType Max SQL Datatype Precision Storage
DOUBLE 308 15 17976931348623200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 FLOAT 1.79E+308 7/15 4 or 8 bytes
LONG 19 19 9223372036854775807 BIGINT 9223372036854775807 19 8 bytes
INT64 19 19 9223372036854775807 BIGINT 9223372036854775807 19 8 bytes
ULONG 20 20 18446744073709551615 NUMERIC(20,0)/DECIMAL(20,0)   38  
UINT64 20 20 18446744073709551615 NUMERIC(20,0)/DECIMAL(20,0)   38  
FLOAT 7 7 34028230000000000000000000000000000000 FLOAT/DECIMAL 1.79E+308 7/15  
System.Numerics.BigInteger No Max No Max No Max VARBINARY(MAX) (2^31-1)-2 bytes > 80,340  
Decimal 28 28 7899999999999999999999999999+1 FLOAT/DECIMAL      
        INT  (-2,147,483,648) to 2,147,483,647   4 bytes
        SMALLINT (-32,768) to 32,767   2 bytes
        TINYINT 255   1 byte

Tuesday, August 12, 2014

Informatica MD5 Hash vs SQL MD5 Hash

This is the difference between SQL MD5 hash values and those generated by Informatica. This comes into play when trying to compare hash columns used in type 2 history generated using different tooling across databases.

SELECT
(SELECT CONVERT(VARCHAR(35),HASHBYTES('MD5',CONVERT(VARCHAR(500),'VALUE STRING 1')+'VALUE STRING N'),2)) InfaHash,
(SELECT HASHBYTES('MD5',CONVERT(VARCHAR(500),'VALUE STRING 1')+'VALUE STRING N') )NormalHash

Friday, August 1, 2014

Debugging Unicode Charcter Mapping Issues

Recently I had an issue where I was getting an error in Informatica that a field was truncating but the source value was the same size as the target definition.

I turned out that the source data had smart quotes and the Unicode code page for the SQL Server relational connection did not support smart quotes. However this was in a comment column that was very wide. It was hard to find that as the source of the issue.

1. Open your log
2. Open the bad file output and read it. In my case the bad file explicitly escaped the character incorrectly representing the data.
3. Extract the bad row from your source file and create a smaller test file.  See tool 1.
4. Load the record into a table variable with loose data typing and also into a clone of your table. Look for truncation. Compare the values with a select...except..select query. Start removing columns from the selects to determine what is not loading into your table. For me this did not work as the issue was not my database definition.
5. Run the second tool against your table variable to get an inventory of your characters. Anything weird like smart quotes or cedilla's?
6. Run the bad row file through your debugger, ensuring you write to a table you will not have a PK violation to. Start replacing values in the file and removing sections of long strings until you find the issue.
 

 
TOOL 1: Powershell String Puller
Get-Content -Encoding UTF8 I:\somefile.dat | Where-Object {$_ -match '<Any Word from File Header>' -or $_ -match '<PK FROM ERROR LOG>'}| Set-Content -Encoding UTF8 OutputErrorRecord.dat


TOOL 2: SQL Character Counter
DECLARE @Numbers TABLE (Number int NOT NULL PRIMARY KEY);
WITH digits (d) AS (
SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION
SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION
SELECT 7 UNION SELECT 8 UNION SELECT 9 UNION
SELECT 0)

INSERT @Numbers (Number)
SELECT Number
FROM (SELECT i.d + ii.d * 10 + iii.d * 100 + iv.d * 1000 +
v.d * 10000 + vi.d * 100000 AS Number
FROM digits i
CROSS JOIN digits ii
CROSS JOIN digits iii
CROSS JOIN digits iv
CROSS JOIN digits v
CROSS JOIN digits vi) AS Numbers
WHERE Number > 0

SELECT ch COLLATE Latin1_General_BIN2, COUNT(*)
FROM   <SOME TABLE>
CROSS APPLY (SELECT ch = substring(<SUSPECT COL1>, Number, 1)
FROM @Numbers
WHERE Number BETWEEN 1 AND len(<SUSPECT COL1>) )AS x
GROUP BY ch
ORDER BY 2

Friday, June 27, 2014

Powershell - Adding Concurrency in Powershell 3 and 4

Our scheduler was throwing errors when evoking a decompression routine I wrote in Powershell. The requirement was to evoke the decompression script many times concurrently with different file name parameter values. Due to limitations in the command field length in the scheduler, we had to call my script from a bat file, which was swallowing any of my errors and output.

I had a few scenarios I had to test.
Can a simple Powershell script be called a second time while is still executing.  I had assumed it could but needed to prove this to move to more relevant scenarios.

I was unsure if my code was written to run concurrently. Rather than confuse my code with concurrency, I wrote a simple test script and a simple concurrency test. This proved I could evoke a standard PS script at least 6 times in the same nano second using  different parameters, without any issue.

So calling the script itself was not an issue. I was unsure whether my code was thread safe. I was also unsure whether the scheduled jobs were written correctly. I saved the thread safety test for last, assuming the issue was with the way my code was being called in the scheduler.



Create a file named ConcurrencyTest.ps1
param(
[INT]$WAITER=$(throw "You did not provide a value for WAITER parameter.")
)

$time=Get-Date -format o| foreach {$_ -replace ":", "."}
$fileName=".\OutPut$time.txt"
New-Item $fileName -type file
Write-output "start $WAITER $time" > $fileName
Start-Sleep $Waiter
$time=Get-Date -format o| foreach {$_ -replace ":", "."}
Write-output "end attempt $WAITER $time" >> $fileName

Create a file named ConcurrencyWorkflow.ps1

workflow RunStuffParallel
{
    $executionAttempts=@(1..100)

    ForEach -Parallel ($Attempt in $ExecutionAttempts)
    {
        Invoke-Expression ".\ConcurrencyTest.ps1 $Attempt"  #adds some variability      
#        Invoke-Expression ".\ConcurrencyTest.ps1 5" #hard coded adds linearity       
    }
}

RunStuffParallel

Tuesday, June 10, 2014

Informatica Performance Tuning Cheat Sheet

If you are reading this, stop. Log into the Informatica support site and download the performance tuning guide. That is the first step. It is not comprehensive as each database structure and corporate infrastructure is different. But the basics are there in detail.

Before you continue, read your thread statistics in the session log. I assure me that this is going to point in the general direction of the issue. Save the thread stats for comparison after tuning.

Use Grid Computing
If your company is not penny wise and pound foolish, take the time to do a cost estimate on a grid and an integration server load balancer. Without this, which I do not have at the time of this writing, you are constrained beyond any measure your coding can resolve.

Use Concurrency
This is true of all databases, but has a different flavor depending on your target and source structure (flat file to OLTP, OLTP to flat DW, Flat stage to schema data mart, OLTP to flat file).

Override Tracing
One more option in session properties under config object tab --> Error handling section just change Override tracing to NORMAL from NONE..atleast 25% performacne improvement you can observe.

OLTP
Make sure your OLTP database has no more dependencies than required. For an OLTP design where you are not the database developer, work with the application development team to understand the requirements around all the foreign keys. This helps you determine what the load order for tables can be, as well as allowing you to see what groups of tables can be loaded at the same time.

Data Marts
For a star schema, only allow dimensions to be linked through the fact. I know that is obvious, but people always ask for one little compromise for one report or another. Load conformed dimensions before all other data mart processing. Then load as many dims at the same time as your database engine can handle. then load the facts. I prefer using post SQL to call a sproc to calculate the dimensions start and end dates.

Load hub and satellite models using the logic of the Data Vault 2.0 logic. Somewhat similar to loading dims before facts, but you have links to account for as well. It has been so long since I have done, but I think the order is satellites, Hubs, and then links. Look it up.

Line Size
All session tuning revolves around limiting the precision of each port and knowing how big the largest datum will be for each port. For file sources or targets count the max size of each field for a row. That per field count is set as the line sequential length.

DTM Buffer Block Size


Increase the DTM buffer block setting in relation to the size of the rows. The integration service only allocates two blocks for each source and target by default. Large volume systems do not play well with this. However, over allocation  will gracelessly fail the session for reasons that only Informatica understands. To me I would rather have that as an option (degraded through put vs failure) rather than have it rammed down my throat.


The block size calculation depends on the number of Source and Target and the precision of their rows.

Calculation for SessionBufferBlock
SessionBufferBlock = (((NumberOfSource + NumberOfTarget) * 2))

Calculation for Port Precision
Add the maximum size of the data types all the ports in the source and target. I *think* this excludes the intermediary transformations. Mapping Designer - Target Instance - Ports tab and
Mapping Designer - Source Instance - Ports tab.

Calculation for Buffer Block Size
BufferBlockSize = 20 * (Total Precision of Source and Target)

Calculation for DTM size to accommodate the calculated BufferBlock Size;
(SessionBufferBlock * BufferBlockSize * number of partition)/(0.9)

WF Mgr - Session - Config Object - Advanced - Default Buffer Block Size: How many blocks do I need to provide for all sources and targets. Don't use auto is what I was told unless you have one of each.



To quote the guide:

A session that contains n partitions, set the DTM Buffer Size to at least n times the value for the session with one partition. The Log Manager writes a warning message in the session log if the number of memory blocks is so small that it causes performance degradation. The Log Manager writes this warning message even if the number of memory blocks is enough for the session to run
successfully. The warning message also gives a recommended value.

DTM Buffer Size
Minimum size to allow 20 rows to be processed. The DTM Buffer Size setting specifies the amount of memory the Integration Service uses for the Data Trasnform Manager to buffer memory. Default allocates a minimum of 12 MB for DTM buffer memory. When you enter a value without a unit of measure ittakes that to mean bytes 1024 = 1024 byte. UoM you can use are: KB, MB, GB.

You keep the buffer for non-double byte data as 12MB.
You increase the buffer size for Unicode double byte chars to 24MB
Multiple either of these by the number of partitions.
If any block you are writing is of a data type larger than the specified size (e.g. BLOB) increase the size to the max expected for a single insert.

WF Mgr - Session - Properties - DTM Buffer Size:
(NumberOfSource + NumberOfTarget) * 2) * BufferBlockSize * Partitions )/0.9
This equation is from the Performance Analyzer output from the Communities pages.  I have also seen (session Buffer Blocks) = (.9) * (DTM Buffer Size) / (Default Buffer Block Size) * (number of partitions)

Increase the property by multiples of the buffer block size, and then run and time the session after each increase.


Caches
Some transformations require caching: Agg, Rank, LKP, Join.
¨ Limit the number of connected input/output and output only ports.
¨ Select the optimal cache directory location that is available to the int service as a power center resource.
¨ Increase the cache sizes.
¨ Use the 64-bit version of PowerCenter to run large cache sessions.

When you attach the session log to a post-session email, enable flat file logging.

Connections
Always use native/manufacturer connections over generic ODBC connections.
Microsoft SQL Server, consult your database documentation for information about how to increase the packet size. Microsoft SQL Server, must also change the packet size in the relational connection object in the Workflow Manager to reflect the database server packet size. the current size can be found using SSMS - Server - Properties - Advanced - Network Packet Size (4096 default I think).

TIPS:
If you are not sure you have a target bottle neck, change the target to a flat file local to the integration server. If performance is better then you need to tune the target checkpoint interval , packet size , database design (indexing, partitioning), or timing in relationship to other operations on the target (indexing, backups).

REFERENCES:
http://aambarish.blogspot.com/2012/05/tuning-sessions-for-better-performance.html

http://makingdatameaningful.com/2012/09/18/data_vault-hubs_links_and_satellites_with_associated_loading_patterns/

Tuesday, May 6, 2014

T-SQL - OPENROWSET Access to Excel Data

This is a sad story. Se developer came to me with the question: is there a DOS command to change an Excel file to a CSV file?

There was no context to the question, but I knew this was going to require a step back. The intention was to load different data sets on the same sheet into a database as part of a conversion away from EUC solutions to an enterprise solution. We discussed the merits of using third party libraries to access the Excel data, using an ETL tool with built in access engines (Informatica or SSIS), or using raw SQL. As he is stronger in SQL we discussed using OPENROWSET to access the data using the Excel/Access data access engine. We then planned to load into a staging table and parse out the subsets of data using the sub total columns as max row number delimiters.

One issue was that all examples we could find used the Jet 4.0 engine

SELECT * INTO XLImport3 FROM OPENDATASOURCE('Microsoft.Jet.OLEDB.4.0',
'Data Source=C:\test\xltest.xls;Extended Properties=Excel 8.0')...[Customers$]

SELECT * INTO XLImport4 FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0',
'Excel 8.0;Database=C:\test\xltest.xls', [Customers$])

SELECT * INTO XLImport5 FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0',
'Excel 8.0;Database=C:\test\xltest.xls', 'SELECT * FROM [Customers$]')
 

This is only useful for 32 bit applications and servers. I found the 64 bit data access provider is the ACE provider. This needs to be downloaded and registered on the database server to allow access to EUC sources like Excel and Access in a 64 bit operating environment. the SQL would be reformatted as:


SELECT * INTO XLImport5 FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0',
'Excel 8.0;Database=C:\test\xltest.xls', 'SELECT * FROM [Customers$]')



REFS:

http://www.microsoft.com/en-us/download/details.aspx?id=13255

http://support.microsoft.com/kb/321686

http://social.msdn.microsoft.com/Forums/en-US/b6274401-d8fe-49ed-bd79-e8cd431fd3ce/microsoftjetoledb40-has-not-been-registered-for-openrowset-query?forum=transactsql