My Software Notepad: 2015

Wednesday, November 25, 2015

Mongo Performance Monitoring

A basic database top for Mongo is
> & 'mongostat' /host:localhost /port:27014 /username:mongoMonitor /password:<> /authenticationDatabase:admin

insert query update delete getmore command % dirty % used flushes vsize    res qr|qw ar|aw netIn netOut conn     time
    *0    *0     *0     *0       0     1|0     0.0   34.2       0 641.0M 562.0M   0|0   1|0   79b    15k    1 15:25:26
    *0    *0     *0     *0       0     1|0     0.0   34.2       0 641.0M 562.0M   0|0   1|0   79b    15k    1 15:25:27
    *0    *0     *0     *0       0     1|0     0.0   34.2       0 641.0M 562.0M   0|0   1|0   79b    15k    1 15:25:28
insert query update delete getmore command % dirty % used flushes vsize    res qr|qw ar|aw netIn netOut conn     time
    *0    *0     *0     *0       0     1|0     0.0   34.2       0 641.0M 562.0M   0|0   1|0   79b    15k    1 15:25:36

This is very console driven and and requires writing the console output to stdout for trend analysis.

There are a number of for costs solutions for Monitoring Mongo. Most monitoring platforms have some way to hook into Mongo's built in performance metrics. Solarwinds uses a powershell wrapper (on Windows, probably bash on Linux) in their Mongo template. This was interesting because it showed a clear pattern for building your own monitor if you do not have really really nice monitors like we do.

The path to writing your own monitor is creating a connectionadding a polling interval
creating an object to load the json stats in
parsing the object into discrete KVP
associate the measures with date and time
adding aggregates to the measure
storing the metric in a database or file

The basic query you can run would be

db.runCommand( { serverStatus: 1} )

A more discrete monitor call (for queue exhaustion in this case) would be
db.runCommand( { serverStatus: 1, metrics: 0, locks: 0, globalLock: 1, asserts: 0, connections: 0, network:0, cursors: 0, extra_info:0 , opcounters:0, opcountersRepl: 0, storageEngine:0, wiredTiger:0})

Using discrete calls per metric group increases the number of connections, queues, and I/O. However it allows you to poll individual metric groups at different intervals.

The GlobalLocks stats look like this.
{"totalTime":1048003623000,"currentQueue":{"total":0,"readers":0,"writers":0},"activeClients":{"total":10,"readers":0,"writers":1}}

globalLock:

totalTime: 1048003623000

currentQueue:

total: 0

readers: 0

writers: 0

activeClients:

total: 10

readers: 0

writers: 1

Polling this on a 1 minute interval can give you really detailed utilization patterns when you first implement applications with Mongo. However over time you should be able to scale back to 5 or 15 minute intervals as your average utilization levels out.

Monday, October 5, 2015

Database Scaling Projections

Metric	Current	Last Year	Monthly Growth	12 Month
Database Size (GB, TB, PB)
Average Stored Record Size
Average Collection or Table Size
Largest Collection or Table
Writes / Day
Reads / Day
% Queries Using PK
% Queries Using Non-PK Idx
% Queries Using Aggregations (sort of Mongo or Hive Specific)
% MapReduce Jobs
Avg Records Returned per Query
Avg Record Size Returned per Query
Avg Hash Table Size for UNION or JOIN operations

Tuesday, September 22, 2015

Running ETL Code in Powershell Workflow

I needed a test harness to run multiple concurrent versions of the same Pentaho job. I wanted to test that the pid file feature I added prevented subsequent executions while the job was already running.

Create a file named ConcurrencyTest.ps1 with the following content

param(
[Parameter(Position=0,
      Mandatory=$True,
      ValueFromPipeline=$True)]
   [INT]$Attempts=$(throw "You did not provide a value for Attempts parameter.")
   )

function DoStuff
{
   param(
   [Parameter(Position=0,
    Mandatory=$True,
    ValueFromPipeline=$True)]
   [int]$Iter
   )
   $root ="$env:programfiles"
   Set-Location $root\Pentaho\design-tools\data-integration
   cmd /c .\Kitchen.bat /file:C:\Source\Trunk\Transforms\Job_Ods_AggregateMongo.kjb /Level:Detailed | Out-File out.$Iter.log
}

workflow RunStuffParallel
{
   param(
       [Parameter(Position=0,
       Mandatory=$True)]
       [int]$MaxIter
   )

    $ExecutionAttempts=@(1..$MaxIter)

    ForEach -Parallel ($Attempt in $ExecutionAttempts)
    {
       DoStuff -Iter $Attempt
    }
}

RunStuffParallel -MaxIter $Attempts

Execute the test using .\ConcurrencyTest.ps1-Attempts 5

Wednesday, September 16, 2015

Microsoft Web API Custom Model Binding

Web API is great. It is easier than WCF service creation and costs 100% less than Service Stack (service Stack is more feature rich, FWIW). There is a ton of documentation on line for Web API and I like it better than the Service Stack documentation available outside of Pluralsight (my opinion, not a huge deal).

One topic I found annoyingly less well documented we how to parse the query of URI query string, validate the values, and build a conditional filter (or where clause) for the query string in the repository class. The link below has a really good article showing some of this.

using System;
using DemoService.Common;
using System.Collections.Generic;
using System.Linq;
using System.Net.Http;
using System.Web.Http.Controllers;
using System.Web.Http.ModelBinding;
using Newtonsoft.Json;

namespace DemoService.Models
{
    public class ServiceOptionsModelBinder : IModelBinder
    {


        public bool BindModel(HttpActionContext actionContext, ModelBindingContext bindingContext)
        {

            var key = bindingContext.ModelName;
            var val = bindingContext.ValueProvider.GetValue(key);
            List<KeyValuePair<string, string>> requestValuePairs;
            var request = actionContext.Request;
            var requestMethod = request.Method;
            var requestHeader = JsonConvert.DeserializeObject<QueryObj>
                  (request.Content.Headers.ToDictionary());


            //Check and get source data from uri
            if (!string.IsNullOrEmpty(request.RequestUri.Query))
            {
                //also consider using QueryStringValueProvider
                requestValuePairs = request.GetQueryNameValuePairs().ToList();
            }
            //TODO: when we need to create a POST request, fix the type mismatch below
            //Check and get source data from body
            else if (request.Content.IsFormData())
            {
                var requestBody = request.Content.ReadAsStringAsync().Result;
                requestValuePairs = null;
                //requestValuePairs = Parsers.ConvertToKvp(requestBody);
            }

            else throw new NotSupportedException("Not supported, Aint HTTP compatible");
            bindingContext.Model = requestValuePairs;

            return true;
        }
    }

}

REFS:
http://www.strathweb.com/2013/04/asp-net-web-api-parameter-binding-part-1-understanding-binding-from-uri/
http://www.codeproject.com/Articles/701182/A-Custom-Model-Binder-for-Passing-Complex-Objects
http://stackoverflow.com/questions/29393442/custom-model-binder-for-a-base-class-in-web-api

Reading List

The Design and Implementation of Probabilistic Programming Languages
Visualizing Algorithms
Probabilistic Models of Cognition
A Dynamic Programming Algorithm for Inference in Recursive Probabilistic Programs
Learning Structured Generative Concepts

Tuesday, July 21, 2015

C# Notation Reminder

I often forget the nomenclature of a few C# coding constructs. Maybe writing them down will help.

In C# 2.0, generics were introduced. For a generic there is a <T> which stands for Generic Type Parameter. The generic type is a blueprint for what you actually want to implement. An example is the code below where I want to implement an IoC container to write session details to Redis.

            /// <summary>
            /// funq IoC container for in memory cache client like Redis
            /// used for Auth
            /// </summary>
            /// <param name="container"></param>
            public override void Configure(Container container)
            {
                Plugins.Add(new AuthFeature(
                    () => new AuthUserSession(),
                    new IAuthProvider[] {new BasicAuthProvider(), }));

                container.Register<ICacheClient>(new MemoryCacheClient());
                var userRepository = new InMemoryAuthRepository();
                container.Register<IUserAuthRepository>(userRepository);
            }
        }

The other thing I forget is what () => means in for lambdas. To create a lambda expression, you optionally specify input parameters on the left side of the lambda operator =>, and you put the expression or statement block on the other side. I always forget that () is just an empty parameter list. I am not even sure why I used an empty list in the delegate above. Why!?

For example, the lambda expression x => x * x specifies a parameter that’s named x and returns the value of x squared.

REFS:
https://msdn.microsoft.com/en-us/library/0zk36dx2.aspx
https://msdn.microsoft.com/en-us/library/bb397687.aspx

Thursday, July 9, 2015

NodeJS - Concurrency Model

NodeJS is magical. It is fast and easy and monolingual from UI to DAL. It is the purest unicorn of technology. It supports concurrency but is single threaded.So your code executes in the single thread, However, all I/O is evented and asynchronous, so the following won't block the server. The callback and the promise abstractions provide easy access to the event loop. Any I/O call saves the callback and returns control to the node runtime environment.One key idea is "CPU-intensive work should be split off to another process with which you can interact as with events, or by using an abstraction like WebWorkers." Competing ideas to ponder are the use of the cluster module presented in Portland's use group Chris McDonald's example or using PM2 as outlined here clustering made easy managing eventing

Basic Concepts

Continuation passing style functional programing paradigm where each function provides an extra argument to pass a return value to it. That means that when invoking a CPS function, the calling function is required to supply a procedure to be invoked with the subroutine's "return" value. Expressing code in this form makes a number of things explicit which are implicit in direct style. Procedure returns become apparent as calls to a continuation; intermediate values, which are all given names; order of argument evaluation, which is made explicit; and the final action of the called procedure, which simply call a procedure with the same continuation, unmodified, that was passed to the caller.
Streams Readable and writable streams an alternative way of interacting with (file|network|process) I/O.
Buffers Buffers provide a binary-friendly, higher-performance alternative to strings by exposing raw memory allocation outside the V8 heap.
Events Many Node.js core libraries emit events. You can use EventEmitters to implement this pattern in your own applications.
Timers setTimeout for one-time delayed execution of code, setInterval for periodically repeating execution of code. See http://book.mixu.net/node/ch9.html

Cautionary Tales and Coding Practices
http://callbackhell.com/
http://becausejavascript.com/node-js-process-nexttick-vs-setimmediate/
http://howtonode.org/understanding-process-next-tick

PayPal has a lot of data and a lot of concurrent users. This is a problem I want to have, so when I see them move from Java to Node, I pay attention. Paypal developed Java and Node pages side by side to benchmark performance. To quote, the node.js app was:

Built almost twice as fast with fewer people
Written in 33% fewer lines of code
Constructed with 40% fewer files

Both CRUD apps were simple (few routes, few API calls). Node outperformed Java, even given the fact that the Java app had a two month head start. You can see the performance benchmarks here
https://www.paypal-engineering.com/2013/11/22/node-js-at-paypal/

I care about Node because I want an easier path to building a site to secure access to business intelligence assets. Most data services teams I work with have no web developers at all, and rely on Sharepoint and Power BI to deliver what-if analysis. Based on my review of Express with templating and the elegance of D3 on Angular for charting, I think Node becomes less "too hard" to implement for a reporting application team.

I wanted to understand the concurrency model as compared to using the actor pattern implemented in Hadoop via Akka.

Some ideas I liked
"

It is useful to understand how node and V8 interact. Node handles waiting for I/O or timers from the operating system. When node wakes up from I/O or a timer, it generally has some JavaScript callbacks to invoke. When node runs these callbacks, control is passed into V8 until V8 returns back to node.
So, if you do var ii = 1; ii++;, you will never find that ii is anything other than 2. All JavaScript runs until completion, and then control is passed back to node. If you do doSomething(); doSomething(); that will always run doSomething twice, and it will not return to node's event loop until the second invocation of doSomething returns. This means you can completely lock up node from a simple error like this:

for (var i=0 ; i >= 0 ; i++) {}

It doesn't mater how many I/O callbacks you have registered, timers set to go off, or sockets waiting to be read. Until V8 returns from that infinite loop, node does no more work.
This is part of what makes programming in node so nice. You never have to worry about locking. There are no race conditions or critical sections. There is only one thread where your JavaScript code runs."

REFS
https://www.paypal-engineering.com/2013/11/22/node-js-at-paypal/
http://blog.mixu.net/2011/02/01/understanding-the-node-js-event-loop/
http://stackoverflow.com/questions/5153492/models-of-concurrency-in-nodejs
https://creationix.com/jsconf.pdf
https://github.com/xk/node-threads-a-gogo

Wednesday, July 1, 2015

A Basic Outline For NodeJS Rapid Development

I have intentionally veered away from web development work. This is due to some preconceived notions about the market and talent. The basic nay saying sounded like this: there is too much abstraction to understand web code, the learning curve is too steep in modern web development, there are already too many talented Java/.Net web developers out there to compete with, etc. My is really forcing me to rethink what lines I should draw for myself. A good way to break down some of that lack of knowledge seems to be learning the Node JS ecosystem (ie MEAN stack). It has been tons of fun so far.

Here is a set of notes I am using to learn the different templating engines available for scaffolding a new Node project. This is based on a windows work environment. I can already see how ramping up a templated site to allow for D3 on Angular visualizations is a great entry point for any BI project. Since it is all java script some of the IoC and MVC pattern abstractions that make C# code hard to read become far less indirect (at least to me so far). Also, everything has a cute name.

notes:

//install node I used
Write-host "install node globally" -foreground "Darkcyan"
choco install node
npm --update
Write-host "install express globally" -foreground "DarkCyan"
npm install -g express
Write-host "install node-inspector debugger globally" -foreground "DarkCyan"
npm install -g node-inspector
Write-host "install yeoman app templating globally" -foreground "DarkCyan"npm install -g yo
Write-host "install daemon to monitor changes to your node site globally" -foreground "Darkcyan"
npm install -g nodemon
Write-host "install grunt gulp and bowerbuild support globally" -foreground "Darkcyan"
npm install -g bower grunt-cli gulp
Write-host "install express site templating generator. Sort of sucks but oh well. You can extend it with Hogan for logicless templates or ejs/jade for logic supporting templates" -foreground "darkcyan"
npm install -g express-generator
Write-host "install hottowel app templating generator." -foreground "darkcyan"
npm install -g generator-hottowel

/*
-h, --help          output usage information
-V, --version       output the version number
-e, --ejs           add ejs engine support (defaults to jade)
    --hbs           add handlebars engine support
-H, --hogan         add hogan.js engine support
-c, --css <engine> add stylesheet <engine> support (less|stylus|compass) (defaults to plain css)
    --git           add .gitignore
-f, --force         force on non-empty directory
*/

//change to root source dir

express AppName -t hogan -c less ; cd AppName; npm install;pwd; ls;
if (Test-Path node_modules){write-host "node-modules found"} else {write-host "node-modules not found. did you do npm install as root?"}
more .\package.json;.\bin\www

//change the second line in bin\www appName:server to just app (matching the app.js in root dir)
//add a Gruntfile.js
New-Item Gruntfile.js -ItemType file

//start Node CLI repl
node --debug

//start an inspector
node-inspector
//open a chrome tab to the posted URL

//Run site in Debug
SET DEBUG=App:* ; npm start

//open chrome brownser to url and port in bin/www settings
http://127.0.0.1:3000/

//run app
node ./bin/www
//or maybe
npm start

//or let's try creating a new app using hottowel
mkdir yoApp
cd yoApp
yo hottowel helloWorld
//another way to start an app
gulp serve-dev --sync
///////////////////////////////////
//To run site in Debug
SET DEBUG=App:* ; npm start

//open chrome brownser to url and port in bin/www settings
http://127.0.0.1:3000/

//run app
node ./bin/www

//or maybe
npm start

//or let's try creating a new app using hottowel
mkdir yoApp
cd yoApp
yo hottowel helloWorld
//another way to start an app
gulp serve-dev --sync

# uninstall commands

$NodeInstallPath  = cmd /c where node
$NpmInstallPath  = cmd /c where npm;if($NpmInstallPath.Count -gt 1){$NpmInstallPath = $NpmInstallPath[0]}


$AppData = Get-Childitem env:APPDATA | %{ $_.Value }
$userconfig       = npm config get -g -Filter userconfig
$globalconfig     = npm config get -g -Filter globalconfig
$globalignorefile = npm config get -g -Filter globalignorefile
$cache            = npm config get -g -Filter cache
$prefix           = npm config get -g -Filter prefix

if (test-path $globalconfig){
    write-host -foreground "cyan" "cleaning out $globalconfig"
    rmdir -force -recurse $globalconfig}
if (test-path $userconfig){
    write-host -foreground "cyan" "cleaning out $userconfig"
    rmdir -force -recurse $userconfig}
if (test-path $globalignorefile){
    write-host -foreground "cyan" "cleaning out $globalignorefile"
    rmdir -force -recurse $globalignorefile}
if (test-path $cache){
    write-host -foreground "cyan" "cleaning out $cache"
    rmdir -force -recurse $cache}
if (test-path $prefix){
    write-host -foreground "cyan" "cleaning out $prefix"
    rmdir -force -recurse $prefix}
if (test-path $NodeInstallPath){
    write-host -foreground "cyan" "cleaning out $NodeInstallPath"
    choco uninstall nodejs}

Monday, June 1, 2015

Actor Pattern Work Distribution Model

We determined that the combined implementation and support costs of Hadoop on REHL was undesirable if it could be avoided. It was also clear to us that in the time we were looking to implement, Hadoop on Windows had the following unappealing challenges:

it was a support problem for Microsoft internally, even with all the resources available to the Azure team
the online community was too limited to provide consistent consensus for making architectual choices
initial attempts to deploy to Windows laptops for local development resulted in frustration.
OS licensing costs for Windows blades lowered the value proposition of Hadoop.

I was given the assignment to research a framework to support the actor pattern as a way to build a light weight real time analytics system on a Windows platform. We started looking at Akka.Net and trying to understand the difference an actor framework would make over managing thread pools and concurrency directly. Several team members had been personally interested in Akka.net, which is the only reason we did not start with the Orleans project (which may have been a better fit for us actually). We knew we really just needed some data abstraction layer like Spark (which is predicated on Akka in Scala) and a distribution framework (like Akka...).

As Akka.net was a port to .Net, much of the documentation was either for Java or Scala. I found Typesafe had good documentation. Akka.io did as well. Some of the basic concepts of the actor pattern were available via coursera or youtube

Basic steps to prove out (partially complete):
Simple micro service
starting standalone HTTP server
handling simple file-based configuration
logging
routing,
deconstructing requests
serialize JSON entities to class entities
deserialize class entities to JSON messages
error handling
issuing requests to external services
managing requests from external services
recovery of failed actors
queue and or database persistence
integration testing with mocking of external services
operationalizing the code

Some basic architectural guidelines that were provided for Orleans but may apply to an actor model
• Significant number of loosely coupled entities (hundreds to millions)
• Entities are small enough to be single threaded
• Workload is interactive: request/response, start/monitor/complete
• Need or may need to run on >1 server
• No need for global coordination, only between a few entities at a time
• Different entities used at different times

Problematic fit
• Entities need direct access to each other’s memory
• Small number of huge entities, multithreaded
• Global coordination/consistency needed
• Long running operations, batch jobs, SIMD

REFS:
http://research.microsoft.com/pubs/244727/Orleans%20Best%20Practices.pdf
https://www.typesafe.com/activator/template/akka-http-microservice
http://akka.io/

Wednesday, May 13, 2015

Python JSON Based Configuration

Given a JSON configuration file as such

{
    "queue_args":{
        "host"                   :"localhost",
        "port"                   :"15672",
        "virtual_host"           :"/",
        "channel_max"            :"None",   /* Int of AMQP channel_max value*/
        "frame_max"              :"None",   /* Int of AMQP frame_max value*/
        "heartbeat_interval"     :"None",   /* Int of AMQP heartbeat_interval*/
        "ssl"                    :"None",   /* Bool to enable ssl*/
        "ssl_options"            :"None",   /* Dict of ssl options. See https://www.rabbitmq.com/ssl.html*/
        "connection_attempts"    :"1000",   /* Int maximum number of retry attempts*/
        "retry_delay"            :"0.25",   /* Float time to wait in seconds, before the next.*/
        "socket_timeout"         :"None",   /* Int socket timeout (in seconds?) for high latency networks*/
        "locale"                 :"None",
        "backpressure_detection" :"None",   /* Bool to toggle backpressure detection*/
        "login"                  :"guest",
        "password"               :"guest",
        "exchange"               :"",
        "exchange_type"          :"fanout"
    },
    "daemon_args":{
        "daemon"     : "False",             /* Bool to run as a daemon rather than as an immediate process*/
        "pidfile"    : "StreamMessage.pid", /* the daemon PID file (default: %default)*/
        "working-dir": ".",                 /* the directory to run the daemon in*/
        "uid"        : "os.getuid()",       /* the userid to run the daemon as (default: inherited from parent process)*/
        "gid"        : "os.getgid()",       /* the groupid to run the daemon as (default: inherited from parent process)*/
        "umask"      : "0022",              /* the umask for files created by the daemon (default: 0022)*/
        "stdout"     : "False",             /* sends standard output to the file STDOUT if set*/
        "stderr"     : "False"              /* sends standard error to the file STDERR if set*/
    },
    "spark_args":{
        "connection": "main",              /* The name of the connection configuration to host */
        "connector":   "connection",        /* Override the connector implementation entry point */
        "class": "MessageReceiver",         /* The entry point for your application (e.g. org.apache.spark.examples.SparkPi)*/
        "master": "spark://192.168.56.101", /* The master URL for the cluster. TODO: determine correct port (e.g. spark://23.195.26.187:7077)*/
        "deploy-mode": "client",            /* Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client)*/
        "conf": "None",                     /* Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).*/
        "application-jar": "None",          /* Path to jar including your application and all dependencies. Must be globally visible inside of your cluster, hdfs:// or file:// present on all nodes.*/
        "application-arguments": "None"     /* Arguments passed to the main method of your main class, if any*/
    }
}

You will need to minify, parse, and unit test.

from jsmin import jsmin
import json
json_file = 'c:\\Temp\\config.json'
raw_data=open(json_file)
mini_data=jsmin(raw_data.read())
json_data=json.loads(mini_data)
print json_data['rabbit_args']['exchange_type']
raw_data.close()

Tuesday, May 12, 2015

SQL on Hadoop Part 1 - Hive

I ended up taking a wring turn on the path to building Spark applications. So to get a quick win on my project we decided to delay the streaming and move to SQL and tabular storage of data for trend analysis.

In the Hadoop world you can use SQL with several engines against heterogeneous data sources. The easiest way to render data using ANSI 92 SQL is using Hive. This is a database on HDFS and a rendering engine. The Hive engine differs wildly from a SQL Server or Oracle database engine. In SQL Server the query optimizer uses a cost-based approach to determine which physical operator (access to a tabular pointer/location table and column) will implement the logical operators (algebraic operation like UNION) in the DML statement. In Hive the logical operators in the DML statement are rendered into map reduce jobs, each of which spins up a JVM, a costly and unintuitive process to try and performance tune for.

One major difference, and a selling point for Hadoop, is the idea that schema is not applied to data as it is written to Hadoop. Instead it is expected that you will serialize and deserialize data written to Hadoop as part of your process and that you will infer the schema on reading the data. This means there is no costly application of schema before you can start importing data. And that you can restructure the same data several ways without reimporting and rewriting that data to different datamarts. The down side is that for schema on read to be reusable the deserialization or libraries applying schema must be a shared resource across various data applications.

The Hive warehouse is the metadata story (Derby or My SQL) describing the Hive layout.

Databases in Hive are more of a namespace abstraction separating data stores. Hive supports the following storage objects:

HDFS files: distributed file system managed by Hadoop and allowing ubiqiotous access across data nodes, redundancy, durability. Differs from SAN and RAID type storage technologies in management but not really in access or application of ACLs

Databases: namespace for lower level data objects

Tables: columnar data storage just like any database
Partitions: Split a table based on the value of a column that determines where the data is physically stored. Partitioning tables will change how Hive structures the data storage on HDFS and will create subdirectories reflecting the structure of the partitioning. In SQL Server 2008 there was a limit of 1000 partitions (Month 1, Month 2, ..., Month 1,000). Later it was raised to 10,000 partitions. I am not sure what the limit is in Hadoop. Over partitioning can create a large number of files and directories, and add thrashing / processing overhead to the NameNode (which manages the file system locations in memory).

Buckets: A partition physically organizes data horizontally (by row) based on the range of values in the partitioning column to be used in the WHERE clause. A Hive table can be PARTITIONED BY (PostalCode STRING, StartedDate DateTime). Bucketing decomposes data sets based on a column as well. However a column that is used as a bucketing column will be hashed by a user-defined number of buckets. This is coded as CLUSTER BY Col1 INTO 12 BUCKETS and that will create 12 data groups per partition (say...PostalCode) by Col1. The number of buckets does not change because of data volume (count or storage size). Bucketing is essential in performance tuning map-side joins and prolly other stuff.

Warnings on external tables. If you drop an object external to (not managed by) Hive, the metadata about this is removed from the ive Warehouse, but the data is not removed.

SQL is not actually ANSI 92, even though some say it is. Hive SQL allows Java Regex column specification, uses LIMIT in place of TOP, and allows a really wrong SQL syntax along side an ANSI compliant SQL. Both of these are legal:
A. SELECT Col1,Col2 AS Total FROM Table1 WHERE 1=1 LIMIT 100;
B. FROM Table1 SELECT Col1,Col2 AS Total WHERE 1=;

The terminating semicolon is required, not optional.

Hive supports subqueries, but only as correlated subqueries nested in the FROM clause (not from the projected column list)

SELECT Tbl2.Col1, Tbl1.Col2
FROM (
SELECT ColA + ColB AS Col1
FROM Table2) Tbl2 JOIN Table1 Tbl1 ON (Tbl1.Pk = Tbl2.FK)

Hive supports UNION ALL (return duplicates) but not UNION (return unique values).

Hive can access data in "Hive Managed Tables" on HDFS, or external tables in HBase or Cassandra. This may work with Mongo leveraging the Mongo-Hadoop Connector.

PRO TIP: Tables have custom/extended properties. These are not just descriptive, but can be leveraged by SerDes (serializer/deserializer) to determine data structure.

One of the biggest risks in Hive is the lazy evaluation of schema. You can define a table based on input of a deserialized HL7 OBX and insert a MISMO based document in the same table. Hive nor Hadoop will warn you of the clash. When you go to retrieve a record from the second document, an error will occur. So each app needs to be aware of potential issues caused by ETL no honoring the schema .

My Software Notepad