October 3rd, 2012

Joining Multiple Tables, Grouping and Evaluating Totals

PowerShell Team
PowerShell Team

Joining Multiple Tables, Grouping And Evaluating Totals

This is a continuation to the Join-Object blog post considering multiple joins and SQL style grouping and total evaluation.

Problem:

In the Join-Object blog we’ve seen that relational data naturally organizes in multiple tables. Once we have those tables, we frequently want to see the information back together, so we need something like Join-Object. Sometimes we also need to group the information and evaluate totals inside the groups. For instance, what are the total sales for each customer?

How can we use Join-Object in multiple tables and group information from the tables to report totals?

Joining multiple tables:

Consider the following tables describing sales at a fruit stand:

Customers

Id

Name

Phone

1

John Smith

(206)555-8899

2

Mark Stone

(425)999-8811

3

Ruth Moore

(425)888-3355

 

Products

Id

Name

Price

1

Apple

10

2

Orange

12

3

Kiwi

15

 

Orders

Id

Date

CustomerId

1

6/6/2012 9:12:44 AM

1

2

6/7/2012 6:12:10 PM

1

3

6/5/2012 12:05:03 PM

2

 

OrderItems

Id

Quantity

ProductId

OrderId

1

2

3

1

2

1

2

1

3

4

5

1

4

5

1

2

5

4

2

2

6

1

1

3

 

As we saw in the Join-Object blog, we need this many tables to deal with the simple conceptual problem of selling fruit!

One might question the need for the OrderItems table. An order item is a line in the receipt for an order that lists a product and its quantity. Since there can be many items per order, it needs its own table as mentioned in the Join-Object blog.

As we can see from the data, we have 3 customers, 3 products, 3 orders, and 6 order items. The three first order items are for the Order with Id 1.

Good data design forces us to split the tables, but after this is done, we frequently need to see the data together.

For example, suppose I want a report with all basic receipt information, including customer name, order date, all products bought, their quantity, and price. Here is the SQL statement that produces the report:

SELECT     Customer.Name, [Order].Date, OrderItem.Quantity,

           Product.Name AS [Product Name], Product.Price

FROM       OrderItem INNER JOIN Product ON OrderItem.ProductId = Product.id INNER JOIN

           Customer INNER JOIN  [Order] ON Customer.id = [Order].CustomerId ON OrderItem.OrderId = [Order].id

 

Here is the result:

 

Joining All Tables

Name

Date

Quantity

Product Name

Price

John Smith

6/6/2012 9:12:44 AM

2

Apple

10

John Smith

6/6/2012 9:12:44 AM

1

Orange

12

John Smith

6/6/2012 9:12:44 AM

4

Kiwi

5

John Smith

6/7/2012 6:12:10 PM

5

Apple

10

John Smith

6/7/2012 6:12:10 PM

4

Orange

12

Mark Stone

6/5/2012 12:00:00 AM

1

Apple

10

 

A little explanation of the SQL statement

The  “AS [Product Name]” segment sets the column name in the output. Both product name and customer name are called “Name” in their own tables, so we need to differentiate them in the output.

The brackets around “[Order]” differentiate it from the SQL “Order” keyword.

The Inner Join shows only the items present in both tables.

All 4 tables are “inner joined” in the statement.

 

Putting the data in Windows PowerShell

From here on, we will use Windows PowerShell to manipulate this data, so let’s start by putting this data in PowerShell.

The following script creates four CSV files and then imports them into four variables that contain our data.

@”

Id,Name,Phone

1,John Smith,(206)555-8899

2,Mark Stone,(425)999-8811

3,Ruth Moore,(425)888-3355

“@ > c:\temp\customers.csv

 

$customers=import-csv c:\temp\customers.csv

 

@”

Id,Name,Price

1,Apple,10

2,Orange,12

3,Kiwi,15

“@ > c:\temp\products.csv

 

$products=import-csv c:\temp\products.csv

 

@”

Id,Date,CustomerId

1,6/6/2012 9:12:44 AM,1

2,6/7/2012 6:12:10 PM,1

3,6/5/2012 12:05:03 PM,2

“@ > c:\temp\orders.csv

 

$orders=import-csv c:\temp\orders.csv

 

@”

Id,Quantity,ProductId,OrderId

1,2,1,1

2,1,2,1

3,4,3,1

4,5,1,2

5,4,2,2

6,1,1,3

“@ > c:\temp\orderItems.csv

 

$orderItems=import-csv c:\temp\orderItems.csv

 

 

“Customers “;$customers | format-table -AutoSize

“`r`nProducts “;$products | format-table -AutoSize

“`r`nOrders “;$orders | format-table -AutoSize

“`r`nOrder Items “;$orderItems | format-table -AutoSize

 

Here is the output:

Customers

 

Id Name        Phone

— —-        —–

1  John Smith  (206)555-8899

2  Mark Stone  (425)999-8811

3  Ruth Moore  (425)888-3355

 

 

 

Products

 

Id Name    Price

— —-    —–

1  Apple   10

2  Orange  12

3  Kiwi    15

 

 

 

Orders

 

Id Date      CustomerId

— —-      ———-

1  6/6/2012  9:12:44 AM 1

2  6/7/2012  6:12:10 PM 1

3  6/5/2012  12:05:03 PM 2

 

 

 

Order Items

 

Id Quantity ProductId OrderId

— ——– ——— ——-

1  2        1         1

2  1        2         1

3  4        3         1

4  5        1         2

5  4        2         2

6  1        1         3

 

 

The equivalent joins in PowerShell with the help of Join-Object are as follows:

 

# Joins the customer to his orders and places Name(from Customer),

# Date(from Order) and Id (from Order)

$customerOrders=Join-Object -Left $customers -Right $orders -LeftProperties Name -RightProperties Date,Id -Where {$args[0].Id -eq $args[1].CustomerId}

 

# Adds the OrderItems to the data already gathered. The output objects will

# have Name(from Customer), Date(from Order), Quantity(from OrderItem)

# and ProductId (from OrderItem).

$customerOrderItems=Join-Object -Left $customerOrders -Right $orderItems -LeftProperties Name,Date -RightProperties Quantity,ProductId -Where {$args[0].Id -eq $args[1].OrderId}

 

# Adds the Products to the data already gathered. The output objects will

# have Name (from Customer), Date(from Order), Quantity (from OrderItem)

# Product Name (from Product) and Price from Product

# We use a Hashtable instead of simply “Name” for Product Name because

# “Name” is already present in the output object for the customer name,

# so we need to differentiate the Product name from that.

# If you did not know the syntax for the Hashtable Expression, you will

# be interested in the fact it also works for Select-Object properties

$customerProducts=Join-Object -Left $customerOrderItems -Right $products -LeftProperties Name,Date,Quantity -RightProperties @{Name=“Product Name”;Expression={$_.Name}},Price -Where {$args[0].ProductId -eq $args[1].Id}

 

 

$customerProducts | format-table

 

Here is the output:

Name       Date                 Quantity Product Name Price

—-       —-                 ——– ———— —–

John Smith 6/6/2012 9:12:44 AM  2        Apple        10  

John Smith 6/6/2012 9:12:44 AM  1        Orange       12  

John Smith 6/6/2012 9:12:44 AM  4        Kiwi         15  

John Smith 6/7/2012 6:12:10 PM  5        Apple        10  

John Smith 6/7/2012 6:12:10 PM  4        Orange       12  

Mark Stone 6/5/2012 12:05:03 PM 1        Apple        10  

 

 

So, the following three lines of code …

$customerOrders=Join-Object -Left $customers -Right $orders -LeftProperties Name -RightProperties Date,Id -Where {$args[0].Id -eq $args[1].CustomerId}

 

$customerOrderItems=Join-Object -Left $customerOrders -Right $orderItems -LeftProperties Name,Date -RightProperties Quantity,ProductId -Where {$args[0].Id -eq $args[1].OrderId}

 

$customerProducts=Join-Object -Left $customerOrderItems -Right $products -LeftProperties Name,Date,Quantity -RightProperties @{Name=“Product Name”;Expression={$_.Name}},Price -Where {$args[0].ProductId -eq $args[1].Id}

 

… do the same job as the SQL statement:

 

SELECT   Customer.Name, [Order].Date, OrderItem.Quantity,

         Product.Name AS [Product Name], Product.Price

FROM     OrderItem INNER JOIN Product ON OrderItem.ProductId = Product.id INNER JOIN

         Customer INNER JOIN  [Order] ON Customer.id = [Order].CustomerId ON OrderItem.OrderId = [Order].id

 

There are some advantages to the SQL statement:

·         It is a single statement that is somewhat shorter than the three statements  in PowerShell.

·         It has a better syntax for aliasing a property in the output (“as ProductName” as opposed to “@{Name=“Product Name”;Expression={$_.Name}}”).

 

In my opinion, those are small advantages. Each one of the PowerShell join statements is easy to understand in isolation. The hashtable syntax used in the third command (to change a property name) is a bit long, but very common to PowerShell users, because  it is useful for other cmdlets including Select-Object.

I considered changing Join-Object to take all the lists and Where clauses in a single cmdlet call, but the need to pair each of two lists with a Where clause (and a Join type) makes the command a bit too cryptic for my taste. I prefer three statements that are easy to understand to one that is not. Performance-wise, there is no difference, because each Join has to be considered in isolation.

Now that the data is joined, what about getting the total money spent by each customer, the total money spent in each product and the total money spent?

Total Money Spent by Each Customer

Let’s start with the whole command:

 

$customerProducts | Group-Object name | `

    Select-Object Name,@{Name=‘Total’;

                         Expression={$_.Group | ForEach-Object `

                           -Begin {$total=0;} `

                           -Process {$total+=[int]$_.Price*$_.Quantity} `

                           -End {$total}}} | format-table  

 

 

It produces this output:

Name       Sum

—-       —

John Smith 190

Mark Stone  10

 

 

The first line:

$customerProducts | Group-Object name

Produces two groups:

Count Name                      Group                         

—– —-                      —–                                                                                         

    5 John Smith                {@{Name=John Smith; Date=6/6/2012 9:12:44 AM;…

    1 Mark Stone                {@{Name=Mark Stone; Date=6/5/2012 12:05:03 PM;…  

 

In order to better see what is in each group, we run:

($customerProducts | Group-Object name)[0].Group

 

Which produces:

Name        : John Smith

Date        : 6/6/2012 9:12:44 AM

Quantity    : 2

ProductName : Apple

Price       : 10

 

Name        : John Smith

Date        : 6/6/2012 9:12:44 AM

Quantity    : 1

ProductName : Orange

Price       : 12

 

Name        : John Smith

Date        : 6/6/2012 9:12:44 AM

Quantity    : 4

ProductName : Kiwi

Price       : 15

 

Name        : John Smith

Date        : 6/7/2012 6:12:10 PM

Quantity    : 5

ProductName : Apple

Price       : 10

 

Name        : John Smith

Date        : 6/7/2012 6:12:10 PM

Quantity    : 4

ProductName : Orange

Price       : 12

 

And:

($customerProducts | Group-Object name)[1].Group

 

Which produces:

Name        : Mark Stone

Date        : 6/5/2012 12:05:03 PM

Quantity    : 1

ProductName : Apple

Price       : 10

 

So, inside the “Group” property of each object in the output of group-object is the Price and Quantity we want to multiply in order to get a total. If we wanted just the total quantity of fruit we could do:

$customerProducts | Group-Object name | `

    Select-Object Name, `

    @{Name=‘Total’;Expression={($_.Group | Measure-Object Quantity -Sum).Sum}} `

    | format-table 

 

With output:

Name       Total

—-       —–

John Smith    16

Mark Stone     1

 

Name is the first column in the output. The second column is given by:

@{Name=‘Total’; Expression={($_.Group | Measure-Object Quantity -Sum).Sum}}

 

This hashtable is a syntax for select object which means a column with a Name = Total and a value which is the Expression. The Expression gets the value of the Group property and pipes to Measure-Object to sum the ‘Quantity’ property of each member of the Group. Finally the Sum property of the Measure-Object result is retrieved. That property contains the actual sum.

Since we want to get Price * Quantity and measure-object only supports plain properties we end up with the full command repeated here:

$customerProducts | Group-Object name | `

    Select-Object Name,@{Name=‘Total’;

                         Expression={$_.Group | ForEach-Object `

                           -Begin {$total=0;} `

                           -Process {$total+=[int]$_.Price*$_.Quantity} `

                           -End {$total}}} | format-table  

 

Instead of | measure-object we have a | ForEach-Object where we use –Begin, -Process and –End to evaluate the total. This could be easily modified to calculate anything else, including averages, standard deviations, etc.

 

Note: The [int] cast in front of $_.Price is required because the price is a string out of our CSV file. This could be avoided by using Export-CliXml and Import-CliXml for our data, whioch would preserve the data type.

 

Here is the corresponding SQL statement:

SELECT   Customer.Name, SUM(OrderItem.Quantity * Product.Price) AS SUM

FROM     OrderItem INNER JOIN

            Product ON OrderItem.ProductId = Product.id INNER JOIN

            Customer INNER JOIN

            [Order] ON Customer.id = [Order].CustomerId ON OrderItem.OrderId = [Order].id

GROUP BY Customer.Name

 

The SQL statement combines the joining, grouping and total evaluation(Price * Quantity) operations.

The Powershell statement is only doing the grouping and total evaluation, since the $customerProducts variable has the result of the joins performed in the previous section of this blog post.

The bold segments of the SQL statement are performing the grouping and the PowerShell statement.

SQL is designed specifically for this kind of data, so it provides a shorter syntax overall, but I believe the PowerShell equivalent is simple enough to understand.

Total Money Spent in Each Product

The only thing that changes from the previous section is that the property we’re passing to Group-Object is “ProductName”:

$customerProducts | Group-Object ProductName | `

    Select-Object Name,@{Name=‘Total’;

                         Expression={$_.Group | ForEach-Object `

                           -Begin {$total=0;} `

                           -Process {$total+=[int]$_.Price*$_.Quantity} `

                           -End {$total}}} | format-table  

 

Name   Sum

—-   —

Apple   80

Orange  60

Kiwi    60

 

The corresponding SQL is:

SELECT  Product.Name, SUM(OrderItem.Quantity * Product.Price) AS Sum

FROM    OrderItem INNER JOIN

        Product ON OrderItem.ProductId = Product.id

GROUP BY Product.Name

 

Total Money Spent

Here is the command that produces this total:

$customerProducts | ForEach-Object `

                           -Begin {$total=0;} `

                           -Process {$total+=[int]$_.Price*$_.Quantity} `

                           -End {$total}

 

 

Here is the corresponding SQL:

SELECT  SUM(OrderItem.Quantity * Product.Price) AS Sum

FROM    OrderItem INNER JOIN

        Product ON OrderItem.ProductId = Product.id

 

 

Lucio Silveira[MSFT]

Category
PowerShell

Author

PowerShell Team
PowerShell Team

PowerShell is a task-based command-line shell and scripting language built on .NET. PowerShell helps system administrators and power-users rapidly automate tasks that manage operating systems (Linux, macOS, and Windows) and processes.

0 comments

Discussion are closed.