Introduction:

Pentaho is a business intelligence software that facilitates an organization as a strong ETL tool and helps to extract data from various resources such as data warehouses, cloud and other data sources. The data extracted as such can be transformed and cleansed as per the requirement. In addition, it provides OLAP services, reporting, generating information dashboards and data mining capabilities. It has now evolved as a platform for predictive analysis and Artificial Intelligence. However, this article is a basic introduction on Pentaho with hand-on examples on creating transformations and jobs.

Pentaho comes in two editions which are as follows:

  1. Community Edition

The community edition of the software is freely available and can be downloaded from the internet from                      https://community.hitachivantara.com/docs/DOC-1009931-downoads .

  1. Enterprise Edition

The Enterprise edition of the software can be bought from its vendor Hitachi Vantara. The enterprise edition will incur some cost to the buyer.

Intallation:

It is relatively easy to install the software in windows operating system. After downloading the setup file, one can follow the instructions in the setup wizard to install the software in local machine. In case of Linux, after navigating to the downloaded file through terminal, one can type command 1 (if the system has GTK libraries) or command 2 (if the system does not have GTK library).

command 1: ./pentaho-business-analytics-8.2.0-x64.bin – – mode text

command 2: ./pentaho-business-analytics-8.2.0-x64.bin – – mode xwindow

Information on manual installation of the software can be found at https://help.pentaho.com/Documentation/8.1/Setup/Installation.

Data Integration (DI) Server (Server Application) of Pentaho:

DI server in Pentaho executes jobs and transformation using PDI (Pentaho Data Integration) engine. A transformation is a network of logical tasks called steps. It is essentially data flow. Job is a workflow-like model for coordinating resources, execution and dependencies of ETL activities.  DI server includes following components:

  1. Design Tool (standalone) – It is used for designing jobs and transformations.
  2. Spoon – It is a GUI tool to develop all jobs and transformations.
  3. Kitchen – Tool to run any job and transformations.
  4. Pan – Tool to run just the transformations.
  5. Carte – Remote ETL server.

 

Data Integration Using Spoon:

Spoon is a data integration tool of Pentaho. The picture below depicts how spoon looks like. A new transformation can be created in File –>New–> Transformation. Pentaho transformations have .ktr as an extension.

Spoon pentaho
Figure 1: Spoon Data Integration tool

In the figure above, the left panel under the “Design” tab consists of the components such as Input, Output, and Streaming. The “View” tab gives the attributes of the current transformation. These components provide several options that can be used in the transformation.

Subcomponents in design panel
Figure 2: Sub-components of the options available on the left panel of Spoon

In the figure above, various sub-components of Input component that include Text File input, CSV file input, Data grid, OLAP etc. are shown. These sub-components can be dragged to the central region called “canvas” that shows up “Drag & Drop” as seen in Figure 1 to begin creating a transformation. For instance, in order to create a transformation that reads a CSV file from a certain location, CSV file input sub-component can be dragged from the left panel to the canvas. The location and the CSV file to be read as input can be specified by double clicking the component as shown in Figure 3.

csv file input pentaho
Figure 3: CSV file Input and options available

The data from CSV file can be written to a text file. The Output component in the left panel provides this facility. The transformation that performs this functionality is given in the figure below wherein the stream of data read from a input CSV file is written to an output text file. The flow of data in the transformation is given by an arrow which connects the CSV file input entry to the Text File Output entry. This pathway that connects the steps together and allow schema metadata to pass from one step to another is called hop.

csv to text output
Figure 4: Tranformation to read a csv file and write its content to a text file

Job using Spoon:

Job can be created in the spoon following the same steps as transformations are created. A new job can be created in File –> New –> Job. The left panel consists of components that have specific functionality. Once the components are expanded, the sub-components can be dragged and dropped into the central region. A spoon job has .kjb as an extension.

spoon pentaho
Figure 5: A Spoon Job

In Figure 5, a job is created that starts with the entry “START” and then executes the transformation that was created in the “Data Integration using Spoon” section. The “Transformation” entry in the job above provides options to specify the transformation that has to be executed. Once the entry is doubled clicked, a small window pops up as in Figure 3 wherein the transformation to be executed and other option related to transformation can be specified. The execution of the job completes with the entry success which specifies the successful execution of the job.

The log and metrics such as execution time, number of read, write operations and error in each entry of job as well as transformation can be viewed in the  “Execution Results” panel as shown in the figure below.

job log
Figure 6: A Job along with Log in Spoon

Creating Connection for Database (MSSQL) in Spoon:

Pentaho allows ETL operations in tables of several DBMS such as SQL, Oracle, HBase etc. In order to perform any transactions in these DBMS, connection has to be created. A new connection can be created in File –> New –> Database Connection. The following screen shows up while creating a database connection.

database connection
Figure 7: Creating Database connection in Spoon

After filling up all the credentials such as Host Name, Database Name, Port Number, Username and Password, a connection name has to be specified. The connection can be tested by clicking on the test button. The connection thus created can be used to fetch the data from the database.

 

transformation
Figure 8: Reading the data from a MS SQL table and creating a copy of the table

The figure above is an example of a transformation that reads data from a SQL table in the “Table input” entry. Then, it adds sequence to each rows received from the table in the “Add sequence” entry which is then passed on to a dummy output so that the output from the previous entry can be viewed in the “logging” tab of the execution panel as specified in Figure 6. After that, the rows are filtered in the “Sample rows” entry based on the sequence. The even rows pass onto the “Dummy (do nothing)” whereas the odd rows are passed onto “Dummy (do nothing) 2”.  Finally, a table is created in the “Table Output” with all the even rows received from its predecessor entry. It is to be noted that the by double clicking the “Table input” entry, one can write SQL query as in the figure below.

SQL pentaho
Figure 9: SQL query in the Table input entry

Scripting in Pentaho Transformation:

Pentaho offers writing JavaScript, Java and SQL scripts within the transformations. However, all the features of these scripting languages are not available. The entries that allows writing javascripts, java, SQL scripts are available in the “Scripting” component in the left panel. Some examples of the scripts are given below.

Js pentaho
Figure 10: JavaScript code in Pentaho Transformation

Generally, the JavaScript code is used in transformations to either set a value for a variable or perform certain mathematical operations on the variables.

A snippet of java that sets a session cookies from a response header received during a REST call is as follows:

Java code in Pentaho
Figure 11: Java code to extract cookie from a response header in REST call

Conclusion:

In this way, though resources for Pentaho may be scarce, it is a very useful ETL tool that provides GUI which requires less manual coding to perform data extraction and cleansing tasks. In addition, it has several free libraries and plug-ins and the community that uses it is also expanding rapidly. Since its community edition is an open access software and the hardware requirements to run the Pentaho server is minimal as compared to other such business intelligence software, its value for money is very high. Lastly, the availability of integrating pentaho with Big Data services such as Hadoop and cloud computing services like Azure and AWS makes this software highly scalable.

References:

  1. https://help.pentaho.com/Documentation/8.2/Setup/Evaluation
  2. https://www.youtube.com/watch?v=J8NbYQaQiPo
  3. https://help.pentaho.com
  4. https://community.hitachivantara.com/
  5. https://www.hitachivantara.com/en-us/products/big-data-integration-analytics/pentaho-trial-download/evaluation-support.html?source=pentaho-redirect#data-integration

One Response

Leave a Reply

Your email address will not be published. Required fields are marked *