Search our courses
Training

This Data Science for Solution Architects training course helps Solution Architects and other IT practitioners understand the value proposition, methodology and techniques of the emerging discipline of Data Science.  The class also introduces the students to a number of existing production-ready technologies and capabilities that enable enterprises to build cost-efficient Big Data processing solutions.

[Delivery options: Web Age data training Classes are delivered in traditional classroom style format. Online data training classes are also available in a synchronous instructor led format.]


Objectives:

This intensive big data training course provides theoretical and technical aspects of Data Science and Business Analytics. The course covers the fundamental and advanced concepts and methods of deriving business insights from raw data using cost-effective data processing solutions. The course is supplemented by hands-on labs that help attendees reinforce their theoretical knowledge of the learned material.


Topics:

  •     Applied Data Science and Business Analytics
  •     Algorithms, Techniques and Common Analytical Methods
  •     NoSQL and Big Data Systems Overview
  •     MapReduce
  •     Big Data Business Intelligence and Analytics
  •     Visualizing and Reporting Processed Results
  •     Data Analysis with R
  •     Hadoop Programming Ecosystem

Data Science for Solution Architects

Course Code

GTBD8

Duration

4 Days

Course Fee

POA

Accreditation

N/A

Target Audience

  • Enterprise Architects, Solution Architects, Information Technology Architects, Business Analysts, Senior Developers, and Team Leads

Attendee Requirements

  • Participants should have the general knowledge of statistics and programming

Expand all

Course Description

This Data Science for Solution Architects training course helps Solution Architects and other IT practitioners understand the value proposition, methodology and techniques of the emerging discipline of Data Science.  The class also introduces the students to a number of existing production-ready technologies and capabilities that enable enterprises to build cost-efficient Big Data processing solutions.


Objectives:

This intensive big data training course provides theoretical and technical aspects of Data Science and Business Analytics. The course covers the fundamental and advanced concepts and methods of deriving business insights from raw data using cost-effective data processing solutions. The course is supplemented by hands-on labs that help attendees reinforce their theoretical knowledge of the learned material.


Topics:

  •     Applied Data Science and Business Analytics
  •     Algorithms, Techniques and Common Analytical Methods
  •     NoSQL and Big Data Systems Overview
  •     MapReduce
  •     Big Data Business Intelligence and Analytics
  •     Visualizing and Reporting Processed Results
  •     Data Analysis with R
  •     Hadoop Programming Ecosystem
Course Outline

Chapter 1. Applied Data Science

  •     What is Data Science?
  •     Data Science Ecosystem
  •     Data Mining vs. Data Science
  •     Business Analytics vs. Data Science
  •     Who is a Data Scientist?
  •     Data Science Skill Sets Venn Diagram
  •     Data Scientists at Work
  •     Examples of Data Science Projects
  •     An Example of a Data Product
  •     Applied Data Science at Google
  •     Data Science Gotchas

Chapter 2. Data Analytics Life-cycle Phases

  •     Big Data Analytics Pipeline
  •     Data Discovery Phase
  •     Data Harvesting Phase
  •     Data Priming Phase
  •     Exploratory Data Analysis
  •     Model Planning Phase
  •     Model Building Phase
  •     Communicating the Results
  •     Production Roll-out

Chapter 3. Getting Started with R

  •     Introduction
  •     Positioning of R in the Data Science Arena
  •     R Integrated Development Environments
  •     Running R
  •     Running RStudio
  •     Ending the Current R Session
  •     Getting Help
  •     Getting System Information
  •     General Notes on R Commands and Statements
  •     R Data Structures
  •     R Objects and Workspace
  •     Assignment Operators
  •     Assignment Example
  •     Arithmetic Operators
  •     Logical Operators
  •     System Date and Time
  •     Operations
  •     User-defined Functions
  •     User-defined Function Example
  •     R Code Example
  •     Type Conversion (Coercion)
  •     Control Statements
  •     Conditional Execution
  •     Repetitive Execution
  •     Repetitive execution
  •     Built-in Functions
  •     Reading Data from Files into Vectors
  •     Example of Reading Data from a File
  •     Writing Data to a File
  •     Example of Writing Data to a File
  •     Logical Vectors
  •     Character Vectors
  •     Matrix Data Structure
  •     Creating Matrices
  •     Working with Data Frames
  •     Matrices vs Data Frames
  •     A Data Frame Sample
  •     Accessing Data Cells
  •     Getting Info About a Data Frame
  •     Selecting Columns in Data Frames
  •     Selecting Rows in Data Frames
  •     Getting a Subset of a Data Frame
  •     Sorting (ordering) Data in Data Frames by Attribute(s)
  •     Applying Functions to Matrices and Data Frames
  •     Using the apply() Function
  •     Example of Using apply()
  •     Executing External R commands
  •     Loading External Scripts in RStudio
  •     Listing Objects in Workspace
  •     Removing Objects in Workspace
  •     Saving Your Workspace in R
  •     Saving Your Workspace in RStudio
  •     Saving Your Workspace in R GUI
  •     Loading Your Workspace
  •     Hands-on Exercises
  •     Getting and Setting the Working Directory
  •     Getting the List of Files in a Directory
  •     Diverting Output to a File
  •     Batch (Unattended) Processing
  •     Importing Data into R
  •     Exporting Data from R
  •     Hands-on Exercise
  •     Standard R Packages
  •     Extending R
  •     Extending R in R GUI
  •     Extending R in RStudio
  •     CRAN Page

Chapter 4. R Statistical Computing Features

  •     Statistical Computing Features
  •     Descriptive Statistics
  •     Basic Statistical Functions
  •     Examples of Using Basic Statistical Functions
  •     Using the summary() Function
  •     Math Functions Used in Data Analysis
  •     Examples of Using Math Functions
  •     Correlations
  •     Correlation Example

Chapter 5. Data Science Algorithms and Analytical Methods

  •     Supervised vs Unsupervised Machine Learning
  •     Supervised Machine Learning Algorithms
  •     Unsupervised Machine Learning Algorithms
  •     Choose the Right Algorithm
  •     Life-cycles of Machine Learning Development
  •     Classifying with k-Nearest Neighbors (SL)
  •     k-Nearest Neighbors Algorithm
  •     k-Nearest Neighbors Algorithm
  •     The Error Rate
  •     Hands-on Exercise
  •     Decision Trees (SL)
  •     Using Decision Trees
  •     Random Forests
  •     Naive Bayes Classifier (SL)
  •     Classification of Documents with Naive Bayes
  •     Unsupervised Learning Type: Clustering
  •     K-Means Clustering (UL)
  •     K-Means Clustering in a Nutshell
  •     Regression Analysis
  •     Types of Regression
  •     Simple Linear Regression Model
  •     Linear Regression Illustration
  •     Least-Squares Method (LSM)
  •     LSM Assumptions
  •     Fitting Linear Regression Models in R
  •     Example of Using R's lm() Function
  •     Example of Using lm() with a Data Frame
  •     Regression Models in Excel
  •     Hands-on Exercise
  •     Logistic Regression
  •     Regression vs Classification
  •     Time-Series Analysis
  •     Decomposing Time-Series

Chapter 6. Text Mining

  •     What is Text Mining?
  •     The Common Text Mining Tasks
  •     What is Natural Language Processing (NLP)?
  •     Some of the NLP Use Cases
  •     Machine Learning in Text Mining and NLP
  •     Machine Learning in NLP
  •     TF-IDF
  •     The Feature Hashing Trick
  •     Stemming
  •     Example of Stemming
  •     Stop Words
  •     Popular Text Mining and NLP Libraries and Packages

Chapter 7. What is NoSQL?

  •     Limitations of Relational Databases
  •     Limitations of Relational Databases (Cont'd)
  •     Defining NoSQL
  •     What are NoSQL (Not Only SQL) Databases?
  •     The Past and Present of the NoSQL World
  •     NoSQL Database Properties
  •     NoSQL Benefits
  •     NoSQL Database Storage Types
  •     The CAP Theorem
  •     NoSQL Systems CAP Triangle
  •     Mechanisms to Guarantee a Single CAP Property
  •     Limitations of NoSQL Databases
  •     Big Data Sharding
  •     Sharding Example
  •     Quiz
  •     Quiz Answers

Chapter 8. MapReduce Overview

  •     The Client – Server Processing Pattern
  •     Distributed Computing Challenges
  •     MapReduce Defined
  •     Google's MapReduce
  •     The Map Phase of MapReduce
  •     The Reduce Phase of MapReduce
  •     MapReduce Explained
  •     MapReduce Word Count Job
  •     MapReduce Shared-Nothing Architecture
  •     Similarity with SQL Aggregation Operations
  •     Example of Map & Reduce Operations using JavaScript
  •     Problems Suitable for Solving with MapReduce
  •     Typical MapReduce Jobs
  •     Fault-tolerance of MapReduce
  •     Distributed Computing Economics
  •     MapReduce Systems

Chapter 9. Hadoop Overview

  •     Apache Hadoop
  •     Apache Hadoop Logo
  •     Typical Hadoop Applications
  •     Hadoop Clusters
  •     Hadoop Design Principles
  •     Hadoop Versions
  •     Hadoop's Main Components
  •     Hadoop Simple Definition
  •     Side-by-Side Comparison: Hadoop 1 and Hadoop 2
  •     Hadoop-based Systems for Data Analysis
  •     Other Hadoop Ecosystem Projects
  •     Hadoop Caveats
  •     Hadoop Distributions
  •     Cloudera Distribution of Hadoop (CDH)
  •     Cloudera Distributions
  •     Hortonworks Data Platform (HDP)
  •     MapR

Chapter 10. Hadoop Distributed File System Overview

  •     Hadoop Distributed File System (HDFS)
  •     HDFS High Availability
  •     HDFS "Fine Print"
  •     Storing Raw Data in HDFS
  •     Hadoop Security
  •     HDFS Rack-awareness
  •     Data Blocks
  •     Data Block Replication Example
  •     HDFS NameNode Directory Diagram
  •     Accessing HDFS
  •     Examples of HDFS Commands
  •     Other Supported File Systems
  •     WebHDFS
  •     Examples of WebHDFS Calls
  •     Client Interactions with HDFS for the Read Operation
  •     Read Operation Sequence Diagram
  •     Client Interactions with HDFS for the Write Operation
  •     Communication inside HDFS

Chapter 11. MapReduce with Hadoop

  •     Hadoop's MapReduce
  •     MapReduce 1 and MapReduce 2
  •     Why do I need Discussion of the Old MapReduce?
  •     MapReduce v1 ("Classic MapReduce")
  •     JobTracker and TaskTracker (the "Classic MapReduce")
  •     YARN (MapReduce v2)
  •     YARN vs MR1
  •     YARN As Data Operating System
  •     MapReduce Programming Options
  •     Hadoop's Streaming MapReduce
  •     Python Word Count Mapper Program Example
  •     Python Word Count Reducer Program Example
  •     Setting up Java Classpath for Streaming Support
  •     Streaming Use Cases
  •     The Streaming API vs Java MapReduce API
  •     Amazon Elastic MapReduce
  •     Apache Tez

Chapter 12. Apache Pig Scripting Platform

  •     What is Pig?
  •     Pig Latin
  •     Apache Pig Logo
  •     Pig Execution Modes
  •     Local Execution Mode
  •     MapReduce Execution Mode
  •     Running Pig
  •     Running Pig in Batch Mode
  •     What is Grunt?
  •     Pig Latin Statements
  •     Pig Programs
  •     Pig Latin Script Example
  •     SQL Equivalent
  •     Differences between Pig and SQL
  •     Statement Processing in Pig
  •     Comments in Pig
  •     Supported Simple Data Types
  •     Supported Complex Data Types
  •     Arrays
  •     Defining Relation's Schema
  •     Not Matching the Defined Schema
  •     The bytearray Generic Type
  •     Using Field Delimiters
  •     Loading Data with TextLoader()
  •     Referencing Fields in Relations

Chapter 13. Apache Pig Relational and Eval Operators

  •     Pig Relational Operators
  •     Example of Using the JOIN Operator
  •     Example of Using the Order By Operator
  •     Caveats of Using Relational Operators
  •     Pig Eval Functions
  •     Caveats of Using Eval Functions (Operators)
  •     Example of Using Single-column Eval Operations
  •     Example of Using Eval Operators For Global Operations

Chapter 14. Hive

  •     What is Hive?
  •     Apache Hive Logo
  •     Hive's Value Proposition
  •     Who uses Hive?
  •     Hive's Main Sub-Systems
  •     Hive Features
  •     The "Classic" Hive Architecture
  •     The New Hive Architecture
  •     HiveQL
  •     Where are the Hive Tables Located?
  •     Hive Command-line Interface (CLI)
  •     The Beeline Command Shell

Chapter 15. Hive Command-line Interface

  •     Hive Command-line Interface (CLI)
  •     The Hive Interactive Shell
  •     Running Host OS Commands from the Hive Shell
  •     Interfacing with HDFS from the Hive Shell
  •     The Hive in Unattended Mode
  •     The Hive CLI Integration with the OS Shell
  •     Executing HiveQL Scripts
  •     Comments in Hive Scripts
  •     Variables and Properties in Hive CLI
  •     Setting Properties in CLI
  •     Example of Setting Properties in CLI
  •     Hive Namespaces
  •     Using the SET Command
  •     Setting Properties in the Shell
  •     Setting Properties for the New Shell Session
  •     Setting Alternative Hive Execution Engines
  •     The Beeline Shell
  •     Connecting to the Hive Server in Beeline
  •     Beeline Command Switches
  •     Beeline Internal Commands

Chapter 16. Hive Data Definition Language

  •     Hive Data Definition Language
  •     Creating Databases in Hive
  •     Using Databases
  •     Creating Tables in Hive
  •     Supported Data Type Categories
  •     Common Numeric Types
  •     String and Date / Time Types
  •     Miscellaneous Types
  •     Example of the CREATE TABLE Statement
  •     Working with Complex Types
  •     Table Partitioning
  •     Table Partitioning
  •     Table Partitioning on Multiple Columns
  •     Viewing Table Partitions
  •     Row Format
  •     Data Serializers / Deserializers
  •     File Format Storage
  •     File Compression
  •     More on File Formats
  •     The ORC Data Format
  •     Converting Text to ORC Data Format
  •     The EXTERNAL DDL Parameter
  •     Example of Using EXTERNAL
  •     Creating an Empty Table
  •     Dropping a Table
  •     Table / Partition(s) Truncation
  •     Alter Table/Partition/Column
  •     Views
  •     Create View Statement
  •     Why Use Views?
  •     Restricting Amount of Viewable Data
  •     Examples of Restricting Amount of Viewable Data
  •     Creating and Dropping Indexes
  •     Describing Data

Chapter 17. Apache Sqoop

  •     What is Sqoop?
  •     Apache Sqoop Logo
  •     Sqoop Import / Export
  •     Sqoop Help
  •     Examples of Using Sqoop Commands
  •     Data Import Example
  •     Fine-tuning Data Import
  •     Controlling the Number of Import Processes
  •     Data Splitting
  •     Helping Sqoop Out
  •     Example of Executing Sqoop Load in Parallel
  •     A Word of Caution: Avoid Complex Free-Form Queries
  •     Using Direct Export from Databases
  •     Example of Using Direct Export from MySQL
  •     More on Direct Mode Import
  •     Data Export from HDFS
  •     Export Tool Common Arguments
  •     Data Export Control Arguments
  •     Data Export Example
  •     INSERT and UPDATE Statements
  •     INSERT Operations
  •     UPDATE Operations
  •     Example of the Update Operation
  •     Failed Exports
  •     Sqoop2

Chapter 18. Introduction to Functional Programming

  •     What is Functional Programming (FP)?
  •     Terminology: Higher-Order Functions
  •     Terminology: Lambda vs Closure
  •     A Short List of Languages that Support FP
  •     FP with Java
  •     FP With JavaScript
  •     Imperative Programming in JavaScript
  •     The JavaScript map (FP) Example
  •     The JavaScript reduce (FP) Example
  •     Using reduce to Flatten an Array of Arrays (FP) Example
  •     The JavaScript filter (FP) Example
  •     Common High-Order Functions in Python
  •     Common High-Order Functions in Scala
  •     Elements of FP in R

Chapter 19. Introduction to Apache Spark

  •     What is Apache Spark
  •     A Short History of Spark
  •     Where to Get Spark?
  •     The Spark Platform
  •     Spark Logo
  •     Common Spark Use Cases
  •     Languages Supported by Spark
  •     Running Spark on a Cluster
  •     The Driver Process
  •     Spark Applications
  •     Spark Shell
  •     The spark-submit Tool
  •     The spark-submit Tool Configuration
  •     The Executor and Worker Processes
  •     The Spark Application Architecture
  •     Interfaces with Data Storage Systems
  •     Limitations of Hadoop's MapReduce
  •     Spark vs MapReduce
  •     Spark as an Alternative to Apache Tez
  •     The Resilient Distributed Dataset (RDD)
  •     Spark Streaming (Micro-batching)
  •     Spark SQL
  •     Example of Spark SQL
  •     Spark Machine Learning Library
  •     GraphX
  •     Spark vs R

Chapter 20. The Spark Shell

  •     The Spark Shell
  •     The Spark Shell UI
  •     Spark Shell Options
  •     Getting Help
  •     The Spark Context (sc) and SQL Context (sqlContext)
  •     The Shell Spark Context
  •     Loading Files
  •     Saving Files
  •     Basic Spark ETL Operations

Chapter 21. Spark RDDs

  •     The Resilient Distributed Dataset (RDD)
  •     Ways to Create an RDD
  •     Custom RDDs
  •     Supported Data Types
  •     RDD Operations
  •     RDDs are Immutable
  •     Spark Actions
  •     RDD Transformations
  •     Other RDD Operations
  •     Chaining RDD Operations
  •     RDD Lineage
  •     The Big Picture
  •     What May Go Wrong
  •     Checkpointing RDDs
  •     Local Checkpointing
  •     Parallelized Collections
  •     More on parallelize() Method
  •     The Pair RDD
  •     Where do I use Pair RDDs?
  •     Example of Creating a Pair RDD with Map
  •     Example of Creating a Pair RDD with keyBy
  •     Miscellaneous Pair RDD Operations
  •     RDD Caching
  •     RDD Persistence
  •     The Tachyon Storage

Chapter 22. Parallel Data Processing with Spark

  •     Running Spark on a Cluster
  •     Spark Stand-alone Option
  •     The High-Level Execution Flow in Stand-alone Spark Cluster
  •     Data Partitioning
  •     Data Partitioning Diagram
  •     Single Local File System RDD Partitioning
  •     Multiple File RDD Partitioning
  •     Special Cases for Small-sized Files
  •     Parallel Data Processing of Partitions
  •     Spark Application, Jobs, and Tasks
  •     Stages and Shuffles
  •     The "Big Picture"

Chapter 23. The Spark Machine Learning Library

  •     What is MLlib?
  •     Supported Languages
  •     MLlib Packages
  •     Dense and Sparse Vectors
  •     Labeled Point
  •     Python Example of Using the LabeledPoint Class
  •     LIBSVM format
  •     An Example of a LIBSVM File
  •     Loading LIBSVM Files
  •     Local Matrices
  •     Example of Creating Matrices in MLlib
  •     Distributed Matrices
  •     Example of Using a Distributed Matrix
  •     Classification and Regression Algorithm
  •     Clustering

Lab Exercises

    Lab 1. Getting Started with R
    Lab 2. Working with R
    Lab 3. Data Import and Export in R
    Lab 4. k-Nearest Neighbors Algorithm
    Lab 5. Simple Linear Regression
    Lab 6. Common Text Mining Tasks with the tm Library
    Lab 7. Learning the Lab Environment
    Lab 8. The Hadoop Distributed File System
    Lab 9. Getting Started with Apache Pig
    Lab 10. Working with Data Sets in Apache Pig
    Lab 11. The Hive and Beeline Shells
    Lab 12. Hive Data Definition Language
    Lab 13. The Spark Shell
    Lab 14. Spark ETL and HDFS Interface
    Lab 15. Using k-means Algorithm from MLlib

Learning Path
  • There are no specific requirements for this course. Please contact us to discuss your suitability.
Ways to Attend
  • Attend a public course, if there is one available. Please check our schedule, or register your interest in joining a course in your area.
  • Private onsite Team training also available, please contact us to discuss. We can customise this course to suit your business requirements.

Private Team Training is available for this course

We deliver this course either on or off-site in various regions around the world, and can customise your delivery to suit your exact business needs. Talk to us about how we can fine-tune a course to suit your team's current skillset and ultimate learning objectives.

Private Team Training | Contact us

Technical ICT learning & mentoring services

Private Team Training

Our instructors are specialist consultants with vast real world experience and expertise allowing them to design and deliver client-focused courses for your organisation.

Learn more about our Private Team Training

What Our Clients Say

"Absolutely fantastic training. Thoroughly enjoyed it thanks to our highly enthusiastic tutor.  It wouldn't be an understatement to say that it was the best professional training that I have ever received."

 

Customised Linux with Networking

Live Online -  February 2022

 

"The course content was very good. When needed, the Instructor was extending the content of the course with hints and tips to help us understand different topics that were covered in the course."

 

Kubernetes Administration Certification - GTLFK

Live Online June 2021

 

 

 

“The course was held at the highest possible standards, the instructor was excellent, well prepared, well informed, and clearly an SME. Top marks.”

 

Professional Cloud Service Manager - GTC13

Live Online December 2021

 

“Very engaging and practical course so hope to be able to put the learning into practice.”

 

Being Agile in Business - GTBAB

Live Online September 2021

 

“Great instructor, who encouraged active participation. The breakout groups and exercises kept the group engaged and the content relevant to our own products”.

 

Site Reliability Engineering Foundation - GTDSRE

Live Online January 2022

 

 

 

"Intelligence is the ability to avoid doing work, yet
getting the work done"

Linus Torvalds, creator of Linux and GIT

Technical ICT learning & mentoring services

About GuruTeam

GuruTeam is a high-level ICT Learning, Mentoring and Consultancy services company. We specialise in delivering instructor-led on and off-site training in Blockchain, Linux, Cloud, Big Data, DevOps, Kubernetes, Agile, Software & Web Development technologies. View our Testimonials

Download our eBrochure
Our Accreditation Partners
  •  
  •  
  •  

 

Upcoming Courses

Kubernetes Administration

11th - 14th March 2024

26th - 29th March 2024

Live Online

 

This Kubernetes Administration Certification training course is suitable for anyone who wants to learn the skills necessary to build and administer a Kubernetes cluster

Learn More

RUST

11th - 14th March 2024

26th - 29th March 2024

 Live Online

This course will help you understand what Rust applications look like, how to write Rust applications properly, and how to get the most out of the language and its libraries.

Learn More

Introduction to Python 3 

19th - 21st March 2024

9th - 11th April 2024

7th - 9th May 2024

   4th - 6th June 2024

 

Live Online

This Introduction to Python 3 training course is designed for anyone who needs to learn how to write programs in Python or support/modify existing programs.

 

Learn More

 GO LANG TRAINING

11th - 14th March 2024

26th - 29th March 2024

 

Live Online        

 

This Go language programming training course will help you understand how Go works, and immediately be more productive. If you are building a team using Go, this will be a great opportunity to get your team on the same page and speaking the same language. Innovative lab exercises and code samples are provided to reinforce skills and quickly master the topics.

Learn More

Newsletter

Stay up to date, receive updates on scheduled dates, new courses, offers, and events.

Subscribe to our Newsletter