Search our courses
Training

Processing big data in real time is challenging due to scalability, information inconsistency, and fault tolerance. Big Data Analysis with Python teaches you how to use tools that can control this data avalanche for you. In this course, you’ll learn practical techniques to aggregate data into useful dimensions for posterior analysis, extract statistical measurements, and transform datasets into features for other systems.

The course begins with an introduction to data manipulation in Python using pandas. You’ll then get familiar with statistical analysis and plotting techniques. With multiple hands-on activities in store, you’ll be able to analyze data that is distributed on several computers by using Dask. As you progress, you’ll study how to aggregate data for plots when the entire dataset cannot be accommodated into memory. You’ll also explore Hadoop (HDFS and YARN), which will help you tackle larger datasets. The course also covers Spark and its interaction with other tools.

By the end of this course, you’ll be able to bootstrap your own Python environment, process large files, and manipulate data to generate statistics, metrics, and graphs.

LEARNING OUTCOMES

  • Use Python to read and transform data into different formats
  • Generate basic statistics and metrics using data on the disk
  • Work with computing tasks distributed over a cluster
  • Convert data from various sources into storage or querying formats
  • Prepare data for statistical analysis, visualization, and machine learning
  • Present data in the form of effective visuals

Big Data Analysis with Python

Course Code

GTDBDAP

Duration

2 Days

Course Fee

POA

Accreditation

N/A

Target Audience

  • Big Data Analysis with Python is designed for Python developers, data analysts, and data scientists who want to get hands-on with methods to control data and transform it into impactful insights. Basic knowledge of statistical measurements and relational databases will help in understanding various concepts explained in this course.

Expand all

Course Description

Processing big data in real time is challenging due to scalability, information inconsistency, and fault tolerance. Big Data Analysis with Python teaches you how to use tools that can control this data avalanche for you. In this course, you’ll learn practical techniques to aggregate data into useful dimensions for posterior analysis, extract statistical measurements, and transform datasets into features for other systems.

The course begins with an introduction to data manipulation in Python using pandas. You’ll then get familiar with statistical analysis and plotting techniques. With multiple hands-on activities in store, you’ll be able to analyze data that is distributed on several computers by using Dask. As you progress, you’ll study how to aggregate data for plots when the entire dataset cannot be accommodated into memory. You’ll also explore Hadoop (HDFS and YARN), which will help you tackle larger datasets. The course also covers Spark and its interaction with other tools.

By the end of this course, you’ll be able to bootstrap your own Python environment, process large files, and manipulate data to generate statistics, metrics, and graphs.

LEARNING OUTCOMES

  • Use Python to read and transform data into different formats
  • Generate basic statistics and metrics using data on the disk
  • Work with computing tasks distributed over a cluster
  • Convert data from various sources into storage or querying formats
  • Prepare data for statistical analysis, visualization, and machine learning
  • Present data in the form of effective visuals
Course Outline

Lesson 1: The Python Data Science Stack

  • Python Libraries and Packages
  • Using Pandas
  • Data Type Conversion
  • Aggregation and Grouping
  • Exporting Data from Pandas
  • Visualization with Pandas

Lesson 2: Statistical Visualizations

  • Types of Graphs and When to Use Them
  • Components of a Graph
  • Which Tool Should Be Used?
  • Types of Graphs
  • Pandas DataFrames and Grouped Data
  • Changing Plot Design: Modifying Graph Components
  • Exporting Graphs

Lesson 3: Working with Big Data Frameworks

  • Hadoop
  • Spark
  • Writing Parquet Files
  • Handling Unstructured Data

Lesson 4: Diving Deeper with Spark

  • Getting Started with Spark DataFrames
  • Writing Output from Spark DataFrames
  • Exploring Spark DataFrames
  • Data Manipulation with Spark DataFrames
  • Graphs in Spark

Lesson 5: Handling Missing Values and Correlation Analysis

  • Setting up the Jupyter Notebook
  • Missing Values
  • Handling Missing Values in Spark DataFrames
  • Correlation

Lesson 6: Exploratory Data Analysis

  • Defining a Business Problem
  • Translating a Business Problem into Measurable Metrics and Exploratory Data Analysis (EDA)
  • Structured Approach to the Data Science Project Life Cycle

Lesson 7: Reproducibility in Big Data Analysis

  • Reproducibility with Jupyter Notebooks
  • Gathering Data in a Reproducible Way
  • Code Practices and Standards
  • Avoiding Repetition

Lesson 8: Creating a Full Analysis Report

  • Reading Data in Spark from Different Data Sources
  • SQL Operations on a Spark DataFrame
  • Generating Statistical Measurements
Learning Path
Ways to Attend
  • Attend a public course, if there is one available. Please check our schedule, or register your interest in joining a course in your area.
  • Private onsite Team training also available, please contact us to discuss. We can customise this course to suit your business requirements.

Private Team Training is available for this course

We deliver this course either on or off-site in various regions around the world, and can customise your delivery to suit your exact business needs. Talk to us about how we can fine-tune a course to suit your team's current skillset and ultimate learning objectives.

Private Team Training | Contact us

Technical ICT learning & mentoring services

Private Team Training

Our instructors are specialist consultants with vast real world experience and expertise allowing them to design and deliver client-focused courses for your organisation.

Learn more about our Private Team Training

What Our Clients Say

“I particularly liked the heavy hands on sessions that went on with the training. Other than that, really liked Mark's training style. His experience in the field really shines through.”

 

Docker - GTDK1

Feb ‘19

“Instructor's ability to demonstrate new features that are not part of the course help show his mastery as well as prepare us for changes in the technology. Great work.

 

Using Docker & Kubernetes in Production - GTK8SG

Oct ‘18


“This course was an excellent insight into the Cloud Service Management world and equips me with the tools to go back to my company and build upon it.”

 

Cloud Service Manager - GTC13

Jan ‘19

 

''Fantastic course, looking forward to applying this in my work and home life. Excellent, practical approach, very motivational. I think the entire company should attend training.''

 

Being Agile in Business - GTBAB

Sept '19

“Excellent instructor. You can tell he really understands the concepts he's presenting and is very passionate about his work. He answered every question we asked and presented the course in an interesting and involving manner.”

 

Spring Boot Development - GTIT40

Nov ‘18

"Intelligence is the ability to avoid doing work, yet
getting the work done"

Linus Torvalds, creator of Linux and GIT

Technical ICT learning & mentoring services

About GuruTeam

GuruTeam is a high-level ICT Learning, Mentoring and Consultancy services company. We specialise in delivering instructor-led on and off-site training in Blockchain, Linux, Cloud, Big Data, DevOps, Kubernetes, Agile, Software & Web Development technologies. View our Testimonials

Download our eBrochure
Our Accreditation Partners
  •  
  •  
  •  
  •  

 

Upcoming Courses

Kubernetes Administration

18th - 21st August - Live Online

12th - 15th October - Live Online

This Kubernetes Administration Certification training course is suitable for anyone who wants to learn the skills necessary to build and administer a Kubernetes cluster

Learn More

CompTIA Network+ FastTrack

Coming Soon

This fast-paced course teaches the essentials of networking and helps to prepare the student for the CompTIA Network+ certification.

Learn More

Applied Data Science and Big Data Analytics

Coming Soon

Learn about the theoretical and practical aspects of using Python in the realm of Data Science, Business Analytics, and Data Logistics

Learn More

Introduction to Python 3

8th - 10th September - Live Online

29th September - 1st October-  Live Online

20th - 22nd October  -  Live Online

  10th - 12th November -  Live Online

                                                                                        24th - 26th November -  Live Online

                                                                                         15th - 17th December -  Live Online

Python is a powerful and popular object-oriented programming/scripting language with many high quality libraries.

Learn More

Newsletter

Stay up to date, receive updates on scheduled dates, new courses, offers, and events.

Subscribe to our Newsletter