Python for Big Data Analytics and Machine Learning 101

Nov 30 - Dec 02, 2017

Hellenic American Education Center, Athens

Duration: 16Hrs

Learn the fundamentals of data science with Python, whether working on your laptop or a big data cluster, using numpy, pandas and pySpark. Communicate your analyses using informative graphics from matplotlib and seaborn. Configure, train and assess machine learning models with scikit-learn.

Overview

Development on contemporary machine learning and cluster computing frameworks is geared towards Python. Even when Python is not a framework’s primary API, there is always a python binding. A notable example is Spark whose primary API is Scala but is most often used through its Python binding.

Python is also the language most data scientists prefer for desktop data analysis. There’s a rich set of ready-made ML algorithms and libraries to pull data from and push to large storage backends in different formats, making the whole process of Exploratory Data Analysis (EDA) effective and easy.

This course is a 3-day hands-on lab on Python’s numpy, pandas, pySpark, matplotlib, seaborn and scikit-learn packages, a de facto data scientist’s toolset standard. Along the way we’ll test our knowledge with exercises using real-life datasets from Kaggle
and elsewhere.

Objectives

Upon course completion, the participants will know

the essential statements, constructs and idioms of Python and how to develop and share their code using Jupyter
notebooks.
the basics of numpy and pandas libraries for querying in-memory tabular data.
how to visualize the outcomes of data analyses using matplotlib and seaborn.
They will also learn how to process data on large clusters using PySpark and setup and assess machine learning models with scikit-learn.

Who should attend

Software engineers who want to make a transition to data science practice.
Data scientists who want to learn about the Python data-analysis and machine learning toolset.
Business Analysts who want to take an evolutionary leap to big data analytics.
Technical managers involved in the evaluation of technologies and human resource skills related to analytics and big data.

Prerequisites

Knowledge of installing and configuring computer software.
Understanding of computer programming concepts.
General knowledge of data formats and data transformations (filtering and reduction)
Basic knowledge of python and basic descriptive statistics is helpful but not mandatory.
A laptop with Ubuntu 16.04 or windows 10 OS, at least 8GB RAM and 12GB of free disk storage.

Course Outline

Development front-ends: jupiter console, jupyter notebook and qtconsole
Using the command history
Interacting with the OS
The interactive debugger
Python bootcamp
Literals, expressions and statements
Python containers, comprehensions and generator expressions
Function objects, lambdas and closures
Fast array calculations with the numpy package
The ndarray object
Universal functions
Integer and boolean Slicing
Set logic
Tabular data management with the pandas package
Indexing, selection and filtering
Function application and mapping
Data filtering and reductions
Handling missing data
Hierarchical indexing
Cluster computing with Spark and PySpark
Installing and configuring Spark over Spark’s standalone cluster
Dataframes and untyped operations
running SQL programmaticaly
schema objects and types
aggregations
Plotting with matplotlib and Seaborn packages
Matplotlib API primer
Figures, subplots, axes, lines and markers
Line and bar plots
Histograms and density plots
Scatter plots
Introduction to scikit-learn
Decision trees
Random forests
Gradient boosted trees

Event Speakers

Christos Malliopoulos

Location

Hellenic American Education CenterMassalias 22, Athens 106 80