Python for Big Data Analytics and Machine Learning 101

Nov 30 - Dec 02, 2017

Hellenic American Education Center, Athens

Duration: 16Hrs

Learn the fundamentals of data science with Python, whether working on your laptop or a big data cluster, using numpy, pandas and pySpark. Communicate your analyses using informative graphics from matplotlib and seaborn. Configure, train and assess machine learning models with scikit-learn.

Overview

Development on contemporary machine learning and cluster computing frameworks is geared towards Python. Even when Python is not a framework’s primary API, there is always a python binding. A notable example is Spark whose primary API is Scala but is most often used through its Python binding.

Python is also the language most data scientists prefer for desktop data analysis. There’s a rich set of ready-made ML algorithms and libraries to pull data from and push to large storage backends in different formats, making the whole process of Exploratory Data Analysis (EDA) effective and easy.

This course is a 3-day hands-on lab on Python’s numpy, pandas, pySpark, matplotlib, seaborn and scikit-learn packages, a de facto data scientist’s toolset standard. Along the way we’ll test our knowledge with exercises using real-life datasets from Kaggle
and elsewhere.

Objectives

Upon course completion, the participants will know

  • the essential statements, constructs and idioms of Python and how to develop and share their code using Jupyter
    notebooks.
  • the basics of numpy and pandas libraries for querying in-memory tabular data.
  • how to visualize the outcomes of data analyses using matplotlib and seaborn.
  • They will also learn how to process data on large clusters using PySpark and setup and assess machine learning models with scikit-learn.

Who should attend

  1. Software engineers who want to make a transition to data science practice.
  2. Data scientists who want to learn about the Python data-analysis and machine learning toolset.
  3. Business Analysts who want to take an evolutionary leap to big data analytics.
  4. Technical managers involved in the evaluation of technologies and human resource skills related to analytics and big data.

Prerequisites

  1. Knowledge of installing and configuring computer software.
  2. Understanding of computer programming concepts.
  3. General knowledge of data formats and data transformations (filtering and reduction)
  4. Basic knowledge of python and basic descriptive statistics is helpful but not mandatory.
  5. A laptop with Ubuntu 16.04 or windows 10 OS, at least 8GB RAM and 12GB of free disk storage.

Course Outline

  1. Development front-ends: jupiter console, jupyter notebook and qtconsole
    Using the command history
    Interacting with the OS
    The interactive debugger
  2. Python bootcamp
    Literals, expressions and statements
    Python containers, comprehensions and generator expressions
    Function objects, lambdas and closures
  3. Fast array calculations with the numpy package
    The ndarray object
    Universal functions
    Integer and boolean Slicing
    Set logic
  4. Tabular data management with the pandas package
    Indexing, selection and filtering
    Function application and mapping
    Data filtering and reductions
    Handling missing data
    Hierarchical indexing
  5. Cluster computing with Spark and PySpark
    Installing and configuring Spark over Spark’s standalone cluster
    Dataframes and untyped operations
    running SQL programmaticaly
    schema objects and types
    aggregations
  6. Plotting with matplotlib and Seaborn packages
    Matplotlib API primer
    Figures, subplots, axes, lines and markers
    Line and bar plots
    Histograms and density plots
    Scatter plots
  7. Introduction to scikit-learn
    Decision trees
    Random forests
    Gradient boosted trees

Event Speakers

Location

Hellenic American Education CenterMassalias 22, Athens 106 80