Python for Big Data Analytics and Machine Learning 101

Oct 03 - 04, 2019

Hellenic Americal Education Center, Athens

Development on contemporary machine learning and cluster computing frameworks is geared towards Python. Even when Python is not a framework’s primary API, there is always a python binding. A notable example is Spark whose primary API is Scala but is most often used through its Python binding.

Python is also the language most data scientists prefer for desktop data analysis. There’s a rich set of ready-made ML algorithms and libraries to pull data from and push to large storage backends in different formats, making the whole process of Exploratory Data Analysis (EDA) effective and easy.

This course is a 2-day hands-on lab on Python’s numpy, pandas, pySpark, matplotlib, seaborn and scikit-learn packages, a de facto data scientist’s toolset standard. Along the way we’ll test our knowledge with exercises using real-life datasets from Kaggle and elsewhere.

An important takeaway for all the participants is the Pyhon notebooks used in the course that will serve as a valuable reference for their future tasks.

Objectives

Upon course completion, the participants will know

  • the essential statements, constructs and idioms of Python and how to develop and share their code using Jupyter notebooks.
  • the basics of numpy and pandas libraries for querying in-memory tabular data.
  • how to visualize the outcomes of data analyses using matplotlib and seaborn.
  • how to process data on large clusters using PySpark
  • setup and assess machine learning models with scikit-learn.

Who Should Attend

  • Software engineers who want to make a transition to data science practice.
  • Data scientists who want to learn about the Python data-analysis and machine learning toolset.
  • Business Analysts who want to make an evolutionary leap to big data analytics.
  • Technical managers involved in the evaluation of technologies and human resources or in strategies utilizing big data in the framework of related enterprise policies.

Prerequisites

  • Knowledge of installing and configuring computer software.
  • Understanding of computer programming concepts.
  • General knowledge of data formats and data transformations (filtering and reduction).
  • Knowledge of basic descriptive statistics is helpful but not mandatory.
  • Have a laptop with Ubuntu 16.04 or windows 10 OS or a Mac, at least 4GB RAM and 32GB disk storage.

Course Outline

  1. Development front-ends: jupiter console, jupyter notebook and qtconsole
    Using the command history
    Interacting with the OS
    The interactive debugger
  2. Python bootcamp
    Literals, expressions and statements
    Python containers, comprehensions and generator expressions
    Function objects, lambdas and closures
  3. Fast array calculations with the numpy package
    The ndarray object
    Universal functions
    Integer and boolean Slicing
    Set logic
  4. Tabular data management with the pandas package
    Indexing, selection and filtering
    Function application and mapping
    Data filtering and reductions
    Handling missing data
    Hierarchical indexing
  5. Cluster computing with Spark and PySpark
    Installing and configuring Spark over Spark’s standalone cluster
    pyspark.Dataframes and untyped operations
    running SQL programmaticaly
    schema objects and types
    aggregations
  6. Plotting with matplotlib and Seaborn packages
    Matplotlib API primer
    Figures, subplots, axes, lines and markers
    Line and bar plots
    Histograms and density plots
    Scatter plots
  7. Introduction to scikit-learn
    Decision trees
    Random forests
    Gradient boosted trees

 

Event Speakers

Location

Hellenic Americal Education CenterMassalias 22, Athens 106 80