Python for Big Data Analytics and Machine Learning 101
Duration: 16Hrs
Learn the fundamentals of data science with Python, whether working on your laptop or a big data cluster, using numpy, pandas and pySpark. Communicate your analyses using informative graphics from matplotlib and seaborn. Configure, train and assess machine learning models with scikit-learn.
Overview
Development on contemporary machine learning and cluster computing frameworks is geared towards Python. Even when Python is not a framework’s primary API, there is always a python binding. A notable example is Spark whose primary API is Scala but is most often used through its Python binding.
Python is also the language most data scientists prefer for desktop data analysis. There’s a rich set of ready-made ML algorithms and libraries to pull data from and push to large storage backends in different formats, making the whole process of Exploratory Data Analysis (EDA) effective and easy.
This course is a 3-day hands-on lab on Python’s numpy, pandas, pySpark, matplotlib, seaborn and scikit-learn packages, a de facto data scientist’s toolset standard. Along the way we’ll test our knowledge with exercises using real-life datasets from Kaggle
and elsewhere.
Objectives
Upon course completion, the participants will know
- the essential statements, constructs and idioms of Python and how to develop and share their code using Jupyter
notebooks. - the basics of numpy and pandas libraries for querying in-memory tabular data.
- how to visualize the outcomes of data analyses using matplotlib and seaborn.
- They will also learn how to process data on large clusters using PySpark and setup and assess machine learning models with scikit-learn.
Who should attend
- Software engineers who want to make a transition to data science practice.
- Data scientists who want to learn about the Python data-analysis and machine learning toolset.
- Business Analysts who want to take an evolutionary leap to big data analytics.
- Technical managers involved in the evaluation of technologies and human resource skills related to analytics and big data.
Prerequisites
- Knowledge of installing and configuring computer software.
- Understanding of computer programming concepts.
- General knowledge of data formats and data transformations (filtering and reduction)
- Basic knowledge of python and basic descriptive statistics is helpful but not mandatory.
- A laptop with Ubuntu 16.04 or windows 10 OS, at least 8GB RAM and 12GB of free disk storage.
Course Outline
- Development front-ends: jupiter console, jupyter notebook and qtconsole
Using the command history
Interacting with the OS
The interactive debugger - Python bootcamp
Literals, expressions and statements
Python containers, comprehensions and generator expressions
Function objects, lambdas and closures - Fast array calculations with the numpy package
The ndarray object
Universal functions
Integer and boolean Slicing
Set logic - Tabular data management with the pandas package
Indexing, selection and filtering
Function application and mapping
Data filtering and reductions
Handling missing data
Hierarchical indexing - Cluster computing with Spark and PySpark
Installing and configuring Spark over Spark’s standalone cluster
Dataframes and untyped operations
running SQL programmaticaly
schema objects and types
aggregations - Plotting with matplotlib and Seaborn packages
Matplotlib API primer
Figures, subplots, axes, lines and markers
Line and bar plots
Histograms and density plots
Scatter plots - Introduction to scikit-learn
Decision trees
Random forests
Gradient boosted trees