Python for Big Data Analytics and Machine Learning 101
Development on contemporary machine learning and cluster computing frameworks is geared towards Python. Even when Python is not a framework’s primary API, there is always a python binding. A notable example is Spark whose primary API is Scala but is most often used through its Python binding.
Python is also the language most data scientists prefer for desktop data analysis. There’s a rich set of ready-made ML algorithms and libraries to pull data from and push to large storage backends in different formats, making the whole process of Exploratory Data Analysis (EDA) effective and easy.
This course is a 2-day hands-on lab on Python’s numpy, pandas, pySpark, matplotlib, seaborn and scikit-learn packages, a de facto data scientist’s toolset standard. Along the way we’ll test our knowledge with exercises using real-life datasets from Kaggle and elsewhere.
An important takeaway for all the participants is the Pyhon notebooks used in the course that will serve as a valuable reference for their future tasks.
Objectives
Upon course completion, the participants will know
- the essential statements, constructs and idioms of Python and how to develop and share their code using Jupyter notebooks.
- the basics of numpy and pandas libraries for querying in-memory tabular data.
- how to visualize the outcomes of data analyses using matplotlib and seaborn.
- how to process data on large clusters using PySpark
- setup and assess machine learning models with scikit-learn.
Who Should Attend
- Software engineers who want to make a transition to data science practice.
- Data scientists who want to learn about the Python data-analysis and machine learning toolset.
- Business Analysts who want to make an evolutionary leap to big data analytics.
- Technical managers involved in the evaluation of technologies and human resources or in strategies utilizing big data in the framework of related enterprise policies.
Prerequisites
- Knowledge of installing and configuring computer software.
- Understanding of computer programming concepts.
- General knowledge of data formats and data transformations (filtering and reduction).
- Knowledge of basic descriptive statistics is helpful but not mandatory.
- Have a laptop with Ubuntu 16.04 or windows 10 OS or a Mac, at least 4GB RAM and 32GB disk storage.
Course Outline
- Development front-ends: jupiter console, jupyter notebook and qtconsole
Using the command history
Interacting with the OS
The interactive debugger - Python bootcamp
Literals, expressions and statements
Python containers, comprehensions and generator expressions
Function objects, lambdas and closures - Fast array calculations with the numpy package
The ndarray object
Universal functions
Integer and boolean Slicing
Set logic - Tabular data management with the pandas package
Indexing, selection and filtering
Function application and mapping
Data filtering and reductions
Handling missing data
Hierarchical indexing - Cluster computing with Spark and PySpark
Installing and configuring Spark over Spark’s standalone cluster
pyspark.Dataframes and untyped operations
running SQL programmaticaly
schema objects and types
aggregations - Plotting with matplotlib and Seaborn packages
Matplotlib API primer
Figures, subplots, axes, lines and markers
Line and bar plots
Histograms and density plots
Scatter plots - Introduction to scikit-learn
Decision trees
Random forests
Gradient boosted trees