Python still stands strong as a top tool for Data Engineers. I recently reviewed 21 Data Engineer job postings and found that 17 (81%) of the jobs listed Python as a requirement.
Here are the three highest rated books on Amazon that feature Python for Data Engineering. All three books appear on Amazon's Data Modeling & Design list.
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython
Number four on Amazon's Data Modeling & Design list with an impressive customer review score of 4.6 out of 5.
If you are looking to focus on data preparation, transformation and creating datasets for analytics, then this book has a lot to offer. Authored by the creator of the open-source pandas package for data analysis with Python, Wes McKinney.
NumPy, matplotlib, IPython (aka Jupyter Notebooks) and SciPy are also featured.
Data Pipelines Pocket Reference: Moving and Processing Data for Analytics
Number eight on Amazon's Data Modeling & Design list with an equally impressive customer review score of 4.5 out of 5.
A very different book to the first book on the list, with the core focus of this book being building pipelines to extract transform load (ETL) data. The code samples are written in Python and SQL.
Technology used for the examples also include MySQL, PostgreSQL, MongoDB, REST APIs, AWS S3, Amazon redshift, Snowflake, Kafka and Apache Airflow.
Data Engineering with Python: Work with massive datasets to design data models and automate data pipelines using Python
Number fourteen on Amazon's Data Modeling & Design list with a customer review score of 4.1 out of 5.
Similar to the Data Pipelines Pocket Reference book, readers will be educated on how to build data pipelines, and how to perform ETL with Python. Chapter 2 is dedicated to building out the infrastructure for data engineering.
Technology used for the examples include Python, Apache NiFI, Apache Airflow, Elasticsearch, Kibana, PostgreSQL, Kafka, and Spark.
A New Approach for Data Engineering
Why not try Data Engineering without needing to extract and load data, and without needing to write SQL or Python code?
With Promethium we use the approach of CTP - Connect Transform Publish - instead of ETL.
Data doesn't need to be staged to perform transforms, and doesn't need to be loaded to make it available for analytics.