Data Engineering and Computer Science
Data engineering role is ensuring uninterrupted flow of data between servers and applications
Resources
- https://github.com/ossu/computer-science
- What is Data Engineering and Why Is It So Important?
- ETL (extract, transform, load)
- Have we bridged the gap between Data Science and DevOps?
- Codelabs
- Google Developers Codelabs provide a guided, tutorial, hands-on coding experience. Most codelabs will step you through the process of building a small application, or adding a new feature to an existing application
Python
See AI/Data Engineering/Python
Julia
Javascript
- https://www.w3schools.com/js/
- https://codesandbox.io
- https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/JavaScript_basics
- https://dtabio.gitbooks.io/data-science-with-javascript/content/links_and_resources.html
- http://www.kdnuggets.com/2016/06/top-machine-learning-libraries-javascript.html
Bash
CUDA
- https://developer.nvidia.com/cuda-education
- https://dragan.rocks/articles/18/Interactive-GPU-Programming-1-Hello-CUDA
Books
See AI/Data Engineering/Python#Books
- #BOOK Mining of Massive Datasets (Leskovec, 2014 CAMBRIDGE)
- #BOOK Advanced Analytics with Spark (Ryza, 2017 OREILLY)
- #BOOK The Big Book of Data Engineering (Databricks)
R
- #BOOK R para profesionales de los datos: una introducción
- #BOOK Geocomputation with R
- #BOOK Efficient R programming
- #BOOK Engineering Production-Grade Shiny Apps
- #BOOK Advanced R
- #BOOK Hands-On Programming with R
- #BOOK R Packages (Wickham 2020)
Courses
- See AI/Data Engineering/Python#Courses
- #COURSE Intro to Hadoop and MapReduce
- #COURSE Mining Massive Data Sets (CS246 Stanford)
- #COURSE Getting and Cleaning Data (Coursera)
- SQL:
- Tutorial and exercises
- SQL (basic, intermediate, advanced / pet problems):
Code
- See AI/Data Engineering/ML Ops
- #CODE ABSL.flags - Defines a distributed command line system and manual argument parsing
- #CODE Memray - Memray is a memory profiler for Python
- #CODE mmap.ninja - Memory mapped numpy arrays of varying shapes
- You can use
mmap_ninja
with any training framework (such asTensorflow
,PyTorch
,MxNet
), etc., as it stores your dataset as a memory-mapped numpy array - A memory mapped file is a file that is physically present on disk in a way that the correlation between the file and the memory space permits applications to treat the mapped portions as if it were primary memory, allowing very fast I/O
- You can use
- #CODE Polars - Fast multi-threaded, hybrid-out-of-core DataFrame library in Rust | Python | Node.js
- #CODE Pandas AI/Data Engineering/Pandas
- #CODE Modin - Scale your pandas workflows by changing one line of code
- #CODE Xarray AI/Data Engineering/Xarray
- #CODE Dedupe - A python library for accurate and scaleable fuzzy matching, record deduplication and entity-resolution
- #CODE PyTables
- #CODE H5py
- #CODE Singer - Simple, Composable Open Source ETL
- #CODE Docker
- #CODE Kubernetes - K8s is an open-source system for automating deployment, scaling, and management of containerized applications.
Business Intelligence
Big data, distributed computing
- #CODE Dask
- #CODE Ray
- A system for parallel and distributed Python that unifies the ML ecosystem
- https://ray.readthedocs.io/en/latest/
- https://ray-project.github.io/
- #TALK Ray: A Distributed Execution Framework for AI | SciPy 2018 | Robert Nishihara
- #TALK Ray: A System for Scalable Python and ML |SciPy 2020| Robert Nishihara
- #CODE PyGDF - GPU Data Frame
- #CODE Apache Hadoop
- The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
- https://www.quora.com/What-is-the-difference-between-Apache-Spark-and-Apache-Hadoop-Map-Reduce
- Intro to Hadoop and MapReduce (Udacity)
- https://datawanderings.com/2017/01/15/your-first-diy-hadoop-cluster/
- http://ruhanixedu.com/blog/interview-question-and-answers/big-data/
- #CODE Apache Spark
- http://cacm.acm.org/magazines/2016/11/209116-apache-spark/fulltext
- http://www.kdnuggets.com/2015/11/introduction-spark-python.html
- https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html
- #TALK A brief introduction to Distributed Computing with PySpark (Pydata)
- #TALK Connecting Python To The Spark Ecosystem
- http://tech.marksblogg.com/billion-nyc-taxi-rides-spark-2-1-0-emr.html
- http://ruhanixedu.com/blog/interview-question-and-answers/apache-spark-interview-questions-answers/
- Text Normalization with Spark
- Spark ML
- [MLlib](http://spark.apache.org/mllib/, https://spark.apache.org/docs/latest/ml-guide.html)
- PySpark
- Optimus
- #CODE Apache Storm
- #CODE Apache Arrow
- #CODE Blaze
Databases
- SQL:
- NoSQL:
Subtopics
Open datasets (for ML, DL and DS)
See AI/Data Engineering/Open ML data
MLOps
See AI/Data Engineering/ML Ops
Feature engineering
- https://en.wikipedia.org/wiki/Feature_engineering
- Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. It is fundamental to the application of ML, and is both difficult and expensive. The need for manual feature engineering can be obviated by automated feature learning
- http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/
- https://tech.zalando.com/blog/feature-extraction-science-or-engineering/
Feature extraction
See AI/Feature learning techniques in AI/Computer Vision/Computer Vision
Data mining
- http://nbviewer.jupyter.org/github/ptwobrussell/Mining-the-Social-Web-2nd-Edition/tree/master/ipynb/
- https://www.dataquest.io/course/apis-and-scraping
Web scraping
- https://www.dataquest.io/blog/web-scraping-tutorial-python/
- http://thiagomarzagao.com/2013/11/12/webscraping-with-selenium-part-1/
- https://medium.com/@hoppy/how-to-test-or-scrape-javascript-rendered-websites-with-python-selenium-a-beginner-step-by-c137892216aa#.hrjljvffd
- https://antonio-maiolo.com/2016/12/01/web-crawler-scrapy-crawl-spider-tutorial/
- http://stackoverflow.com/questions/19021541/scrapy-scrapping-data-inside-a-javascript
API
- A categorized public list of APIs from round the web
- A collective list of public JSON APIs for use in web development
- Public APIs
Databases
- https://en.wikipedia.org/wiki/Distributed_database
- ACID (Atomicity, Consistency, Isolation, Durability)
- SQL vs NoSQL
SQL
- https://en.wikipedia.org/wiki/SQL
- https://en.wikipedia.org/wiki/Relational_database
- A relational database is a digital database whose organization is based on the relational model of data.
- https://www.analyticsvidhya.com/blog/2017/01/46-questions-on-sql-to-test-a-data-science-professional-skilltest-solution/
- Tutorial and exercises
- SQL (basic, intermediate, advanced / pet problems)
- List of SQL Commands
- JOIN
- A SQL join clause combines columns from one or more tables in a relational database. It creates a set that can be saved as a table or used as it is. A JOIN is a means for combining columns from one (self-table) or more tables by using values common to each. ANSI-standard SQL specifies five types ofJOIN:INNER,LEFT OUTER,RIGHT OUTER,FULL OUTER and CROSS.
- https://periscopedata.com/blog//how-joins-work.html
- https://www.digitalocean.com/community/tutorials/sqlite-vs-mysql-vs-postgresql-a-comparison-of-relational-database-management-systems
- Python interface
NoSQL
- https://en.wikipedia.org/wiki/NoSQL
- Not only SQL: A NoSQL database provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases. NoSQL databases are increasingly used in big data and real-time web applications. Many NoSQL stores compromise consistency (in the sense of theCAP theorem) in favor of availability, partition tolerance, and speed.
- Column: Accumulo, Cassandra, Druid, HBase, Vertica, SAP HANA
- #TALK GOTO 2012 - Introduction to NoSQL - Martin Fowler
- Graph:
- A graph database is a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data. A key concept of the system is the graph (or edge or relationship), which directly relates data items in the store. The relationships allow data in the store to be linked together directly, and in many cases retrieved with a single operation.
- Graph databases employ nodes, edges and properties.
- Nodes represent entities/items you might want to keep track of (people, businesses, accounts).
- Edges, also known as graphs or relationships, are the lines that connect nodes to other nodes; they represent the relationship between them.
- Properties are pertinent information that relate to nodes (sort of keywords).
- AllegroGraph, ArangoDB, InfiniteGraph, Apache Giraph, MarkLogic, Neo4J, OrientDB, Virtuoso, Stardog
- https://neo4j.com/developer/graph-database/
- Key-value
- https://en.wikipedia.org/wiki/Key-value_database
- A key-value store, or key-value database, is a data storage paradigm designed for storing, retrieving, and managing associative arrays, a data structure more commonly known today as a dictionary or hash.
- Dictionaries contain a collection of objects, or records, which in turn have many different fields within them, each containing data. These records are stored and retrieved using a key that uniquely identifies the record, and is used to quickly find the data within the database.
- Document-oriented database
Data munging
Data preparation
- Data cleansing: Missing data
- Variables encoding
- Normalisation, scaling
- Outlier detection
Exploratory data analysis
- https://www.codementor.io/jadianes/data-science-python-r-exploratory-data-analysis-visualization-du107jjms
- http://blog.districtdatalabs.com/data-exploration-with-python-2
- https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
Big data
- http://www.datasciencecentral.com/profiles/blogs/25-big-data-terms-you-must-know-to-impress-your-date-or-whoever
- Architecture of Giants: Data Stacks at Facebook, Netflix, Airbnb, and Pinterest
MapReduce
- https://en.wikipedia.org/wiki/MapReduce
- MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster
- A MapReduce program is composed of aMap() procedure (method) that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and aReduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies)