What is streaming data?

Streaming data refers to a continuous flow of data that is generated in real-time and can be processed and analyzed as it is generated. This type of data is commonly found in applications such as social media, financial trading, and IoT devices.

What is time series data?

Time series data is a type of data that is collected over time and is used to analyze trends and patterns. This type of data is commonly found in applications such as finance, weather forecasting, and stock market analysis.

Kafka is an open-source distributed streaming platform that is used to build real-time data pipelines and streaming applications. It is designed to handle high volumes of data and provides features such as fault tolerance, scalability, and high throughput.

Apache Beam is an open-source unified programming model that is used to build batch and streaming data processing pipelines. It provides a simple and flexible API that can be used with various data processing engines such as Apache Spark, Apache Flink, and Google Cloud Dataflow.

What is Apache Spark?

Apache Spark is an open-source distributed computing system that is used for big data processing and analytics. It provides a unified API for batch and streaming data processing and supports various programming languages such as Java, Scala, and Python.

What is Apache Flink?

Apache Flink is an open-source distributed stream processing framework that is used for real-time data processing and analytics. It provides features such as fault tolerance, high throughput, and low latency and supports various data sources such as Kafka, Hadoop, and Amazon S3.

Streaming Data - Best practice for cloud streaming

At streamingdata.dev, our mission is to provide a comprehensive resource for all things related to streaming data, time series data, Kafka, Beam, Spark, and Flink. We strive to offer high-quality content, tutorials, and resources to help developers and data professionals stay up-to-date with the latest trends and best practices in the field. Our goal is to empower our community with the knowledge and tools they need to build scalable, real-time data pipelines and applications that drive business value.

Video Introduction Course Tutorial

/r/dataengineering Yearly

📄 "We have great datasets"

📄 i just want sleep

📄 PSA: Learn Vendor Agnostic Technologies!

📄 Exporting to excel is always a people pleaser...

📄 Data driven organisations

📄 Who owns data quality?

📄 Your Snowflake credits at work.

📄 It's amazing how many organizations workflows still revolve around Excel. I've seen CFOs and COOs folders filled with 20 different versions of the same Excel file.

📄 I’ve had the definition wrong this entire time…

📄 The current data landscape

📄 Happy (or not so happy) Wednesday! What part of your technical work do you dread the most? What are you doing about it?

📄 Data engineering projects with template: Airflow, dbt, Docker, Terraform (IAC), Github actions (CI/CD) & more

📄 If data engineering did Spotify Wrapped

📄 Follow up on that Google Drive question...

📄 Anyone read this book? It came out in 2022 so it's very modern and up to date.

📄 State of Data Engineering 2022

📄 What’s your process for deploying a data pipeline from a notebook, running it, and managing it in production?

📄 Finally got a job

📄 What are your favourite GitHub repos that shows how data engineering should be done?

📄 Data engineering with ChatGPT

📄 It's not always Old Man Jenkins...

📄 It is a recession after all, isn't it?

📄 Free Data engineering bootcamp - Data Engineering Zoomcamp - starts in 10 days

📄 How are you exporting your prod DB tables to your data warehouse?

📄 The "Big Three's" Data Storage Offerings

📄 How are you monitoring your data pipelines and what are you using to debug production issues?

📄 I didn’t know you guys were paid THIS well

📄 I like caravans more.

📄 ETL using pandas

📄 You SHALL pass...?

📄 just got laid off (FAANG)

📄 DBT lays off 15% of their staff

📄 Data engineers processing data access requests

📄 If you know, you know

📄 So I watched a few videos about Fabric, and started to cry a little...

📄 I got the job!

📄 The only insightful venn diagram I've ever made

📄 Don't Fall for the Hype: A Data Professional's Perspective on Familiar Concepts Rebranded as Innovations

📄 Job search for Data Engineering in Stockholm (2yoe)

📄 can't wait for an end to end python stack with no JVM

📄 The problem with data industry is hiring roles instead of people

📄 One day we’ll get the respect we deserve 🥲

📄 Snowflake pushing snowpark really hard

📄 PSA: we learned the hard way DBT Cloud support doesn’t work weekends…

📄 What's your favorite data quality horror story?

📄 Getting tired of “How do I break into DE posts”

📄 It's cron all the way down

📄 Just turned down a 150k job offer when I was unemployed just 2 years ago.

📄 Data pipeline design patterns

Streaming Data Cheatsheet

This cheatsheet is a reference guide for anyone getting started with streaming data, time series data, Kafka, Beam, Spark, and Flink. It covers the basic concepts, topics, and categories related to these technologies.

Streaming Data

Streaming data refers to data that is continuously generated and processed in real-time. This data can come from various sources such as sensors, social media, and web applications. Streaming data is different from batch data, which is processed in batches after a certain period of time.

Key Concepts

Data Ingestion: The process of collecting and importing data from various sources into a streaming platform.
Data Processing: The process of transforming, filtering, and aggregating data in real-time.
Data Analytics: The process of analyzing and visualizing streaming data to gain insights and make decisions.
Data Storage: The process of storing streaming data in a scalable and fault-tolerant manner.

Streaming Platforms

Apache Kafka: A distributed streaming platform that allows you to publish and subscribe to streams of records in real-time.
Apache Flink: A distributed stream processing framework that allows you to process and analyze streaming data in real-time.
Apache Spark: A distributed computing framework that allows you to process large-scale data sets, including streaming data.
Google Cloud Dataflow: A fully-managed service for processing and analyzing streaming and batch data.

Time Series Data

Time series data refers to data that is collected over time at regular intervals. This data is used to analyze trends, patterns, and anomalies over time. Time series data is commonly used in finance, weather forecasting, and IoT applications.

Key Concepts

Time Series Analysis: The process of analyzing time series data to identify trends, patterns, and anomalies.
Time Series Forecasting: The process of predicting future values of a time series based on historical data.
Time Series Visualization: The process of visualizing time series data to gain insights and make decisions.

Time Series Databases

InfluxDB: A time series database that is optimized for storing and querying time series data.
TimescaleDB: A time series database that is built on top of PostgreSQL and provides scalability and performance for time series data.
OpenTSDB: A distributed time series database that is built on top of HBase and provides scalability and performance for time series data.

Apache Kafka

Apache Kafka is a distributed streaming platform that allows you to publish and subscribe to streams of records in real-time. Kafka is commonly used for building real-time data pipelines and streaming applications.

Key Concepts

Topics: A category or feed name to which records are published.
Partitions: A topic can be divided into multiple partitions to allow for parallel processing of data.
Producers: Applications that publish records to Kafka topics.
Consumers: Applications that subscribe to Kafka topics and consume records.
Brokers: Kafka nodes that store and manage the partitions of a topic.

Kafka Clients

Java: The official Kafka client library for Java.
Python: The official Kafka client library for Python.
Kafka Streams: A client library for building real-time streaming applications on top of Kafka.
Kafka Connect: A framework for building and running connectors that move data between Kafka and other systems.

Apache Flink

Apache Flink is a distributed stream processing framework that allows you to process and analyze streaming data in real-time. Flink is commonly used for building real-time data pipelines and streaming applications.

Key Concepts

DataStreams: A stream of data that is processed in real-time.
Operators: Transformations that are applied to DataStreams to process and analyze data.
Windows: A way to group data in a DataStream based on time or other criteria.
State: A way to maintain and update state across multiple events in a DataStream.

Flink APIs

DataStream API: A high-level API for building streaming applications in Flink.
Table API: A SQL-like API for querying and processing streaming data in Flink.
ProcessFunction API: A low-level API for building custom stream processing logic in Flink.

Apache Spark

Apache Spark is a distributed computing framework that allows you to process large-scale data sets, including streaming data. Spark is commonly used for building batch processing and real-time data pipelines.

Key Concepts

Resilient Distributed Datasets (RDDs): A distributed collection of data that can be processed in parallel.
DataFrames: A distributed collection of data organized into named columns.
Transformations: Operations that are applied to RDDs or DataFrames to process and analyze data.
Actions: Operations that trigger the execution of transformations and return results.

Spark APIs

RDD API: A low-level API for building distributed computing applications in Spark.
DataFrame API: A high-level API for querying and processing structured data in Spark.
Structured Streaming API: A high-level API for building real-time streaming applications in Spark.

Conclusion

This cheatsheet provides a quick reference guide for anyone getting started with streaming data, time series data, Kafka, Beam, Spark, and Flink. It covers the basic concepts, topics, and categories related to these technologies. Use this cheatsheet as a starting point for your learning journey and explore each technology in more detail to become an expert in the field.

Common Terms, Definitions and Jargon

1. Streaming data: A continuous flow of data that is generated in real-time and processed as it is produced.
2. Time series data: A type of data that is collected over time and used to analyze trends and patterns.
3. Kafka: An open-source distributed streaming platform that is used to build real-time data pipelines and streaming applications.
4. Beam: An open-source unified programming model that is used to build batch and streaming data processing pipelines.
5. Spark: An open-source distributed computing system that is used to process large-scale data sets.
6. Flink: An open-source stream processing framework that is used to process real-time data streams.
7. Data pipeline: A series of steps that are used to collect, process, and analyze data.
8. Data ingestion: The process of collecting and importing data from various sources into a data storage system.
9. Data processing: The process of transforming raw data into a format that can be analyzed and used for insights.
10. Data analysis: The process of examining data to identify patterns, trends, and insights.
11. Data visualization: The process of presenting data in a visual format, such as charts, graphs, and maps.
12. Data modeling: The process of creating a mathematical representation of data to make predictions and identify trends.
13. Data warehousing: The process of storing and managing large amounts of data in a centralized location.
14. Data lake: A large, centralized repository of raw data that can be used for analysis and insights.
15. Data streaming: The process of processing and analyzing data in real-time as it is generated.
16. Real-time analytics: The process of analyzing data in real-time to make immediate decisions.
17. Batch processing: The process of processing large amounts of data in batches.
18. Event-driven architecture: An architectural pattern that is used to build real-time, event-driven systems.
19. Microservices: A software architecture pattern that is used to build complex applications as a collection of small, independent services.
20. RESTful API: A type of API that is designed to be simple and easy to use, using HTTP requests to access and manipulate data.

Editor Recommended Sites

AI and Tech News
Best Online AI Courses
Classic Writing Analysis
Tears of the Kingdom Roleplay
Network Simulation: Digital twin and cloud HPC computing to optimize for sales, performance, or a reduction in cost
NFT Shop: Crypto NFT shops from around the web
Dev Tradeoffs: Trade offs between popular tech infrastructure choices
Last Edu: Find online education online. Free university and college courses on machine learning, AI, computer science
Machine learning Classifiers: Machine learning Classifiers - Identify Objects, people, gender, age, animals, plant types