Consulting
A consultant is a professional who provides expert advice in a particular area. A Data Engineer 🔢 👨🔧 may provide expert advice in implementing new solutions from the ground up, or improving existing solutions. I both cases, it is valuable to translate business requirements into a high level architectural blueprint.
1. Basics
Does the system have to be real-time? What's the minimal required response time? What kind of skills does the team poses? Is performance more important than readability? Should data be available and highly consistent? Is there enough time, or enough money? Is the selected solution robust, flexible, secure?
Advice | Elucidation |
There is always a trade-off | The trade-offs are sometimes hard to find, and balancing occurs while developing a system. There are no silver bullets. Don't give answers you don't have. Research pros and cons, make an informed decision. |
Take maturity into account | Mature technologies are often more stable, and easier to implement because of greater adoption among adjacent technologies |
Make services as isolated as possible | |
Avoid premature optimizations | It is considered as breaking YAGNI. Choosing Parquet over CSV for example, just because it is performing better. You can always make this choice later. |
Write tests before refactoring | This way you can safely modify code, and verify that the output is still exactly the same. |
Avoid single points of failure | |
Start stateless |
2. Technologies
The number of possible combinations between different technologies is vast. Take a look at the following technologies within each subject:
Now, if you give a piece of advice that turned out to be wrong, you lose all credibility. So how do you go about making crucial architectural decisions? First, you should be aware that these subjects exist, second you should know some advantages and disadvantages of some technologies of each subject.
You can do this by reading the rest of this chapter!
3. Databases
A Data Engineer must be aware of the physical limitations with respect to data. For example, it is physically impossible to read a couple of records with top consistency from a distributed database under 150 microseconds.
Relational VS Non Relational
There are some big differences between relational and non relational databases, although it heavily relies on the DB implementation:
Relational | Non Relational | |
Storage | Tables and rows | JSON documents |
Language | SQL | depends on implementation |
Consistency | Consistent | Eventually Consistent |
Structure | Structured Normalized (foreign keys and indexes), rigid data model | Semi-Structured Denormalized (related data stored in same document), no schema model |
Advantages |
|
|
Disadvantages |
|
|
Schema | Predefined | Programmatically |
Scaling | Mostly vertical (improving machine), high cost, and a limitation to the level of scaling | Horizontal (adding more machines), low cost, can handle high number of operations |
Maintenance | Expensive, requires trained workforce | Cheap, automatic repair and easier distribution |
Caching | Separate hardware | Integrated |
You generally choose a relational database when ACID compliance must be ensured, and when your data is structured and unchanging. But, most importantly, it is not Big Data.
If there's little structure, and the DB must handle large VVV data, a NoSQL database is a better choice. Also, the lack of schema and migrations allows for quick development. To choose which NoSQL DB is best suited for your use case, this interesting decision tree may be useful:
Key-Value stores (Redis) are super fast in memory stores. Document stores (MongoDB) are super flexible schemaless stores with built-in features and query system, a RDBMS replacement. Wide-Column stores (Cassandra) are super capable of heavy writes and real-time querying, good when you are dealing with massive data, real-time.
CAP
A distributed database that continues to operate when network partitions occur is considered to be partition tolerant. As a consequence, data that is read from one partition may differ from the other, meaning that consistency isn't guaranteed, but the database is available. Not unless you are talking about "eventual consistency", which implies that it takes some time to synchronize. Consider the order of magnitudes between different network operations here:
For this reason, the CAP theorem states that networked shared-data systems can only guarantee/strongly support two of the following three properties:
Consistency: every read receives the most recent write or an error
Availability: every request receives a reasonable response within a reasonable amount of time
Partition Tolerance: the system will continue to function when network partitions occur
Because networks are not reliable, and eventual consistency is never instant, we must tolerate partitions in a distributed system. So we are left with the following two choices:
CP: wait for a response from the partitioned node which could result in a timeout error. The system can also choose to return an error, depending on the scenario you desire. (Choose Consistency over Availability when your business requirements dictate atomic reads and writes)
AP: return the most recent version of the data you have, which could be stale. This system state will also accept writes that can be processed later when the partition is resolved. (Choose when consistency is not crucial)
CA: highly available and consistent, but not partition tolerant (Choose when you can't have network partitions)
Transactions, ACID and BASE
ACID is an acronym that describes database transaction properties (pessimistic replication model):
Atomicity: guarantees that each transaction is handled as a single unit. A transaction consisting of multiple statements either succeeds completely (a change), or fails completely (no change).
Consistency: ensures validity of database with constraints, cascades, and triggers.
Isolation: ensures concurrent transactions to leave the database state as if they were executed sequentially.
Durability: guarantees that once a transaction has been committed, it will remain committed even in case of a system failure.
BASE is an acronym that describes eventually consistent services (optimistic replication model):
Basically Available: means that any data request should receive a response, but that the response may indicate a failure instead of the requested data.
Soft State: given eventual consistency, the system may be in a changing state until consistency is reached
Eventual Consistency: describes the situation where a DDBS achieves high availability while loosely guaranteeing that data, in the absence of updates, will eventually reflect the last updated value across the system, and therefore the data may vary in value until the system reaches a consistent state.
4. Data Ingestion
When moving data, especially unstructured data, from where it is originated into a system where it can be stored and analyzed, we are talking about ingestion. This process may be continuous or asynchronous, real-time or batched, from a wide variety of data sources.
Batch Transferring
Data can be ingested in a batch. Apache Sqoop is capable to efficiently transfer bulk data between Apache Hadoop and structured datastores such as relational databases. Embulk helps transfer data between various databases, storages, file formats, and cloud services.
Real-time Messaging
Then there's real-time data ingestion. When many sources are sending data to your API's, and your API goes down, the data is lost, and the application may show errors. If you put a message broker in between the application and API, the data is temporarily stored so that the API can access it when it's up again. This may also be useful when you want to replace or update an API or service. It decouples processing from data producers.
Great use cases for messaging are:
Website Activity Tracking
Collecting Metric Data
Log Aggregation
Stream Processing
Event Sourcing
Commit Log
Kafka, for example, allows producers to publish a stream of records to one or more topics, and allows consumers to subscribe to one or more topics, while processing the stream for them.
5. File System
File Systems such as HDFS and AWS S3 are used to store files. In the context of Data Engineering, files may be large files of data, not just images or videos. It may be used to store serialized files, described in the next section.
When would you need a File System? When you want to store or process massive files using Spark for example.
S3 | HDFS | |
Elasticity | Yes | No |
Cost/TB/Month | $23 | $206 |
Availability | 99.99% | 99.9% |
Durability | 99.999999999% | 99.9999% |
Transactional writes | Yes with DBIO | Yes |
6. Serialization Format
The files that you store on the file system may be stored in a particular format. How do you choose the right format?
What type of data do you have?
Is the desired format compatible with your processing tools?
What are your file sizes?
Do you have to be able to split the files for map-reduce style processing?
Do the schema's evolve over time?
Format | Advantages | Use Case |
Text CSV TSV | Readable Easy to implement | Adding large amounts of data to HDFS quickly |
JSON | Dictionary-like | Applications |
Avro | Schema evolution (allows schema change) Serialization Row-based binary format Block compression Splittable | Event data that changes over time Adding, removing, renaming columns later |
SEQ | Row oriented Splittable Block compression | Sharing datasets between MR jobs |
Parquet | Column-based binary format Quick column access Block compression Append data Optimized for reading | Column specific queries Adding columns later |
ORC | Splittable Block compression Lightweight indexing Optimized for reading | Column specific queries |
There's the idea of wide and narrow datasets:
Narrow: 10 million rows, 10 columns
Wide: 4 million rows, 1000 columns
7. Stream Processing
Stream processing is a paradigm, equivalent to reactive programming, allowing for more parallelism. A stream is an unbounded sequence of something. In order to process an endless sequence of data, you would have to cut the sequence somewhere, store it, then process it, and then do the next batch and worry about aggregating the data at some point. In some cases speedy results are desired.
Native Streaming
Storm, Flink, Kafka Streams, and Samza, can process every record as it arrives. Native Streaming feels natural as every record is processed as soon as it arrives, allowing the framework to achieve the minimum latency possible. But it also means that it is hard to achieve fault tolerance without compromising on throughput as for each record, we need to track and checkpoint once processed. Also, state management is easy as there are long running processes which can maintain the required state easily.
Micro Batching
Spark Streaming, Storm-Trident, micro batch records, introducing a small delay. Fault tolerance comes for free as it is essentially a batch and throughput is also high as processing and check-pointing will be done in one shot for group of records. But it will be at some cost of latency and it will not feel like a natural streaming. Also efficient state management will be a challenge to maintain.
8. Batch Processing
If results are required only once an hour or once a day, it may not be required to set up and maintain a streaming solution. In this case, data may be appended to a parquet file, and processed overnight. But processing all this data may take days, or weeks. For this problem, Hadoop was brought into existence. Map-Reduce in combination with HDFS is able to process distributed data in a distributed fashion, in which it writes intermediate results to disk, which was a slow process. Spark introduced in-memory processing, which drastically improved performance, because it only had to write the end result to HDFS or some other data source. That brings us to scheduling!
10. Workflow
Airflow is one such scheduling tool, it uses your defined DAG's to schedule tasks, which you can keep track of in the web interface:
A DAG is defined in a python file:
Other scheduling tools are available, mostly depending on the cloud platform that you're using.
Last updated