Consulting

A consultant is a professional who provides expert advice in a particular area. A Data Engineer 🔢 👨‍🔧 may provide expert advice in implementing new solutions from the ground up, or improving existing solutions. I both cases, it is valuable to translate business requirements into a high level architectural blueprint.

1. Basics

Does the system have to be real-time? What's the minimal required response time? What kind of skills does the team poses? Is performance more important than readability? Should data be available and highly consistent? Is there enough time, or enough money? Is the selected solution robust, flexible, secure?

Advice

Elucidation

There is always a trade-off

The trade-offs are sometimes hard to find, and balancing occurs while developing a system. There are no silver bullets. Don't give answers you don't have. Research pros and cons, make an informed decision.

Take maturity into account

Mature technologies are often more stable, and easier to implement because of greater adoption among adjacent technologies

Make services as isolated as possible

Avoid premature optimizations

It is considered as breaking YAGNI. Choosing Parquet over CSV for example, just because it is performing better. You can always make this choice later.

Write tests before refactoring

This way you can safely modify code, and verify that the output is still exactly the same.

Avoid single points of failure

Start stateless

2. Technologies

The number of possible combinations between different technologies is vast. Take a look at the following technologies within each subject:

Now, if you give a piece of advice that turned out to be wrong, you lose all credibility. So how do you go about making crucial architectural decisions? First, you should be aware that these subjects exist, second you should know some advantages and disadvantages of some technologies of each subject.

You can do this by reading the rest of this chapter!

3. Databases

A Data Engineer must be aware of the physical limitations with respect to data. For example, it is physically impossible to read a couple of records with top consistency from a distributed database under 150 microseconds.

Relational VS Non Relational

There are some big differences between relational and non relational databases, although it heavily relies on the DB implementation:

Relational

Non Relational

Storage

Tables and rows

JSON documents

Language

SQL

depends on implementation

Consistency

Consistent

Eventually Consistent

Structure

Structured Normalized (foreign keys and indexes), rigid data model

Semi-Structured Denormalized (related data stored in same document), no schema model

Advantages

  • Complicated querying

  • Reliable ACID transactions

  • Relationship constraints

  • Operations are treated as transactions automatically

  • Query routine analysis

  • Referential integrity (cascading)

  • Information that belongs together is stored together

  • No SQL injection

  • Easy sharding

  • No schema validation allows for experimenting

  • High availability

  • Open source

Disadvantages

  • Doesn't scale that well

  • Normalization means lots of joins, which may affect speed

  • Hard to work with semi-structured data

  • Limited support for joins

  • Limited indexing

  • Denormalization means mass update

  • Manually handle transactions

  • Nested documents are hard to keep consistent

  • No schema validation may leave DB in invalid state

  • Weaker consistency (BASE)

Schema

Predefined

Programmatically

Scaling

Mostly vertical (improving machine), high cost, and a limitation to the level of scaling

Horizontal (adding more machines), low cost, can handle high number of operations

Maintenance

Expensive, requires trained workforce

Cheap, automatic repair and easier distribution

Caching

Separate hardware

Integrated

You generally choose a relational database when ACID compliance must be ensured, and when your data is structured and unchanging. But, most importantly, it is not Big Data.

If there's little structure, and the DB must handle large VVV data, a NoSQL database is a better choice. Also, the lack of schema and migrations allows for quick development. To choose which NoSQL DB is best suited for your use case, this interesting decision tree may be useful:

Key-Value stores (Redis) are super fast in memory stores. Document stores (MongoDB) are super flexible schemaless stores with built-in features and query system, a RDBMS replacement. Wide-Column stores (Cassandra) are super capable of heavy writes and real-time querying, good when you are dealing with massive data, real-time.

CAP

A distributed database that continues to operate when network partitions occur is considered to be partition tolerant. As a consequence, data that is read from one partition may differ from the other, meaning that consistency isn't guaranteed, but the database is available. Not unless you are talking about "eventual consistency", which implies that it takes some time to synchronize. Consider the order of magnitudes between different network operations here:

http://norvig.com/21-days.html#answers
           0.5 ns - CPU L1 dCACHE reference
           1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
           5   ns - CPU L1 iCACHE Branch mispredict
           7   ns - CPU L2  CACHE reference
          71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
         100   ns - MUTEX lock/unlock
         100   ns - own DDR MEMORY reference
         135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
         202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
         325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
      10,000   ns - Compress 1K bytes with Zippy PROCESS
      20,000   ns - Send 2K bytes over 1 Gbps NETWORK
     250,000   ns - Read 1 MB sequentially from MEMORY
     500,000   ns - Round trip within a same DataCenter
  10,000,000   ns - DISK seek
  10,000,000   ns - Read 1 MB sequentially from NETWORK
  30,000,000   ns - Read 1 MB sequentially from DISK
 150,000,000   ns - Send a NETWORK packet California -> Netherlands
|   |   |   |
|   |   | ns|
|   | us|
| ms|

For this reason, the CAP theorem states that networked shared-data systems can only guarantee/strongly support two of the following three properties:

  • Consistency: every read receives the most recent write or an error

  • Availability: every request receives a reasonable response within a reasonable amount of time

  • Partition Tolerance: the system will continue to function when network partitions occur

Because networks are not reliable, and eventual consistency is never instant, we must tolerate partitions in a distributed system. So we are left with the following two choices:

  • CP: wait for a response from the partitioned node which could result in a timeout error. The system can also choose to return an error, depending on the scenario you desire. (Choose Consistency over Availability when your business requirements dictate atomic reads and writes)

  • AP: return the most recent version of the data you have, which could be stale. This system state will also accept writes that can be processed later when the partition is resolved. (Choose when consistency is not crucial)

  • CA: highly available and consistent, but not partition tolerant (Choose when you can't have network partitions)

Transactions, ACID and BASE

ACID is an acronym that describes database transaction properties (pessimistic replication model):

  • Atomicity: guarantees that each transaction is handled as a single unit. A transaction consisting of multiple statements either succeeds completely (a change), or fails completely (no change).

  • Consistency: ensures validity of database with constraints, cascades, and triggers.

  • Isolation: ensures concurrent transactions to leave the database state as if they were executed sequentially.

  • Durability: guarantees that once a transaction has been committed, it will remain committed even in case of a system failure.

BASE is an acronym that describes eventually consistent services (optimistic replication model):

  • Basically Available: means that any data request should receive a response, but that the response may indicate a failure instead of the requested data.

  • Soft State: given eventual consistency, the system may be in a changing state until consistency is reached

  • Eventual Consistency: describes the situation where a DDBS achieves high availability while loosely guaranteeing that data, in the absence of updates, will eventually reflect the last updated value across the system, and therefore the data may vary in value until the system reaches a consistent state.

4. Data Ingestion

When moving data, especially unstructured data, from where it is originated into a system where it can be stored and analyzed, we are talking about ingestion. This process may be continuous or asynchronous, real-time or batched, from a wide variety of data sources.

Batch Transferring

Data can be ingested in a batch. Apache Sqoop is capable to efficiently transfer bulk data between Apache Hadoop and structured datastores such as relational databases. Embulk helps transfer data between various databases, storages, file formats, and cloud services.

Real-time Messaging

Then there's real-time data ingestion. When many sources are sending data to your API's, and your API goes down, the data is lost, and the application may show errors. If you put a message broker in between the application and API, the data is temporarily stored so that the API can access it when it's up again. This may also be useful when you want to replace or update an API or service. It decouples processing from data producers.

Great use cases for messaging are:

  • Website Activity Tracking

  • Collecting Metric Data

  • Log Aggregation

  • Stream Processing

  • Event Sourcing

  • Commit Log

Kafka, for example, allows producers to publish a stream of records to one or more topics, and allows consumers to subscribe to one or more topics, while processing the stream for them.

5. File System

File Systems such as HDFS and AWS S3 are used to store files. In the context of Data Engineering, files may be large files of data, not just images or videos. It may be used to store serialized files, described in the next section.

When would you need a File System? When you want to store or process massive files using Spark for example.

S3

HDFS

Elasticity

Yes

No

Cost/TB/Month

$23

$206

Availability

99.99%

99.9%

Durability

99.999999999%

99.9999%

Transactional writes

Yes with DBIO

Yes

6. Serialization Format

The files that you store on the file system may be stored in a particular format. How do you choose the right format?

  • What type of data do you have?

  • Is the desired format compatible with your processing tools?

  • What are your file sizes?

  • Do you have to be able to split the files for map-reduce style processing?

  • Do the schema's evolve over time?

Format

Advantages

Use Case

Text

CSV

TSV

Readable

Easy to implement

Adding large amounts of data to HDFS quickly

JSON

Dictionary-like

Applications

Avro

Schema evolution (allows schema change)

Serialization

Row-based binary format

Block compression

Splittable

Event data that changes over time

Adding, removing, renaming columns later

SEQ

Row oriented

Splittable

Block compression

Sharing datasets between MR jobs

Parquet

Column-based binary format

Quick column access

Block compression

Append data

Optimized for reading

Column specific queries Adding columns later

ORC

Splittable

Block compression

Lightweight indexing

Optimized for reading

Column specific queries

There's the idea of wide and narrow datasets:

  • Narrow: 10 million rows, 10 columns

  • Wide: 4 million rows, 1000 columns

7. Stream Processing

Stream processing is a paradigm, equivalent to reactive programming, allowing for more parallelism. A stream is an unbounded sequence of something. In order to process an endless sequence of data, you would have to cut the sequence somewhere, store it, then process it, and then do the next batch and worry about aggregating the data at some point. In some cases speedy results are desired.

Native Streaming

Storm, Flink, Kafka Streams, and Samza, can process every record as it arrives. Native Streaming feels natural as every record is processed as soon as it arrives, allowing the framework to achieve the minimum latency possible. But it also means that it is hard to achieve fault tolerance without compromising on throughput as for each record, we need to track and checkpoint once processed. Also, state management is easy as there are long running processes which can maintain the required state easily.

Micro Batching

Spark Streaming, Storm-Trident, micro batch records, introducing a small delay. Fault tolerance comes for free as it is essentially a batch and throughput is also high as processing and check-pointing will be done in one shot for group of records. But it will be at some cost of latency and it will not feel like a natural streaming. Also efficient state management will be a challenge to maintain.

8. Batch Processing

If results are required only once an hour or once a day, it may not be required to set up and maintain a streaming solution. In this case, data may be appended to a parquet file, and processed overnight. But processing all this data may take days, or weeks. For this problem, Hadoop was brought into existence. Map-Reduce in combination with HDFS is able to process distributed data in a distributed fashion, in which it writes intermediate results to disk, which was a slow process. Spark introduced in-memory processing, which drastically improved performance, because it only had to write the end result to HDFS or some other data source. That brings us to scheduling!

10. Workflow

Airflow is one such scheduling tool, it uses your defined DAG's to schedule tasks, which you can keep track of in the web interface:

# airflow needs a home, ~/airflow is the default,
# but you can lay foundation somewhere else if you prefer
# (optional)
export AIRFLOW_HOME=~/airflow

# install from pypi using pip
pip install apache-airflow

# initialize the database
airflow initdb

# start the web server, default port is 8080
airflow webserver -p 8080

# start the scheduler
airflow scheduler

# visit localhost:8080 in the browser and enable the example dag in the home page

A DAG is defined in a python file:

dag = DAG(
    'tutorial', default_args=default_args, schedule_interval=timedelta(days=1))

Other scheduling tools are available, mostly depending on the cloud platform that you're using.

Last updated