It’s Just 1 Line of Code Change

Admit it, we’ve all been there. If you are shaking your head and saying nope, not me, you deserve a tip of the hat. The rest of us should have gotten the wag of the finger.  And we probably have already been punished with the disastrous results of our overly confident and/or overly optimistic action.

So, you’ve been working on implementing a particular feature, change or fix and you’ve gone back and forth coding, verifying, testing (I hope you didn’t skip the verifying and testing part). At this point, the code is so much ingrained in your mind that you know it forwards and backwards. It went through whichever formal testing process your organization/team has adopted. Everything looks fine and you are ready to release. At the very last second, somebody finds another, maybe very trivial issue or one of stakeholders asks you to just do a tiny change somewhere. The question is, “Oh how big of a change is it”, “Is it easy to fix?”  Since you’ve been living and breathing this piece of code for the past few days, or however long it took you to implement it, you reply “It’s just 1 line of code change.”  You are thinking to yourself, “Oh I’ll do the change and check it in, should be good to go.”  Right?  You are eager to release, stakeholders are eager to release, everybody’s eager to release.  Nobody wants to retest the whole thing for the one-liner, so you push to production…..

What’s the worst that can happen?  Well you might want to tell that to the NASA Engineers who worked on the Mariner probe launched on July 22, 1962. They might just punch you in the face or worse.

When the Mariner embarked on that fateful day, instead of making history in space exploration as the first rocket to fly by venus, it exploded and came crashing down after less than 5 minutes into the flight and went down in history as the most expensive programming typo.

Yes you read that right. All it was, was a TYPO. This typo cost the U.S. government about $80 million ($630 million in 2016 dollars). Somewhere within the computer instructions code a hyphen was omitted. So instead of soaring into space the Mariner ended up crashing into the ground.

Here is the official explanation provided on the  NASA website,

the Mariner 1 Post Flight Review Board determined that the omission of a hyphen in coded computer instructions in the data-editing program allowed transmission of incorrect guidance signals to the spacecraft. During the periods the airborne beacon was inoperative the omission of the hyphen in the data-editing program caused the computer to incorrectly accept the sweep frequency of the ground receiver as it sought the vehicle beacon signal and combined this data with the tracking data sent to the remaining guidance computation. This caused the computer to swing automatically into a series of unnecessary course corrections with erroneous steering commands which finally threw the spacecraft off course.

 

To summarize:

Use hyphen, have a successful space rocket launch, conquer venus, pop the champagne, have a parade, make history. 🚀  🎉 🍾

Omit hyphen, watch your rocket crash and burn, crush a nation’s dreams, waste the government’s money, live with the embarrassment till the end of your life.💥 😱 💸

Ok, so maybe the software you are working on might not be that mission critical. But that “omitted hyphen” equivalent in your one liner that you’ve just so carelessly released to production, might just cause you your very own mini Mariner disaster situation. Where in the best case you’ll have to scramble to fix the problem fast.

You may also argue that we don’t know if the “omitted hyphen” was caused by a last minute change. Probably not. It is still a great example though showing us how a tiny mistake can lead to a huge disaster.  So next time when faced with the question of doing that “1 line of code change” in the last minute, you just remember the Mariner and the hyphen. And be aware that releasing with the issue you know, rather than releasing with an unknown, might just be the best decision you’ve ever made.

We're Hiring Engineers at Ladders. Come join our team!

The Man Who Knew Infinity: Coding Ramanujan’s Taxi

Ramanujan is on the left

Have you see the movie (or read the book) The Man Who Knew Infinity?

This new movie — which stars Dev Patel and Jeremy Irons — explores Indian mathematician Srinivasa Ramanujan and his profound understanding, ingenuity, and love of math.  The film inspired me on both an intellectual and emotional level. But what really drew my attention was a particular five second scene.

The scene takes place in 1918. Ramanujan‘s mentor and friend G.H. Hardy quips that he had just taken taxi number 1729 and finds the number “a rather dull one.”  Ramanujan passionately replies, “No, Hardy, it’s a very interesting number! It’s the smallest number expressible as the sum of two cubes in two different ways.”

Ramanujan was able to see beyond the simple taxi cab number and into the depths of the expression behind it: a³ + b³ = c³ + d³…better known as Ramanujan’s Taxi. I thought this problem was fascinating and wondered how the code implementation would look. Little did I realize there were many optimization layers to this algorithm onion.

The taxi Ramanujan took — at least in the movie

First Crack at Implementing Ramanujan’s Taxi

I started with a straight forward implementation written in Scala. The code, with performance timings, can be found on GitHub:


We begin with a brute-force implementation by looping though all combinations to find where a³ + b³ = c³ + d³. We achieve O(n⁴) performance because of the four loops used to calculate all values of a³, b³, c³, and d³ equal or less than parameter n, which bounds our search field.

This brute-force implementation, with O(n⁴) performance, kinda sucks. So, how can we do better?

We Can Do Better

First question to ask is: do we always need to calculate all the values of a³, b³, c³, and d³? Remember, the equation we are using is a³ + b³ = c³ + d³. If we solve for d³, we get d³ = a³ + b³ – c³. Thus, once we know a³, b³, and c³, we can calculate the value of d³ directly instead looping through all values of d³.

My next implementation, again in Scala, replaces the fourth loop with the calculation d³ = a³ + b³ — c³:

The 2nd version has O(n³) performance since we get to skip that final loop. Neat!

Third Time’s A Charm

We’re not done yet. There is a third, and the best yet, enhancement to consider. What if we don’t need to solve for all values of not only d³, but c³ too? A few things to understand:

  • If we calculate all values of a³ and b³ equal to or less than n, we essentially have all possible values of not only a³ and b³, but also c³ and d³.
  • The sum of a³ + b³ is equal to the sum of c³ + d³
  • If the sum of #2 above for a given pair of integers (a³, b³) matches the sum of another pair of integers (a³, b³), we have in essence found the c³ and d³ pair.

If we store every combination of the sum of a³ + b³ and the corresponding pair (a³, b³), any sum that has two pairs means we have found a³ + b³ = c³ + d³ where the first pair in the list can be considered (a³, b³) and the next (c³, d³).

For example, if we iterate through the combinations of a³ + b³, we will store the sum 1729 with the pair (1³, 12³). Continuing to iterate, we will see another sum of 1729 arise, but this time with the pair (9³, 10³). Because we have two different pairs both summing to 1729, we have found a Ramanujan Taxi that solves for a³ + b³ = c³ + d³.

In the third version, we use a Hashmap to store the sum (key) and the corresponding list of pairs as a Sorted Set (value). If the list contains more than one pair, we’ve got a winner!

This implementation has O(n²) performance since we only need two loops to calculate the combinations for a³ and b³. Very neat!

I suspect there is a forth optimization where we only need to calculate values of a³ and derive b³ from a³ (the ‘b’ loop is just an offset of the ‘a’ loop) with O(n) performance.

Also, another challenge is to re-write the implementations as a functional programming pattern. I’ll leave that for you to explore.

An Amazing Movie, an Amazing Man

After watching The Man Who Knew Infinity, I was in awe of Ramanujan’s genius. By implementing his taxi algorithm — with its several performance optimizations — I got a glimpse of the beauty he saw in “No, Hardy, it’s a very interesting number!”

Ramanujan’s Taxi, at almost a century old, is still making new discoveries. Mathematicians at Emory University have found the number 1729 relates to elliptic curves and K3 surfaces — objects important today in string theory and quantum physics.

I expect we have only scratched the surface of Ramanujan’s taxi cab number and the man’s amazing genius.

Article originally appeared at Free Code Camp

We're Hiring Engineers at Ladders. Come join our team!

Ladders Engineering Cultural Values

The Ladders team is very excited to share with the world our Engineering Cultural Values. We hold these truths to be self-evid…just kidding. Creating a set of inspiring, long lasting, and fundamental values isn’t easy. It is a reflective process with many different, but often surprisingly similar opinions.

We asked ourselves:

  • What does it mean to be an engineer at Ladders?
  • What do we expect from ourselves, our colleagues, and our company?
  • What do we take pride in and where do we need to do better?
  • What are our core values that will form the bedrock now and in the future?

Wow! That is a lot of stuff to consider. After many brain-storming sessions, inspirations from other companies (thank you Zillow and Amazon), and tossed/consolidated ideas, we came up with 7 core cultural values.

Without much further adieu, we present Ladders Engineering Cultural Values.

Ladders Engineering Cultural Values

AWS IO Performance: What’s Bottlenecking Me Now?

AWS

When moving to AWS, it can be difficult to pinpoint where your previously high-functioning system is underperforming. AWS has its own internal rhyme and reason, and anyone moving to AWS must be familiar with how AWS operates. One tool that can help you on this front is AWS CloudWatch, which provides metrics and monitoring on core system performance. The focus of this article is how to configure your IO subsystem in AWS. This requires using Cloudwatch to understand your subsystem’s performance. Are you hitting throughput limits? Are you utilizing all your allocated IOPS? What’s the average queue length?

To define all these terms from scratch is out of the scope of this article – we assume you have some basic familiarity with AWS I/O terminology or at minimum have some links to get you started. Our goal here is instead to provide some high-level guidelines on how to begin thinking through your I/O performance in AWS.

It’s Not Just the Volume

Let’s take a step back for a second to let the implications of using EBS with our instances sink in. EBS (Elastic Block Store) is a type of durable, persistent, network-attached storage available for AWS EC2 instances. Emphasis here goes on “network-attached”: whenever you read/write to an EBS volume, there is an unavoidable network component involved that sits directly between the EC2 instance and the EBS network-attached storage. In many cases, this network component will be your bottleneck unless you are running on a particularly capable instance network setup (2), since it isn’t very difficult to surpass the aforementioned instance limits.

The first step in assessing your IO woes is validating this network connection from your instance to your EBS volume. To further complicate things, your performance is not only limited by the volume you provision, but the instance you attach said volume to! Amazon lists the network limits per instance type, ranging from 4k IOPS and 500MiB/s limits to 65K IOPS and 10GiB/s limits, so check your documentation! It is incredibly important that you take into consideration your expected disk throughput and IOPS when selecting an instance. A perfectly performant IO subsystem can’t outshine a low-throughput network connection to an instance, so be sure to test in every application.

When asking “what’s bottlenecking me now?”, the EBS volume type and the instance type have to be considered when setting up your system. You could provision 20000 IOPS to an io1 disk, but because of instance bottlenecks, max out at around 8000 IOPS for a throughput of 260MiB/s. That’s 12000 IOPS you’re paying for wasted, all because you forgot to consider your instance type as a bottleneck.

When consulting AWS documentation about your instance, one should be cautious about taking the listed IO limits at face value. In a fair bit of documentation, AWS achieves the listed instance benchmarks only by  assuming a read-only workload in “perfect weather.” The reality is that these numbers are rounded approximations of best-case scenario performance, using block sizes that may not be relevant to your application. Read your documentation carefully and be prepared to test your assumptions. Of course, this is just the tip of the iceberg – you’ll need to weigh in other factors as well, such as the file system being used, the read/write workload, and whether options like encryption are enabled on the EBS volumes connected to the instance.

What About the Disk?

Go ahead and start monitoring your instance to quickly determine where you are particularly struggling in the IO department (if at all!). AWS CloudWatch metrics make it very easy to validate if your high throughput I/O application is having disk-related performance bottlenecks. Combining this with standard disk benchmarking tools at the OS level will allow clear picture of your IO subsystem to emerge. We’ve found it is fairly simple to get an IO subsystem to outperform the EBS-to-Instance network connection (RAID setups come to mind), so after validating the network pipe limitations, you’ll be much better equipped to provision your EBS volume appropriately for desired IOPS and/or throughput.

Appropriate drive selection depends heavily on your primary performance attribute (2), IOPS or bandwidth. For higher IOPS applications, general SSD’s are available in general purpose burst type (gp2) and provisioned (io1) formats. SSD’s are more expensive than their HDD counterparts, and are optimized around low latency smaller block size operations. Most general purpose and mission critical applications are just a matter of appropriately sizing one of these two SSD types. The other kinds of EBS storage options offered by AWS are throughput optimized (st1) and cold (sc1) HDD volumes. What is unique about these storage types is their maximum bandwidth per drive, 500MiB/s and 250MiB/s respectively. These HDD volumes require a larger block size and a preferably sequential workload to labor effectively due to their painfully low IOPS limits (500/250 IOPS respectively).

IOPS

At the end of the day, most AWS applications outside of big data/log processing will boil down to a choice between either io1 or gp2 volumes. HDD volumes have their place, but they must be carefully considered due to their extreme IOPS limitations even though their admirably high bandwidth limits might be appealing.

EBS snapshots: A Lesson in Prudence

AWS ELB

Let’s turn now to some of the functionality provided in tandem with EBS volumes. AWS offers powerful snapshot functionality in conjunction with EBS volumes, allowing the user to make point-in-time backups of a given volume. Better yet, these snapshots are incremental – your first snapshot of a volume will take time, but any subsequent snapshot of the same volume will only snapshot blocks that have been modified since the last snapshot. AWS accomplishes this through a bit of smoke and mirrors that ultimately boils down to two things: pointers and S3. In short, all snapshotted blocks of a given volume are stored in S3. When a new EBS volume is spun up from a snapshot, AWS will initialize the volume using the newest blocks from all snapshots associated with the original drive. This means that an end user can set up a rolling backup schedule for a given EBS volume, and be assured of both data integrity to a point in time, as well as speedy incremental snapshots.

Hearing all this, it is easy to get carried away. Incremental snapshots + speedy initialization of a drive from a snapshot + programmatic control of instances and volumes via AWS API calls = all sorts of possibilities, right? One might think of backup systems, easy setup / maintenance of slave nodes or QA environments, or perhaps a data refresh scheme using snapshots? Sounds good, but stop for a second and remember: THERE IS NO FREE LUNCH. Not here, not anywhere. We’re dealing with computers, and everything has a trade-off cost. So where are we paying here?

Remember when we mentioned how AWS stores snapshot blocks in S3? So how do we get those blocks back from S3, an entirely separate system, into a drive that we initialize from a snapshot? Indeed, AWS does you no favors (or perhaps one too many favors!) on this – when you navigate a volume spun up from a snapshot, all your files look like they’re there!

Or are they? Perhaps you might notice slight performance degradation when you access a file from your snapshot-initialized volume. The first time you open a file, your IO performance is poor; the second time, much better. This pattern repeats for every file on your volume.

There it is – no free lunch. When a drive is initialized from a snapshot (whose blocks are stored on S3) and then attached to an instance, AWS fetches blocks from S3 lazily. Until you access a given block on your disk, it is not physically on your volume. So the first time you access a block, you’re paying two costs – network latency to fetch from S3, and then IO latency for reading the block from disk. And to make things worse – recall the rest of the article on how an EBS volume is essentially storage accessed through a network, which means you have more layers of latency and bottlenecks to work through for an IO read!

Okay, so what can we do? Well, if your volume is small, and / or you have time on your side, there are a few options. The main option suggested by AWS is outlined here: in essence, use the dd tool(3)  to pull all blocks from your newly-initialized volume, and redirect the output to /dev/null. In other words, read everything such that all blocks are pulled from S3 (the drive is “pre-warmed”), and then discard those reads. Now you’ve eliminated the S3 latency penalty.

Of course, depending on the size of your drive, this process can be prohibitively slow. It also requires extra manual work or automation from your engineers, which can backfire depending on the complexity of your implementation.

None of this is meant to deter you from making use of snapshots – they’re a wonderful, well-implemented tool. But as with any tool, they come with drawbacks and gotchas, and this is a biggie. So just remember: EBS snapshots are back-ended by S3, and blocks for snapshot-initialized volumes are fetched lazily.

AWS is Not Magic

Your best bet is to test the network connection directly via an I/O load test utility (such as SQLIO.exe or diskspd.exe in the Windows world). Generating an insanely high IO/bandwidth workload is very easy when you’re using the right utility, and it allows you to validate the EBS-to-Instance pipe and disk limitations at not only the AWS layer (CloudWatch), but at the OS layer as well. You can easily find the network pipe limit in terms of bandwidth or I/O using an overly performant drive setup and a correctly tuned load tester utility, such as RAID 0 volumes and diskspd.exe (RAID0 being what it is, of course, exercise caution!).

AWS offers flexibility and power that far exceeds that of your average data center, but (at risk of a groan) with great power does indeed come with great responsibility. You MUST be prepared to play by AWS’s rules and not go by conventional datacenter wisdom. This means a few things:

  1. Live and breathe AWS documentation.
  2. At the same time, believe nothing and test everything in documentation.
  3. Perform your proof-of-concepts at scale and as robustly as possible.
  4. Remember that AWS services are nothing more than complex abstractions – you will face bottlenecks from things AWS lists, and bottlenecks from things they never mention once. We refer you back to points one through three.

If anything else, the intent of this article is to convey the complexity of moving to AWS through the lens of configuring I/O, and to offer some suggestions and principles on how to navigate the world of AWS. The payoff of moving to AWS can be wonderful, but AWS is not magic. Learn to navigate by its rules, and you’ll be rewarded – otherwise, you’ll wonder what all the hype was about!

Reference Links

Additional Reading

We're Hiring Engineers at Ladders. Come join our team!