In some ways it feels like this year flew by, in others like it was as long as an eternity. Looking back, it definitely surprises even me how many significant events have happened in these 12 months. I know one can say that about pretty much every year but it feels more pronounced this time. Without much ado, here is a recap of how this year has been.

Team of one no more

By the beginning of 2017, we had secured an accepted offer from Bill who was to become Sendgrid’s second in house DBA. While the larger tech ops org had always had my back during maternity leaves and vacations and we’ve always had a strong support team in the folks in Pythian, it was clear that we were overdue for the in house DB Ops team to no longer remain a one woman team.

This was not just a great thing for the team and Sendgrid, but also tremendously useful for my career growth. I had been running on fumes for a while and started to chafe at my day to day tasks, itching to get more involved in large strategy planning rather than drowning in daily tactical tasks. But to do all that I needed to also learn to let go of some things I have had full control over for while. Now, almost a year later, I have realized that having a trusted partner on my team has led to serendipitous changes like letting engineering teams write the chef cookbooks for managing their new database clusters. This process of ‘letting go’ was a crucial first step to the next big career event in 2017 for me. I was told recently that in the past year I have become more pragmatic. And I think this was the single most unexpected but welcome result of not working in isolation anymore.

New title…a whole new role

It was clear to me in 2016 that i was hitting a ceiling in career growth if I remained a single person team doing tactical work all the time. Once the team grew by 100% (😃), I was able to focus more on higher level planning. Getting more involved in architecture blueprints, writing more blog posts, planning my first ever conference talk (more on all that later).

By late 2017, I earned my promotion to Principal DBA which, while the next step in the IC career track at Sendgrid, is also a role that is far more leadership than ‘heads down on code’. Since this promotion, my role has now shifted from working on code to lots and lots of planning and reading or architecture blueprints. Yes, it means more meetings..but it also means having a broader impact on how we build or rebuild pieces of our architecture and supporting entire teams. If being a ‘senior engineer’ requires one to be a force multiplier within their team, it feels that principal engineer demands that ten times as much and with a much larger impact across the organisation.

I am still working on accepting changes that have come with this role change, my calendar is certainly a testament and my standup updates are almost always a mix of ‘Talked to Person X about project Foo and Person Y about project Bar’. Getting my head as an engineer out of classifying these conversations as ‘non work’ and instead recognising them as critical collaboration as part of a distributed engineering organisation is now part of my job. As my dear friend Sean Kilgore said in a tweet: “My specialty is random conversations. And all of the gdocs suggestions.” He meant it in jest but I think this is not a bad goal to have in 2018 😃

Blog posts

While technical posts such as How we encrypted our backups are always fun and rewarding once a project is complete, it seems like the most popular post I wrote this past year was a more personal one on management vs leaderhsip. I will not rehash here my thoughts on the subject but I will note that, for all the focus many people in tech put on the code and the tools. It is definitely of note that blog posts on the human interactions side of the job that seem to get a lot more attention and spark more discussion. It is a good thing. It is about time we stopped pretending this field doesn’t have human vs human responsibilities.

Second tech conference, first talk

One of the highlights of 2017 was being able to attend and speak at LISA. LISA is one of the oldest tech conferences around and through its selections of chairs and talk chairs, has done a great job at being inclusive to attendees and first time speakers, myself included.

My talk was about working with DBAs in a Devops world which was in part what I talked about, but also the title was more of a hook to talk about “how to architect products by talking to people”. I wrote a preview blog post which was very helpful in shaping exactly what I was going to cover besides the outline of the proposal. I also got lots of help and support from wonderful people like Alice Goldfuss and Connie-Lynne Villani in both enouraging me to submit to LISA and at the conference as a a ball of nervous energy until I was done presenting.

LISA was also my second ever tech conference to attend and I realized that i now much prefer conferences with really diverse attendance. I got to meet so many fellow women in tech and had plenty of awesome conversations that I will cherish for a long time and it was all because of a planning committee that put real work in making sure the conference had a friendly and inclusive environment.

IPO!

January 2018 will mark six years working for Sendgrid. When I joined I had no idea that the business of delivering emails was….a business…and possibly a lucrative one. It was a highlight of this year and possibly the rest of my career being in NYC and, in person, watching the company I have poured years of hard work in go public and have its day in the sun. While this is only a milestone and not a destination, it is definitely a milestone many companies strive to accomplish and it felt good being a part of it.

Wrap-up

It has been a very busy year indeed. I have not yet thought about any personal goals in 2018 but I sure have plenty of them professionally. If it turns out to be half as awesome as 2017 has been, it won’t be too bad. 😉

This post is a flight of ideas. Blame Charity getting my brain going

This morning, I came to twitter to find this thread of tweets from the always awesome Charity majors.

This got me thinking about how it seems many poeple in our field, and so many companies by consequence, have adopted the attitude that leadership comes from above. That you must have a certain title to be able to lead in your organization.

In individual contributors, this can be a sign of lack of maturity. An engineer should not be considered ‘senior’ unless they exhibit a sense of ownership towards the business and a sensibility towards what’s best for the customers. Not only that, but a senior engineer should also be capable of influencing engineers around her to ‘do it right’. That sometimes means the longer way and sometimes even means the less ‘cool’ tech. A senior engineer, even if not a ‘team lead’, should be able to argue, with data, for solutions that provide the most business value with the least amount of risk.

Similarly, if Junior/associate engineers are to grow in their careers, they should be encouraged to find their passion and translate that into business value. As engineers, we may be insulated from the customer facing work by having support teams and customer sueccess reps but that should not mean we cannot be aware of how the quality of our work impacts the customers and our co workers who have to talk to the customers.

Sadly we do not learn enough of that in our fancy CS degrees. Too much emphasis is put on algorithms and software and not enough in understanding how to talk to the business side of the company, how to have empathy for the customer facing team members and how to behave and think as one team that is ultimately providing a service to paying customers. I was guilty of this as a fresh grad. I was looking to write code, to play with software that was new to me and handling customers’ issues was a ‘nuisance’ that I just had to deal with. It took the fall of my first company to learn that code and elegant design are nothing if they are not providing a business value and solving a problem for a paying customer.

This is also a problem of managers. Micro managers, among many other harms, reinforce the idea that only managers can decide how things are done and make decisions. Select all managers who do not advocate for their team and know how to say no during critical organizational planning imply to the team that they cannot drive excellence themselves but have to be told, by some laid out plan devised by executives, what to do.

Now, that is not to say that the executive perspective is not of value. Far from it, an executive has the position of knowing the market landscape a company is competing in, and owns the strategy of the organization to develop an edge in that marklet. But the executive is not to be expected to know the details and severity of tech debt the company has accumelated and what parts of said tech debt can closely endanger meeting this needed business edge. Without individual contributors understaning their company strategy and making these connections between “this service is old and needs a refactor” and “we need this to make that new product scalable and an easier sell”, the executive team may never see their lofty plans to fruition and ultimately the business will lose customers.

So what can be done to make this better?

Yes it involves everyone…

It is important to make ‘sense of ownership’ a part of performance review for everyone, not just for team leads and line managers. It should be something junior individual contributors strive to internalize in order to grow in their careers and it needs to be something senior team members have to exhibit and are scored on. Remember, it is what goes in the performance review that shows what the company really values. Since it is literally ‘putting money where your mouth is’. One thing that is very dangerous, is mistaking when an individual contributor is sounding the alarm about an architectural problem as them being “a cynic” or a “pessimist”. That is especially a problem that women in tech face as we are expected to just be merry and happy all the time and when we point out issues during design reviews, it tends to be seen as being ‘brash’ or ‘harsh’. Actively burying concerns from team members and chalking them off to ‘personality’ or the ever non inclusive phrase ‘team fit’ will only spell long term dysfunction for your organization. Ignore at your own risk.

Line managers, those whose direct reports are all individual contributors, need to constantly let their reports bubble up pain points in the company tech stack. Involve them in the roadmap planning process. Make sure to communicate to them the strategy and not just ‘here is what we will be doing’. For many people, knowing the why goes a very long way in being invested to do the best possible job. Line managers should encourage the quiet ones to still participate in this discussion even if not in front of the whole team. Sometimes the best feedback on ‘what needs to be fixed’ can come out of 1:1 conversations. Not everyone is comfortable sounding these concerns in a group.

Finally, directors and executives should be open to feedback from all levels of the organization. Do not wait for this to come to you. At my current company we do a biannual survey that is deliberately anonymous but allows everyone to provide the executive team with feedback from all levels of the company. It is super important to not just solicit this feedback but to transparently also create action items based on that feedback and report back to the company on the progress of these action items.

Thanks to Charity and Nicole for sparking this post and to Camille Fournier’s book that has given me a great perspective into management.

I have spoken before about how important it is for me and my team to make as many parts of the database stack match our larger infrastructure. One of the most crucial ways to do this is to make sure that not only are we deploying and managing databases using configuraton management, but that the cookbooks remain in lock step with changes our cookbooks at large move towards.

When i first learned chef, the way to test your chef cookbooks was chef minitest. But that is now deprecated and is no longer the recommended test method of chef cookbooks. So what is a DBA who is not always in tune with chef land to do? learn from her teammates of course! 😀

With the help of our ops engineering team members, I was directed at some examples in newer cookbooks we have and to the resources supported by InSpec which is what chef-audit is based on.

So how does chef audit mode work?

Chef audit works exactly like recipe code. In fact, you can write the audit code right inside the recipe it is auditing. It is very natural in its language which makes it very easy to write before the actual chef code (hint hint: TDD FTW!) and it has lots of resource types which make for simple, easy to read and maintain, tests.

So if I have a cookbook that was using minitest, how can I make it move to chef audit?

You need to decide where your audit code will live. Your options are:

  • One gigantic audit recipe that is included in your runlist however you normally include recipes in the cookbook. This option will put all the code in one file but can get unwieldy and large as a cookbook gets more complex

  • An _audit_foo.rb recipe for every foo.rb recipe you already have. The audit ones have to be separately included as well. This is arguably the most organized manner to do this and can work very well for books with lots of internal recipes where one large audit file can get very large. But it can also feel like recipe sprawl.

  • Add the audit code directly inside each recipe. This is nice because you can then see in the same file both the code that makes the changes and the audits that run to validate the policies around these changes. But again, this can get harder to use if the individual recipe is long or complex

Each of these, as you can see, have pros and cons. The good thing about this flexibility is that you can pick what works best for your cookbook or organization :D

Now onto an example…or two…

Say you have this code block in a recipe to drop a script file in a specific spot and make it executable.

cookbook_file '/usr/local/bin/test_backup.sh' do
  source 'test_backup.sh'
  owner 'mysql'
  group 'mysql'
  mode 0o755
  action :create
end

The code to audit this block would look like this

control_group 'MySQL BackupTests : Archive access' do
  control 'test_backup.sh' do
      describe file('/usr/local/bin/test_backup.sh') do
        it { should exist }
        it { should be_file }
        it { should be_executable }
        it { should be_owned_by 'mysql' }
      end
    end
end

As you can see, the audit part is very natural in its language, and the resources are quite simple to use. So how do we tell chef to run this audit code in our test environment?

Presuming you use test kitchen, you need to edit that config to enable audit mode. Under the provisioner section of your .kitchen.yml config file, add these 2 lines:

client_rb:
  audit_mode: :enabled

So what are the advantages of audit_mode vs the old minitests? well besides the fact that minitest is deprecated, I find that having the test code as part of the recipe tree (and better yet, can even be in the recipe directly) gives a very nice single view of what every recipe should look like and what my view of the state of the host should be based on that. That should come even handier for anyone trying to look at a cookbook I wrote and understand what the cookbook is supposed to do.

Aa I started a journey with chef a few years back, converting my knowledge on how to build our databases into repeatable cookbooks, I will be spending the next months with the rest of our data ops team converting our resources into chef audit code.

In case you don’t follow SysAdvent (in which case, why?? you are missing out on great content every year!), the post on December 6th this year was by Alice Goldfuss. I have tweeted a link to that post that day and since then my tweet has been quoted, responded to and the ensuing ‘conversation’ has been eyeopening.

A lot of the response has been anger that an operations engineer is advocating for putting developers on call. A lot were quick to suggest that this is a quick way to lose engineers, that it advocates a terrible quality of life, that it stomps on work/life balance and that it ultimately doesn’t serve to improve code quality.

The most surprising response was suggesting that this call comes from the privilege of not having family/kids responsibilities which I suppose implies only the single people are in ops and carrying pagers.

I can go on and on about how little empathy I have seen from people who are supposed to be fellow engineers towards their fellow human ops people

I can go on and on about how many naively suggested that testing and ‘QA’ are sufficient to not need the people who wrote the code to own their work in the real world environment of a system of scale

I can go on and on about what it says about people who presumes Alice meant “give the developer a pager and send off into the night”…I suppose that’s fine when you do it to the ops gal instead.

There is a lot to unpack in these responses, many of which disappointing from engineers I respect, but I want to pinpoint 2 things

What does it mean to put engineers on call

NO it does not mean spite and no it is not punishment for bugs and no it is not a call for dissolving the lines between life and work. I have written about my experience with burnout before and I wish that on no one.

What it means to put engineers who write the code on call is that they are the subject matter experts on what is running in production. They know when error foo happens it means the cache layer failed, DNS is taking too long or maybe that the other microservice in that product comprised of about a dozen of them is the one silently failing. You may think an architecture diagram can make stuff like this clear as day but when YOU, the person on the team who has been building this thing for 2 quarters has to look at your own architecture diagram one quarter later to grok which microservice has failed you will realize why you should be paged first.

This does not mean at all that Ops wants nothing to do with your application paging out at 3 AM. We put our delivery engineering teams (the term ‘delivery’ here is deliberate and very appropriate) on call but they always have an escalation path to ops that is not gated by a timer. Ops still has your back. If you think this is a network problem you are not familiar with, you can immediately page the on call ops engineer and she will help verify where the broken zeros and ones are.

This is not an effort to spite those who dare push a bug to production. Anyone who has been responsible for production environments knows that will happen. Hell, you don’t even HAVE to introduce bugs through a deploy, really. Unforeseen changes in your environment will cause presumptions to stop being valid and take both those who wrote the code and those with less knowledge of it (Ops) by surprise.

What this does mean is that we are explicitly saying that the code is useless unless it provides the customers the value they paid for. And when a C level executive has to explain to customers why an outage took X time to resolve, “we couldn’t get the engineering team involved for some time” is not an acceptable answer.

Your code and my servers and databases are a risk center unless we are both invested in making them run. It is as simple as that.

What does it mean to be a senior engineer

It honestly frightened me that people who are senior engineers are balking so hard at giving developers pagers. Maybe they fell for the hyperbole of “we now don’t want developers to sleep either”, maybe they’ve been on call in difficult environments before and that’s the PTSD talking. But I do not believe that any engineer should be called ‘senior’, by title or by implication if they refuse to be reachable in a team rotation in case their own work caused a customer facing issue.

Note my words because I am trying my best to choose them carefully. ‘customer facing issue’. You can’t give people pagers without making sure you are not paging on bullshit signals that aren’t actually affecting the customers and the business. And surprise, ops engineers are people too.

So what does this mean? If you fail to see how your code is more than the sum of its functions and test. That it needs to provide real value. If you insist that someone else be the first line of defense when it breaks, then you are failing to acknowledge that as a senior engineer, your job is more than producing code. You are setting a terrible example for junior engineers on your team. They will now learn the lesson of “I don’t have to own that my code provides value once it’s in production”. The damage that mentality will do to an engineering organization is lasting and will take a long time before anyone realizes it is the root for a LOT of tech debt and it takes years to also reverse.

If you are a manager promoting engineers to senior titles and an emphasis on ownership, including stability, is not a non-negotiable criterion of that promotion then you are damaging your organization. And if you think that a sense of ownership and truly understanding what it takes to create stable systems of scale can happen without ever being paged by a service in production then I am not sure why you are in charge of engineers supposedly building such systems of scale.

Finally, no one is saying this all ignores our lives outside work. In fact, the opposite. Ops engineers have lives too. We are men and women with spouses and babies who get sick and sometimes are solo parenting. Mature teams have each others’ backs. Mature teams will override set on call shifts when life strikes and that is applauded.

What I (and I think Alice) are saying is “do not systematically make it acceptable to throw code at the production wall”….not let’s page developers at 3 AM out of spite.

Note: This is inspired by Julia Evans’ recent post about ….capacity planning 😌

Ground rules

RDBMS

Yes..this post is geared for those of us who use MySQL with a single writer at a time and 2 or more read replicas. A lot of what I will talk about here applies differently, or not at all, to multi writer clustered datastores, although those also come with their own set of compromises and caveats. So…your milage will definitely vary.

Sharding

I have already covered large strokes of this in one of my earlier posts, I mostly focused there on the benefits of functional or horizontal sharding. Yes that is a prerequisite, since what you use to access the database layer WILL decide how much flexibility you have to scale.

If you are a company that experiences large differences between peak and average traffic, you should be prepared to leave the paradigm of ‘the database’ as a single physical entity behind.

Ability to split reads and writes

This is something you will need to be able to do, but not necessarily enforce as a set in stone rule. There will be use cases where a write needs to be read very soon after and where tolerance for things like lag/eventual consistency is low. Those are ok to have, but in the same applications, you will also have scenarios for reads that can tolerate some longer time span of eventual consistency. When such reads are in high volume, do you really want that volume going to your single writer if it doesn’t really have to? Do yourself a favor, and make sure soon in your growth days that you can control the use of a read or write IP in your code.

Now onto the thought process of actual capacity planning…

A database cluster is not keeping up. what do I do?

Determine the system bottleneck

  • Is the issue high CPU?
  • Is it IO capacity?
  • Is it growing lag without a clear query culprit?
  • Is is locks?
  • How do I know which it is?

You need a baseline

Once you know what system metric you are mostly bound to, you need to establish baseline and peak values. Otherwise, determining whether your current issue is a bug vs real growth is going to be a lot more error prone than you’d like.

Basic server metrics can only go so far but at some point you will find you also need context based metrics. Query performance and app side perceived performance will tell you what the application sees as a response time to queries.

Learn your business traffic patterns

Are you a business that is susceptible to peaks in specific weekdays (marketing)? do you have regular launches that triple or quadruple your traffic like gaming? These sorts of questions will drive how much of reserved headroom you should keep or whether you need to invest in elastic growth.

Determine the ratio of raw traffic numbers in relation to capacity in use

This is simply the answer to “If we made no code optimizations, how many emails/sales/whatever” can we serve with the database instance we have right now?

Ideally, this a specific value that makes the math towards planning a year’s growth a simple math equation. But life is never ideal and this value will vary depending on season or completely external happy factors like signing up a new major customer. In early startups this number is a faster moving target but it should stabilize as the company transitions from early days to more established business with more predictable business growth patterns.

Do I really need to buy more machines?

You need to find a way to determine if this is truly capacity (I need to split the writes to support more concurrent write load or add more read replicas) vs code based performance bottleneck (this new query can really have its results cached in something cheaper and not beat the database as much).

How do you do that? You need to get familiar with your queries. The baby step for that is a combination of innotop, slow log and the percona toolkit’s pt-query-digest. You can automate this by shipping the DB logs to a central location and automating the digest portion.

But that is also not the entire picture, slow logs are performance intensive if you lower their threshold too much. If you need less selective sampling you will need to detect the entire conversations between the application and the datastore. In open source land you can go as basic as tcpdump or you can use hosted products like datadog, newrelic or vivid cortex.

Make a call

Capacity planning can be 90% science and 10% art but that 10% shouldn’t mean that we shouldn’t strive for as much of the picture as we can. As engineers we can sometimes fixate on the missing 10% and not realize that if we did the work, that 90% can get us v far into a better idea of our stack’s health, a more efficient use of our time optimizing performance and planning capacity increases carefully which eventually results in much better return on investment for our products.