This post is inspired by quite a few great posts like this one by Ryn Daniels and Lara Hogan

2015 was a year of lots of changes and lots of growing for me, both professionally and personally. It also had a lot of firsts for me.

Going remote

Late in 2014, I announced to Sendgrid that I am about to move to South Florida where my husband has been working for months. Sendgrid, while distributed in multiple offices, always had all its employees in one of these offices, no truly remote employees. So this was new territory and it was a new challenge given how I am usually collaborating with more than just the Operations team.

After a year, while I do not declare it an absolute success, i feel that I would not still be working at Sendgrid if it weren’t for my team’s and managers’ support. It is easy to declare someone a ‘good employee’, with all the different meanings different companies embue that term with. But the support I got from the team was absolutely crucial to my ability to continue providing value to the company.

First Tech conference

People still get surprised when they hear this but until June of 2015, I had not been to a single tech conference. A combination of being permanently busy, being a mom, moving across country more than a couple of times have led to having no time to properly plan any tech conference attendance. Finally I took the plunge and I picked Monitorama as my first one.

The talks were great. One talk I couldn’t stop referring to since returning was “Engineering happiness” By Laura Thompson of Mozilla. I could see myself, post and present co workers in the examples she provided on how a happy engineer slowly becomes less happy with the status quo. One of the first things I did once I returned to work, was share that video with my team lead and engineering leader at the time. I felt it was important to share her advice and to internalize it.

However, conferences are not just about the talks. While I learned a lot in those, I found the hallway track the real gem of that trip and the reason that conference has left lasting memories with me.

I got to meet lots of people I had been following on twitter for a while that I already learned a lot from, and looked up to. I had a LOT of fun talking about being a woman in the ops community with Katherine Daniels and Jennifer Davis. Who are about to publish a book I am looking forward to. Another set of conversations that were very educational and have pushed some of my 2016 plans was talking about team leadership and management challenges with Roy Rapaport of Netflix.

There are too many people to list here that I enjoyed meeting and talking to at Monitorama. And a lot of those conversations shaped and guided things I have done the rest of the year and things I have planned for 2016. I cannot recommend enough how great this conference was. Hat tip to Jason Dixon for creating that great environment and all the #hugops

Blog and blog posts

I started this blog early in 2015. It was not the first time I start a blog and I was not sure if I was gonna keep it up. But then I wrote a blog post for my company and I decided that maybe I will want to write more.

Yes part of that is building personal brand but I was also sometimes frustrated by talking about important things in Devops or in database architecture in 140 character pieces. Yes I did start the blog early in 2015 but it was a blog post by Lara Hogan about celebrating our achievements, conversations with my boss in late 2014 and a personal conversation with Jennifer Davis that convinced me that I do have things to say and that my experience so far at Sendgrid could be beneficial to others who are just starting as DBAs.

Most of what I wrote this year was technical or ‘lessons learned’ in the roller coaster of managing databases at a fast growing company. But I also wrote one post that is near and dear to my heard about burnout that got me to face how much I needed to take care of me at the same level of commitment I was taking care of servers and databases.

2016 Plans

  • Give a tech conference talk. I submitted my first ever abstract . Don’t know yet if it will be accepted but I am excited to go to PLCME for the first time nonetheless
  • Take better care of me. We hired our second DBA at Sendgrid late 2015. The signs of burnout on me were clear. Nothing is clearer when one performance review says “I fear that Silvia works too hard. She needs to take care of herself” 😊. With onboarding the new member, I hope to be able to split the never ending list of things we need to get done in 2016
  • A deeper focus on architecture and automation. I have spent a huge portion of the last few years working with engineers on schema design and bringing the first step of database configuration management to fruition. But a mature infrastructure is more than just configuration management and I hope to be able to grow more skills in larger system design, making the database layer more robust and a true PaaS layer
  • Write more technical blog posts. We do so much at Sendgrid. I think we can share lots of lessons learned and I hope I can help with that.
  • Get closer to becoming Staff Engineer.
  • Much more…

Sensu for monitoring

Here at Sendgrid we spent the last couple of years porting a lot of our service and host monitoring to Sensu. Its solid API support meant we could write all sorts of tooling around it. We also liked the idea of standalone, client side checks that push their status to the Sensu alerting queue asynchronously. If you are new to Sensu or haven’t ever read on it, this is a good place to start.

Typical usage example

The typical use I have for such standalone checks is health checks, a simple example looks like this in Sensu’s client config

{
  "checks": {
    "mysql_alive": {
      "command": "mysql-alive.rb -h <IP> -d mysql -u :::mysql.user|sensu::: -p :::mysql.password:::",
      "handlers": [
        "default"
      ],
      "standalone": true,
      "interval": 10,
      "notification": "OMG MySQL is dead!"
    }
  }
}

But if you look closer, all you really do is tell Sensu to run a command. So this can be…any command. This will be useful in a just a moment.

What I am solving

I have been traditionally using the crond service for running local management tasks on databases like rotating partitions and triggering backups. Along with that, we have a report that would check the logs of these jobs, on each backup replica, and make sure they ended in success messages. This is fragile due to a number of reasons:

  • CronD doesn’t have any built in monitoring. It does not have a concept of ‘stale’
  • Those reports - I will call them ‘watchers’ - are one more moving part adding complexity to the question ‘when was the last successful backup of my DB?’
  • This setup is prone to race conditions. You must time the task and its watcher in cron exactly or else the watcher can preemptively trigger an alert or signal failure when the backup is not done yet. Any drift in duration of the task will eventually make this happen (like a backup taking longer as a database grows).
  • What if the watcher script didn’t run? It is also in CronD - either on the same host or on another host, right? Either you find yourself in a rabbit hole of who watches the watchers, or a human has to notice that a report didn’t come out…humans aren’t good at remembering things.
  • Changing the designation of a server means you must change it in a number of places or the watcher will watch the wrong host.
  • We are striving to keep our stack boring. While newer technologies like chronos and rundeck provide more enhanced scheduling, they also need a service discovery layer to do this right. That was a bigger undertaking and too much scope creep for what I was solving.

So I decided to make Sensu work to my advantage, with the help of chef roles.

Partition rotation in Sensu

I started off by moving any credentials I need for my partition rotation script into Sensu’s redacted configuration. This is good practice for anything you put into sensu that uses secrets. The credentials are added in /etc/sensu/client.json then are just referenced in check configurations using :::secret_thingie::: notation.

_Protip_: Sensu doesn’t know what to redact in client.json. You must also define the name of the keys you want redacted. like this..

"redact": [
  "other_password",
  "password"
]

This is also not a deep merged list. As you can see I had to explicitly include ‘password’ once I needed to add another key in that list.

Then I needed to define the new Sensu check that rotates the partitions. I am using a chef resource as the code example.

sensu_check 'add_table_partitions' do
  command "/usr/local/pdb/bin/pdb-parted --add --interval d +7d.startof h=localhost,u=specialdbuser,p=:::redacted_partition_password:::,D=mydb,t=special_table >> /var/log/partition_rotation.log 2>&1"
  handlers %w(default_handler special_dba_handler)
  interval 86400
  standalone true
  additional(:occurrences => 3, :notification => "#{node['hostname']} failed adding partitions to important table. See log file in /var/log/mail_send_cancel_pause.log")
end

Let’s inspect what just happened here…

command: pdb-parted is a very useful perl script by the folks from Palominodb (now at Pythian) for rotating partitions in a MySQL DB. This is the same line I used to maintain in a crontab file configuration.

interval: how often this Sensu check runs. In this example these are daily partitions so running the script daily was sufficient. The important part here is that your script is idempotent. pdb-parted is. If it finds that the needed partitions already exists, it just outputs a message to that effect and exits nicely.

occurrences: This is the number of allowed failures before alerting. It is nice to have a buffer especially when I already configure the command to make partitions days/weeks in advance.

For those who use tools other than chef and want to see what the final check configuration looks like, here it is, important bits redacted:

{
  "checks": {
    "add_table_partitions": {
      "command": "/usr/local/pdb/bin/pdb-parted --add --interval m +7m.startof h=localhost,u=specialdbuser,p=:::redacted_partition_password:::,D=mydb,t=special_table >> /var/log/partition_rotation.log 2>&1",
      "handlers": [
        "default_handler",
        "special_dba_handler"
      ],
      "standalone": true,
      "interval": 86400,
      "occurrences": 3,
      "notification": "hostname failed adding partitions to special table. See log file in /var/log/partition_rotation.log"
    }
  }
}

Partition rotation now has built in monitoring. If this script exits with a non zero code, the Sensu handlers take care of getting this knowledge to the predefined group of people using the method we want. We already have multiple on-call team rotations set up and escalation paths and all the other stuff that comes along with monitoring at scale. So we got to just leverage all of that rather than reinventing the wheel.

Planned future improvements

No solution is perfect. Neither is this one.

  • We still run the risk of more than one machine in a cluster holding the primary role. Until service discovery is in place, this is something we can build a check for leveraging chef search.
  • Now that tasks like backups are running in a first class citizen tool in our stack, I can more easily add stats and get graphs for how long backups take, how big backups are to make capacity planning and MTTR tracking easier.

The not-so-merry go round

There is a lot of talk these days about burnout in our field. A lot of great initiatives to get us tech people to not hide it, not sweep it under the rug.

This is a great start for a more honest conversation about the stresses we all deal with in this industry and yet, a lot of the examples I see of people trying to deal with it is through quitting their current gig, taking a long vacation, then working on finding the next gig.

In the long run, this is not great for companies’ overall health and more importantly for a team’s health and morale. There are already statistics about how churn in IT is the highest compared to other fields and if you are at all familiar with all the effort that goes into recruiting, hiring and on boarding in our field, you probably already know that it is an expensive process with a lead time in the months. A lot of what i ruminated over the last few weeks wasn’t just my words but also the words of a dear colleague who had to do this exact thing a few years back.

Seeing it

It may be irrelevant whether our industry makes us wanna appear invincible or if this industry just attracts those of us who want to always appear strong and have it all figured out. But I know I always try to do that. Family, kid, plus managing a database layer that has expanded twenty-fold in my three and a half years with my current job.

Let me start with saying that I do not do it alone. Early on we hired consultants to help off load a lot of the DBA tasks. But in the end of the day, I’ve always felt that I am the in house DBA, I own the performance, management, and health of these systems.

This sense of ownership, while lauded, led me to unrealistic expectations of myself. I was checking work chat all the time, checking email all the time. I had slipped into a pattern of coupling being online/available all the time with doing a good job.

As the months and quarters rolled by, cracks started to appear.

I was having less fun with spending time with developers, talking distributed system architecture. Suddenly, I was having the ‘Sunday evening dread’…something I didn’t really think would happen working on so many exciting things and interesting problems for so long. I could see the snark levels increase in my conversations..

“yeah I will test this new shiny thing in maybe a few years”

“Sure…we will someday deprecate this old thing!”

Corroborating evidence

Sometimes even though we know something, we need an external source with more experience to confirm to us that it is true, that we aren’t just ‘not good enough’. For me, a lot of that was talks by Laura Thompson of Mozilla. The first is more directed at managers but it absolutely helped me. The second was at this year’s Monitorama. All the signs were there. The decrease in my github activity, the constant feeling that I was fighting emergencies all the time, physical and emotional exhaustion, sense of ineffectiveness and lack of accomplishment (Yes, I am now directly quoting the slides)

What to do

Own your boundaries

Being able to separate the time you are working from the time you are not is paramount. In the end, no one owns my well being more than me. I work remote in a different timezone than my team so boundaries were extra important to establish.

I started off by disallowing any push notifications for work hipchat on my phone. Not accepting meetings past 5 PM. Removing work email from the phone and most importantly, the laptop doesn’t leave the office room during the week.

What helped me the most was turning off work email on my phone. I needed to accept that email is an asynchronous method of communication and that I shouldn’t feel guilty about not checking it every hour including right before going to sleep. I know this may sound incredibly obvious to some but for me the checking work email and work chat from the phone constantly was like an itch and it took me a while to accept that it was an expectation I was setting for myself and that in the long run it was not making me a more productive employee or a better engineer in any way.

Talk to your team

The roles here differ depending on whether you are a manager, or individual contributor. None of this could work without support from my team, including management.

Managers, this is how you avoid churn. You need to make sure your team feels safe saying they need a break and to feel safe taking it. None of this ‘unlimited vacation time’ nonsense if no one is actually going on vacation. 1:1s are supremely important here. I am not ignoring that a busy team also means an overwhelmed manager sometimes but this is the time where you as a leader must prioritize keeping in touch with the team than anything else.

/soapbox

I am an Individual contributor but I am also one of the more senior team members (in tenure) so accepting that I carry some of the responsibility of setting the tone was necessary. Besides being honest with myself, I needed to be honest with my team. I started letting my project manager and my lead know that I would be staying off chat in the evening. Making sure they have a way to get a hold of me and trusting them that they will truly only use it sparingly and in emergencies. Without that framework and their support, I may know what I need but I would not feel empowered to act on it and I am very grateful that they let me do that and do the same for themselves so we can all continue working together.

Talk to someone

This doesn’t have to be a mental health professional although that is also a good thing to do. But in the simpler sense it helps a lot to talk to people who have been or are still going through the same thing, even with a few minor details altered. There is a lot of us in hangops.slack.com who have stories and scars from this. So much so that we have a dedicated mental health channel.

Work in progress

I mentioned in the beginning how I see most people deal with this situation. And there many of us. Does this mean I am looking for a new gig? No. This is not a quitting post :) I like my team. A lot. And I don’t just want to continue working with them but to continue to enjoy it and I want to see them also have a good time working with me. This work in progress has to always start with me recognising what I need and communicating it but I also have a team that has supported the steps I took to ease the stress.

I can’t stress this enough. I am still learning how to deal with this. I know now I always will be. It is a fine line between being someone who is honest about their work, always caring to go the extra mile and do right by their employer without sacrificing their sanity and inner peace.

I don’t want to eventually feel contempt and resentment towards that work. I am also not arguing for slacking off in the name of life work balance. Too many companies don’t put a lot of value in keeping their employees healthy in mind as well as body and that is a very good reason to look for another gig, and there are companies that are aware of these pressures and challenges but are only beginning to truly acknowledge them and begin a conversation about them.

It is on all of us to not try to be individual heroes/ninjas/rockstars and instead promote teams of healthy, rested, smart engineers with well balanced lives.

Many thanks to Sean Kilgore, Jennifer Davis and Charity Majors for helping put these thoughts to words

Alfred, csshx and terminalception

I use Tmux usually but Tmux on the mac has not been playing nice with csshx for me. Something in the dark magic of perl broke with an error that looks like this

Mar 5 08:58:42 silvias-MacBookPro.local perl[10828] : ImageIO: CGImageDestinationFinalize image destination must have at least one image 2015-03-05 08:58:42.473 perl5.18[10828:2436454] CGImageDestinationFinalize failed for output type 'public.tiff' **** ERROR **** PerlObjCBridge:: convertPerlToObjC(): Referenced thingy not blessed **** ERROR **** PerlObjCBridge:: convertArg() for index 2: convertPerlToObjC() failed **** ERROR **** PerlObjCBridge:: sendObjcMessage: Error converting argument 1 for message "setObject:forKey:" **** ERROR **** PerlObjCBridge: error [1] sending message [__NSDictionaryM setObject:forKey:] at /System/Library/Perl/Extras/5.18/darwin-thread-multi-2level/PerlObjCBridge.pm line 248.

But I wasn’t about to let that take away my terminalception magic :)

Note: This is for a mac environment but you may emulate it in your favourite distro by replacing Alfred with whatever launcher you may use.

What you need

Alfred app is my favorite launcher in mac but I suspect Quicksilver (if you are the quaint type) can also run commands directly to the terminal.

Next install Homebrew. You need this to install csshx easily. Also because GNU tools :) ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Install csshx

brew install csshx

If you are chef shop like mine, a lot of times what you are looking for is to ssh to all the machines of a certain type. In chef, we call those roles. This step will change depending on the configuration management/service discovery framework of choice in your infrastructure.

First, set the terminal app your alfred will use. I set it to Terminal because I wanted this to be separate from my usual workspace in iTerm 2

Alfred settings

Then, when I want to fire up csshX to a bunch of our our servers all at once, it usually looks like this.

Protip: knife ssh can take cssh as an argument, so no awk and bash pipes required.

alfred term

This fires up Terminal, which runs the chef search, and opens separate windows.

csshx cap

Your cursor will be in the bottom red window by default and input will appear in all the windows at the same time. When done, CMD+Q will bring this window

quit window

And done :)

How it’s been

Before SendGrid, I used to deploy all my databases by hand. I’d have a documentation page (a google doc, internal wiki page…whatever). And it would be a long bulleted list of “Install this, then install this”. If you have ever maintained ‘How to’ documents like that, This picture will eventually ring true.

code comments

This was not a good approach, obviously. Especially when small details start changing but the ‘documentation’ lags behind. Now you have a situation that enables tribal knowledge which means 3 am ops person who is not the DBA has even less of an ability to know what should be running on a database and how it should look like under normal operations.

Multiply by…a lot

Then came my largest deployment to date at Sendgrid. We needed a data store for storing the click and open tracking for our short URLs and we decided to use MySQL as the place for this. This was going to be a ton of rows with a high demand on fast writes and supporting a lot of reads. So a single instance was not going to cut it.

Because MySQL 5.5 was the standard GA version at the time, we were still limited to a MySQL that didn’t very efficiently use all the cores newer server configuraitons could offer. So to squeeze out the most performance out of our not so cheap hardware, we also decided to house 5 MySQL instances per box. The way to do that was add a Virtual IP per instance on the box, use distinct data directories and config files per instance while still making sure that all 5 instances are ‘equal’ so as not to let one starve the others of system resources.

Enter chef

As you can see where I am going with this, it became very clear to me that I could not successfully deploy this new cluster (the biggest I had done yet) using the same old method. I needed a way to automate the building of these clusters and I needed that to also be an easy method of maintaining the state of these clusters (configuration or MySQL and teh system underneath) in code.

So why chef? Simply put, it was what Sendgrid had already been using for configuration management and what is now often called ‘Infrastructure as code’. I wanted this datastore to begin the effort of not making what I do seem like black magic…because it really isn’t. I work with a team of great operations engineers and when trying to scale traffic to double or more annually with a not very big team, consistency of tools is of extreme importance.

What I learned

Learning chef as a DBA was an interesting experience. I will preface this section by saying that 2 years later, I am rewriting not just the cookbook for this data cluster but all of my chef code at Sendgrid. There are many things I learned the hard way in that first major iteration and I can imagine the same pitfalls happening to others in a DBA or similar roles in other companies.

Write your own cookbook

I am not going to go into code samples. There are a few community cookbooks for installing MySQL/Percone Server and I consider them all a great place to find examples. Yes, you can absolutely grab them and deploy MySQL with them and I imagine for many budding teams this may be a very fine path to take. But know the debt you take on when grabbing someone else’s code to deploy your infrastructure. I chose from the very beginning to write my own cookbook because by the time I started, Sendgrid was already doing a huge throughput and that comes with a number of tweaks.

When things are similar but not the same

When I started on this cookbook writing adventure, I thought my different database clusters were similar enough to use one cookbook with just some attribute differences. And maybe when I strated 2+ years ago that was true. Very quickly though, as we sharded more tables into their own clusters, plus added a few more brand new projects using MySQL, that stopped being true. I found myself maintaining the monorail of database cookbooks. Making its testing strategy truly comprehensive meant 3 test kitchen suites per database kind. Build times grew exponentially.

This is why in this rewrite I am heavily using what is basically a wrapper style. Yes, most of my MySQL deployments use what is more or less the same pattern, but usually in the post server install time, things diverge. And there are few things as frustrating as watching a multi hour jenkins build because I changed a config file for a specific database type.

Embrace your organisation’s cookbook hierarchy

Besides automation, making my life easier…etc. First and foremost, I decided to learn chef and write cookbooks for our databases because database land should not be an island. This is why in my rewrites I made sure the operations engineering team reviewed my code. Not only is peer review from them, being immersed in chef the most, so useful. They also know what parts of system management we decided to turn into internal lightweight resources, making my code even simpler and no reinventing my mostly the same but not quite wheel. This has made the rewritten cookbooks much easier to follow and maintain.

This rewrite is not done. I only have a few clusters left with cookbooks in progress for them already. I have learned quite a lot about being a operations engineer working on this project.