Data-Based, Part 1: Mathias Meyer and The Riak Handbook

Our own Ines Sombra interviews Mathias Meyer, the author of the Riak Handbook and Infrastructure Engineer at Travis-CI.

Mathias on Twitter: https://twitter.com/#!/roidrage Mathias' Website: http://www.paperplanes.de/

Discussion * 0:20 All about Mathias * 2:45 The Riak handbook * 6:30 Using specific DBs for different use cases * 9:30 Postgres and why it's exciting * 14:30 The challenges that big data stores will face and the future of databases * 20:00 Advice for people getting started

Promotions Buy the Riak Handbook here, and enter code EYDATA to get 20% off through May 17: http://riakhandbook.com/

Links * Riak - http://basho.com/products/riak-overview/ * CouchDB -http://couchdb.apache.org/ * MongoDB - http://www.mongodb.org/ * Redis - http://redis.io/ * Percona - http://www.percona.com/ * MongoDB Best Practices - https://blog.engineyard.com/2011/mongodb-best-practices/ * Travis-CI - http://travis-ci.org/ * Martin Fowler's Polyglot Persistence Blog Post - http://martinfowler.com/bliki/PolyglotPersistence.html * Cassandra - http://cassandra.apache.org/

Ines Sombra: Hello. Welcome to our first Cloud Out Loud Focus on Data. This is Ines. I'm a data engineer for Engine Yard. And I'm interviewing Mathias Meyer, who's the author of the Riak Handbook, and he has another book that is upcoming, the NoSQL Handbook, and I'm gonna give the mic to Mathias. Would you like to introduce yourself?

Mathias Meyer: Hey, I'm Mathias, and I'm from Berlin in Germany, and not that much to say about me. I'm a software developer, slash, infrastructure madman, slash, database enthusiast. And, yeah, I'm also the author of the Riak Handbook.

Ines Sombra: So, Mathias, can you tell us a little bit, how you got started with NoSQL and actually with databases in general, too?

Mathias Meyer: It's a beautifully long story. Back in university, I started out with Sybase, which is a beautiful database − if you ever work with Microsoft SQL server, you're still working with Sybase in one way or the other – and worked my way around the block when it comes to relational databases, did projects with my SQL Oracle DB2 and other weird databases.

But pretty early on, I got exposure to CouchDB, with thanks to Jan Lehnardt, one of the core team members, who's also in Berlin. And he gave a talk at a local user group about it. And there was another local guy, Alexander Lang, who wrote the Ruby Library for CouchDB pretty much around the same time.

And, yeah, that's pretty much when I started looking into CouchDB, not in NoSQL in generally as just yet, but it was pretty early on. And CouchDB was one of the earliest – the NoSQLish databases that were out there.

And around that time, I'd worked with Amazon's S3 and SimpleDB and got some early exposure to EC2, which is not necessarily NoSQL related. But I started a company that did cloud management and infrastructure management pretty much around the same time that Engineer had started doing the same thing on EC2.

And around that time was when Redis was released, and, yeah, thanks to Ezra, a guy who used to work at Engine Yard, I got pretty good and early exposure to Redis as well. So both Redis and CouchDB ended up powering this stack of said infrastructure management solution.

And so, yeah, I started giving talks on CouchDB and Redis and kind of got fascinated with NoSQL in general and looked into MongoDB, then got curious about Riak and Cassandra and all the others. It was a downward spiral.

Ines Sombra: Yeah. So you recently published a book, the Riak Handbook. Can you tell us a little bit more about it? So you started – you went from Couch to Riak, and then you went – it seems that you went really in depth in Riak, right?

Mathias Meyer: Yes. I originally started out wanting to write a book on NoSQL, kind of pick the five most popular databases in the NoSQL space, which are hard to pin down, but I had my choices. And the first part somehow ended up being very, well, very practical introductions to the CAP Theorem and to eventual consistency.

And then I started writing the part on Riak, which, well, it ended up being a very, very long and thorough chapter. And I just kept writing and writing, and new stuff kept popping up on Riak, new questions that people asked on stuff that I wanted to add to the book, to the chapter, to the Riak chapter in particular.

And that was all around the time I worked for Basho; got to say that as well. And so I picked up a lot of knowledge about what customers did with Riak, how you modeled data structures in Riak and just production experience with Riak in general. And that added and added and added more and more and more to the chapter.

So, eventually, I decided to just separate it out as a book because, at that time, there was no book on Riak, and there still is no other book on Riak, so it seemed like a good choice to do that. And, in the end, it was a very good choice to publish it as a separate book.

Ines Sombra: Good. And you're still working on the NoSQL Handbook, right?

Mathias Meyer: Not exactly. Well, I have to figure out what exactly I want to do. Writing a book on five different NoSQL databases is a very big task. Who knew? A couple of months already went into the Riak chapter, so if you've read the Riak Handbook, you can imagine how much longer it would take to cover all the different databases.

The thing I'm currently thinking about is to publish a book on Redis because the Redis chapter was the first complete chapter I wrote, and it kind of gave me an idea of how I wanted the book to be more, you know, not talking about all the data structures and theory and stuff like that, but more to give practical approaches on how you can use the specific features.

Redis, for example, has all these weird data structures that you can use in pretty interesting ways, and I just wanted to have a chapter that explains specific use cases you can do with that, and that kind of gave me more ideas for the Riak Handbook. That's how it ended up being very practical and hands-on and just taking you from beginning to end, basically.

And the Redis chapter is – it works in a similar way, and so I'm currently thinking more of just splitting that out as a separate book and focusing on these two databases because, of all the databases, those two are the ones I know best.

Ines Sombra: Yeah, and also, they're very resilient, and they're very well tested and tried. I think the Redis Handbook would be very interesting to have as well. So I will buy it.

So, okay, so we have Riak, and we have Redis, and those are the ones that you are the most familiar with and the ones you seem to like the most. But is there any other trends regarding databases? In your opinion, has the way we develop applications changed over the years?

Mathias Meyer: Well, it definitely has changed in the way that now, we feel freer to pick more databases. So Martin Fowler just called it polyglot persistence in a blog post he wrote one or two years ago. And it's certainly that we now feel a lot freer to just, you know, to look at a particular use case and to pick the database that seems best with that particular use case.

Putting all the hype on NoSQL aside, it's gotten more obvious that it's a good idea to just use a database just like you use a programming language for a specific use case. It makes sense to do the same with a database.

The interesting part though for me has been that NoSQL databases over the last years have just gotten closer to relational databases when it comes to, for example, user friendliness. And the same is true for the other way around.

It's like when you look at a current MySQL release, MySQL 5.6 adds a memcached interface, which basically it turns MySQL into a Cuvelli store, giving you raw network access to the storage engine, so that you can bypass all the more complex things, like the SQL parser, the analyzer, and all that stuff, just giving you simple access to the storage engine that's underneath. And that's pretty cool, I think.

And MySQL cluster is getting better and better and offering easy cluster solutions for MySQL, for example. You have Postgres offering native JSON support, having built in Cuvelli storage, namely H-Store, and stuff like that. It's kind of fun to see.

And another proof is that, you know, Oracle, that it kind of is a serious thing to do, that Oracle has released its own NoSQL databases. I've got to admit I haven't looked into that in too much detail yet. But it's kind of interesting to see.

But I also like that companies like Google, Facebook, and Percona do a lot of work – and Twitter even, just Twitter recently released their own fork of MySQL. And all of these companies put a lot of effort into maturing MySQL and enter a direction that makes it more and more suitable to carry high loads, to carry high write loads, high read loads, to just serve big websites. That's kind of interesting.

What I find shocking though is how few people know Percona. If you're using MySQL, you have to know Percona. You should know about Percona's extra backup toolkit and about the Percona toolkit in general. It's pretty nice work they're doing.

Ines Sombra: Yeah. Do you know much about the extra DB plus there? So I think that they're not releasing a version of MySQL that is meant to be more geared towards multimaster, and then they're recently pushing into beta. And have you heard anything about that, the Percona distro?

Mathias Meyer: I have only heard of it from Percona, and I haven't looked into it myself, but it's very interesting. It's kind of interesting to see Percona coming up with their own clustering solution that stands next to Oracle's own version of MySQL clusters, so kind of cool to see where that is going eventually. But, yeah, I have yet to look into that myself; so many things to look into.

Ines Sombra: Yeah, and they keep popping up.

Mathias Meyer: Yeah.

Ines Sombra: So what in the world of, like, data makes you smile these days? What do you really like? What are you interested in now?

Mathias Meyer: Believe it or not, the database I'm most interested in right now is Postgres. It's kind of cool to see what happens in Postgres in general, and I haven't spent enough time with it myself, so that's kind of what I'm digging into right now.

Also, the company I now work for, Travis CI, it's open-source continuous integration; they're also running on Postgres. And, yeah, I'm looking forward to diving into it more for that particular use case.

Ines Sombra: Yeah, the guys from Instagram had a meet-up, I think, yesterday – no, Tuesday – and they were talking about how they use Postgres and Redis to scale their architecture, but mostly anything that has to do with images were based on Postgres. And they were able to shard it, and then they were able to pretty much solve the problems that people these days are using NewSQL solutions for, just with Postgres, a couple of extensions, and just a clever way about getting shards to it.

So it's pretty impressive, the amount of load that it can take and how precise you can be at deploying an application that has really very high needs, but with the technology that has already been tried and true.

So I think that, yeah, it's been very – it's been maturing very well, and then the next version is gonna be even better, like, JSON support is gonna make things much more, like, loose schemas and things like that.

So I remember one time when we had this conversation that you were talking about the myth of NoSQL and schema-less. Can you tell our listeners a little bit more about it?

Mathias Meyer: Yeah, schema-less – schema-less is an interesting part of NoSQL. It's the idea that you don't have a fixed schema, that your schema just evolves with your data, that you can just add and remove new attributes to data. That was an interesting – how do I put it? – an interesting promise of NoSQL.

And I've used a database like CouchDB myself in production, and in the end, you always have a schema, even if you have, like, a document structure, there's always a schema behind it. You always have a certain set of attributes, a certain set of values that these attributes can have and things like that.

So, in one way or the other, you always end up migrating data, even, you know, NoSQL databases don't take that part away. They just force you to go different routes to actually migrate your data. For example, to migrate data lazily, so whenever you read data, and you see – you realize that the schema of that particular piece of data's outdated, you update it, and you write it back to the database. That's kind of a lazy approach to migrate data on the fly.

And I think I remember a blog post about Engine Yard doing something similar, even with their MySQL setup. So it's kind of a mixed bag. Schema-less is not a guarantee that you can just do whatever you want. In the end, you'll always need indexes on data to have very good performance when accessing data by certain attributes. And to have indexes, you have to have an idea of what your schema looks like.

Ines Sombra: Yeah. So it's a little bit of an illusion, and I think it's just more of a marketing like, maybe whenever you – after you committed to something that appears to be schema-less, you tend to find out, and then it's one of those, like, \"Hey, by the way, we didn't know about this,\" and then, all of a sudden, it just happened.

Mathias Meyer: Yeah. One thing to add to that, you know, it's like a lot of people think that changing schemas in, for example, MySQL is expensive or in Postgres. But, you know, Postgres, I think some schema changes in Postgres are a one-off operation. They don't even involve copying around all the data.

And, for MySQL in particular, there's a tool that's part of Percona's toolkit that allows you to do online migrations. It's a weird way to do that, and if you're not a fan of that in general, you'd probably scream when you hear about that.

But the basic idea is you create a new table with the new schema, and then you copy all the data over into that table. In the meanwhile, you capture all the data that is changing while you're copying around the data, and when you're done copying, you apply all these changes to the new table. And then you just flip a switch, and all the reads and writes go to that new table.

And, obviously, it will just – it will put some load on your database, but it's a lot less load. It gives you a lot more head room compared to having a blocking schema change that would involve locking a table and copying a lot of data around and basically meaning a maintenance mode for several hours in the worst case.

Ines Sombra: Yeah. So you discuss a little bit about how now databases are starting to adopt a little bit of things from, like, NoSQL, and then NoSQL are starting to bridge the gap and become – start offering a little bit more of the guarantees that relational databases have.

What do you see, the big challenges that the data stores are going to be facing, or what do you think that the direction of, like, where do you think they're going? How do you see the database world in, like, in two years or five years from now?

Mathias Meyer: Oh, God, five years, oh, I don't want to make that call.

Ines Sombra: This is one of those things where, like, finally, we're not gonna need this, and it just happens. So, yes, I guess that question is a little far-fetched.

But, at least direction-wise, where do you think that they're – do you think that they're just gonna become this mesh hybrid of being able to do different types of data access within the same database, or do you think that we're still gonna have a direction where you have data stores that are specific to the function that you want and maybe the connections between them become easier? Where do you think things are going?

Mathias Meyer: Probably, a year ago, I would have said – I would have insisted that you will still have specialized – more very specialized tools around. But, now, it's more like − in particular, the bigger players in the field, like MongoDB, Cassandra, Riak, and others, they're getting closer and closer and closer to the user, making their databases more friendly to use for users.

For example, Cassandra now has some sort of consistency guarantee. It has atomic distributed counters, which is like what people actually expect from a database. And what Amazon has done with its DynamoDB, they added a lot of features that users just expect from a database, even though you can tell them that it's distributed and then that stuff like that is hard and all these things.

You know, user friendliness is all that matters in a way. And users have some expectations of their database, and I think the databases are getting closer and closer to the users' expectations.

And then that is both ways, you know, that MySQL having a memcached interface that allows simple Cuvelli storage. That's what users now want, and it's a fair thing to ask for, and it's actually not a hard thing to implement on top of MySQL because most of the features are in place.

And you see databases like Riak adopting things like secondary indexes that allow you to do proper queries on attributes and range queries and stuff like that. So I think user friendliness is gonna be a very big factor in the coming month and maybe years.

Ines Sombra: See, to me, like, I'm a little conflicted when it comes to user friendliness because I didn't really grow up in the Ruby community. I just – I'm fairly new to it, and the concept of ORM is still something that I understand that people use it, but I think it's one of those things that I'm conflicted about.

I think it's, like, user friendly, but at the same time, I think it helps people disconnect from their data store in ways that it's a little – to me, it's a little – it fosters this system of not understanding how the thing that you deal with works. And I think it's probably good because it makes you get started faster.

But then, you know, when you see a relational database, it's just basically ORM access, and it's not heavily optimized, and then that whole thing about whether it's normalized or not, it's, like, I'm conflicted about that. It's, like, am I understanding the situation incorrectly, or basically do you see things like ORM on layers between the user and the database as something inherently good?

Mathias Meyer: Yes and no. ORMs are nice to – you know, you have some sort of abstractions, and people love abstractions on top of the databases. But I don't know. It's hard to say anything in specific there because, on the one hand, you want to have your database into as many hands as possible. But, on the other hand, you want people to understand what happens underneath. Those are the two – to me, they're connected.

To me, you need to have a database that is – or any tool basically that is easy to use but that's also easy to run in production and that is, well, not impossible to understand what happens underneath. So, yeah, I don't have a good answer there.

But, as with NoSQL in general, there's gonna be trade-offs in that regard. I'm not saying that every database will eventually have fancy tool chains around them allowing you to click here and click there, and everything will magically work. But it will – when I say user friendliness, I'm talking more about meeting users' expectations.

And users' expectations, for example, are having a career language, you know, having a way to index data that you can get it back out again, having a means to implement counters in your applications without having to worry about weird data structures that you have to model yourself, stuff like that.

Ines Sombra: Yeah. And their ability is huge. I mean, if you say something to the database, like, in the relational time, like, we were ready, gotten so used to asset property, just having something that gets committed, and you don't know if it's written or not, I find it very difficult to grasp. If you send it to the database, it should be durable, so that's one of the things.

So what is your advice for somebody that is getting started? What are the biggest dos and don'ts, do you think, that somebody that is just starting to evaluate different things to be aware of?

Mathias Meyer: So I'm not a big person when it comes to talking about dos. I talk more about don'ts. And I can only talk from my experience there. And, from my experience, we jumped on NoSQL databases full on, basically replacing relational databases on a project entirely.

And these days, I would not do that anymore. I don't do that anymore basically. But it's my personal choice is always a relational database first, and then add things where it makes sense. And that's what I personally would recommend.

For me, there's just no reason to not use a relational database in most cases, unless I specifically know the use case and know what kind of data I can expect, and then I can make assumptions and validate then that a relational database is probably not a good fit for that.

And so my advice in general would be to start off easy and to start replacing small bits in your application where it kind of makes sense. For example, use Redis for some small computational data for which Redis is just a much faster tool to use than a relational database, for example.

And that's kind of how I would recommend approaching things, just start off small and replace small parts or add, you know, when you add new features and see something like Redis would kind of be a nice fit for that, you know, just start playing with it, start adding it with that feature, and roll it out slowly. And just watch and learn and see what happens with that particular feature, and then edit in other places where it makes sense.

Ines Sombra: Well, thank you, Mathias, for talking with us. Again, he's the author of the Riak Handbook. Go buy his book. It's actually very easy to follow, and if you're interested in Riak, I've read the book myself. It's, like, very easy to read, very easy to understand the concepts that he explains. I'm looking forward to maybe reading your Redis Handbook. I hope that you continue working on it.

And, in Engineer, we're working on Riak support for manage, so if any of our listeners are interested in it, just please let us know. Open a ticket, and it will eventually come to me. And is there anything else that you want to say, Mathias?

Mathias Meyer: No. It was a pleasure.

Ines Sombra: Yeah, thank you so much.

Mathias Meyer: It was a pleasure talking.

Ines Sombra: Thank you again, and I look forward to maybe having another podcast with you whenever you publish your next book.

Mathias Meyer: Sure thing.

Ines Sombra: Thanks.

Mathias Meyer: Thanks.

Ines Sombra: Bye.

Mathias Meyer: Bye.