Mongo和Couch对比

Mongo和Couch是终极关系数据库的两个杀手。

在mongodb.org有一篇文章提出两者比较和如何使用。
1.如果正在建立类似Lotuc notes应用，数据会暂时离线几个小时，然后再上线，这种模式适合Couch，和Couch的MVCC模式(Multiversion concurrency control)完全符合。

2.如果我们需要一个master-master一直的可替换的数据库，物理上分布式的，经常离线的，使用Couch

3.如果有高性能要求，使用Mongo，比如缓存等。

4.如果应用需要非常关键的事务支持，比如金融交易等，就不要使用MongoDB，还是使用传统的关系数据库。

5.如果有非常高的更新率，使用Mongo。

个人意见：Mongo倒是非常类似内存产品Terracotta，Terracotta比memcache更加适合频繁更新，而且使用Terracotta可以和传统数据库友协调。

原文:
http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB
[该贴被admin于2009-08-10 17:55修改过]

很著名的一篇NoSQL: If Only It Was That Easy
没有SQL,如果仅仅是这么简单.
http://bjclark.me/2009/08/04/nosql-if-only-it-was-that-easy/

网址被大陆查封屏蔽,转贴英文如下:
The biggest thing in web apps since “rails can’t scale” is this idea that “your rdbms doesn’t scale.”(在Web最大的事情就是Rails并不具有伸缩性,因为Rails是依赖数据库的,而数据库是不可伸缩性,这句话我个人好像以前也说过). This has gone so far as to be dubbed the coming of age for “nosql” with lots of blog posts and even a meetup.(未来被称为无数据库时代) Indeed, there are many promising key-value stores, distributed key-value stores, document oriented dbs, and column oriented db projects on the radar( 由此诞生了很多key-value存储模式,个人推荐的缓存也是一种key-value存储模式, key-value存储模式是可伸缩的). This is *definitely* a great thing for the web application scene and this level of variety will definitely open doors for organizations large and small in the near and long term.

However, along with these great tools, an attitude that “the rdbms is dead” has popped up, and while that may be true in the long run, in the short term, it’s definitely premature.(随着这些伟大的工具推出和应用，一种观点说， “关系型数据库管理系统已经死了”，而这也许是真实的，从长远来看，在短期内，这肯定为时过早。)

(下面就不翻译了,大家用google翻译自己将就做着理解,意思关系数据库和新型key-value存储模式可能同时并存)
What is scaling?
First, lets get a couple things straight:

to scale (third-person singular simple present scales, present participle scaling, simple past and past participle scaled)

(transitive) To change the size of, maintaining proportion.
We should scale that up by a factor of 10.
(transitive) To climb.
Hilary and Norgay were the first known to have scaled Everest.
(intransitive) (computing) To tolerate significant increases in throughput or other potentially limiting factors.
That architecture won’t scale to real-world environments.
The first thing we need to agree on is what it means to scale. According to our definitions above, it’s to change the size while maintaining proportions and in CS this usually means to increase throughput. What scaling isn’t: performance.

performance (plural performances)The act of performing; carrying into execution or action; execution; achievement; accomplishment; representation by action; as, the performance of an undertaking of a duty.In Computer science: The amount of useful work accomplished by a computer system compared to the time and resources used. Better Performance means more work accomplished in shorter time and/or using less resources

[该贴被admin于2009-08-10 17:54修改过]

In reality, scaling doesn’t have anything to do with being fast. It has only to do with size. If your request takes 12 seconds, it doesn’t matter, it only matters that you can do 1 per second, 10 per second, 100 per second, 1000 per second, etc. of that 12 second query.
Now, scaling and performance do relate in that typically, if something is performant, it may not actually need to scale. Your upper limit may be high enough you don’t even need to worry about scaling. The problem with rails is not that it doesn’t scale (it happens to scale pretty easily), it’s that you have to scale it almost immediately. The problem with RDBMS isn’t that they don’t scale, it’s that they are incredibly hard to scale. Ssharding is the most obvious way to scale things, and sharding multiple tables which can be accessed by any column pretty quickly gets insane. Furthermore, you might be able to use something other than a RDBMS that you won’t need to scale because it’s more performant or efficient at doing the work you’re currently doing in a RDBMS.

So NoSQL then. . .
At my previous job, my co-workers and I evaluated most of the current nosql solutionsto varying degrees. All of the projects have been evaluated for both use as simple “tables” of data, such as storing a single type of key/value data, as well as a document db to be used as our primary data store. They have a curious data set which consists of a single set of parent objects with many child objects that relate 1-1 or 1-n with this set of objects (I call these primary objects). We then have a secondary set of objects that store changes made to both the primary objects that we primarily use for auditing. Our current db setup is standard master-slave replication in mysql with 1 master and up to 3 slaves depending on usage. The primary objects are mostly changed via UPDATES and the secondary objects are all inserted at the end of the tables. We also have a few random other data sets which loosely relate to the primary parent objects.

To get a couple out of the way, I’m not going to cover memcached (because it’s not a db), memcachedb (general sentiment that it is immature), couchdb (because we didn’t want to use map/reduce to pull information and there are questions about it’s performance and replication), dynomite (seen as immature), Amazon SimpleDB (because of size limits), or Lightcloud (seen as immature). As far as the ones that we I deemed immature, I’m sure there are people out there using these things and having a great time, but our research into them, and word of mouth from others who have tried them kept us from really going deep.

Tokyo *
url: http://tokyocabinet.sourceforge.net/
type: Key/Value store with Full Text Search*
Conclusion: Doesn’t scale.

We liked Tokyo Tyrant so much, we put it in production. In fact, every request to AboutUs.org hits Tokyo. One of the uses is as a persistent memcached replacement for caching 10 million+ wiki pages (as a json document of all the pieces of our page, which comes out to around 51gb(edited) of data), and it works great. It runs on a single server, it serves up a single type of data, very quickly, and has been a pleasure to use. We keep other ancillary data sets on some other servers too, and it’s great for this. Tokyo Tyrant is a great example of very performant software, but it doesn’t scale. If you’d like to make it scale, it’s not very hard, you scale it exactly like Memcached (by some sort of application side hashing of keys). You can have as many servers as you’d like, but you can’t easily add servers to a cluster (increase in size while maintaining proportion) and therefore, you can’t tolerate significant increases in throughput. The good news it that here “significant” is relatively massive, and you probably won’t need to scale it any time soon.

We tried to insert 160mil 2k -20k documents into a single Tokyo Tyrant server, and performance quickly dropped off and kept going down. You could have had a nice holiday skiing on the graph of inserts/sec. This is pretty typical of anything that you write to a disk. The more you write, the slower it goes. We could distribute the write easily, because Tokyo doesn’t scale.

Tokyo does support replication, and a few other great things, but these don’t make for scaling.

* We don’t use the full text search, so I can’t comment there.

Redis
url: http://code.google.com/p/redis/
type: Key/Value store with collections and counters
Conclusion: Doesn’t scale.

Redis is also awesome like Tokyo. I would say the two are pretty comparable as simple k/v stores. The counters and collections are AWESOME, and if I was still at AboutUs, I think I’d be pushing to move a couple pieces of the infrastructure to Redis. I have less to say about Redis because I haven’t used it in production, but it looks great if it fits your bill. Does it scale? No. It, just as memecached and tokyo tyrant, can do sharding by handling it in the client, and therefore, you can’t just start adding new servers and increase your throughput. Nor is it fault tolerant. Your redis server dies, and there goes that data. And just as tokyo tyrant and memcached, you probably won’t ever need to try to scale it. Redis also supports replication.

Project Voldemort
url: http://project-voldemort.com/
type: Distributed Key/Value store
Conclusion: Scales!

Voldemort is a very cool project that comes out of LinkedIn. They seem to even be providing a full time guy doing development and support via a mailing list. Kudos to them, because Voldemort, as far as I can tell, is great. Best of all, it scales. You can add servers to the cluster, you don’t do any client side hashing, throughput is increased as the size of the cluster increases. As far as I can tell, you can handle any increase in requests by adding servers as well as those servers being fault tolerant, so a dead server doesn’t bring down the cluster.

Voldemort does have a downside for me, because I primarily use ruby and the provided client is written in java, so you either have to use JRuby (which is awesome but not always realistic) or Facebook Thrift to interact with Voldemort. This means thrift has to be compiled on all of your machines, and since Thrift uses Boost C++ library, and Boost C++ library is both slow and painful to compile, deployment of Voldemort apps is increased significantly.

Voldemort is also intersting because it has pluggable data storage backend and the bulk of it is mostly for the sharding and fault tolerance and less about data storage. Voldemort might actually be a good layer on top of Redis or Tokyo Cabinet some day.

Voldemort, it should be noted, is also only going to be worth using if you actually need to spread your data out over a cluster of servers. If your data fits on a single server in Tokyo Tyrant, you are not going to gain anything by using Voldemort. Voldemort however, might be seen as a good migration path from Tokyo * when you do hit that wall were performance isn’t enough.

MongoDB
url: http://www.mongodb.org
type: Document Database
Conclusion: Doesn’t scale (yet!)

MongoDB is not a key/value store, it’s quite a bit more. It’s definitely not a RDBMS either. I haven’t used MongoDB in production, but I have used it a little building a test app and it is a very cool piece of kit. It seems to be very performant and either has, or will have soon, fault tolerance and auto-sharding (aka it will scale). I think Mongo might be the closest thing to a RDBMS replacement that I’ve seen so far. It won’t work for all data sets and access patterns, but it’s built for your typical CRUD stuff. Storing what is essentially a huge hash, and being able to select on any of those keys, is what most people use a relational database for. If your DB is 3NF and you don’t do any joins (you’re just selecting a bunch of tables and putting all the objects together, AKA what most people do in a web app), MongoDB would probably kick ass for you.

Oh, and did I mention that, of all the NoSQL options out there, MongoDB is the one of the only ones being developed as a business with commercial support available? If you’re dealing with lots of other people’s data, and have a business built on the data in your DB, this isn’t trivial.

On a side note, if you use Ruby, check out MongoMapper for very easy and nice to use ruby access.

Cassandra
url: http://incubator.apache.org/cassandra/
type: Column Database
Conclusion: Probably scales

Cassandra is another very promising project that I wouldn’t use yet. Cassandra came out of Facebook and seems to be in use there powering search in your inbox. It’s described as a distributed key/value store, where values can be collections of other key/values (called column families). It is definitely supposed to scale, and probably does at Facebook, by simply adding another machine (they will hook up with each other using a gossip protocol), but the OSS version doesn’t seem to support some key things, like loosing a machine all together. They are also in the midst of changing how the basic datastructures are stored on disk, and I don’t know that I’d trust my data to this sexy db until those things are worked out, which should be soon.

Cassandra also seems like a contender for a primary database or RDBMS replacement, as soon as it matures. The scaling possibilities are very attractive, and complex data structures shouldn’t be hard to model in it. I’m not going to go any deeper on cassandra because Evan Weaver did a great job of it here, but I will say that Cassandra is very promising and we were (when I left) looking at it very closely at AboutUs.org.

Amazon S3
url: http://aws.amazon.com/s3/
type: key/value store
Conclusion: Scales amazingly well

You’re probably all like “What?!?”. But guess what, S3 is a killer key/value store. It is not as performant as any of the other options, but it scales *insanely* well. It scales so well, you don’t do anything. You just keep sticking shit in it, and it keeps pumping it out. Sometimes it’s faster than other times, but most of the time it’s fast enough. In fact, it’s faster than hitting mysql with 10 queries (for us). S3 is my favorite k/v data store of any out there.

MySQL
url: http://www.mysql.com
type: RDBMS
Conclusion: Doesn’t Scale

Now you are probably like “Dude, what?!? You got some SQL in this NoSQL article”. I’ve got news for you guys, mysql is a pretty bad ass key/value store. It can do everything that Tokyo and Redis can do, and it really isn’t that much slower. In fact, for some data sets, I’ve seen MySQL perform ALOT faster than Tokyo Tyrant (I’ll post my findings in a follow up). For most applications (and say, FriendFeed), MySQL is plenty fast and it’s familiar and ubiquitous. I’m sure the NoSQL guys reading this will all be saying “Yeah, but we are dealing with more data than MySQL can handle”. Well, you might be dealing with more data than mysql used as a RDBMS might be able to handle, but it’s just as easy or easier to shard MySQL as it is Tokyo or Redis, and it’s hard to argue that they can win on many other points.

Conclusion
So, does RDBMS scale? I would say the answer is: not any worse than lots of other things. Most of what doesn’t scale in a RDBMS is stuff people don’t use that often anyway. And does NoSQL scale: a couple solutions do, most don’t. You might even argue that it’s just as easy to scale mysql (with sharding via mysql proxy) as it is to shard some of these NoSQL dbs. And I think it’s a pretty far leap to declare the RDBMS dead.

The real thing to point out is that if you are being held back from making something super awesome because you can’t choose a database, you are doing it wrong. If you know mysql, just used it. Optimize when you actually need to. Use it like a k/v store, use it like a rdbms, but for god sake, build your killer app! None of this will matter to most apps. Facebook still uses MySQL, a lot. Wikipedia uses MySQL, a lot. FriendFeed uses MySQL, a lot. NoSQL is a great tool, but it’s certainly not going to be your competitive edge, it’s not going to make your app hot, and most of all, your users won’t give a shit about any of this.

What am I going to build my next app on? Probably Postgres. Will I use NoSQL? Maybe. I might also use Hadoop and Hive. I might keep everything in flat files. Maybe I’ll start hacking on Maglev. I’ll use whatever is best for the job. If I need reporting, I won’t be using any NoSQL. If I need caching, I’ll probably use Tokyo Tyrant. If I need ACIDity, I won’t use NoSQL. If I need a ton of counters, I’ll use Redis. If I need transactions, I’ll use Postgres. If I have a ton of a single type of documents, I’ll probably use Mongo. If I need to write 1 billion objects a day, I’d probably use Voldemort. If I need full text search, I’d probably use Solr. If I need full text search of volatile data, I’d probably use Sphinx.

If there’s anything to take away from the NoSQL debate, it’s just to be happy there’s more tools, because more cool tools = more win for everyone.

看了一个关于存储的文章，将数据持久化分为
1、关系数据库
2、文档数据库，比如CouchDB
3、分布式key-value数据库，比如最近非常热的Voldemort
现在我怀疑key-value模式能表达出关系数据库表达的那么强大的语义么？key-value用作缓存还可以，但是存储主业务数据，很多负责的关系，比如一对多或者更负责的关系怎么体现？另外如果主数据库采用key-value那么现有的ORM框架转变为Object to Key-Value是否可转化？

>现有的ORM框架转变为Object to Key-Value是否可转化
所以，关键是是Key-Value是一种对象的Key-Value，Key是实体对象的ID标识，Value是实体根对象。

这样理解Key-Value，一切就顺了。Key-Value就应该是对象的Key-Value。

key-value的问题是缺乏结构化，用key-value存储主业务关联数据，如果别人不想用id查询，想用name查询，取不key中的id没有作用了，还要把所有数据取到内存里分别把对象的name取出来查找。。。

>取不key中的id没有作用了
key-value不是直接面向业务应用的，我前面说过，key-value是用来保存领域模型的实体对象的，这个实体对象就是DDD的实体聚合根，这个实体对象才面向业务应用，你如果想查询，使用Specification规格来实现，具体可见Evans DDD。

我们不能再用直接面向数据库思维来看待技术的跳跃，key-value背后是架构思维的转变。

我理解你得意思是Voldemort只做存储，然而对于存储数据的业务查询，查询语句的定义，如何回滚都有实现领域模型仓库的开发人员自己完成，代码放在仓库类里

对，key-value就是把过去关系数据库存储和计算两个事情分开，只做其中的存储。

哦，这样说就明白了，用key-value存储有个好处就是不需要开发人员为object to relation db 之间的不匹配烦恼了，但是无形之中较传统db增加了很多工作量，比如自己实现事务机制以及很多数据库之前实现的保障机制，而且潜在出现错误风险加大，请问放弃关系DB的目的是什么？效率？可伸缩性？

Terracotta
这个东西到现我都还不明白到底怎么使用~~

迄今为止我所听说的分布式key-value都是用作缓存或者记录状态，存储主业务数据还真是没听说过。。。