Hongtium后端技术

Abstraction, isolation & integration.

By

金山云服务平台介绍

2013-12-05,金山云发布了完整云平台的星座系列产品:“天蝎”云主机,“水瓶”云硬盘+“金牛”海量存储(早先发布的KS3),“天秤”负载均衡,“天琴”数据库服务,以及早先发布的应用“快盘”以及新的轻工具“看孙子”。

首先介绍当前服务端开发的情况:


——云计算是一种解决之道。

金山云当前产品覆盖总览

金山云当前产品覆盖总览

金山云的产品总览:“天蝎”云主机,“水瓶”云硬盘+“金牛”海量存储(早先发布的KS3),“天秤”负载均衡,“天琴”数据库服务…这些产品完整覆盖了云计算定义分类的计算、存储、网络、服务、应用五大部分。

 

 

 

 

15-12-2013-12-17-48

主角–“天蝎”云主机 EC2

解决服务端开发的第一个问题:服务器!并且直接能获得几大特性:弹性的能力、极高的性能、极低的价格…

 

 

 

 

 

15-12-2013-12-18-17

“天蝎”云主机可在Web控制台快速部署

部署1台云主机和100台,几乎没有区别。业务扩张没有压力。

 

 

 

 

 

15-12-2013-12-18-22

性能和价格的比较

当前的价格比较图;云主机的瓶颈–磁盘IO吞吐量的比较图。

 

 

 

 

 

14-12-2013-18-08-05

云硬盘 EBS

EBS,是当前云主机的瓶颈。所以金山云着力解决IO性能,目前已经是业界的领先。

 

 

 

 

 

14-12-2013-18-05-27

“热迁移”

“热迁移”只有基于网络分布式文件系统的云硬盘才可能。金山云率先在中国的IaaS领域实现“热迁移”。

金山云在11月成功实施了对某客户的MySQL主库做热迁移的案例。大概在10~20秒之内,服务成功迁移到新机。

 

 

 

14-12-2013-18-14-45

海量云存储 KS3

同时,金山云拥有全国最大的商业数据海量云存储系统—“金牛”。从10PB ->100PB -> 1EB的设计容量,在过去的2年内完成。

 

 

 

 

 

14-12-2013-18-05-39

“金牛”拥有极低的成本

对需要储存大量数据的公司而言,每月付出的费用几乎忽略不计,同时拥有极高的可靠性。
不用再羡慕Google和Facebook的数据中心,使用金山云,相当于同时拥有了全国10几个顶级IDC,访问不再是问题。

 

 

 

14-12-2013-18-05-52

负载均衡产品 LBaaS

网络部分,目前金山云推出的是“天秤”负载均衡,解决常用的流量均衡、健康监测需求,同时顺带有一定的安全能力。

 

 

 

 

 

14-12-2013-18-05-59

负载均衡

开发者可以从容应对海量请求。

同时,负载均衡器也是构建高可用服务的基本产品。不仅仅是WebServer需要,像MySQL等也可以应用。

 

 

 

 

14-12-2013-18-06-15

“天琴” RDS

关系数据库服务RDS,解决自己购买主机、自己维护DBMS的麻烦。使用云平台统一的RDS,传统DBA的工作都减轻了,而且还能获得弹性扩展,数据安全,性能加强等优势。

 

 

 

 

14-12-2013-18-06-20

Data Safety & Security

RDS给你的业务带来双重的“安全”— Safety & Security 。

 

 

 

 

 

14-12-2013-18-06-43

云图

金山云致力打造的是完整的云计算生态链!通过“整合”的力量,打通前后端开发界线,贯通云计算IaaS、多种开发者平台(PaaS)、移动互联网相关渠道…让移动互联网和大数据相关领域的开发,不再有高成本(时间、金钱)的死角。

 

 

 

14-12-2013-18-07-04

金山云的定位

“云”时代,金山云就是你的基础架构部。

所有TAB都有自己的基础架构部,如何与之竞争?开发者也要知道“整合”的力量。

 

 

 

 

14-12-2013-18-07-08

理念

我们全力要做到的4点,也是我们当前的优势——通过顶尖的技术,带来最好的性能,同时保持极低的价格,并通过整合的力量为大家打造完整的云计算生态链。

 

 

 

 

14-12-2013-18-07-12

最终目标

最终和开发者一道,实现终极目标——解放生产力!

让有限的资源,都真正投入到“业务创新”之上!

 

 

 

By

twitter美妙tweets新纪录,如何做到的?

https://blog.twitter.com/2013/new-tweets-per-second-record-and-how  翻墙麻烦,转录于此。

Recently, something remarkable happened on Twitter: On Saturday, August 3 in Japan, people watched an airing ofCastle in the Sky, and at one moment they took to Twitter so much that we hit a one-second peak of 143,199 Tweets per second. (August 2 at 7:21:50 PDT; August 3 at 11:21:50 JST)

To give you some context of how that compares to typical numbers, we normally take in more than 500 million Tweets a day which means about 5,700 Tweets a second, on average. This particular spike was around 25 times greater than our steady state.

During this spike, our users didn’t experience a blip on Twitter. That’s one of our goals: to make sure Twitter is always available no matter what is happening around the world.

New Tweets per second (TPS) record: 143,199 TPS. Typical day: more than 500 million Tweets sent; average 5,700 TPS.

Tweet

This goal felt unattainable three years ago, when the 2010 World Cup put Twitter squarely in the center of a real-time, global conversation. The influx of Tweets –– from every shot on goal, penalty kick and yellow or red card –– repeatedly took its toll and made Twitter unavailable for short periods of time. Engineering worked throughout the nights during this time, desperately trying to find and implement order-of-magnitudes of efficiency gains. Unfortunately, those gains were quickly swamped by Twitter’s rapid growth, and engineering had started to run out of low-hanging fruit to fix.

After that experience, we determined we needed to step back. We then determined we needed to re-architect the site to support the continued growth of Twitter and to keep it running smoothly. Since then we’ve worked hard to make sure that the service is resilient to the world’s impulses. We’re now able to withstand events like Castle in the Sky viewings, the Super Bowl, and the global New Year’s Eve celebration. This re-architecture has not only made the service more resilient when traffic spikes to record highs, but also provides a more flexible platform on which to build more features faster, including synchronizing direct messages across devices, Twitter cards that allow Tweets to become richer and contain more content, and a rich search experience that includes stories and users. And more features are coming.

Below, we detail how we did this. We learned a lot. We changed our engineering organization. And, over the next few weeks, we’ll be publishing additional posts that go into more detail about some of the topics we cover here.

Starting to re-architect

After the 2010 World Cup dust settled, we surveyed the state of our engineering. Our findings:

  • We were running one of the world’s largest Ruby on Rails installations, and we had pushed it pretty far –– at the time, about 200 engineers were contributing to it and it had gotten Twitter through some explosive growth, both in terms of new users as well as the sheer amount of traffic that it was handling. This system was also monolithic where everything we did, from managing raw database and memcache connections through to rendering the site and presenting the public APIs, was in one codebase. Not only was it increasingly difficult for an engineer to be an expert in how it was put together, but also it was organizationally challenging for us to manage and parallelize our engineering team.
  • We had reached the limit of throughput on our storage systems –– we were relying on a MySQL storage system that was temporally sharded and had a single master. That system was having trouble ingesting tweets at the rate that they were showing up, and we were operationally having to create new databases at an ever increasing rate. We were experiencing read and write hot spots throughout our databases.
  • We were “throwing machines at the problem” instead of engineering thorough solutions –– our front-end Ruby machines were not handling the number of transactions per second that we thought was reasonable, given their horsepower. From previous experiences, we knew that those machines could do a lot more.
  • Finally, from a software standpoint, we found ourselves pushed into an “optimization corner” where we had started to trade off readability and flexibility of the codebase for performance and efficiency.

We concluded that we needed to start a project to re-envision our system. We set three goals and challenges for ourselves:

  • We wanted big infrastructure wins in performance, efficiency, and reliability –– we wanted to improve the median latency that users experience on Twitter as well as bring in the outliers to give a uniform experience to Twitter. We wanted to reduce the number of machines needed to run Twitter by 10x. We also wanted to isolate failures across our infrastructure to prevent large outages –– this is especially important as the number of machines we use go up, because it means that the chance of any single machine failing is higher. Failures are also inevitable, so we wanted to have them happen in a much more controllable manner.
  • We wanted cleaner boundaries with “related” logic being in one place –– we felt the downsides of running our particular monolithic codebase, so we wanted to experiment with a loosely coupled services oriented model. Our goal was to encourage the best practices of encapsulation and modularity, but this time at the systems level rather than at the class, module, or package level.
  • Most importantly, we wanted to launch features faster. We wanted to be able to run small and empowered engineering teams that could make local decisions and ship user-facing changes, independent of other teams.

We prototyped the building blocks for a proof of concept re-architecture. Not everything we tried worked and not everything we tried, in the end, met the above goals. But we were able to settle on a set of principles, tools, and an infrastructure that has gotten us to a much more desirable and reliable state today.
The JVM vs the Ruby VM
First, we evaluated our front-end serving tier across three dimensions: CPU, RAM, and network. Our Ruby-based machinery was being pushed to the limit on the CPU and RAM dimensions –– but we weren’t serving that many requests per machine nor were we coming close to saturating our network bandwidth. Our Rails servers, at the time, had to be effectively single threaded and handle only one request at a time. Each Rails host was running a number of Unicorn processes to provide host-level concurrency, but the duplication there translated to wasteful resource utilization. When it came down to it, our Rails servers were only capable of serving 200 – 300 requests / sec / host.

Twitter’s usage is always growing rapidly, and doing the math there, it would take a lot of machines to keep up with the growth curve.

At the time, Twitter had experience deploying fairly large scale JVM-based services –– our search engine was written in Java, and our Streaming Api infrastructure as well as Flock, our social graph system, was written in Scala. We were enamored by the level of performance that the JVM gave us. It wasn’t going to be easy to get our performance, reliability, and efficiency goals out of the Ruby VM, so we embarked on writing code to be run on the JVM instead. We estimated that rewriting our codebase could get us > 10x performance improvement, on the same hardware –– and now, today, we push on the order of 10 – 20K requests / sec / host.

There was a level of trust that we all had in the JVM. A lot of us had come from companies where we had experience working with, tuning, and operating large scale JVM installations. We were confident we could pull off a sea change for Twitter in the world of the JVM. Now, we had to decompose our architecture and figure out how these different services would interact.
Programming model
In Twitter’s Ruby systems, concurrency is managed at the process level: a single network request is queued up for a process to handle. That process is completely consumed until the network request is fulfilled. Adding to the complexity, architecturally, we were taking Twitter in the direction of having one service compose the responses of other services. Given that the Ruby process is single-threaded, Twitter’s “response time” would be additive and extremely sensitive to the variances in the back-end systems’ latencies. There were a few Ruby options that gave us concurrency; however, there wasn’t one standard way to do it across all the different VM options. The JVM had constructs and primitives that supported concurrency and would let us build a real concurrent programming platform.

It became evident that we needed a single and uniform way to think about concurrency in our systems and, specifically, in the way we think about networking. As we all know, writing concurrent code (and concurrent networking code) is hard and can take many forms. In fact, we began to experience this. As we started to decompose the system into services, each team took slightly different approaches. For example, the failure semantics from clients to services didn’t interact well: we had no consistent back-pressure mechanism for servers to signal back to clients and we experienced “thundering herds” from clients aggressively retrying latent services. These failure domains informed us of the importance of having a unified, and complementary, client and server library that would bundle in notions of connection pools, failover strategies, and load balancing. To help us all get in the same mindset, we put together both Futures and Finagle.

Now, not only did we have a uniform way to do things, but we also baked into our core libraries everything that all our systems needed so we could get off the ground faster. And rather than worry too much about how each and every system operated, we could focus on the application and service interfaces.
Independent systems
The largest architectural change we made was to move from our monolithic Ruby application to one that is more services oriented. We focused first on creating Tweet, timeline, and user services –– our “core nouns”. This move afforded us cleaner abstraction boundaries and team-level ownership and independence. In our monolithic world, we either needed experts who understood the entire codebase or clear owners at the module or class level. Sadly, the codebase was getting too large to have global experts and, in practice, having clear owners at the module or class level wasn’t working. Our codebase was becoming harder to maintain, and teams constantly spent time going on “archeology digs” to understand certain functionality. Or we’d organize “whale hunting expeditions” to try to understand large scale failures that occurred. At the end of the day, we’d spend more time on this than on shipping features, which we weren’t happy with.

Our theory was, and still is, that a services oriented architecture allows us to develop the system in parallel –– we agree on networking RPC interfaces, and then go develop the system internals independently –– but, it also meant that the logic for each system was self-contained within itself. If we needed to change something about Tweets, we could make that change in one location, the Tweet service, and then that change would flow throughout our architecture. In practice, however, we find that not all teams plan for change in the same way: for example, a change in the Tweet service may require other services to do an update if the Tweet representation changed. On balance, though, this works out more times than not.

This system architecture also mirrored the way we wanted, and now do, run the Twitter engineering organization. Engineering is set up with (mostly) self-contained teams that can run independently and very quickly. This means that we bias toward teams spinning up and running their own services that can call the back end systems. This has huge implications on operations, however.
Storage
Even if we broke apart our monolithic application into services, a huge bottleneck that remained was storage. Twitter, at the time, was storing tweets in a single master MySQL database. We had taken the strategy of storing data temporally –– each row in the database was a single tweet, we stored the tweets in order in the database, and when the database filled up we spun up another one and reconfigured the software to start populating the next database. This strategy had bought us some time, but, we were still having issues ingesting massive tweet spikes because they would all be serialized into a single database master so we were experiencing read load concentration on a small number of database machines. We needed a different partitioning strategy for Tweet storage.

We took Gizzard, our framework to create sharded and fault-tolerant distributed databases, and applied it to tweets. We created T-Bird. In this case, Gizzard was fronting a series of MySQL databases –– every time a tweet comes into the system, Gizzard hashes it, and then chooses an appropriate database. Of course, this means we lose the ability to rely on MySQL for unique ID generation. Snowflake was born to solve that problem. Snowflake allows us to create an almost-guaranteed globally unique identifier. We rely on it to create new tweet IDs, at the tradeoff of no longer having “increment by 1” identifiers. Once we have an identifier, we can rely on Gizzard then to store it. Assuming our hashing algorithm works and our tweets are close to uniformly distributed, we increase our throughput by the number of destination databases. Our reads are also then distributed across the entire cluster, rather than being pinned to the “most recent” database, allowing us to increase throughput there too.
Observability and statistics
We’ve traded our fragile monolithic application for a more robust and encapsulated, but also complex, services oriented application. We had to invest in tools to make managing this beast possible. Given the speed with which we were creating new services, we needed to make it incredibly easy to gather data on how well each service was doing. By default, we wanted to make data-driven decisions, so we needed to make it trivial and frictionless to get that data.

As we were going to be spinning up more and more services in an increasingly large system, we had to make this easier for everybody. Our Runtime Systems team created two tools for engineering: Viz and Zipkin. Both of these tools are exposed and integrated with Finagle, so all services that are built using Finagle get access to them automatically.

stats.timeFuture("request_latency_ms") {
// dispatch to do work
}

The above code block is all that is needed for a service to report statistics into Viz. From there, anybody using Viz can write a query that will generate a timeseries and graph of interesting data like the 50th and 99th percentile of request_latency_ms.

Runtime configuration and testing
Finally, as we were putting this all together, we hit two seemingly unrelated snags: launches had to be coordinated across a series of different services, and we didn’t have a place to stage services that ran at “Twitter scale”. We could no longer rely on deployment as the vehicle to get new user-facing code out there, and coordination was going to be required across the application. In addition, given the relative size of Twitter, it was becoming difficult for us to run meaningful tests in a fully isolated environment. We had, relatively, no issues testing for correctness in our isolated systems –– we needed a way to test for large scale iterations. We embraced runtime configuration.

We integrated a system we call Decider across all our services. It allows us to flip a single switch and have multiple systems across our infrastructure all react to that change in near-real time. This means software and multiple systems can go into production when teams are ready, but a particular feature doesn’t need to be “active”. Decider also allows us to have the flexibility to do binary and percentage based switching such as having a feature available for x% of traffic or users. We can deploy code in the fully “off” and safe setting, and then gradually turn it up and down until we are confident it’s operating correctly and systems can handle the new load. All this alleviates our need to do any coordination at the team level, and instead we can do it at runtime.
Today
Twitter is more performant, efficient and reliable than ever before. We’ve sped up the site incredibly across the 50th (p50) through 99th (p99) percentile distributions and the number of machines involved in serving the site itself has been decreased anywhere from 5x-12x. Over the last six months, Twitter has flirted with four 9s of availability.

Twitter engineering is now set up to mimic our software stack. We have teams that are ready for long term ownership and to be experts on their part of the Twitter infrastructure. Those teams own their interfaces and their problem domains. Not every team at Twitter needs to worry about scaling Tweets, for example. Only a few teams –– those that are involved in the running of the Tweet subsystem (the Tweet service team, the storage team, the caching team, etc.) –– have to scale the writes and reads of Tweets, and the rest of Twitter engineering gets APIs to help them use it.

Two goals drive us as we did all this work: Twitter should always be available for our users, and we should spend our time making Twitter more engaging, more useful and simply better for our users. Our systems and our engineering team now enable us to launch new features faster and in parallel. We can dedicate different teams to work on improvements simultaneously and have minimal logjams for when those features collide. Services can be launched and deployed independently from each other (in the last week, for example, we had more than 50 deploys across all Twitter services), and we can defer putting everything together until we’re ready to make a new build for iOS or Android.

Keep an eye on this blog and @twittereng for more posts that will dive into details on some of the topics mentioned above.

Thanks goes to Jonathan Reichhold (@jreichhold), David Helder (@dhelder), Arya Asemanfar (@a_a), Marcel Molina (@noradio), and Matt Harris (@themattharris) for helping contribute to this blog post.

zp8497586rq

By

ZT:OpenStack将怎样影响软件行业?

【CSDN报道】云盈四海中国加速器 (Stratus Alliance) 创始人檀林在微博上分享了看花旗银行的 报告,以下是报告摘译。

报告亮点

图:E版和G版代码贡献量对比

  • OpenStack有太多贡献者,太多力量交织在一起,将延迟企业级部署;这并不是稳固的模式
  • 缺乏企业级供应商,在某些组件上还缺乏稳定性。我们认为企业用户会先寻找混合云的解决方案。
  • 当OpenStack提供一些公有云服务时,AWS很可能变的更加成熟和获得更大的规模。
  • 有观点认为,VMware在OpenStack生态环境的创新,让vCloud获得了主动权,但在我们看来,VMware实质上已经落后了。
  • 微软Azure比OpenStack成熟的多,是个合格的混合云玩家,但缺乏有竞争力的生态系统,不能提供很多下一代基于云基础架构的应用软件。
  • 在过去的3个发行版中,Rea Hat成为了OpenStack最大的贡献者,并推出了OpenStack发行版,以及商业支持服务。Red Had专注在企业应用市场。
  • Citrix提供了另一个竞争产品CloudStack,更加成熟和简单部署。我们认为管理软件市场(CA、BMC)正感受到来自云管理平台提供的新一代工具的压力。

OpenStack让服务提供商可以向Google和微软一样和AWS竞争

图:OpenStack部署模式

花旗银行估计,AWS大概占据了Amazon的“其它收入”部分的90%,即在2013年的营收将达到38亿美元。这大约是AWS最大的竞争对手Rackspace的十倍左右。这对于ISV们而言有非常大的吸引力。

在大规模公有云领域,Amazon将证明与其竞争是十分困难的,包括规模、负载、丰富的开发者API、自服务界面等等Amazon都全面领先。

混合云市场竞争激烈。

OpenStack案例分析

图:OpenStack用户案例

在已经部署OpenStack的项目中,可以观察到(见上图):

  • 大部分用户使用E版,也就是大约一年前开始部署。
  • 大部分用户只适用Nova和Swift两个基本项目,这显示出OpenStack还相当不成熟。像Cinder、Quantum等项目还远未成熟。
  • 大多数用户都将OpenStack作为私有云服务。我们确信他们对混合云有明显需求,OpenStack的核心驱动力还没有发挥出来。我们相信,HP、IBM等大型的IT公司计划推出交钥匙的混合云服务
  • 大多数玩家并没有直接替换现有系统(PayPal例外,他们直接替换了VMware的平台),而是增加了OpenStack平台。因此,我们认为私有云用户不会直接在虚拟化平台上建立OpenStack平台。

OpenStack商业模型

图:OpenStack商业模型

  • 集成商:HP和IBM都推出了交钥匙式的方案,而Mirantis占据了大量的份额。
  • 服务提供商:这部分提供商都处在公有云领域,Rackspace处于明显领先地位。不过,随着时间推移,未来这部分市场竞争将非常激烈。
  • OpenStack+:这部分厂商在服务提供商的基础上增加一些服务,如增加管理能力、改进安装过程等等。
  • 发行版:这部分玩家与Linux有亲密的关系,他们都有想用户出售软件,并提供服务的经验。不过,HP、Dell也会专注交钥匙式的服务同时,提供发行版。
  • OpenStack硬件:Nebula和Morph labs在提供发行版的同时,更重要的是提供一体化硬件。从而大大降低部署难度。我们认为Oracle会和Nebula结盟售卖OpenStack硬件,我们看到其它硬件提供商采用这种方式推动用户采用OpenStack。

OpenStack竞争环境

图:四大开源云平台的月度贡献度对比,OpenStack遥遥领先

图:与其它开源社区的健康度对比

产品成熟度

Swift已经在Rackspace的生产环境中使用了许多年,Nova也是几个早期的组件之一,已经有HP等几个大型项目中运行。这两个组件在经过大量测试后已经成熟。Cinder、Quantum等组件还远未成熟。不过,IBM、HP依然会在交钥匙式的方案中提供这些不成熟的组件

与AWS和GCE价格对比

 

图:OpenStack在价格上没有优势

总结

我们认为OpenStack的企业用户会集中在Web 2.0和特别实体(CSDN注:报告中并未解释其特指哪些用户)。我们预计OpenStack的公有云部署将会减缓,尽管一些安全和政府部门会采用混合云架构,但这只占企业IT投资的零头。因此,我们认为OpenStack软件公司在2013、2014年内并不会显著发展

花旗银行的结论

  • 我们认为,用户的私有云最有可能采用交钥匙的解决方案(IBM等),公有云方面则更多选择服务商(Rackspace等)。也就是说,提供发行版的价值要低于交钥匙的方案。
  • 如果OpenStack服务商没有杀手级功能,将很难和AWS竞争。同时,能获得大量私有云客户的服务商将获得成功
  • 运维管理模式将更智能。许多创业公司(如RightScale、ServiceMesh、Puppet Labs、Cloudsoft、Hotlink、OpsCode)将对CA和BMC产生挑战。
  • 服务器虚拟化的领导力,并不能成为公有云和混合云的领导力。VMware占据了虚拟化市场的70%份额。不过运行在云上的应用于基于虚拟化平台上的应用不同,举个例子,他们依靠与基础架构API及的集成,并且支持硬件资源横向扩展,并允许个别失效。基于此,hypervisor、服务器虚拟化并不能简单的进化到云平台。VMware在公有云和混合云领域的竞争中处于弱势
  • 硬件预算将被软件预算蚕食。

OpenStack对软件公司的影响

图:OpenStack对软件行业的影响

Red Hat

我们认为,依靠强健的开源软件生态圈(强健的硬件和软件厂商认证),Red Hat将获得许多企业用户。

VMware

我们认为现在基于VMware的平台依然会保持,这将保证公司获得稳定的利润,但授权费主要来自新增用户。

我们获得的分析结论认为,VMware不会在2014年底前发布和OpenStack对等的API。到那时,OpenStack将比现在成熟的多。我们认为这一切会发生,到那时,新增用户将更倾向OpenStack

VMware在2014年后,将通过Nicira获得可观的收入。

Citrix

我们认为CloudStack并不会给Citrix带来巨大的收入(很可能少于2000万美元/年)。我们认为短期和中期内,依然看好CloudStack,但长远看我们不确定。

微软

Azure是OpenStack彻头彻尾的对手,而且也明显的落后于Amazon。Azure比OpenStack成熟很多,但缺乏灵活性,因为它并没有突破Windows生态系统。我们认为Azure会继续占领中端市场,这些用户需要的是简单。

Oracle

Oracle的利润的主要来源来自企业用户。其中数据库提供70%利润,应用软件提供20%利润,这些软件可以轻易的兼容OpenStack。但这样的话,Oracle将让现有的用户保持当前的基础平台,这与OpenStack的方向是不同的。

CA / BMC

我们认为整个产业正在重建平台,从客户端/服务器架构到云,这对于CA和BMC而言可能面临重大失败

以上是花旗银行报道的摘译,下面来看看OpenStack基金会董事程辉在参加完上周的OpenStack Summit后的总结:

OpenStack用户

程辉提到了Comcast、Best Buy、Bloomberg、HubSpot、NSA、CERN等用户。其中:

Best Buy透露,去年感恩节期间承担了25%流量。 在(使用OpenStack)之前,他们的基础设施提供商给他们每个虚拟机收费20,000美元,而现在用OpenStack自己搭的云平台,可以创建上千个虚拟机的一个机架才需要91,000美元

NSA是美国国家安全局,政府机构,美国保密等级最高、经费开支最大、雇员总数最多的超级情报机构,也是美国所有情报部门的中枢。在NSA内部有一个安全加强版本的OpenStack

此外,程辉还提到收到投资人的邮件谈及了80000太服务器迁移到OpenStack平台的事件。另外,他在现场与一名在波音公司工作了33年的华人工程师交流:

他目前负责波音的IT基础设施,也提到正在用OpenStack替换之前的VMware

OpenStack的商业模式和玩家

Nebula

图:Nebula Cloud Controller

包括以上花旗银行的报告,大多数观点认为Nebula做的是硬件一体机。但程辉经过交流认为:

Nebula仅提供了软硬件一体的Nebula Cloud Controller,这是一个2U的机架式服务器与24口万兆网口整合在一起的『怪物』,因为服务器与网络设备融合在一起,很少有这样设计。另外值得注意的是采用了AMD双路皓龙G34接口的处理器,而不是主流的Intel处理器,也可见在设计这款服务器背后的纠结,其它配置是64G内存,1T磁盘,256G SSD加速盘。

Controller加上最少3台左右Compute Node服务器和一台交换机,可以组成一个私有云,含上License的费用,约合10万美金的起步价

Mirantis

Mirantis目前在OpenStack美国生态圈中也算是非常成功,而且是赚到钱的公司。今年年初开源了其OpenStack部署工具集Fuel,分享了OpenStack自动化工具、部署架构参考、HA方案等。这次OpenStack Summit Mirantis SVP Boris带来的是基于Fuel的一套产品化的Web UI——Fuel Web。通过Web界面,完成服务器的初始化,OpenStack服务自动化部署,非常直观和方便。

Red Hat

图:Red Hat的云平台架构

如上图所示,RedHat终于在云计算领域有所斩获,依托于Red Hat Linux/KVM,再借助OpenStack和OpenShift,以及CloudForms混合云平台,Red Hat云计算战略前所未有的清晰。但OpenStack高速发展以及每6个月一次升级让Red Hat还没法推出对于OpenStack的商业支持

技术分享

Nicira第一次公开讲NVP的架构和观点

一直感觉Nicira跟VMware一样比较封闭,Nicira原团队在Quantum上有一些领导力,但很少见到来自Nicira的技术分享,在Quantum项目中引入Nicira的一些技术,但这次Summit分享例外。

块存储Ceph大热 Cinder遭冷落

Cinder的缺点是单机部署,没有HA,存储容易成为热点。Ceph是Cinder中的一种开源分布式块存储方案,其本身并不属于OpenStack项目,已经有一段历史了,源于Dreamhost的联合创始人Sage Weil在2007年的一篇博士论文,毕业之后也一直在开发Ceph,并于去年成立了Inktank公司来提供Ceph的专业支持服务。了解到Piston的企业发行版中已经将Ceph集成进来,Paypal也在内部用它。

写在最后

毫无疑问,经过两年半的高速发展,OpenStack已经成为最火热的话题,而且有越来越多的用户使用,以及成功的创业公司(如Mirantis),甚至鼓舞人心的胜利(PayPal从VMware迁移到OpenStack)。但别忘了,OpenStack最大的对手AWS也在高速发展,其规模效应和成熟的产品将在OpenStack成熟前占领更大的市场(Baird报告显示,2016年AWS销售额100亿美元/年)。(文/ 包研  审校/仲浩)

zp8497586rq

By

ZT: 系统架构领域的一些学习材料

Baidu科学家 林仕鼎 写的总结(原文地址) 。涉及到我感兴趣的虚拟机、分布式、P2P等话题,故转载于此。

标签:架构 系统 system research

系统架构是一个工程和研究相结合的领域,既注重实践又依赖理论指导,入门容易但精通很难,有时候还要讲点悟性,很具有“伪科学”的特征。要在此领域进阶,除了要不断设计并搭建实际系统,也要注意方法论和设计理念的学习和提炼。

经常有同学询问如何学习,特贴一篇学习材料,供大家参考。09年时写的,在系统领域浩如烟海的文献中提取了一些我认为值得研究和学习的项目,没包括近几年出现的一些工作,也不够全面。不过,其实也足够了,看paper是一个从少到多再到少的过程。对问题本质、背景和发展历史有大致了解,再辅以hands-on的实践(长期的真正的实践),足以摸到本领域的门径。

此文在网上转载不少,但多数没有说明出处。今天在这里重发,也顺便向315致敬。

对于工程师来说,到一定阶段后往往会遇到成长瓶颈。要突破此瓶颈,需要在所属技术领域更深入学习,了解本领域的问题本质、方法论与设计理念、发展历史等。以下提供一些架构相关领域的学习材料,附上简单点评,供有兴趣的工程师参考。希望大家能通过对这些领域的了解和学习,掌握更多system design principles,在自己的工作中得心应手,步入自由王国。

1. Operating SystemsMach [Intro: http://www-2.cs.cmu.edu/afs/cs/project/mach/public/www/mach.html,Paper: http://www-2.cs.cmu.edu/afs/cs/project/mach/public/www/doc/publications.html]

传统的kernel实现中,对中断的响应是在一个“大函数”里实现的。称为大函数的原因是从中断的入口到出口都是同一个控制流,当有中断重入发生的时候,实现逻辑将变得非常复杂。大多数的OS,如UNIX,都采用这种monolithic kernel architecture。

1985年开始的Mach项目,提出了一种全新的microkernel结构,使得由于70年代UNIX的发展到了极致而觉得后续无枝可依的学术界顿时找到了兴奋点,也开始了沸沸扬扬的monokernel与microkernel的争论。

插播一个花絮:Mach的主导者Richard Rashid,彼时是CMU的教授,受BillGates之托去游说JimGray加盟MS。结果把自己也被绕了进来,组建了Microsoft Research。他到中国来做过几次21Century Computing的keynotes。

Exokernel [Intro:http://pdos.csail.mit.edu/exo/,Paper:http://pdos.csail.mit.edu/PDOS-papers.html#Exokernels]

虽然microkernel的结构很好,但实际中并没有广泛应用,因为performance太差,而且大家逐渐发现OS的问题并不在于实现的复杂性,而更多在于如何提高application使用资源的灵活性。这也就是在kernel extension(例如loadable module in Linux)出现后,有关OS kernel architecture的争论就慢慢淡出人们视线的原因。

Exokernel正是在这样的背景中出现的,它并不提供传统OS的abstraction(process,virtual memory等),而是专注于资源隔离与复用(resource isolation and multiplexing),由MIT提出。在exokernel之上,提供了一套库,著名的libOS,用于实现各种OS的interface。这样的结构为application提供了最大的灵活度,使不同的application可以或专注于调度公平性或响应实时性,或专注于提高资源使用效率以优化性能。以今天的眼光来看,exokernel更像是一个virtual machine monitor。

Singularity [Intro:http://research.microsoft.com/os/Singularity/,Paper: http://www.
research.microsoft.com/os/singularity/publications/HotOS2005_BroadNewResearch.pdf
]

Singularity出现在virus,spyware取之不尽、杀之不绝的21世纪初期,由Microsoft Research提出。学术界和工业界都在讨论如何提供一个trust-worthy computing环境,如何使计算机系统更具有manage-ability。Singularity认为要解决这些问题,底层系统必须提供hardisolation,而以前人们都依赖的硬件virtual memory机制并无法提供高灵活性和良好性能。在.Net和Java等runtime出现之后,一个软件级的解决方案成为可能。

Singularity在microkernel的基础上,通过.Net构建了一套type-safed assembly作为ABI,同时规定了数据交换的message passing机制,从根本上防止了修改隔离数据的可能。再加上对application的安全性检查,从而提供一个可控、可管理的操作系统。由于.NetCLR的持续优化以及硬件的发展,加了这些检查后的Singularity在性能上的损失相对于它提供的这些良好特性,仍是可以接受的。

这种设计目前还处于实验室阶段,是否能最终胜出,还需要有当年UNIX的机遇。

2. Virtual MachinesVMWare ["MemoryResource Management in VMware ESX Server",OSDI’02,Best paper award]

耳熟能详的vmware,无需多说。

XEN [“Xen and the Art of Virtualization”, OSDI’04]

性能极好的VMM,来自Cambridge。

Denali [“Scaleand Performance in the Denali Isolation Kernel”, OSDI’02, UW]

为internetservices而设计的application level virtual machine,在普通机器上可运行数千个VMs。其VMM基于isolation kernel,提供隔离,但并不要求资源分配绝对公平,以此减少性能消耗。

Entropia [“The Entropia VirtualMachine for Desktop Grids”, VEE’05]

要统一利用公司内桌面机器资源来进行计算,需要对计算任务进行良好的包装,以保证不影响机器正常使用并与用户数据隔离。Entropia就提供了这样的一个计算环境,基于windows实现了一个application level virtual machine。其基本做法就是对计算任务所调用的syscall进行重定向以保证隔离。类似的工作还有FVM:“AFeather-weight Virtual Machine for Windows Applications”。

3. Design Revisited “Are Virtual Machine Monitors Microkernels Done Right?”,HotOS’05

这个题目乍听起来,十分费解,其意思是VMMs其实就是Microkernel的正确实现方法。里面详细讨论了VMM和Microkernel,是了解这两个概念的极好参考。

Thirty Years Is Long Enough: Getting Beyond C”, HotOS’05

C可能是这个世界上最成功的编程语言,但其缺点也十分明显。比如不支持thread,在今天高度并行的硬件结构中显得有点力不从心,而这方面则是functional programming language的长处,如何结合二者的优点,是一个很promising的领域。

4. Programming ModelWhy Threads Are a Bad Idea

单使用thread结构的server是很难真正做到高性能的,原因在于内存使用、切换开销、同步开销和保证锁正确性带来的编程复杂度等。

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services”,OSDI’01

Thread不好,但event也没法解决所有问题,于是我们寻找一个结合的方法。SEDA将应用拆分为多个stage,不同stage通过queue相连接,同一个stage内可以启动多个thread来执行queue中的event,并且可通过反馈来自动调整thread数量。

Software Transactional Memory

如果内存可以提供transaction语义,那么我们面对的世界将完全两样,language, compiler, OS, runtime都将发生根本变化。虽然intel现在正在做hardware transactional memory,但估计可预见的将来不会商用,所以人们转而寻求软件解决方案。可想而知,这个方案无法base在native assembly上,目前有C#,haskell等语言的实现版本。资料比较多,参见Wikipedia

5. Distributed AlgorithmsLogical clock, [“Time,clocks, and the ordering of events in a distributed system”, Leslie Lamport, 1978]

这是一篇关于Logic clock, time stamp, distributed synchronization的经典paper。

Byzantine [“The ByzantineGenerals Problem”, Leslie Lamport, 1982]

分布式系统中的错误各种各样,有出错就能停机的,有出错了拖后腿的,更严重的是出错了会做出恶意行为的。最后的这种malicious behavior,就好像出征将军的叛变,将会对系统造成严重影响。对于这类问题,Lamport提出了Byzantine failure model,对于一个由3f+1个replica组成的statemachine,只要叛变的replica数量小于等于f,整个state machine还能正常工作。

Paxos [“The part-time parliament”, Leslie Lamport, 1998]

如何在一个异步的分布式环境中达成consensus,这是分布式算法研究的最根本问题。Paxos是这类算法的顶峰。不过这篇paper太难了,据说全世界就3.5人能看懂,所以Lamport后来又写了一篇普及版paper:“Paxos Made Simple” ,不过还是很难懂。另外,也可参看Butler Lampson写的“The ABCD’s of Paxos”(PODC’01),其中关于replicated state machine的描述会严重启发你对并行世界本质的认识,图灵奖的实力可不是盖的。

这上面反复出现了一个名字:Leslie Lamport,他在distributed computing这个领域挖坑不辍,终成一代宗师。关于他,也有几则轶事。记得以前他在MSR的主页是这么写的,“当我在研究logicalclock的时候,BillGates还穿着开裆裤(in diaper)…”(大意如此,原文现在找不到了)。另外,他在写paper的时候,很喜欢把其他牛人的名字变换一下编排进去。这可能也是他还没拿到图灵奖的原因。

关于Lamport的其他成就,还可以参见这篇向他60岁生日献礼的paper:“Lamport on mutual exclusion: 27 years of planting seeds”, PODC’01。

6. Overlay Networking, and P2P DHTRON [“Resilient Overlay Networks”, SOSP’01]

RON描述了如何在应用层搭建一个overlay,以提供秒级广域网网络层故障恢复速度,而现有的通过路由协议来恢复通信的时间至少在几十分钟。这种快速恢复特性和灵活性使得overlay networking现在被广泛应用。
Application Level Multicast
End System Multicast”, SigMetrics’00
Scalable Application Layer Multicast”, SigComm’02
关于ALM的paper很多,基本上都是描述如何搭建一个mesh network用以鲁棒的传输控制信息,另外再搭建一个multicast tree用以高效传输数据,然后再根据多媒体数据的特点做一些layered delivery。前几年出现的coolstream, pplive等系统都是这类系统的商业化产品。
P2P
P2P的出现改变了网络。按照各种P2P网络的结构,可以分为三种。
1.    Napster式,集中式目录服务,数据传输Peer to peer。
2.    Gnutella式,通过在邻居间gossip来查询,也被称为unstructured P2P。
3.    DHT,与unstructured P2P不同的是,DHT进行的查询有保证,如果数据存在,可在一定的hop数内返回。这个hop数通常为logN,N为系统节点数。
典型的DHT有CANChord,PastryTapestry等四种。这些研究主要在算法层面,系统方面的工作主要是在其上建立广域网存储系统。还有一些人在机制层面进行研究,例如如何激励用户共享、防止作弊等。

7. Distributed SystemsGFS/MapReduce/BigTable/Chubby/Sawzall
Google的系列paper,大家比较熟悉,不再多说。在可查。
Storage
Distributed storage system的paper太多了。下面列出几篇最相关的。
Chain Replication for Supporting High Throughput and Availability”, OSDI’04。
Dynamo: Amazon’s Highly Available Key-value Store”,SOSP’07。
BitVault: a Highly Reliable Distributed Data Retention Platform”, SIGOPS OSR’07。
PacificA: Replication inLog-Based Distributed Storage Systems”, MSR-TR。
Distributed Simulation

Simulating Large-Scale P2P Systems with the WiDS Toolkit”, MASCOTS’05。Distributed simulation有意思的地方是simulated protocol是distributed的,而这个simulation engine本身也是distributed的。Logical和physical的time和event交杂在系统中,需要仔细处理。

8. Controversial Computing Models现在的软件系统已经复杂到了人已经无法掌握的程度,很多系统在发布时都仍然带着许多确定性(deterministic)或非确定性(non-deterministic)的bugs,只能不断的patch。既然作为人类,不够精细的特性决定了我们无法把系统的bug fix干净,我们只能从其他角度入手研究一种让系统在这令人沮丧的环境中仍能工作的方法。这就像一个分布式系统,故障无法避免,我们选择让系统作为整体来提供高可靠性。

以下3个便是典型代表。基本上,主要研究内容都集中于1) 如何正确保存状态;2)如何捕捉错误并恢复状态;3)在进行单元级恢复时,如何做到不影响整体。

Recovery Oriented Computing

Failure oblivious computing, OSDI’04

Treating Bugs as Allergies, SOSP’05

9. Debugging系统很复杂,人类无法从逻辑上直接分析,只能通过data mining的方法在宏观上进行观察。

Black box debugging[“Performance debugging for distributed systems of black boxes”, SOSP’03]

对大型系统的performance debugging非常困难,因为里面的问题很多都是非确定性的,而且无法重现。只能通过对log的挖掘,找出配对的调用/消息以定位问题。

CP-miner [“A Tool for Finding Copy-paste and Related Bugs in Operating System Code”, OSDI’04]

很多人在重用代码的时候,都使用copy-paste。但有时候简单的CP会带来严重的问题,例如局部变量的重名等。CP-miner通过分析代码,建立语法树结构,然后mine出这类错误。

By

[ZT]Finagle: A Protocol-Agnostic RPC System

Source from: Finagle: A Protocol-Agnostic RPC System

 Finagle is a protocol-agnostic, asynchronous RPC system for the JVM that makes it easy to build robust clients and servers in Java, Scala, or any JVM-hosted language.

Rendering even the simplest web page on twitter.com requires the collaboration of dozens of network services speaking many different protocols. For example, in order to render the home page, the application issues requests to the Social Graph Service, Memcached, databases, and many other network services. Each of these speaks a different protocol: Thrift, Memcached, MySQL, and so on. Additionally, many of these services speak to other services — they are both servers and clients. The Social Graph Service, for instance, provides a Thrift interface but consumes from a cluster of MySQL databases.

In such systems, a frequent cause of outages is poor interaction between components in the presence of failures; common failures include crashed hosts and extreme latency variance. These failures can cascade through the system by causing work queues to back up, TCP connections to churn, or memory and file descriptors to become exhausted. In the worst case, the user sees a Fail Whale.

CHALLENGES OF BUILDING A STABLE DISTRIBUTED SYSTEM

Sophisticated network servers and clients have many moving parts: failure detectors, load-balancers, failover strategies, and so on. These parts need to work together in a delicate balance to be resilient to the varieties of failure that occur in a large production system.

This is made especially difficult by the many different implementations of failure detectors, load-balancers, and so on, per protocol. For example, the implementation of the back-pressure strategies for Thrift differ from those for HTTP. Ensuring that heterogeneous systems converge to a stable state during an incident is extremely challenging.

OUR APPROACH

We set out to develop a single implementation of the basic components of network servers and clients that could be used for all of our protocolsFinagle is a protocol-agnostic, asynchronous Remote Procedure Call (RPC) system for the Java Virtual Machine (JVM) that makes it easy to build robust clients and servers in Java, Scala, or any JVM-hosted language. Finagle supports a wide variety of request/response- oriented RPC protocols and many classes of streaming protocols.

Finagle provides a robust implementation of:

  • connection pools, with throttling to avoid TCP connection churn;
  • failure detectors, to identify slow or crashed hosts;
  • failover strategies, to direct traffic away from unhealthy hosts;
  • load-balancers, including “least-connections” and other strategies; and
  • back-pressure techniques, to defend servers against abusive clients and dogpiling.

Additionally, Finagle makes it easier to build and deploy a service that

  • publishes standard statistics, logs, and exception reports;
  • supports distributed tracing (a la Dapper) across protocols;
  • optionally uses ZooKeeper for cluster management; and
  • supports common sharding strategies.

We believe our work has paid off — we can now write and deploy a network service with much greater ease and safety.

FINAGLE AT TWITTER

Today, Finagle is deployed in production at Twitter in several front- and back-end serving systems, including our URL crawler and HTTP Proxy. We plan to continue deploying Finagle more widely.

Finagle-Diagram

A Finagle-based architecture (under development)The diagram illustrates a future architecture that uses Finagle pervasively. For example, the User Service is a Finagle server that uses a Finagle memcached client, and speaks to a Finagle Kestrel service.

HOW FINAGLE WORKS

Finagle is flexible and easy to use because it is designed around a few simple, composable primitives: FuturesServices, and Filters.

Future objects

In Finagle, Future objects are the unifying abstraction for all asynchronous computation. A Future represents a computation that may not yet have completed and that can either succeed or fail. The two most basic ways to use a Future are to:

  • block and wait for the computation to return
  • register a callback to be invoked when the computation eventually succeeds or fails 

    Future callbacksIn cases where execution should continue asynchronously upon completion of a computation, you can specify a success and a failure callback. Callbacks are registered via the onSuccess and onFailure methods:

Composing Futures

Futures can be combined and transformed in interesting ways, leading to the kind of compositional behavior commonly seen in functional programming languages. For instance, you can convert a Future[String] to a Future[Int] by using map:

Similarly, you can use flatMap to easily pipeline a sequence of Futures:

In this example, User.authenticate() is performed asynchronously;Tweet.findAllByUser() is invoked on its eventual result. This is alternatively expressed in Scala, using the for statement:

Handling errors and exceptions is very easy when Futures are pipelined using flatMapor the for statement. In the above example, if User.authenticate()asynchronously raises an exception, the subsequent call to Tweet.findAllByUser()never happens. Instead, the result of the pipelined expression is still of the typeFuture[Seq[Tweet]], but it contains the exceptional value rather than tweets. You can respond to the exception using the onFailure callback or other compositional techniques.

A nice property of Futures, as compared to other asynchronous programming techniques (such as the continuation passing style), is that you an easily write clear and robust asynchronous code, even with more sophisticated operations such as scatter/gather:

Service objects

Service is a function that receives a request and returns a Future object as a response. Note that both clients and servers are represented as Service objects.

To create a Server, you extend the abstract Service class and listen on a port. Here is a simple HTTP server listening on port 10000:

Building an HTTP client is even easier:

Filter objects

Filters are a useful way to isolate distinct phases of your application into a pipeline. For example, you may need to handle exceptions, authorization, and so forth before your Service responds to a request.

Filter wraps a Service and, potentially, converts the input and output types of the Service to other types. In other words, a Filter is a Service transformer. Here is a filter that ensures an HTTP request has valid OAuth credentials that uses an asynchronous authenticator service:

Filter then decorates a Service, as in this example:

Finagle is an open source project, available under the Apache License, Version 2.0. Source code and documentation are available on GitHub.

ACKNOWLEDGEMENTS

Finagle was originally conceived by Marius Eriksen and Nick Kallen. Other key contributors are Arya Asemanfar, David Helder, Evan Meagher, Gary McCue, Glen Sanford, Grant Monroe, Ian Ownbey, Jake Donham, James Waldrop, Jeremy Cloud, Johan Oskarsson, Justin Zhu, Raghavendra Prabhu, Robey Pointer, Ryan King, Sam Whitlock, Steve Jenson, Wanli Yang, Wilhelm Bierbaum, William Morgan, Abhi Khune, and Srini Rajagopal.

zp8497586rq

By

InnoDB中的B+Tree和MVCC

之前做了个InnoDB的分享,主要是关于InnoDB中B+Tree的结构和MVCC的实现。

PPT: BpTree_MVCC

下面把PPT内容稍微整理一下。

首先是B+Tree,下面给出InnoDB中B+Tree的结构(via

有如下特点:

  1. 寻道次数固定,且次数少(因为树高度比较低),而HD的寻道是非常费时
  2. 数据存储连续,非叶节点只存储指针,数据都在叶节点。索引容易缓存
  3. 每条数据都由双向链表组织,范围查询快
  4. 数据和叶节点在一起,查询快(不需要再次寻道),插入慢(分裂/合并需要对更多数据进行移动)。相比MyIASM,叶节点只存指针,插入块,查询慢(多寻道)
  5. 叶节点每个块内部虽然在连续的磁盘空间中,但叶节点本身并不是连续存储的。经过较长时间的运行,会碎片化,影响范围查询的效率。不过mysql提供了对此的优化方法。

这里强烈推荐 B+Tree index structures in InnoDB 这篇文章,详细介绍了InnoDB中B+Tree的具体实现结构。

随后是关于MVCC。

MVCC是多版本并发控制,用于在实现事务操作时,替代单纯的读写锁。单纯的读写锁会对所有读过的数据加读写锁,读了就不能写,写了就不能读。

既然是解决读写冲突的问题,那何时能写何时能读就是要考虑的重点,为此有“隔离级别”的概念。这个概念强调的就是在什么情况下,允许读,什么情况下,允许写。

InnoDB的MVCC支持四种隔离级别,分别是READ UNCOMMITTED、READ COMMITTED、REPEATABLE READ、SERIALIZABLE。其中最常用的是“READ COMMITTED:读已提交”和“REPEATABLE READ:可重复读”。

  1. READ COMMITTED:读已提交。SELECT的时候无法保证重复读数据是一样的,即同一个事务中两次执行同样的查询语句,若在第一次与第二次查询之间时间段,其他事务又刚好修改了其查询的数据且提交了,则两次读到的数据不一致。就是“读”“已提交”的事务。
  2. REPEATABLE READ:可重复读。任意一次事务中,任何数据的可见性都是在本次事务开始前的状态,即使其它事务提交了,对当前事务依然不可见。即“可重复”“读”到相同的内容。

需要注意的是,无论任何隔离级别,一旦某条记录被UPDATE/DELETE/SELECT FOR UPDATE,即加X锁后,事务提交前就不能再被更新(加X锁)了。

InnoDB是如何实现事务的多版本呢,我在演讲的时候也请出了网易何登成大神的PPT

地址: InnoDB Transaction Lock and MVCC微盘地址  Slideshare地址

这个PPT详细介绍了MVCC的具体实现,包括锁相关的实现,下面我简单总结下重点。

InnoDB通过ReadView(视图)来实现上述隔离级别。ReadView会记录当前状态下:

  1. 最小的活跃事务的事务ID(全局唯一,自增)
  2. 当前事务的ID
  3. 所有活跃事务ID所组成的链表

同时,事务修改字段时,在修改原来的值的时候,会标注当前事务的ID,同时把旧的数据和旧的事务ID放到回滚段。

有了上述两项操作,那么ReadView的作用就体现出来了,即Select语句读取:

  1. 拥 有大于最小活跃事务ID的、当前非活跃事务中事务ID最大的 事务ID的 数据
  2. 再组织一下语言,即通过ReadView找到最大的非活跃事务,取得它的事务ID,再去表中或者其回滚段中,寻找拥有这个事务ID的数据。

同时,任何小于“最小的活跃事务的事务ID”的数据都可以被回收,因为它们再也不会被读取到。

因此可以发现,READ COMMITTED、REPEATABLE READ这两个级别的差别,就在于ReadView的创建时机。前者再语句开始时创建ReadView,语句结束后Drop;后者在事务开始时创建,事务提交后Drop。即可实现其功能。

要注意的是,即便对于READ COMMITTED级别,如果语句执行过程中又有新的事务提交,select还是看不到的(极端情况)。

ReadView的存储结构,或者是更深入的研究,可以去看前述的PPT,不再重复。

其实还分享了关于回滚段、回滚方式,MySQL的X-commit二段提交,对B+Tree的一些操作,感觉写字还是有点儿苍白,况且
Jeremy Cole何登成的blog和PPT都要详细、优雅的多,推荐有兴趣的同学去看看。

zp8497586rq

By

Web developing methodology – 3 essential steps

In one of my cases of SCRM, the global client need us to prove our ability and capability of web products developing and DB designing, especially in cloud ( Windows Azure) environment.  Two big problems, the first question is how to develop complex web products. The other is what is our experience in Azure SQL.

In my point of view, the “ability” should be summarized from methodology. Including three parts:

Abstraction

•Data structure -> DB table schema flexibility & adaptability,  Key-value, Key-List …
•Distributed Computing Algorithm:  Map-Reduce
•Architecture -> SOA

Isolation

•For loosely coupling, modules are isolated

•For HA, front tier/middle layer/back end are isolated
•For concurrency, with async. queues, consumer/producer OR leader/follower mode
•For security, roles/privileges are isolated

Integration

•OS, OpenAPI -> Unified WebService / RESTful interface,  Weixin API + Baidu + Sina …
•Data Source -> Multi-dimensional Data Modeling,
•Extendibility -> Schema Free, Column-based  DBMS

Question 2, As for “SQL Azure“, there are 2 types of Data Service in Azure (Or 3 for federation ):

• Storage  (REST)
–Tables, Blobs, Queues
• SQL Database /SQL Azure (T-SQL)
–High availability : 99.9% up time in SLA (43min/month)
–Scalability(scale-up/scale-out)  :  bandwidth, cluster, load balance(F5)
–Manageability : Need not hardware management,  provisions, UI…
–Relational Model :
–Development Model : ADO.NET, ODBC,Most T-SQL, Reporting Service, (Without Service Broker)
• SQL Azure Federation

In terms of “limitations and constrains“, I analyzed from open documents and other channels.

• A clustered index  (doesn't matter)
• Does not support SQL Server Agent   (doesn't matter, could be replace by other tools)
• Does not support distributed transactions (application transactions)    (Trouble, could be done by app layer using STM and others)
• ALTER DATABASE Transact-SQL statement is not supported     (Something trouble, must use fixed Transact setting)
• Database Count and Size Limits      (Trouble in big system)
• Connection Constraints    (Trouble in big system)
……
zp8497586rq

By

Count A Billion Distinct Objects Using Limited Memory

In highscalability.com, an article 《Big Data Counting: How To Count A Billion Distinct Objects Using Only 1.5KB Of Memory》 described 3 algorithms of “cardinality counting and processing(AND/OR,UNIQUE… operations)”. I was interested in it and also had faced to this circumstance. So review these 3 solutions.

GOAL: “Their servers receive well over 100 billion events per month. How they are able to accurately estimate the cardinality of sets with billions of distinct elements using surprisingly small data structures.

Solution 0: Obviously, all the elements could be presented by their IDs, using HashSet to store the IDs as key, then the hashSet.size() is the answer. But 1 billion IDs(each ID 16 Bytes) consume about 16GB memory, without considering the consumptions of self data-structure(HashTree).

Solution 1:

to be continued….
https://github.com/clearspring/stream-lib

zp8497586rq
此站建立在金山云平台之上