Ceph Workshop

I’ve attended the first Ceph Workshop in Amsterdam, Nov 2, 2012, with all the Ceph gurus, Sage Weil, Wildo den Hollander, Ross Turk, Gregory Farnum presenting this outstanding distributed storage technology. I’m sharing with you my notes.

Keynote by Wildo den Hollander 42on (@widoth)

partners of the first ever workshop day

inktank - the founders
- providing services & support
- contributed 90% of ceph code
42on
- spinoff of European company which now focus on Ceph
Dell
- gave gears for Demo facilities

Why Ceph ?

World of storage is changing everyday, now storing Petabytes of data
How to store it is a lot of work
- Do we use proprietary systems ?
- How do we scale ?
- Does that fit in our budget
Use Ceph
- Fully distributed without any SPoF
- Scales with you into the Petabytes
- Runs on commodity hardware

Keynote by Sage Weil (@liewegas, sage@inktank.com)

First ever Ceph day
Unified Distributed System
Why you should care
- diverse storage needs
  - object storage
  - block devices (for VMs) with snapshots, cloning
  - shared file system with POSIX, coherent caches
  - structured data: files, block devices, or objects ?
- scale
- terabytes, petabytes, exabytes
- heterogeneous hardware
- time
  - ease of admin
  - no manual data migration, load balancing
  - painless scaling
    - expansion and contraction
    - seamless integration
- cost
  - linear function of size, of performance
  - incremental expansion
    - no fork lift upgrades
  - no vendor lock-in
    - choice of hardware and software
  - open

What is it, what it’s for

It’s a storage system that stores:
- Objects
  - native API
  - or RESTful API
- Blocks
  - thin provisioning
  - snapshots
  - layering/cloning
- Files
  - strong consistency, snapshots
It’s a distributed storage system
- that scale from 10s to 10 000 of machines
- terabytes to exabytes
Which is fault tolerant - no SPoF

how it works - architecture - ceph object model

Components
- librados - client library
- radosgw - S3 and switch compatible access to the Ceph Cluster
- rbd - Rados Block Device, useful for virtualization

Why start with objects

more useful than (disks) blocks
- names in a single flat namespace
- variable size
- simple API
more scalable than files
- no hard-to-distribute hierarchy
- update semantics do not span objects
- workload is trivially parallel
Ceph Hardware agnostic but big fan of Btrfs (need better stability), also works with XFS, EXT4
M: Monitors aren’t involved in data-path
OSD - Object Server Daemon

Data Distribution

all objects are replicated N times (could be 2/3 for example)
objects are auto placed, balanced, migrated in a dynamic cluster
must consider physical infrastructure
- ceph-osds on hosts, in racks, in rows, in data center
3 approaches
- pick a pot remember where you put it
- pick a pot; write down where you put it (most of the systems do that, but scalability is a problem and slow)
- calculate where to put it (hash based function)
  - basic idea, determine a location based on current cluster status
  - consistent hashing (riak) here we use Crush

Crush

pseudo random placement algorithm
fast calculation no lookup
repeatable deterministic
Ensures even distribution
Stable mapping
- limiting data migration
Rule-based configuration
- specifiable replication
- infra topology aware
- allows weighting
got a language to specify constraints

Algorithm details

Hash(object name)
CRUSH(pg,…)

Rados - Ceph Object Store

Monitors publish osd map (state of the cluster) - ceph-mon
- keep track of which nodes are operational at any given time
OSDs - ceph-osd
- perform actual data storage/retrieval operations
- safely replicate and store objects
- migrate data as the cluster changes over time
- coordinate based on shared view of reality
Decentralized, distributed approach allows
- massive scales
- illusion of a single copy with consistent behavior (self healing)

Librados

native protocol, highly parallel and scalable
C, C++, Python, PHP, Java

Radosgw

separate daemon that runs on top of RADOS and provides a REST-based object storage API
compatible with S3 and Swift
use librados to talk to the cluster
RESTful at the front, Rados at the backend
REST-based interface to RADOS
Supports buckets accounting

RBD - RADOS Block Device

Virtual disk abstraction
With KVM, native support for librbd Ubuntu precise supports this
use librados
enable exiting things
- live migration
kernel driver krbd
Storage of virtual disks in RADOS
Decouples VM and containers
Images are striped across the cluster
snapshots
Support : Cloud Stack, OpenStack, …
Logical clone for same blocks
1st use cases is cloud computing but could also replace a legacy SAN

CEPH FS - Nearly Awesome

Most exiting peace of the stack right now
HA POSIX Filesystem
A set of metadata servers (MDS) to check access right, …
One path for information and one path for data, no bottleneck
Metadata Server
- Manages metadata for a POSIX-compliant shared filesystems
- Stored metadata in RADOS
- Does not serve file data to clients
- Only required for shared filesytems
Dynamic Subtree Partitioning
- In Filesystem workload, high level of locality accessing one file means we should access other in the tree so we have to take this into consideration
- Entirely adaptive depends on the workloads, partition will shift depending on workload
Recursive accounting
- for every Dir it keeps lots of information
  - keep track of file and dir counts
  - keep track of modif time
  - virtual xattrs present full stats
    - ls -alSh | head will gives you correct information for free because it stored in the metadata, most useful ever
- efficient
Snapshots
- vol or sub-volume snapshots unusable at petabyte scale
  - snapshot arbitrary subdirectories
- simple interface
  - hidden .snap directory
  - no special tools, you can access it with classic POSIX tools (ls, ..)

Multiple protocols, implementation

Linux kernel client
- mount -t ceph 1.2.3.4://mnt
- export NFS, CIFS
CEPH-Fuse
libcephfs.so
- your app
- samba (CIFS)
- Useland NFS server (Ganesha)
- map/reduce (Hadoop)

Ceph Status

Ceph FS missing QA effort but feature complete
argonaut stable release v0.48
- rados, RBD, radosgw
bobtail stable release v0.55 in a month or so
- RBD cloning
- improved performance, scaling, failure behavior
- radosgw API, performance improvement
- freeze in 1 week, release in 4 weeks

Roadmap

Filesystem
- The engineering team here is about to do a bit of a pivot and refocus on the file system now that the object store and RBD are in pretty good shape.
- That will mean both core fs/mds stability
- and features as well as integration efforts, NFS (Ganesha), CIFS (Samba), Hadoop
RBD
- Xen integration with iSCSI in the work
Radosgw
- Keystone integration
RADOS
- geo-replication
- PG split

Why inktank works on this project

Current limited options for scalable open source storage
Proprietary solutions * expensive * don’t scale (well or out) * marry HA and software - annoying situation
users hungry for alternatives
- scalability, …

Two fields

green: cloud, big data
- incumbents don’t have a viable solution
- most players can’t afford to build their own
- strong demand for open source solutions
brown: traditional SAN, NAS; Enterprise
- incumbent struggle to scale out
- can’t compete on price with open solutions

Licensing

promote adoption - free to everyone
enable community development
prevent Ceph from becoming proprietary
allow organic commercialization

Ceph License

LGPL2 Copyleft
- free distribution
- allow derivative work
- changes you distribute/sell must be shared
ok to link to proprietary code
- allow proprietary products to include and build on Ceph
- doesn’t allow proprietary derivatives of Ceph

Fragmented copyright

We do not require copyright assignment from contributors, great because:
- no single person or entity owns all of Ceph
- no single entity can make Ceph proprietary
strong community

Why its so important

Ceph is an ingredient
- We need to play nice in a larger ecosystem
- community will be key to Ceph’s success
Truly open sources solutions are disruptive
- Open is a competitive advantage
  - frictionless integration with projects, platforms, tools
  - freedom to innovate on protocols
  - leverage community testing, development resources
  - open collaboration is efficient way to build technology

A bit of history

Ceph created at UC Santa Cruz (2004–2007)
supported by DreamHost (2008–2011)
Inktank (2012)
growing user and development community * Linux distros, users, cloud stacks, SIs OEMs

QA with Sage Weil

How does it compare to Gluster
- Cluster is a much simpler system - it works but simpler architecture to deliver it quickly
- Ceph much more ambitious architecture, with Metadata servers it deliver much more features
- With Gluster you add caching but sacrifice on consistency don’t have the pb with Ceph
- But Gluster maybe more mature
- Gluster not Fault Tolerant so much more expansive, Ceph handle that at the software layer
Not recommended to use EXT4
- Ceph relies on extended attributed, those can get large, with RADOSGW, EXT4 does have 4K limit
- today leveldb in object store, today it’s not a problem
- now it’s ok to use EXT4 but XFS more tested with Inktank, Btrfs for labs not yet for production
Feedback on commercial devt due to LGLP2
- business people complain but not any devt
- Not a good idea to maintain an independant Fork (example Postgres)
Is there production systems that currently use multiple CEPH interfaces at the same time
- Keystone will allow to use multiple interface but most of current users only use one interface
Storage Tiering
- Ceph agnostic - value of Ceph it aggregates HW together
- next version will improve IOs when used in Fusion IO environment
- Create different Pools: Flash, SATA
- or Put it underneath each OSDs
- Ceph have a journal device, seperate SSD, NVRAM card, …
- bcache, flashcache, .., btrfs is growing support for tiered storage
- Most of all this is handled outside of Ceph itself
Infiniband support
- HPC world is really interested in that, so there is work in progress there
Quota
- Swift doesn’t have any
- today Ceph have accounting implemented, we keep track of everything but no enforcement
- Quota will be subtree implemented not UID based
Latency
- ok accross Europe
- Ceph dedicated 10G links between multiple datacenters
- Ceph Object Store gives you strong consistency, so each write will write to all replica and waits for ACK. So beware of high latency
- Roadmap: async replication, today is sync
Ceph library deleted from Debian src 5 weeks ago
- not aware of that

Cloud Integration by Wildo den Hollander

Zoom in the Block Device and what it does
more or less a stripe accross Rados Object
4 MB stripe over RADOS objects, it seems to be fine to use 4 MB
Sparse allocation (TRIM/discard support)
- start writing to object only when users write to it
- no creation delay at all
- VirtIO lacks necessary functions
- Filesystem like ext4 or Btrfs tell the block device which blocks can be discarded
- Only works with Qemu and SCSI driver (Qemu with SCSI the driver support trim)

Snapshotting

commodity
CoW snapshots

Layering/Cloning (same thing)

VM doesn’t know about the layering
Disney doesn’t use RBD yet but never logs in VM, they never fix anything prefer to re-create it with Layering it takes couple of minutes to deploy hundreds of machines

RBD in the Cloud

High parallel perf dur to object stripping
Discard for removing discarded data by VM
snapshotting for rollback points in case of pb inside a VM
Layering for easy and quick deployment
- save space
Replication will have an impact on performance
Use Journaling for your OSDs
Production ready ?
- Large scale deployment out there
- Big OpenStack clusters backed by Ceph
- CloudStack production deployment already

Proxmox Integration

small debian based VM appliance
use Qemu doesn not use libvirt
RBD integretaed with 2.2 not in the GUI yet
Snapshooting
No layering yet, no GUI, only CLI
Rados itself is synchronous, cache handled at the Ceph level, no more OS level caching
RBD cache runs in Qemu instead
pretty simple, Proxmox only runs KVM with RBD for disk as option of the command line
/etc/pve/storage.cfg contains
- Ceph configuration information
rbd ls
rbd info vm-101-disk-1

Apache Cloudstack integration

it took some work, started 6 month ago, Cloudstack used libvirt
had to do some challanged work to support libvirt
No Snapshotting
No layering
No TRIM/Discard
Still need NFS for SystemVMs
- could be 20G, small NFS
Primary storage: where VM runs
Secondary storage
- Enter Zone, Pod, Rados Monitor …

RBD in OpenStack

both for boot and data disk
support for Glance too
Dreamhost is currently deploying Dreamcompute on top of it soon

Example

100 IOPS let say, 100 000 with 1000 disks, but with VM random IOs, be aware of having 70% of write IOs, 30% reads
Blocks are files underneath OSDs
1 To disks maximum size is better
Low end disks aren’that good because they always park they heads and then dies more often
4 disks per 1 U is a good consolidation ratio

Case studies (RBD and S3)

Dreamhost - Ross Turk Community Manager Ceph (@rossturk ross@inktank.com) - VP Community Inktank

15 years creating and deploying services
Sage is co-founder at Dreamhost
Dreamhost storage selection criteria
- deploy at large scale
- manage at cost-effective way
- must be brought to market quickly
- must enable hybrid deployments
DreamObjects - in public beta
- S3 competitive offering for web storage, backups, digital media
- priced at 7 cents per GB
- Only require Rados Gateway + Rados = few moving parts
- use 4 x HA-proxy as load balancer
- each machine have 12 drives for 36 TB of storage
- Storage Nodes
  - Dell PowerEddge R515
  - 6 core AMD 32 GB RAM
  - 2 x 300 GB SAS drive OS
  - 12 x 3 TB SATA drives
  - 2 x 10 GbE 1 x 1 GbE IPMP
- Management Node
- 36 TB per node x 90 = 3 PB capacity gives 1 PB usable capacity (3x replicas involved)
- deployment Opscode Chef
  - Reduce operations overhead
  - maintain efficiency to keep costs down
- They build a user interface panel to manage all this from the user point of view
  - substantial amount of effort
- Usage panel for metering/reporting traffic integrated into Dreamhost panel
- Application can change with just changing the endpoint, 100% Swift+S3 compatible

Best Practices / Cluster design by Gregory Farnum (developer at Inktank one of the first 3 it seems)

What’s the best way to configure a cluster ?
- lot’s of way
consideration
- required size
- required storage performance
- cost considerations
- failure expectations
Components
- Monitors - report for failure
- OSD - do the data storage
- Metadata servers
- RadosGW for S3
- can all be scalled independantly

Monitors

A mounted hard drive
- yes this can be the OS drive
A defined IP address (uniquely define the monitor shouldn’t change)
How many Monitors ?
- one is the leader, says to the other write down this state
- others agreed and write they’ve been told
- it’s a voting process, dictatorial
  - 50%+1 will give the leader his role
  - If he is the only remaining node not good
  - So 3 is the recommended number of Monitors even with big environment
Where do we put them ?
- all over your infrastructure: across electric circuits
- as far away from each others
Basic Rules
- Only use a odd number of monitors (>1)
- few monitors as your resiliency requirements allow
- far apart
- If you have multiple data centers, deploy one Monitor in each one, latency isn’t a big deal for them.

OSD

require more than Monitors
- Filesystem where all OSD data is stored
  - xfs, btrfs, ext4 (btrfs should become great someday)
  - RAID ? Probably not everywhere. So why don’t you use RAID – and when you might anyway
    - data already replicated
    - Random IOs on RAID5 is slow
    - mirroring should be done at RADOS level
    - But you might consider it if
      - too many disks to run a daemon for each disk
      - Ceph for Hadoop processing without replication
      - if you want to be different from everybody else
        
        could be great for some workload
- Network connectivity
  - TCP/IP
  - Separate OSD and everybody else traffic if you like (not mandatory)
  - Speed appropriate for your node’s disks and cluster speed requirements (10 Gb is great by 1G could work)
- 1 GHz CPU, 1 Gb RAM per daemon (rule of thumb)

OSD Journaling Decisions

for consistency
don’t use a RAMDISK for it, OSD consider the journal to be reliable, if you loose the journal you loose the all OSD, with Btrfs not that bad
hosting VM: journal should be fast
Options
- SSD (often split out between multiple daemons)
- separate partition or RAID group on data store disk
- File on data store filesystem
How big ?
- 10Gb default is a good number per OSD
- Every write goes to the journal (double write)

Crush - data placement algorithm

Model your failure domains
- default list: OSD, host, rack, row, room, datacenter, “root”
- OSD host rack only in most datacenter
If you want you can set OSD tiers

Node configuration

lot’s of ways to configure a node
Highest rack density would be great

Metadata Server - MDS (cache of information stored in RADOS)

More RAM -> More filesystem hierarchy cached
Network connectivity
A lot of RAM
How many MDSes ? only one today
- Scale your MDS cluster based on active usage, not total data or tree size
No statefull information, if it dies, just restart it another, that’s fine (standby node could follow active node automatically, could pre-populate cache based on journal)

RadosGW

Network connectivity
- Cluster aware so load balancing would be great for them
RAM
CPU Power

mkcephfs

old option
does a better job at auto-setting some values
but doesn’t support adding or removing nodes on its own
so it’s not a good way to maintain production systems

ceph-deploy

soon recommended but currently new and shiny

Chef (more common option)

Cookbooks available
could add new nodes
Dreamhost manage 800 daemons with Chef

Crowbar

barclamps available for Ceph
SUSE Cloud also use Crowbar

Juju

Service Orchestration system
Some charms available for Ceph, see Charms store

Manual deployment

not really difficult
need a monitor address
single afternoon job

Q&A

RADOS doesn’t do asynchronous write
BIG SSD used for Journaling should be partitionned with a small 16Go partition, will increase life of the SSD instead of using the all 200Go size

Panel: Deploying Large Ceph Clusters

Participants

Canonical engineer
- scaling out is a different set of problem
- we now have tools to scale much more easily
- but that introduces additional problems
- with modern tools it becomes easier to think about scalability early on in the develop process
- we don’t want to panic about scale issue while in production
Greg Farnum core developer @ Inktank
- I never worked at scale up
Senior Product Manager @ SUSE works on HPC and High Availability
- Scale up Scale out doesn’t matter, customers use both
- Enterprise customers doesn’t care too much about the details of scalability, they ask for toolset/product to make it easy

What SUSE, Ceph, Canonical are doing to allow customer to deploy large datacenters

SUSE: Q&A, fix issues, invest in Open Source
Ceph: Academic for 6 years
- spent a lot of time to instrument and expose information to users
- how to be a good citizen amongst deployer tools
Ubuntu: Juju is a service orchestration tool designed to deploy services
- fully open source
- designed to deploy to the cloud
- deploy services
- manage inter relation ship
- architect a service with all the parts with scaling out made easy
- equivalent to apt-get for services in the cloud
- Juju charms for Ceph already available
- 10 fold reduce of complexity to deploy Ceph

Good/Bad decisions when deploying a 10 nodes Ceph cluster

SUSE: don’t try too much at the same time
Ceph: keep in mind the objective, how much it should grow
- management tools are different based on cluster size
- make sure that data is placed based on the size/Availability of it, lots of configuration options
Canonical: starts small, PoC environment
- keep in mind how big it will grow
- optimize each step of the way
- learn which tools are great for each scale

Tools are converging or Diverging

Canonical: A lot of them are competing
- some tools are very good at particular job
- bending them to do something else isn’t the right approach
- Example is Puppet which is a Configuration Manager which is great
  - Puppet Enterprise includes now Provisionning, compete with Ubuntu
  - It’s for everyone to decide which is the best one for each job
  - For Canonical Puppet is bent to do more than it was designed to do
Ceph: tools are really different, no convergence on the horizon
- Chef will take a node to the state you want - multiple weeks to learn
- Juju - 1 day learn process but 1 purpose per VM
SUSE: we are all living on Open Source World
- We know that a lot of projects don’t take the approach of re-use but rebuild everything
- So we will see different tools coming for the same job

Rebuilding instead of solving issues … How this has changed the upgrade process or other things ?

Canonical : new ways of rebuilding should reduce time it takes to solve issues
- like in mechanical engineering where they do component replacement
- make it simpler and easier to redeploy
SUSE : lifecycle is a pain
- Ceph and other distributed scale-out architecture should be able to update a node while the complete system is still running
- Sysadmins fear upgrades but in a situation where we can roll out upgrade node per node it could be safer

What problems are we still trying to solve ?

SUSE: Biggest issue is still scaling
CEPH: What ubiquitous and cheap virtu will enable us to do
- Computing will be used in a different way
- new dropbox, Ubuntu one, are just the beginning
- share hosting will be terminated, VM will prevail
Canonical: Hyper scale, much denser environment, serious challenge for the management tools
- we have to throw away the old management tools

Ceph Geo replication architecture (Sage Weil)

Most asked feature !
Sort of interesting question, hard to know what they need
Let’s start with what Ceph does
- nodes storage determined by CRUSH
- 3 way replication, the algorithm will decide where to store content
- we don’t control where to store which object
- In this scenario with 2 datacenters
  - setup 2 different pools with a CRUSH rule to store content locally
  - main limitation: write at still synchronous. Which could be slow.
  - So EAST WEST US is not good enough here
  - Multiple datacenter with ms latency in Europe would be fine like this
  - Latency of spinning disks is bigger of the speed of light between 2 datacenters needs to be compared
- Mirroring feature available in a lot of products today
  - a slave is behind by some time
  - great for failover scenario
  - 1st use case targeted by Ceph effort in geo replication
  - they plan to do this on Pool based
  - A secondary Pool in some other location that will be configured to mirror the first pool
  - some projects calls Rsync in the background but doesn’t provide consistency
  - trick is to mirror in parallel, and stop at a consistent state on the other site
  - implicit ordering in the current architecture, like a point in time could be used to achieve this goal
  - cross replication will be possibke
  - 2/3 month away next release will contain a prototype
Other type of replication
- active/active scenario
- the only way to do that is to give up on current replication mechanism
- it would have to be a totally different mechanism, further into the future
- they don’t want to re-implement illusion of consistency like in Riak and Cassandra.

Librados

great to build new services on top of RADOS
Pools
- 1s to 100s
- independent namespaces or object collections
- replication level, placement policy
objects
- bazillions

atomic transactions

client operations send to the OSD cluster
- operate on a single object
- can contain a sequence of operations
  - truncate object
  - write new object data
  - set attribute
- atomicity
  - all operations commit or do not commit atomically

key/value storage

Key/value pairs in an object
based on Google leveldb
- based on BigTable SSTable design
exposed via key/value API
avoid read/modify/write cycle for updating complex objects

Watch/notify

establish statefull ‘watch’ on an object
send ‘notify’ messages to all watchers
use any object as a communication/synchronization channel
- locking, distributed coordination (ala ZooKeeper), etc…
example of users
- radosgw uses it for cache consistency
  - radosgw instances watch a single object (.rgx/notify)
  - locally cache bucket metadata
  - on bucket metadata changes (removal ACL, changes)
    - write changes to relevant bucket
    - send notify
  - on receipt of notify
  - invalidate relevant portion of cache

RADOS Classes

dynamically loaded .so
- /var/lib/rados-classes/*
- implement new object “methods” using existing methods
- part of I/O pipeline
- simple internal API
you can add all kind of new features based on RADOS storage
- grep classe that read the entire objects
- sha1 - generate hash after reading the all file
- images - image manipulation, rotate, thumnail
Idea: building a scalable Mail backend on top of RADOS (favorite from Sage)
- most hosting SP uses NFS
- object for each message or attachment
Idea: distributed key/value table
- aggregate many k/v into a big table
- working prototype exists (thanks Eleanor!)
lua rados class
- embed LUA interpreter
- distributed computing

Caveats

librados.h well documented but the C++ one less docs

iSCSI

TGT + librbd
- user-level
kernel reference implementation
- kernel RBD + bcache + lio/SCST
- test, validate, share

Configuration Management

Puppet
CFEngine

odds and ends

ACL support in kernel
MDS load balancing heuristics - very complicated problem

Final QA

About Data Integrity

TCP checksum is weak so Ceph is doing more checksumming, it improves
OSD does scrumming every 24h, every OSD now does deep scrumming to make sure data is good
- BtrFS does this by default so we will have less work to do if we use this Filesystem
- Kernel level checksumming will improve from top to bottom and back when reads happen to make sure everything is fine
Today if a replica is different we keep the first it seems, I’ll have to check that.

Upgrade process

apt-get upgrade
apt-get install
restart the node
working currently on one node upgrade at a time will be available in the next version

Crash: what should we do ?

down/up power at the same time -> fine
harder when
- underline FS corrupts itself
- Ceph rely on it
- we don’t really know that -> Problem
- BtrFS better because doing Checksum
Ceph careful about knowing the data was
- Perl or Bash script to recover Files from RBD objects in case of failure

Any Mathematics computed object life calculation

Not yet implemented
could be implemented at the RADOS level why not

Ceph Workshop

Keynote by Wildo den Hollander 42on (@widoth)

partners of the first ever workshop day

Why Ceph ?

Keynote by Sage Weil (@liewegas, sage@inktank.com)

What is it, what it’s for

how it works - architecture - ceph object model

Why start with objects

Data Distribution

Crush

Algorithm details

Rados - Ceph Object Store

Librados

Radosgw

RBD - RADOS Block Device

CEPH FS - Nearly Awesome

Multiple protocols, implementation

Ceph Status

Roadmap

Why inktank works on this project

Two fields

Licensing

Ceph License

Fragmented copyright

Why its so important

A bit of history

QA with Sage Weil

Cloud Integration by Wildo den Hollander

Snapshotting

Layering/Cloning (same thing)

RBD in the Cloud

Proxmox Integration

Apache Cloudstack integration

RBD in OpenStack

Example

Case studies (RBD and S3)

Dreamhost - Ross Turk Community Manager Ceph (@rossturk ross@inktank.com) - VP Community Inktank

Best Practices / Cluster design by Gregory Farnum (developer at Inktank one of the first 3 it seems)

Monitors

OSD

OSD Journaling Decisions

Crush - data placement algorithm

Node configuration

Metadata Server - MDS (cache of information stored in RADOS)

RadosGW

mkcephfs

ceph-deploy

Chef (more common option)

Crowbar

Juju

Manual deployment

Q&A

Panel: Deploying Large Ceph Clusters

Participants

What SUSE, Ceph, Canonical are doing to allow customer to deploy large datacenters

Good/Bad decisions when deploying a 10 nodes Ceph cluster

Tools are converging or Diverging

Rebuilding instead of solving issues … How this has changed the upgrade process or other things ?

What problems are we still trying to solve ?

Ceph Geo replication architecture (Sage Weil)

Librados

atomic transactions

key/value storage

Watch/notify

RADOS Classes

Caveats

iSCSI

Configuration Management

odds and ends

Final QA

About Data Integrity

Upgrade process

Crash: what should we do ?

Any Mathematics computed object life calculation

Links