yet.org

Ceph Workshop

I’ve attended the first Ceph Workshop in Amsterdam, Nov 2, 2012, with all the Ceph gurus, Sage Weil, Wildo den Hollander, Ross Turk, Gregory Farnum presenting this outstanding distributed storage technology. I’m sharing with you my notes.

Keynote by Wildo den Hollander 42on (@widoth)

partners of the first ever workshop day

  • inktank - the founders
    • providing services & support
    • contributed 90% of ceph code
  • 42on
    • spinoff of European company which now focus on Ceph
  • Dell
    • gave gears for Demo facilities

Why Ceph ?

  • World of storage is changing everyday, now storing Petabytes of data
  • How to store it is a lot of work
    • Do we use proprietary systems ?
    • How do we scale ?
    • Does that fit in our budget
  • Use Ceph
    • Fully distributed without any SPoF
    • Scales with you into the Petabytes
    • Runs on commodity hardware

Keynote by Sage Weil (@liewegas, sage@inktank.com)

  • First ever Ceph day
  • Unified Distributed System
  • Why you should care
    • diverse storage needs
      • object storage
      • block devices (for VMs) with snapshots, cloning
      • shared file system with POSIX, coherent caches
      • structured data: files, block devices, or objects ?
    • scale
    • terabytes, petabytes, exabytes
    • heterogeneous hardware
    • time
      • ease of admin
      • no manual data migration, load balancing
      • painless scaling
        • expansion and contraction
        • seamless integration
    • cost
      • linear function of size, of performance
      • incremental expansion
        • no fork lift upgrades
      • no vendor lock-in
        • choice of hardware and software
      • open

What is it, what it’s for

  • It’s a storage system that stores:
    • Objects
      • native API
      • or RESTful API
    • Blocks
      • thin provisioning
      • snapshots
      • layering/cloning
    • Files
      • strong consistency, snapshots
  • It’s a distributed storage system
    • that scale from 10s to 10 000 of machines
    • terabytes to exabytes
  • Which is fault tolerant - no SPoF

how it works - architecture - ceph object model

  • Components
    • librados - client library
    • radosgw - S3 and switch compatible access to the Ceph Cluster
    • rbd - Rados Block Device, useful for virtualization

Why start with objects

  • more useful than (disks) blocks
    • names in a single flat namespace
    • variable size
    • simple API
  • more scalable than files
    • no hard-to-distribute hierarchy
    • update semantics do not span objects
    • workload is trivially parallel
  • Ceph Hardware agnostic but big fan of Btrfs (need better stability), also works with XFS, EXT4
  • M: Monitors aren’t involved in data-path
  • OSD - Object Server Daemon

Data Distribution

  • all objects are replicated N times (could be 2/3 for example)
  • objects are auto placed, balanced, migrated in a dynamic cluster
  • must consider physical infrastructure
    • ceph-osds on hosts, in racks, in rows, in data center
  • 3 approaches
    • pick a pot remember where you put it
    • pick a pot; write down where you put it (most of the systems do that, but scalability is a problem and slow)
    • calculate where to put it (hash based function)
      • basic idea, determine a location based on current cluster status
      • consistent hashing (riak) here we use Crush

Crush

  • pseudo random placement algorithm
  • fast calculation no lookup
  • repeatable deterministic
  • Ensures even distribution
  • Stable mapping
    • limiting data migration
  • Rule-based configuration
    • specifiable replication
    • infra topology aware
    • allows weighting
  • got a language to specify constraints

Algorithm details

  1. Hash(object name)
  2. CRUSH(pg,…)

Rados - Ceph Object Store

  • Monitors publish osd map (state of the cluster) - ceph-mon
    • keep track of which nodes are operational at any given time
  • OSDs - ceph-osd
    • perform actual data storage/retrieval operations
    • safely replicate and store objects
    • migrate data as the cluster changes over time
    • coordinate based on shared view of reality
  • Decentralized, distributed approach allows
    • massive scales
    • illusion of a single copy with consistent behavior (self healing)

Librados

  • native protocol, highly parallel and scalable
  • C, C++, Python, PHP, Java

Radosgw

  • separate daemon that runs on top of RADOS and provides a REST-based object storage API
  • compatible with S3 and Swift
  • use librados to talk to the cluster
  • RESTful at the front, Rados at the backend
  • REST-based interface to RADOS
  • Supports buckets accounting

RBD - RADOS Block Device

  • Virtual disk abstraction
  • With KVM, native support for librbd Ubuntu precise supports this
  • use librados
  • enable exiting things
    • live migration
  • kernel driver krbd
  • Storage of virtual disks in RADOS
  • Decouples VM and containers
  • Images are striped across the cluster
  • snapshots
  • Support : Cloud Stack, OpenStack, …
  • Logical clone for same blocks
  • 1st use cases is cloud computing but could also replace a legacy SAN

CEPH FS - Nearly Awesome

  • Most exiting peace of the stack right now
  • HA POSIX Filesystem
  • A set of metadata servers (MDS) to check access right, …
  • One path for information and one path for data, no bottleneck
  • Metadata Server
    • Manages metadata for a POSIX-compliant shared filesystems
    • Stored metadata in RADOS
    • Does not serve file data to clients
    • Only required for shared filesytems
  • Dynamic Subtree Partitioning
    • In Filesystem workload, high level of locality accessing one file means we should access other in the tree so we have to take this into consideration
    • Entirely adaptive depends on the workloads, partition will shift depending on workload
  • Recursive accounting
    • for every Dir it keeps lots of information
      • keep track of file and dir counts
      • keep track of modif time
      • virtual xattrs present full stats
        • ls -alSh | head will gives you correct information for free because it stored in the metadata, most useful ever
    • efficient
  • Snapshots
    • vol or sub-volume snapshots unusable at petabyte scale
      • snapshot arbitrary subdirectories
    • simple interface
      • hidden .snap directory
      • no special tools, you can access it with classic POSIX tools (ls, ..)

Multiple protocols, implementation

  • Linux kernel client
    • mount -t ceph 1.2.3.4://mnt
    • export NFS, CIFS
  • CEPH-Fuse
  • libcephfs.so
    • your app
    • samba (CIFS)
    • Useland NFS server (Ganesha)
    • map/reduce (Hadoop)

Ceph Status

  • Ceph FS missing QA effort but feature complete
  • argonaut stable release v0.48
    • rados, RBD, radosgw
  • bobtail stable release v0.55 in a month or so
    • RBD cloning
    • improved performance, scaling, failure behavior
    • radosgw API, performance improvement
    • freeze in 1 week, release in 4 weeks

Roadmap

  • Filesystem
    • The engineering team here is about to do a bit of a pivot and refocus on the file system now that the object store and RBD are in pretty good shape.
    • That will mean both core fs/mds stability
    • and features as well as integration efforts, NFS (Ganesha), CIFS (Samba), Hadoop
  • RBD
    • Xen integration with iSCSI in the work
  • Radosgw
    • Keystone integration
  • RADOS
    • geo-replication
    • PG split

Why inktank works on this project

  • Current limited options for scalable open source storage
  • Proprietary solutions * expensive * don’t scale (well or out) * marry HA and software - annoying situation
  • users hungry for alternatives
    • scalability, …

Two fields

  • green: cloud, big data
    • incumbents don’t have a viable solution
    • most players can’t afford to build their own
    • strong demand for open source solutions
  • brown: traditional SAN, NAS; Enterprise
    • incumbent struggle to scale out
    • can’t compete on price with open solutions

Licensing

  • promote adoption - free to everyone
  • enable community development
  • prevent Ceph from becoming proprietary
  • allow organic commercialization

Ceph License

  • LGPL2 Copyleft
    • free distribution
    • allow derivative work
    • changes you distribute/sell must be shared
  • ok to link to proprietary code
    • allow proprietary products to include and build on Ceph
    • doesn’t allow proprietary derivatives of Ceph

Fragmented copyright

  • We do not require copyright assignment from contributors, great because:
    • no single person or entity owns all of Ceph
    • no single entity can make Ceph proprietary
  • strong community

Why its so important

  • Ceph is an ingredient
    • We need to play nice in a larger ecosystem
    • community will be key to Ceph’s success
  • Truly open sources solutions are disruptive
    • Open is a competitive advantage
      • frictionless integration with projects, platforms, tools
      • freedom to innovate on protocols
      • leverage community testing, development resources
      • open collaboration is efficient way to build technology

A bit of history

  • Ceph created at UC Santa Cruz (2004–2007)
  • supported by DreamHost (2008–2011)
  • Inktank (2012)
  • growing user and development community * Linux distros, users, cloud stacks, SIs OEMs

QA with Sage Weil

  • How does it compare to Gluster
    • Cluster is a much simpler system - it works but simpler architecture to deliver it quickly
    • Ceph much more ambitious architecture, with Metadata servers it deliver much more features
    • With Gluster you add caching but sacrifice on consistency don’t have the pb with Ceph
    • But Gluster maybe more mature
    • Gluster not Fault Tolerant so much more expansive, Ceph handle that at the software layer
  • Not recommended to use EXT4
    • Ceph relies on extended attributed, those can get large, with RADOSGW, EXT4 does have 4K limit
    • today leveldb in object store, today it’s not a problem
    • now it’s ok to use EXT4 but XFS more tested with Inktank, Btrfs for labs not yet for production
  • Feedback on commercial devt due to LGLP2
    • business people complain but not any devt
    • Not a good idea to maintain an independant Fork (example Postgres)
  • Is there production systems that currently use multiple CEPH interfaces at the same time
    • Keystone will allow to use multiple interface but most of current users only use one interface
  • Storage Tiering
    • Ceph agnostic - value of Ceph it aggregates HW together
    • next version will improve IOs when used in Fusion IO environment
    • Create different Pools: Flash, SATA
    • or Put it underneath each OSDs
    • Ceph have a journal device, seperate SSD, NVRAM card, …
    • bcache, flashcache, .., btrfs is growing support for tiered storage
    • Most of all this is handled outside of Ceph itself
  • Infiniband support
    • HPC world is really interested in that, so there is work in progress there
  • Quota
    • Swift doesn’t have any
    • today Ceph have accounting implemented, we keep track of everything but no enforcement
    • Quota will be subtree implemented not UID based
  • Latency
    • ok accross Europe
    • Ceph dedicated 10G links between multiple datacenters
    • Ceph Object Store gives you strong consistency, so each write will write to all replica and waits for ACK. So beware of high latency
    • Roadmap: async replication, today is sync
  • Ceph library deleted from Debian src 5 weeks ago
    • not aware of that

Cloud Integration by Wildo den Hollander

  • Zoom in the Block Device and what it does
  • more or less a stripe accross Rados Object
  • 4 MB stripe over RADOS objects, it seems to be fine to use 4 MB
  • Sparse allocation (TRIM/discard support)
    • start writing to object only when users write to it
    • no creation delay at all
    • VirtIO lacks necessary functions
    • Filesystem like ext4 or Btrfs tell the block device which blocks can be discarded
    • Only works with Qemu and SCSI driver (Qemu with SCSI the driver support trim)

Snapshotting

  • commodity
  • CoW snapshots

Layering/Cloning (same thing)

  • VM doesn’t know about the layering
  • Disney doesn’t use RBD yet but never logs in VM, they never fix anything prefer to re-create it with Layering it takes couple of minutes to deploy hundreds of machines

RBD in the Cloud

  • High parallel perf dur to object stripping
  • Discard for removing discarded data by VM
  • snapshotting for rollback points in case of pb inside a VM
  • Layering for easy and quick deployment
    • save space
  • Replication will have an impact on performance
  • Use Journaling for your OSDs
  • Production ready ?
    • Large scale deployment out there
    • Big OpenStack clusters backed by Ceph
    • CloudStack production deployment already

Proxmox Integration

  • small debian based VM appliance
  • use Qemu doesn not use libvirt
  • RBD integretaed with 2.2 not in the GUI yet
  • Snapshooting
  • No layering yet, no GUI, only CLI
  • Rados itself is synchronous, cache handled at the Ceph level, no more OS level caching
  • RBD cache runs in Qemu instead
  • pretty simple, Proxmox only runs KVM with RBD for disk as option of the command line
  • /etc/pve/storage.cfg contains
    • Ceph configuration information
  • rbd ls
  • rbd info vm-101-disk-1

Apache Cloudstack integration

  • it took some work, started 6 month ago, Cloudstack used libvirt
  • had to do some challanged work to support libvirt
  • No Snapshotting
  • No layering
  • No TRIM/Discard
  • Still need NFS for SystemVMs
    • could be 20G, small NFS
  • Primary storage: where VM runs
  • Secondary storage
    • Enter Zone, Pod, Rados Monitor …

RBD in OpenStack

  • both for boot and data disk
  • support for Glance too
  • Dreamhost is currently deploying Dreamcompute on top of it soon

Example

  • 100 IOPS let say, 100 000 with 1000 disks, but with VM random IOs, be aware of having 70% of write IOs, 30% reads
  • Blocks are files underneath OSDs
  • 1 To disks maximum size is better
  • Low end disks aren’that good because they always park they heads and then dies more often
  • 4 disks per 1 U is a good consolidation ratio

Case studies (RBD and S3)

Dreamhost - Ross Turk Community Manager Ceph (@rossturk ross@inktank.com) - VP Community Inktank

  • 15 years creating and deploying services
  • Sage is co-founder at Dreamhost
  • Dreamhost storage selection criteria
    • deploy at large scale
    • manage at cost-effective way
    • must be brought to market quickly
    • must enable hybrid deployments
  • DreamObjects - in public beta
    • S3 competitive offering for web storage, backups, digital media
    • priced at 7 cents per GB
    • Only require Rados Gateway + Rados = few moving parts
    • use 4 x HA-proxy as load balancer
    • each machine have 12 drives for 36 TB of storage
    • Storage Nodes
      • Dell PowerEddge R515
      • 6 core AMD 32 GB RAM
      • 2 x 300 GB SAS drive OS
      • 12 x 3 TB SATA drives
      • 2 x 10 GbE 1 x 1 GbE IPMP
    • Management Node
    • 36 TB per node x 90 = 3 PB capacity gives 1 PB usable capacity (3x replicas involved)
    • deployment Opscode Chef
      • Reduce operations overhead
      • maintain efficiency to keep costs down
    • They build a user interface panel to manage all this from the user point of view
      • substantial amount of effort
    • Usage panel for metering/reporting traffic integrated into Dreamhost panel
    • Application can change with just changing the endpoint, 100% Swift+S3 compatible

Best Practices / Cluster design by Gregory Farnum (developer at Inktank one of the first 3 it seems)

  • What’s the best way to configure a cluster ?
    • lot’s of way
  • consideration
    • required size
    • required storage performance
    • cost considerations
    • failure expectations
  • Components
    • Monitors - report for failure
    • OSD - do the data storage
    • Metadata servers
    • RadosGW for S3
    • can all be scalled independantly

Monitors

  • A mounted hard drive
    • yes this can be the OS drive
  • A defined IP address (uniquely define the monitor shouldn’t change)
  • How many Monitors ?
    • one is the leader, says to the other write down this state
    • others agreed and write they’ve been told
    • it’s a voting process, dictatorial
      • 50%+1 will give the leader his role
      • If he is the only remaining node not good
      • So 3 is the recommended number of Monitors even with big environment
  • Where do we put them ?
    • all over your infrastructure: across electric circuits
    • as far away from each others
  • Basic Rules
    • Only use a odd number of monitors (>1)
    • few monitors as your resiliency requirements allow
    • far apart
    • If you have multiple data centers, deploy one Monitor in each one, latency isn’t a big deal for them.

OSD

  • require more than Monitors
    • Filesystem where all OSD data is stored
      • xfs, btrfs, ext4 (btrfs should become great someday)
      • RAID ? Probably not everywhere. So why don’t you use RAID – and when you might anyway
        • data already replicated
        • Random IOs on RAID5 is slow
        • mirroring should be done at RADOS level
        • But you might consider it if
          • too many disks to run a daemon for each disk
          • Ceph for Hadoop processing without replication
          • if you want to be different from everybody else
            • could be great for some workload
    • Network connectivity
      • TCP/IP
      • Separate OSD and everybody else traffic if you like (not mandatory)
      • Speed appropriate for your node’s disks and cluster speed requirements (10 Gb is great by 1G could work)
    • 1 GHz CPU, 1 Gb RAM per daemon (rule of thumb)

OSD Journaling Decisions

  • for consistency
  • don’t use a RAMDISK for it, OSD consider the journal to be reliable, if you loose the journal you loose the all OSD, with Btrfs not that bad
  • hosting VM: journal should be fast
  • Options
    • SSD (often split out between multiple daemons)
    • separate partition or RAID group on data store disk
    • File on data store filesystem
  • How big ?
    • 10Gb default is a good number per OSD
    • Every write goes to the journal (double write)

Crush - data placement algorithm

  • Model your failure domains
    • default list: OSD, host, rack, row, room, datacenter, “root”
    • OSD host rack only in most datacenter
  • If you want you can set OSD tiers

Node configuration

  • lot’s of ways to configure a node
  • Highest rack density would be great

Metadata Server - MDS (cache of information stored in RADOS)

  • More RAM -> More filesystem hierarchy cached
  • Network connectivity
  • A lot of RAM
  • How many MDSes ? only one today
    • Scale your MDS cluster based on active usage, not total data or tree size
  • No statefull information, if it dies, just restart it another, that’s fine (standby node could follow active node automatically, could pre-populate cache based on journal)

RadosGW

  • Network connectivity
    • Cluster aware so load balancing would be great for them
  • RAM
  • CPU Power

mkcephfs

  • old option
  • does a better job at auto-setting some values
  • but doesn’t support adding or removing nodes on its own
  • so it’s not a good way to maintain production systems

ceph-deploy

  • soon recommended but currently new and shiny

Chef (more common option)

  • Cookbooks available
  • could add new nodes
  • Dreamhost manage 800 daemons with Chef

Crowbar

  • barclamps available for Ceph
  • SUSE Cloud also use Crowbar

Juju

  • Service Orchestration system
  • Some charms available for Ceph, see Charms store

Manual deployment

  • not really difficult
  • need a monitor address
  • single afternoon job

Q&A

  • RADOS doesn’t do asynchronous write
  • BIG SSD used for Journaling should be partitionned with a small 16Go partition, will increase life of the SSD instead of using the all 200Go size

Panel: Deploying Large Ceph Clusters

Participants

  • Canonical engineer
    • scaling out is a different set of problem
    • we now have tools to scale much more easily
    • but that introduces additional problems
    • with modern tools it becomes easier to think about scalability early on in the develop process
    • we don’t want to panic about scale issue while in production
  • Greg Farnum core developer @ Inktank
    • I never worked at scale up
  • Senior Product Manager @ SUSE works on HPC and High Availability
    • Scale up Scale out doesn’t matter, customers use both
    • Enterprise customers doesn’t care too much about the details of scalability, they ask for toolset/product to make it easy

What SUSE, Ceph, Canonical are doing to allow customer to deploy large datacenters

  • SUSE: Q&A, fix issues, invest in Open Source
  • Ceph: Academic for 6 years
    • spent a lot of time to instrument and expose information to users
    • how to be a good citizen amongst deployer tools
  • Ubuntu: Juju is a service orchestration tool designed to deploy services
    • fully open source
    • designed to deploy to the cloud
    • deploy services
    • manage inter relation ship
    • architect a service with all the parts with scaling out made easy
    • equivalent to apt-get for services in the cloud
    • Juju charms for Ceph already available
    • 10 fold reduce of complexity to deploy Ceph

Good/Bad decisions when deploying a 10 nodes Ceph cluster

  • SUSE: don’t try too much at the same time
  • Ceph: keep in mind the objective, how much it should grow
    • management tools are different based on cluster size
    • make sure that data is placed based on the size/Availability of it, lots of configuration options
  • Canonical: starts small, PoC environment
    • keep in mind how big it will grow
    • optimize each step of the way
    • learn which tools are great for each scale

Tools are converging or Diverging

  • Canonical: A lot of them are competing
    • some tools are very good at particular job
    • bending them to do something else isn’t the right approach
    • Example is Puppet which is a Configuration Manager which is great
      • Puppet Enterprise includes now Provisionning, compete with Ubuntu
      • It’s for everyone to decide which is the best one for each job
      • For Canonical Puppet is bent to do more than it was designed to do
  • Ceph: tools are really different, no convergence on the horizon
    • Chef will take a node to the state you want - multiple weeks to learn
    • Juju - 1 day learn process but 1 purpose per VM
  • SUSE: we are all living on Open Source World
    • We know that a lot of projects don’t take the approach of re-use but rebuild everything
    • So we will see different tools coming for the same job

Rebuilding instead of solving issues … How this has changed the upgrade process or other things ?

  • Canonical : new ways of rebuilding should reduce time it takes to solve issues
    • like in mechanical engineering where they do component replacement
    • make it simpler and easier to redeploy
  • SUSE : lifecycle is a pain
    • Ceph and other distributed scale-out architecture should be able to update a node while the complete system is still running
    • Sysadmins fear upgrades but in a situation where we can roll out upgrade node per node it could be safer

What problems are we still trying to solve ?

  • SUSE: Biggest issue is still scaling
  • CEPH: What ubiquitous and cheap virtu will enable us to do
    • Computing will be used in a different way
    • new dropbox, Ubuntu one, are just the beginning
    • share hosting will be terminated, VM will prevail
  • Canonical: Hyper scale, much denser environment, serious challenge for the management tools
    • we have to throw away the old management tools

Ceph Geo replication architecture (Sage Weil)

  • Most asked feature !
  • Sort of interesting question, hard to know what they need
  • Let’s start with what Ceph does
    • nodes storage determined by CRUSH
    • 3 way replication, the algorithm will decide where to store content
    • we don’t control where to store which object
    • In this scenario with 2 datacenters
      • setup 2 different pools with a CRUSH rule to store content locally
      • main limitation: write at still synchronous. Which could be slow.
      • So EAST WEST US is not good enough here
      • Multiple datacenter with ms latency in Europe would be fine like this
      • Latency of spinning disks is bigger of the speed of light between 2 datacenters needs to be compared
    • Mirroring feature available in a lot of products today
      • a slave is behind by some time
      • great for failover scenario
      • 1st use case targeted by Ceph effort in geo replication
      • they plan to do this on Pool based
      • A secondary Pool in some other location that will be configured to mirror the first pool
      • some projects calls Rsync in the background but doesn’t provide consistency
      • trick is to mirror in parallel, and stop at a consistent state on the other site
      • implicit ordering in the current architecture, like a point in time could be used to achieve this goal
      • cross replication will be possibke
      • 2/3 month away next release will contain a prototype
  • Other type of replication
    • active/active scenario
    • the only way to do that is to give up on current replication mechanism
    • it would have to be a totally different mechanism, further into the future
    • they don’t want to re-implement illusion of consistency like in Riak and Cassandra.

Librados

  • great to build new services on top of RADOS
  • Pools
    • 1s to 100s
    • independent namespaces or object collections
    • replication level, placement policy
  • objects
    • bazillions

atomic transactions

  • client operations send to the OSD cluster
    • operate on a single object
    • can contain a sequence of operations
      • truncate object
      • write new object data
      • set attribute
    • atomicity
      • all operations commit or do not commit atomically

key/value storage

  • Key/value pairs in an object
  • based on Google leveldb
    • based on BigTable SSTable design
  • exposed via key/value API
  • avoid read/modify/write cycle for updating complex objects

Watch/notify

  • establish statefull ‘watch’ on an object
  • send ‘notify’ messages to all watchers
  • use any object as a communication/synchronization channel
    • locking, distributed coordination (ala ZooKeeper), etc…
  • example of users
    • radosgw uses it for cache consistency
      • radosgw instances watch a single object (.rgx/notify)
      • locally cache bucket metadata
      • on bucket metadata changes (removal ACL, changes)
        • write changes to relevant bucket
        • send notify
      • on receipt of notify
      • invalidate relevant portion of cache

RADOS Classes

  • dynamically loaded .so
    • /var/lib/rados-classes/*
    • implement new object “methods” using existing methods
    • part of I/O pipeline
    • simple internal API
  • you can add all kind of new features based on RADOS storage
    • grep classe that read the entire objects
    • sha1 - generate hash after reading the all file
    • images - image manipulation, rotate, thumnail
  • Idea: building a scalable Mail backend on top of RADOS (favorite from Sage)
    • most hosting SP uses NFS
    • object for each message or attachment
  • Idea: distributed key/value table
    • aggregate many k/v into a big table
    • working prototype exists (thanks Eleanor!)
  • lua rados class
    • embed LUA interpreter
    • distributed computing

Caveats

  • librados.h well documented but the C++ one less docs

iSCSI

  • TGT + librbd
    • user-level
  • kernel reference implementation
    • kernel RBD + bcache + lio/SCST
    • test, validate, share

Configuration Management

  • Puppet
  • CFEngine

odds and ends

  • ACL support in kernel
  • MDS load balancing heuristics - very complicated problem

Final QA

About Data Integrity

  • TCP checksum is weak so Ceph is doing more checksumming, it improves
  • OSD does scrumming every 24h, every OSD now does deep scrumming to make sure data is good
    • BtrFS does this by default so we will have less work to do if we use this Filesystem
    • Kernel level checksumming will improve from top to bottom and back when reads happen to make sure everything is fine
  • Today if a replica is different we keep the first it seems, I’ll have to check that.

Upgrade process

  • apt-get upgrade
  • apt-get install
  • restart the node
  • working currently on one node upgrade at a time will be available in the next version

Crash: what should we do ?

  • down/up power at the same time -> fine
  • harder when
    • underline FS corrupts itself
    • Ceph rely on it
    • we don’t really know that -> Problem
    • BtrFS better because doing Checksum
  • Ceph careful about knowing the data was
    • Perl or Bash script to recover Files from RBD objects in case of failure

Any Mathematics computed object life calculation

  • Not yet implemented
  • could be implemented at the RADOS level why not