42 1 5MB
OceanStor Dorado V6 Technical Deep Dive
Security Level: Internal Only
Overview of OceanStor Dorado V6 Entry-Level
高端
Mid-Range
High End Dorado18000 V6
Dorado8000 V6 Dorado6000 V6 Dorado3000 V6
Dorado5000 V6
Entry-Level
Mid-Range
High End
Type
Dorado3000 V6
Dorado5000 V6
Dorado6000 V6
Dorado8000 V6
Dorado18000 V6
Height / Controllers of each Engine
2U/2C
2U/2C
2U/2C
4U/4C
4U/4C
Controller Expansion
2-16
2-16
2-16
2-16
2-32
Maximum Disks
1200
1600
2400
3200
6400
Cache/Dual Controller
192G
256G/512G
512G/1024G
512G/1024G/2048G
512G/1024G/2048G
Front-end ports Back-end ports
2
8/16/32G FC, 1/10/25/40/100G Ethernet SAS 3.0
Huawei Confidential
SAS 3.0/100G Ethernet
Dorado V6: The Cutting Edge of Storage Innovation
3
Huawei Confidential
Hardware Design
4
Huawei Confidential
Extreme Reliable
Extreme Performance
High Efficiency
New Generation Innovative Hardware Platform
Extreme Reliable
Extreme Performance
Front Panel
Rear Panel High-end controller enclosure
4U, 4 controllers per controller enclosure
4U, 28 shared interface slots Mid-range controller enclosure
2U, 2 controllers per controller enclosure
2U, 36 NVMe SSDs(high density)
Entry-level controller enclosure
2U, 2 controllers per controller enclosure
Intelligent DAE
2U, 25 SAS SSDs 2U, 2 controllers per controller enclosure
5
Huawei Confidential
Standardization
High density
High Efficiency
Extreme Reliable
Controller design for high-end series PWR
MGT board
Controller
Full-Shared IO Card BBU
FAN
6
Huawei Confidential
Extreme Performance
High Efficiency
Extreme Reliable
Controller design for Middle-range series
36 x Palm 高 速
高 速
高 速
CPU
高电 速源
INTPS ER
CPU
IOB
IO Card
25 x 2.5”
7
Huawei Confidential
IO Card
IO Card
PWR 2000 W max
Extreme Performance
High Efficiency
Extreme Reliable
Extreme Performance
Controller design for Entry-Level series & Intelligent DAE
高 速
Mi Mi niS niS AS AS HD HD
Huawei Confidential
高 电 速 源
高 速
12V To 5V
IO Bridge
CPU
B B U
8
高 速
2X PSU IO IO card card SFPSFPSFPSFP + + + +
RJ45
RJ RJ RJ 45 45 45
High Efficiency
Extreme Reliable
Extreme Performance
Storage unit design: low cost, high density
SAS SSD
NVMe SSD
Shorten the depth to 126mm
Older version
•
9
Reduce the Width by 36%
New version
Less space, adapt to 1 meter deep cabinet
Huawei Confidential
U.2 NVMe SSD
•
dual-port Palm SSD
40% increase in energy density
High Efficiency
Extreme Reliable
2U, 36 disks, high capacity density Traditional architecture design 1.
The heat dissipation
2.
High Efficiency
Dual Horizontal Orthogonal Architecture Design 1. The window area
window is small and the wind resistance is large.
Extreme Performance
PALM Disk
increases by 50%, and the heat dissipation capability increases by
Double-sided connector,
25%.
interfering with each
2. Orthogonal
other. The number of
connection without
hard disks is limited.
dual-side interference,
>25-disk double-sided connectors cannot be staggered
Horizontal backplane and orthogonal connection
increasing the number
of hard disks by 44% 2U integrated equipment, 36 Palm SSDs, 44% SSDs
Traditional
User-defined Palm form
Size:
Size:
100.6*14.8*70
160*9.5*79.8
Volume: 103cm³ U.2 NVMe SSD 10
Huawei Confidential
Volume: Dual-port Palm SSD
121cm³
The number of 44% disk slots is added to the width of the 19inch cabinet.
Same capacity, width reduced by 36%
increasing in industry
Extreme Reliable
Extreme Performance
High Efficiency
Innovative Hardware Platform Overview: with self-developed chipsets Network Chip Hi1822 •
lower Network latency 160μs80μs
CPU Chip • •
NO.1 ARM CPU, 930+ SPECint Intelligent enclosure, CPU integrated.
AI Chip •
Kunpeng 920
Ascend 310
AI SoC for mini-scale training
SSD Chip Hi1812e •
lower SSD Latency 40μs20μs (write)
BMC Chip • 11
Huawei Confidential
Hi1710
Trouble shooting accuracy rate 93%
Extreme Reliable
Extreme Performance
High Efficiency
Kunpeng 920, the best processor for storage
High concurrency
Up to 48 cores in one CPU
High integration
Not only computing
High density
4 Sockets in 1U Space
48 Core Acceleration Engine
Huawei Confidential
100G RoCE & SAS 3.0
PCIe 4.0
8-Channel DDR4 12
Extreme Reliable
Extreme Performance
High Efficiency
Dorado V6 Design Principle: Distributed, End-to-End NVMe, and Global Shared Resource Distributed Architecture
FC/FC-NVMe/NVMe Over Fabric/iSCSI
Shared Frontend
Shared Frontend
Shared Frontend
Shared Frontend
Shared Frontend
Shared Frontend
Shared Frontend
Shared Frontend
Storage Controller
Storage Controller
Storage Controller
Storage Controller
Storage Controller
Storage Controller
Storage Controller
Storage Controller
Shared Backend
Shared Backend
Shared Backend
Shared Backend
Shared Backend
Shared Backend
Shared Backend
Shared Backend
Storage Engine
Storage Engine
•
Consistent Distributed Architecture for High-end/Midrange/Entry-level series of Dorado V6.
•
Symmetric Active-Active Cluster(supports symmetric access by the hosts).
•
Load-balancing between all the controllers and autorebalancing upon scale-out, failover and failback.
End-to-End NVMe •
Front-end: NVMe over FC(32G)/NVMe over Fabrics(RoCE).
•
Back-end: NVMe SSD/Intelligent DAE.
NVMe Over Fabric (RoCE)
Intelligent DAE
13
Intelligent DAE
Intelligent DAE
Huawei Confidential
Intelligent DAE
Intelligent DAE
Global Shared Resource
Intelligent DAE
Intelligent DAE
Intelligent DAE
•
Global cache and global storage pool for all LUNs
•
High-end series support Shared Frontend module, Shared Backend module(100GE RDMA)
Hardware Design
14
Huawei Confidential
Extreme Reliable
Extreme Performance
High Efficiency
Extreme Reliable
Introduction of Connectivity
port
port controller
Every physical port connects all the four controllers in one engine
controller
controller
controller
Shared Front-End
Shared Front-End
Shared Front-End
Shared Front-End
A
B
A
B
A
B
A
B
48 cores
48 cores
48 cores
48 cores
C
D
C
D
C
D
C
D
48 cores
48 cores
48 cores
48 cores
48 cores
48 cores
48 cores
48 cores
48 cores
48 cores
48 cores
48 cores
Shared SharedShared Back-End Back-End Back-End
A
B
A
B
C
D
C
D
Back-End port
15
Huawei Confidential
High Efficiency
Host IO
SmartMatrix Technology over 100GE RDMA
Shared SharedShared Back-End Back-End Back-End
Extreme Performance
Full Mesh interconnection between all controllers in each engine Shared interconnection module for connecting between the engines
One intelligent disk enclosure can be accessed by 8 controllers(2 engines) through the shared back-end module
Back-End port
IP DAE (SAS/NVMe)
Extreme Reliable
Extreme Performance
High Efficiency
Intelligent Front-end Connection: Shared Interface Based on Self-developed Networking Chipset Server FC switch WWN: 2100xxxxabcd
Failure Mode of shared and intelligent interface: controller failure is transparent to the host • No impact to the host: FC links keeps up and business keeps functional working without any alarm/event. • Rapid takeover inside the interface: the related I/Os will be redirected to other controllers by the front-end chipset.
No impact!! Backplane
X Ctrl.A
16
Huawei Confidential
Ctrl.B
Ctrl.C
Ctrl.D
Extreme Reliable
Extreme Performance
High Efficiency
High availability Architecture(HyperMetro-inner for High-end series) Tolerance of 2 controllers failure simultaneously
Shared Front-End A
A’
B’
B
B’ C
C’
D’
D
E’’ F’’
’
C
D
Shared Back-End
A’
B’
B
D’’ E
E’
F’
F
H’’
B
A
C’’
G’’
A
Shared Front-End
Shared Front-End A’’
E
F
H’
H
G
H
Shared Back-End
C
C’
D’
D
Intelligent DAE • Global Cache supports 3 copies across two engines. • Guarantee at least 1 cache copy available if 2 controllers failed simultaneously. • Only one engine can also tolerate 2 controllers failure at the same time with 3 copies Global Cache 17 Huawei Confidential
B
C
D
A’
B’
B
D’’
Shared Front-End
C
C’
D’
D
E
E’
E
E’
F’
F
F’
F
H’’
Shared Back-End
A
C’’
G’’
A
Shared Front-End
B’’
F’’ G’
Shared Front-End A’’
E’’
G
Tolerance of 7 controllers failure
Tolerance of 1 engine failure
E
F
G
G’
G
G’
H’
H
H’
H
G
H
G
H
Shared Back-End
Intelligent DAE • Global Cache supports 3 copies across two engines. • One disk enclosure can be accessed by 8 controllers(2 engines) through the shared back-end module • Guarantee at least 1 cache copy available if one engine failed.
A
B
C
D
Shared Back-End
E
F
Shared Back-End
Intelligent DAE • Global cache provides continuous mirroring technology • Tolerates 7 controllers failure one by one of 8 controllers(2 engines)
The best Active-Active design Vendor1: 2 Controllers Scale-out IO interface belongs to one controller, controller fails causes link switch over.
Vendor2: 4~8 Controllers Scale-out Shared front-end, no switch over when any controllers failed.
Extreme Reliable
Extreme Performance
High Efficiency
Huawei Dorado V6 Shared front-end, no switch over when any controllers failed. Controller failure is transparent to the host
Front-End
Controller
Back-End
Disk enclosure shared with dualcontroller, dual-controller(one engine) failure causes service interruption.
18
Huawei Confidential
Disk enclosure shared with fourcontroller, 4 controller failure(one engine) causes service interruption.
Global cache provides continuous mirroring technology and 3 copies across 2 engines
Disk enclosure shared with 8 controllers(2 engines) No service interruption: any 2 controllers failed at the same time; 1 engine failed; 7 controllers failure one by one of 8 controllers(2 engines)
Extreme Reliable
Extreme Performance
High Efficiency
Multi-level reliability technology combination
SmartMatrix
Component reliability
Product reliability
Architecture reliability
Solution reliability
Global disk protection Reliability first in the industry
RAID-TP Tolerate 3 disks at the same time
SmartMatrix
A-A without gateway Business continuity, no fault
Full meshed architecture Reliability first in the industry
99.9999% high availability for the most demanding enterprise reliability needs 19
Huawei Confidential
Extreme Reliable
Extreme Performance
Self-developed SSD disk Dorado supports RAID 5/6/TP, tolerating simultaneous failures of up to three disks
RAID 4 is supported in SSDs to ensure data reliability
Storage pool
Storage pool
20
Huawei Confidential
Global wear leveling
Huawei's patent: global anti-wear leveling
High Efficiency
Extreme Reliable
Extreme Performance
High Efficiency
RAID 2.0+
Hot spare
Hot spare
Traditional RAID
LUN virtualization
RAID2.0+ Block virtualization
Data reconstruction speed is improved 20-fold
Huawei RAID2.0+: bottom-layer media virtualization + upper-layer resource virtualization for fast data reconstruction and smart resource allocation Fast data reconstruction: Data reconstruction time is shortened from 10 hours to only 30 minutes. The data reconstruction speed is improved 20-fold. Adverse service impacts and disk failure rates are reduced. All disks in a storage pool participate in reconstruction, and only service data is reconstructed. The traditional many-to-one reconstruction mode is transformed to the many-to-many fast reconstruction mode. 21
Huawei Confidential
Extreme Reliable
Extreme Performance
High Efficiency
Gateway-Free Active-Active Solution
Lightning Fast, Rock Solid
ERP CRM BI
Production center A
HyperMetro gateway-free active-active
•
Gateway-free: fewer nodes, simplified management
•
Active-Active: load balancing between sites, RPO = 0 and RTO ≈ 0
Easy-to-Scale • Smooth upgrade to 3DC provides a higher level of reliability. Production center B
• Serial, parallel, and ring 3DC networking meets the most demanding
enterprise reliability requirements. • Interconnection with traditional storage reduces the costs of building
DR center
22
Huawei Confidential
disaster recovery systems.
Extreme Reliable
FastWrite - Dual-Write Performance Tuning Dorado V6 storage
Dorado V6 storage
Host
Dorado V6 storage
Host
100 KM 1 Write Command
High Efficiency
FastWrite
Traditional solution Host
Extreme Performance
Dorado V6 storage
Host
100 KM
8 Gbit/s Fibre Channel/10GE
1 Command 2 Ready
2 Transfer Ready
8 Gbit/s Fibre Channel/10GE
3 Data Transfer
3 Data Transfer
5 Transfer Ready
5 Status Good
RTT-1
RTT-1
RTT-2
8 Status Good
Site A
Site A
Definition of active-active data centers Site B
FastWrite: A private protocol is used to combine the two interactions (write command and data transfer). The cross-site write I/O interactions are reduced by 50%. 100 km transfer link: RTT for only once, improving service performance by 25%
Traditional solution: Write I/Os experience two interactions at two sites (write command and data transfer). 100 km transfer link: RTT (≈1.3ms) x 2
23
Huawei Confidential
Site B
Extreme Reliable
End-to-End Symmetric Architecture
Extreme Performance
High Efficiency
Symmetric interface • All series Support Active-Active access mode of the hosts, requests can evenly distribute on every frontend link • LUNs of all series have no ownership controller, easy for
Host
use and load balance(LUNs are divided into slices and
slices are distributed evenly on all the alive controllers by SAN
using DHT algorithm) • High-end series provide shared and intelligent frontend IO module which can divide LUNs into slices and send the requests to their target controller for reducing latency
Hash sharding
Global Cache
DHT
• IOs(located in one or more slices) of LUNs can be written
to the cache of all the controllers and then be responded to the host
Global cache
• The intelligent read cache of all the controllers can prefetch all the LUNs’ data and meta data for cache hitting
Global Pool …
…
• Storage pool can spread across all the controllers and use all the SSDs connected to the controllers to store all the LUNs’ data and meta data by RAID2.0+
24
Huawei Confidential
Extreme Reliable
Extreme Performance
How does Distributed Architecture work Native multipath Active-Active access : round-robin etc.
Huawei ultrapath Embedded router map Divides into slices distributes IOs to target controller
Host
Host
SAN
Front End IO Module
Native multipath Active-Active access : round-robin etc.
Shared Front End Embedded router map Divides into slices distributes IOs to target controller
SAN
Front End IO Module
Multi processor
Multi processor
dynamic resource allocation
dynamic resource allocation
CPU
CPU
DHT
DHT
Global cache
Global cache
CPU
CPU
CPU
CPU
CPU
CPU
CPU
DHT
Global cache
…
…
…
Global Pool: RAID2.0+, Flashlink 2.0
Global Pool: RAID2.0+, Flashlink 2.0
Global Pool: RAID2.0+, Flashlink 2.0
Mid-range/entry-level with native multipath
25
CPU
Front End IO Module
Front End IO Module
Multi processor CPU
SAN
Front End IO Module
Front End IO Module
dynamic resource allocation
CPU
Host
Huawei Confidential
Mid-range/entry-level with ultrapath
High-end with native multipath
High Efficiency
Extreme Reliable
Extreme Performance
High Efficiency
Intelligent Front-end Connection: Shared Interface Based on Self-developed Networking Chipset Server FC switch For Host, only one session
WWN: 2100xxxxabcd
For controllers, have their own sessions to host. Backplane
Ctrl.0 26
Huawei Confidential
Ctrl.1
Ctrl.2
Ctrl.3
Extreme Reliable
Extreme Performance
Intelligent CPU Partition Scheduling Algorithm - Reducing Latency by 30% 64M
CPU Sockets in Controllers
(LUN, LBA), Data …. N7
N1 N2
DHT ring
N6
N5
CPU
N3
N4
Core Grouping in CPU I/O read/write
Data switching channel
Protocol parsing
Data flushing
CORE
CORE
CORE
CORE
CORE
CORE
CORE
CORE
CORE
CORE
CORE
CORE
CORE
CORE
CORE
CORE
Dedicated
Dedicated
Shared
Core-based Resource Isolation I/O read 1
CORE
I/O read 2
I/O write 1
CORE
CORE
I/O read/write grouping
27
Huawei Confidential
I/O write 2
CORE
High Efficiency
Extreme Reliable
Global Cache with RDMA & WAL LUN0 4K ……
4K
LUN1 4K ……
4K
8K
8K
4K
4K
LUN2 ……
Extreme Performance
High Efficiency
Write Latency 95us 8K 50us
Write Ahead Log Cache Linear space
A
B
C
D
……
E
Traditional Cache
Dorado V6
Data Write
A B C
D
E
……
Global memory virtual address space
AddrN1
Controller-A Memory
28
Huawei Confidential
AddrN2
Controller-B Memory
AddrN3
Controller-C Memory
Controller-D Memory
Hardware Design
29
Huawei Confidential
Extreme Reliable
Extreme Performance
High Efficiency
Extreme Reliable
Extreme Performance
High Efficiency
Performance Express Supported by E2E NVMe and RoCE write
HOST
read
Self-developed ASIC interface module: • Offload the FCP and NoF protocol stacks • Higher rate 32Gbps(FC)/100Gbps(ETH)
32G/FC 100G/RoCE
Shared Frontend
Shared Frontend
Shared Frontend
• The chip responds to the host directly, reducing the number of I/O interactions. • ASIC IO balancing/distribution
Shared Frontend
• Multi-queue and polling, lock-free. 50us
Storage Engine
Storage Controller
Storage Controller
Storage Controller
Storage Controller
Shared Backend
Shared Backend
Shared Backend
Shared Backend
30
Intelligent DAE
Huawei Confidential
Self-developed ASIC SSD disk/enclosure:
• Read priority technology: Read requests on SSDs are preferentially executed to respond to hosts in a timely manner.
100us
100G/RoCE
Intelligent DAE
30us
Intelligent DAE
Intelligent DAE
• The intelligent disk enclosure is equipped with the CPU, memory, and hardware acceleration engine. Data is reconstructed and unloaded to the intelligent disk enclosure to reduce latency. • Multi-queue and polling, lock-free.
Extreme Reliable
Extreme Performance
High Efficiency
What’s NVMe?
CPU Cores
SAS Controller
SSD/HDD
31
Huawei Confidential
CORE
SAS
NVMe
Designed for Disk
Designed for Flash/SCM
CORE
CORE
CORE
CORE
CORE
CORE
CORE
CPU Cores
SAS Controller
SSD
Extreme Reliable
Extreme Performance
High Efficiency
NVMe Reduces Protocol Processing Latency App
Reduced interactions: Communication interactions are reduced from 4 to 2, lowering latency
Block Layer SSD
Controller
SCSI
Controller Initiator
1. Transfer command
NVMe
2. Ready to transfer
SAS SAS
SAS
3. Transfer data 4. Response feedback
Target
1. NVMe write command
NVMe
NVMe
2. NVMe write finished
SCSI
SAS protocol stack
32
Huawei Confidential
NVMe protocol stack
NVMe provides an average storage latency less than SAS 3.0.
Extreme Reliable
Extreme Performance
NVMe Concurrent Queue and Lock-Free Processing Core 0
...
Core n
vs.
0
N
NVMe
SAS ...
SAS SSD 24
Number of queues = 25 (Dorado 5000 SAS with 25 SSDs)
33
…
Multiple queues and lock-free
Single queue with lock
SAS SSD 0
N
0
…
Lock
Core N
...
Core 0
NVMe SSD 0
...
NVMe SSD 35
Number of queues = 288 (Dorado 5000 NVMe with 36 SSDs, N = 7)
•
NVMe: Every CPU core has an exclusive queue on each SSD, which is lock-free.
•
Count of queues for each controller = Count of disks * Count of CPU cores for processing back-end I/O.
•
SAS: Each controller has a queue to each SSD, which is shared by all CPU cores. Locks are added to ensure exclusive access of multiple cores. The number of queues for a single controller equals to the number of disks. Huawei Confidential
High Efficiency
Extreme Reliable
NVMe architecture in Storage
34
Huawei Confidential
Extreme Performance
High Efficiency
Extreme Reliable
What’s RoCE RDMA supports zero-copy networking by enabling the network adapter to transfer data from the wire directly to application memory or from application memory directly to the wire, eliminating the need to copy data between application memory and the data buffers in the operating system. Such transfers require no work to be done by CPUs, caches, or context switches, and transfers continue in parallel with other system operations. This reduces latency in message transfer. -- https://en.wikipedia.org/wiki/Remote_direct_memory_access 35
Huawei Confidential
Extreme Performance
High Efficiency
Extreme Reliable
Extreme Performance
High Efficiency
PCIe vs NVMe-oF
Latency
PCIe
NVMe-oF (RoCE) 36
Huawei Confidential
≈40us
≈55us
Maximum of Distance < 1m
7m~10m
Maximum of SSDs
Shared Architecture
DMA Engine
256 for PCIe bus total, no more than 100 for SSDs
Can be shared by 2 controllers
Data Channel, No DMA, CPU dependent.
No limited
Can by shared by 8 controllers, even 32 controllers with switch.
DMA enabled, CPU independent.
Extreme Reliable
Extreme Performance
High Efficiency
Intelligent NIC optimization: Traditional NIC -> TOE -> DTOE Traditional NIC
TOE
DTOE
TCP offload engine
Direct TCP offload engine
I/O
I/O
I/O PHY
PHY
PHY
NIC
MAC
MAC
MAC
TOE NIC
IP
TCP buffer Driver
TCP
OS
DTOE NIC
IP
Socket
buffer IP TCP Driver Socket
Kernel space
buffer
Kernel space
OS
Driver Socket
OS DIF
App
User space
Challenge: A traditional network card needs to trigger an interruption for processing each data packet, and CPU resource consumption is severe.
DIF
App
User space
Huawei Confidential
App
User space
Advantage:
Advantage:
Each application can finish a complete data processing process before triggering an interrupt, significantly reducing the server's response to the interruption.
1. Move processing of the transport layer to the Huawei customized network card 1822's microcode 2. Optimize storage application software to adapt the new architecture 3. Implement data (from the link layer) directly to the application memory 4. Bypassing the kernel state, significantly reducing the latency
Challenge: There are still high latency overheads such as kernel mode interrupts, locks, system calls, and thread switching.
37
DIF
Extreme Reliable
Extreme Performance
FlashLink: Intelligent Disk Enclosure & Collaboration of Chipsets
High Efficiency
System Controller FEI 前端接口
System Controller
Data read/write
FEI 前端接口
*Garbage collection
PCIE DIMM
DIMM
Offloading workloads from controllers to Intelligent DAEs. • Improve system performance by 30%.
Intelligent DAE
Proprietary SSD Chips 38
Improve reconstruction speed by 100% Lower performance impact of reconstruction on business from 15% to
DIMM
Huawei Confidential
CPU
DIMM
DIMM
Data reconstruction
Offloading
CPU CPU CPU
• •
Chips
*Compression
DIMM
DIMM
CPU CPU CPU
Advanced features
Intelligent DAE
Chips PCIE
Cache flushing
CPU
DIMM
5%
Extreme Reliable
Extreme Performance
High Efficiency
Offload data reconstruction to Intelligent disk enclosures Controller Read Disk
Controller
3. Reconstruct data is written into hot spare space.
Calculation Verify
1.1 Reconstruction
DAE A
Chunk
SSD
Chunk
SSD
…
Chunk
Chunk
SSD
SSD
SSD
Chunk
SSD
…
Read Disk
Chunk
SSD
SSD
Intelligent Offload
Chunk
Chunk
SSD
SSD
4.1Transmits P” and Q”
Read Disk
Calculation verify
3.1 Obtain 12 pieces of data
2.1 Read data from disk Chunk
Intelligent DAE B 4.1 Transmits P’ and Q ‘
2. Obtain 23 pieces of data
Chunk
1.2 Reconstruction
Intelligent DAE A
DAE B 1. Read data from disk
5 Data reconstruction using P’ Q’ P” and Q”. Then data is written into hot spare space..
…
Calculation verify
2.2 Read data from disk
3.2 Obtain 11 pieces of data
Chunk
Chunk
Chunk
Chunk
SSD
SSD
SSD
SSD
…
Chunk
Chunk
SSD
SSD
The reconstruction bandwidth of a single disk in the 23+2 RAID group is reduced from 24 times to 5 times
39
Huawei Confidential
Extreme Reliable
Full stripe writing in RoW
Extreme Performance
High Efficiency
Same performance across different RAID levels LUN0 ... 3 KB
4 KB
7 KB
LUN1 ... 3 KB
4 KB
Performance
LUN2 7 KB
4 KB
...
4 KB
Logic space
16 KB
KIOPS
ROW and I/O aggregation
A
Log structure
B
C
D
...
E
Data update RAID5 A
B
C
D
Full stripe
E
P
RAID6
RAID-TP
Dorado
Q
CKG
Traditional Way
Dorado way
Configuration
Extra Reads
Extra Writes
Total IOs (extra IO)
Configuration
Extra Reads
Extra Writes
Total IOs
RAID-5
2
1
4 (3)
RAID-5
0
0
1
RAID-6
3
2
6 (5)
RAID-6
0
0
1
RAID-TP
4
3
8 (7)
RAID-TP
0
0
1
40
Huawei Confidential
Traditional Array
Extreme Reliable
Extreme Performance
High Efficiency
FlashLink : Smooth GC with Multi-stream to reduce WA by 60%
Array
Array
4
Multi-Stream
3.5
Multi-Stream
Normal
Standard
3 2.5 2
MultiStream
Standard
1.5 1 0.5 0
Physical Block
Hot Data
Deleted Data
Cold Data
41
Full Block Reclaim No Garbage movement
Write Amplifier
Life-Cycle
Write Amplification reduces over 60%, life cycle expands 2 times
Extreme Reliable
Extreme Performance
High Efficiency
FlashLink : Smooth GC with Multi-stream (cont.) Hot block list
ROW
Metadata
TRIM
ROW
Metadata is changed frequently, so metadata flows are assigned to the same block list to reduce the amount of data moved during GC and improve GC efficiency.
Warm block list
User data
ROW
User data is changed less frequently, and user data flows are also assigned to the same block list. Data to be moved can then be located sooner during GC and GC efficiency can be improved.
TRIM
Cold block list
Global GC
42
Huawei Confidential
User data that has remained unchanged for a long time is also less likely to be changed in the future. Such data flows are assigned to the same block list as well so that fewer blocks need to be scanned during GC and GC efficiency can be improved.
Extreme Reliable
FlashLink: Global Garbage Collection
FlashLink detects blocks with the highest garbage rates in real time.
Background global GC moves ROW to new block lists.
Blocks are "TRIMed" and SSDs are "informed" that these blocks can be reclaimed for use. So moving of data within disks is no longer required.
43
Huawei Confidential
Extreme Performance
High Efficiency
Extreme Reliable
Extreme Performance
Nearly all software components are in user-mode
Reducing latency caused by interactions between kernel-state and user-mode Traditional design
Dorado V6
less user-level components, more interaction
User space
Kernel space
OMM
Driver
Space Mgmt.
Disk Mgmt.
Value-added features
Pool Mgmt.
Call each other between user mode and kernel mode High Latency
44
Huawei Confidential
full user mode, less interaction
User space
Kernel space
OMM
Space Mgmt.
Value-added features
NVMe Diver
Disk Mgmt.
Pool Mgmt.
SAS Diver
Reduce interactions between two modes Low Latency
High Efficiency
Extreme Reliable
Extreme Performance
High Efficiency
CPU multi-core load balancing optimization:
No grouping -> Grouping -> Grouping + Intelligent scheduling Dorado V3
Traditional
CORE
CORE
0
1
I/O
CORE
2
Core grouping
CORE
3
Dorado V6
…
Mirror
CORE
CORE
0
1 I/O
CORE
2
Grouping + intelligent scheduling
CORE
3
CORE
…
Mirror
CORE
0
1 I/O I/O
CORE
CORE
2
3
…
MirrorMirror
Scheduler
Challenge:
Advantage:
Advantage:
Different tasks compete for time slices of different cores of the CPU, resulting in frequent copying of data IO between different cores, resulting in high latency.
Avoid interference and frequent resource switching
According to the load status, the intelligent scheduler dispatches tasks to other cores to achieve load balancing.
Challenge: Different services, partial nuclear overload, high latency
45
Huawei Confidential
Hardware Design
46
Huawei Confidential
Extreme Reliable
Extreme Performance
High Efficiency
Extreme Reliable
Extreme Performance
Similar Deduplication VS Variable/Fixed length Deduplication Data reduction effect: • 50% increase, compared to fixed length deduplication
512B
VS
64B
• 30% increase, compared to variable length deduplication
Scenario
Exactly the same Partially identical, sector offset Partially identical, bytes offset Partially identical, anywhere different 47
Huawei Confidential
Description
Fixed
Variable
Similar
High Efficiency
Extreme Reliable
Extreme Performance
Fuzzy Matching Increases Data Reduction Rate by 25%
0 0 0 1
1 1 1 1
0 0 0 1
1 1 1 1
1 1 0 1
1 0 0 1
0 0 0 1
1 1 1 1
1 1 1 1
1 1 1 1
0 0
1 0
0 0
1 0
1 0
1 0
0 0
1 0
1 0
1 0
0 1
1 0
0 1
0 0
1 1
1 0
0 1
1 0
1 0
1 0
Assemble similar data.
Extract reference fingerprints.
Identify similar data. 1 0.9 0.8 0.7 1 0.3 0.9 0
0
1
0
1
1
1
0
1
1
1
0
1
0
1
1
1
0
1
1
Dedupe & Comp 1
1
0
1
0
1
1
0
0
1
1
1
0
1
0
1
0
0
0
1
1
1
0
1
0
1
1
1
0
1
1
1
1
0
1
0
0
1
1
0
1
1
1
0.9
0
1
0
1
1
1
0
1
1
1
0
0.9 0.8
0
0
0
25% Higher image data reduction rate
48
Huawei Confidential
High Efficiency
Extreme Reliable
Non-Disruptive Upgrading (NDU)
Extreme Performance
No business impact, no performance loss 94% components in user mode, upgrading in 1s, no host connection loss by switching on global sharing frontend card 6% components in steady kernel, upgrading with rebooting in minutes
1 second Service
Protocol
Service
Data
Service
Service
Service
Control
Manage -ment
Inter Communi -cation
Stable OS Kernel
49
Huawei Confidential
High Efficiency
Online upgrading without business disruption
Extreme Reliable
Extreme Performance
High Efficiency
Host I/O
Hi1822
HiSilicon SmartIO Front-end interface
I/O
System System Management Management process process
① User mode upgrading, no system reboot
Device Device Management Management process process
IO Processing Processing process process
② Self-designed Chips hold IOs and maintain host link information
Self-designed chips hold IOs and keep links alive.
50
Huawei Confidential
Modular software architecture design, upgrading of component in 1S
Configuration Configuration Management Management process process
③ IO Processing Process upgrading
UI/CLIUI/CLI UI/CLIUI/CLI Management Management process process
④ Host IO recovers
IO Processing process upgrading period less than