02-Huawei OceanStor Dorado Architecture and Key Technology PDF [PDF]

OceanStor Dorado V6 Technical Deep Dive Security Level: Internal Only Overview of OceanStor Dorado V6 Entry-Level 高端

42 1 5MB

Report DMCA / Copyright

DOWNLOAD PDF FILE

02-Huawei OceanStor Dorado Architecture and Key Technology PDF [PDF]

  • 0 0 0
  • Gefällt Ihnen dieses papier und der download? Sie können Ihre eigene PDF-Datei in wenigen Minuten kostenlos online veröffentlichen! Anmelden
Datei wird geladen, bitte warten...
Zitiervorschau

OceanStor Dorado V6 Technical Deep Dive

Security Level: Internal Only

Overview of OceanStor Dorado V6 Entry-Level

高端

Mid-Range

High End Dorado18000 V6

Dorado8000 V6 Dorado6000 V6 Dorado3000 V6

Dorado5000 V6

Entry-Level

Mid-Range

High End

Type

Dorado3000 V6

Dorado5000 V6

Dorado6000 V6

Dorado8000 V6

Dorado18000 V6

Height / Controllers of each Engine

2U/2C

2U/2C

2U/2C

4U/4C

4U/4C

Controller Expansion

2-16

2-16

2-16

2-16

2-32

Maximum Disks

1200

1600

2400

3200

6400

Cache/Dual Controller

192G

256G/512G

512G/1024G

512G/1024G/2048G

512G/1024G/2048G

Front-end ports Back-end ports

2

8/16/32G FC, 1/10/25/40/100G Ethernet SAS 3.0

Huawei Confidential

SAS 3.0/100G Ethernet

Dorado V6: The Cutting Edge of Storage Innovation

3

Huawei Confidential

Hardware Design

4

Huawei Confidential

Extreme Reliable

Extreme Performance

High Efficiency

New Generation Innovative Hardware Platform

Extreme Reliable

Extreme Performance

Front Panel

Rear Panel High-end controller enclosure

4U, 4 controllers per controller enclosure

4U, 28 shared interface slots Mid-range controller enclosure

2U, 2 controllers per controller enclosure

2U, 36 NVMe SSDs(high density)

Entry-level controller enclosure

2U, 2 controllers per controller enclosure

Intelligent DAE

2U, 25 SAS SSDs 2U, 2 controllers per controller enclosure

5

Huawei Confidential

Standardization

High density

High Efficiency

Extreme Reliable

Controller design for high-end series PWR

MGT board

Controller

Full-Shared IO Card BBU

FAN

6

Huawei Confidential

Extreme Performance

High Efficiency

Extreme Reliable

Controller design for Middle-range series

36 x Palm 高 速

高 速

高 速

CPU

高电 速源

INTPS ER

CPU

IOB

IO Card

25 x 2.5”

7

Huawei Confidential

IO Card

IO Card

PWR 2000 W max

Extreme Performance

High Efficiency

Extreme Reliable

Extreme Performance

Controller design for Entry-Level series & Intelligent DAE

高 速

Mi Mi niS niS AS AS HD HD

Huawei Confidential

高 电 速 源

高 速

12V To 5V

IO Bridge

CPU

B B U

8

高 速

2X PSU IO IO card card SFPSFPSFPSFP + + + +

RJ45

RJ RJ RJ 45 45 45

High Efficiency

Extreme Reliable

Extreme Performance

Storage unit design: low cost, high density

SAS SSD

NVMe SSD

Shorten the depth to 126mm

Older version



9

Reduce the Width by 36%

New version

Less space, adapt to 1 meter deep cabinet

Huawei Confidential

U.2 NVMe SSD



dual-port Palm SSD

40% increase in energy density

High Efficiency

Extreme Reliable

2U, 36 disks, high capacity density Traditional architecture design 1.

The heat dissipation

2.

High Efficiency

Dual Horizontal Orthogonal Architecture Design 1. The window area

window is small and the wind resistance is large.

Extreme Performance

PALM Disk

increases by 50%, and the heat dissipation capability increases by

Double-sided connector,

25%.

interfering with each

2. Orthogonal

other. The number of

connection without

hard disks is limited.

dual-side interference,

>25-disk double-sided connectors cannot be staggered

Horizontal backplane and orthogonal connection

increasing the number

of hard disks by 44% 2U integrated equipment, 36 Palm SSDs, 44% SSDs

Traditional

User-defined Palm form

Size:

Size:

100.6*14.8*70

160*9.5*79.8

Volume: 103cm³ U.2 NVMe SSD 10

Huawei Confidential

Volume: Dual-port Palm SSD

121cm³

The number of 44% disk slots is added to the width of the 19inch cabinet.

Same capacity, width reduced by 36%

increasing in industry

Extreme Reliable

Extreme Performance

High Efficiency

Innovative Hardware Platform Overview: with self-developed chipsets Network Chip Hi1822 •

lower Network latency 160μs80μs

CPU Chip • •

NO.1 ARM CPU, 930+ SPECint Intelligent enclosure, CPU integrated.

AI Chip •

Kunpeng 920

Ascend 310

AI SoC for mini-scale training

SSD Chip Hi1812e •

lower SSD Latency 40μs20μs (write)

BMC Chip • 11

Huawei Confidential

Hi1710

Trouble shooting accuracy rate 93%

Extreme Reliable

Extreme Performance

High Efficiency

Kunpeng 920, the best processor for storage

High concurrency

Up to 48 cores in one CPU

High integration

Not only computing

High density

4 Sockets in 1U Space

48 Core Acceleration Engine

Huawei Confidential

100G RoCE & SAS 3.0

PCIe 4.0

8-Channel DDR4 12

Extreme Reliable

Extreme Performance

High Efficiency

Dorado V6 Design Principle: Distributed, End-to-End NVMe, and Global Shared Resource Distributed Architecture

FC/FC-NVMe/NVMe Over Fabric/iSCSI

Shared Frontend

Shared Frontend

Shared Frontend

Shared Frontend

Shared Frontend

Shared Frontend

Shared Frontend

Shared Frontend

Storage Controller

Storage Controller

Storage Controller

Storage Controller

Storage Controller

Storage Controller

Storage Controller

Storage Controller

Shared Backend

Shared Backend

Shared Backend

Shared Backend

Shared Backend

Shared Backend

Shared Backend

Shared Backend

Storage Engine

Storage Engine



Consistent Distributed Architecture for High-end/Midrange/Entry-level series of Dorado V6.



Symmetric Active-Active Cluster(supports symmetric access by the hosts).



Load-balancing between all the controllers and autorebalancing upon scale-out, failover and failback.

End-to-End NVMe •

Front-end: NVMe over FC(32G)/NVMe over Fabrics(RoCE).



Back-end: NVMe SSD/Intelligent DAE.

NVMe Over Fabric (RoCE)

Intelligent DAE

13

Intelligent DAE

Intelligent DAE

Huawei Confidential

Intelligent DAE

Intelligent DAE

Global Shared Resource

Intelligent DAE

Intelligent DAE

Intelligent DAE



Global cache and global storage pool for all LUNs



High-end series support Shared Frontend module, Shared Backend module(100GE RDMA)

Hardware Design

14

Huawei Confidential

Extreme Reliable

Extreme Performance

High Efficiency

Extreme Reliable

Introduction of Connectivity

port

port controller

Every physical port connects all the four controllers in one engine

controller

controller

controller

Shared Front-End

Shared Front-End

Shared Front-End

Shared Front-End

A

B

A

B

A

B

A

B

48 cores

48 cores

48 cores

48 cores

C

D

C

D

C

D

C

D

48 cores

48 cores

48 cores

48 cores

48 cores

48 cores

48 cores

48 cores

48 cores

48 cores

48 cores

48 cores

Shared SharedShared Back-End Back-End Back-End

A

B

A

B

C

D

C

D

Back-End port

15

Huawei Confidential

High Efficiency

Host IO

SmartMatrix Technology over 100GE RDMA

Shared SharedShared Back-End Back-End Back-End

Extreme Performance

Full Mesh interconnection between all controllers in each engine Shared interconnection module for connecting between the engines

One intelligent disk enclosure can be accessed by 8 controllers(2 engines) through the shared back-end module

Back-End port

IP DAE (SAS/NVMe)

Extreme Reliable

Extreme Performance

High Efficiency

Intelligent Front-end Connection: Shared Interface Based on Self-developed Networking Chipset Server FC switch WWN: 2100xxxxabcd

Failure Mode of shared and intelligent interface: controller failure is transparent to the host • No impact to the host: FC links keeps up and business keeps functional working without any alarm/event. • Rapid takeover inside the interface: the related I/Os will be redirected to other controllers by the front-end chipset.

No impact!! Backplane

X Ctrl.A

16

Huawei Confidential

Ctrl.B

Ctrl.C

Ctrl.D

Extreme Reliable

Extreme Performance

High Efficiency

High availability Architecture(HyperMetro-inner for High-end series) Tolerance of 2 controllers failure simultaneously

Shared Front-End A

A’

B’

B

B’ C

C’

D’

D

E’’ F’’



C

D

Shared Back-End

A’

B’

B

D’’ E

E’

F’

F

H’’

B

A

C’’

G’’

A

Shared Front-End

Shared Front-End A’’

E

F

H’

H

G

H

Shared Back-End

C

C’

D’

D

Intelligent DAE • Global Cache supports 3 copies across two engines. • Guarantee at least 1 cache copy available if 2 controllers failed simultaneously. • Only one engine can also tolerate 2 controllers failure at the same time with 3 copies Global Cache 17 Huawei Confidential

B

C

D

A’

B’

B

D’’

Shared Front-End

C

C’

D’

D

E

E’

E

E’

F’

F

F’

F

H’’

Shared Back-End

A

C’’

G’’

A

Shared Front-End

B’’

F’’ G’

Shared Front-End A’’

E’’

G

Tolerance of 7 controllers failure

Tolerance of 1 engine failure

E

F

G

G’

G

G’

H’

H

H’

H

G

H

G

H

Shared Back-End

Intelligent DAE • Global Cache supports 3 copies across two engines. • One disk enclosure can be accessed by 8 controllers(2 engines) through the shared back-end module • Guarantee at least 1 cache copy available if one engine failed.

A

B

C

D

Shared Back-End

E

F

Shared Back-End

Intelligent DAE • Global cache provides continuous mirroring technology • Tolerates 7 controllers failure one by one of 8 controllers(2 engines)

The best Active-Active design Vendor1: 2 Controllers Scale-out IO interface belongs to one controller, controller fails causes link switch over.

Vendor2: 4~8 Controllers Scale-out Shared front-end, no switch over when any controllers failed.

Extreme Reliable

Extreme Performance

High Efficiency

Huawei Dorado V6 Shared front-end, no switch over when any controllers failed. Controller failure is transparent to the host

Front-End

Controller

Back-End

Disk enclosure shared with dualcontroller, dual-controller(one engine) failure causes service interruption.

18

Huawei Confidential

Disk enclosure shared with fourcontroller, 4 controller failure(one engine) causes service interruption.

Global cache provides continuous mirroring technology and 3 copies across 2 engines

Disk enclosure shared with 8 controllers(2 engines) No service interruption: any 2 controllers failed at the same time; 1 engine failed; 7 controllers failure one by one of 8 controllers(2 engines)

Extreme Reliable

Extreme Performance

High Efficiency

Multi-level reliability technology combination

SmartMatrix

Component reliability

Product reliability

Architecture reliability

Solution reliability

Global disk protection Reliability first in the industry

RAID-TP Tolerate 3 disks at the same time

SmartMatrix

A-A without gateway Business continuity, no fault

Full meshed architecture Reliability first in the industry

99.9999% high availability for the most demanding enterprise reliability needs 19

Huawei Confidential

Extreme Reliable

Extreme Performance

Self-developed SSD disk Dorado supports RAID 5/6/TP, tolerating simultaneous failures of up to three disks

RAID 4 is supported in SSDs to ensure data reliability

Storage pool

Storage pool

20

Huawei Confidential

Global wear leveling

Huawei's patent: global anti-wear leveling

High Efficiency

Extreme Reliable

Extreme Performance

High Efficiency

RAID 2.0+

Hot spare

Hot spare

Traditional RAID

LUN virtualization

RAID2.0+ Block virtualization

Data reconstruction speed is improved 20-fold  



Huawei RAID2.0+: bottom-layer media virtualization + upper-layer resource virtualization for fast data reconstruction and smart resource allocation Fast data reconstruction: Data reconstruction time is shortened from 10 hours to only 30 minutes. The data reconstruction speed is improved 20-fold. Adverse service impacts and disk failure rates are reduced. All disks in a storage pool participate in reconstruction, and only service data is reconstructed. The traditional many-to-one reconstruction mode is transformed to the many-to-many fast reconstruction mode. 21

Huawei Confidential

Extreme Reliable

Extreme Performance

High Efficiency

Gateway-Free Active-Active Solution

Lightning Fast, Rock Solid

ERP CRM BI

Production center A

HyperMetro gateway-free active-active



Gateway-free: fewer nodes, simplified management



Active-Active: load balancing between sites, RPO = 0 and RTO ≈ 0

Easy-to-Scale • Smooth upgrade to 3DC provides a higher level of reliability. Production center B

• Serial, parallel, and ring 3DC networking meets the most demanding

enterprise reliability requirements. • Interconnection with traditional storage reduces the costs of building

DR center

22

Huawei Confidential

disaster recovery systems.

Extreme Reliable

FastWrite - Dual-Write Performance Tuning Dorado V6 storage

Dorado V6 storage

Host

Dorado V6 storage

Host

100 KM 1 Write Command

High Efficiency

FastWrite

Traditional solution Host

Extreme Performance

Dorado V6 storage

Host

100 KM

8 Gbit/s Fibre Channel/10GE

1 Command 2 Ready

2 Transfer Ready

8 Gbit/s Fibre Channel/10GE

3 Data Transfer

3 Data Transfer

5 Transfer Ready

5 Status Good

RTT-1

RTT-1

RTT-2

8 Status Good

Site A

Site A

Definition of active-active data centers Site B

FastWrite: A private protocol is used to combine the two interactions (write command and data transfer). The cross-site write I/O interactions are reduced by 50%.  100 km transfer link: RTT for only once, improving service performance by 25% 

Traditional solution: Write I/Os experience two interactions at two sites (write command and data transfer).  100 km transfer link: RTT (≈1.3ms) x 2 

23

Huawei Confidential

Site B

Extreme Reliable

End-to-End Symmetric Architecture

Extreme Performance

High Efficiency

Symmetric interface • All series Support Active-Active access mode of the hosts, requests can evenly distribute on every frontend link • LUNs of all series have no ownership controller, easy for

Host

use and load balance(LUNs are divided into slices and

slices are distributed evenly on all the alive controllers by SAN

using DHT algorithm) • High-end series provide shared and intelligent frontend IO module which can divide LUNs into slices and send the requests to their target controller for reducing latency

Hash sharding

Global Cache

DHT

• IOs(located in one or more slices) of LUNs can be written

to the cache of all the controllers and then be responded to the host

Global cache

• The intelligent read cache of all the controllers can prefetch all the LUNs’ data and meta data for cache hitting

Global Pool …



• Storage pool can spread across all the controllers and use all the SSDs connected to the controllers to store all the LUNs’ data and meta data by RAID2.0+

24

Huawei Confidential

Extreme Reliable

Extreme Performance

How does Distributed Architecture work Native multipath Active-Active access : round-robin etc.

Huawei ultrapath Embedded router map Divides into slices distributes IOs to target controller

Host

Host

SAN

Front End IO Module

Native multipath Active-Active access : round-robin etc.

Shared Front End Embedded router map Divides into slices distributes IOs to target controller

SAN

Front End IO Module

Multi processor

Multi processor

dynamic resource allocation

dynamic resource allocation

CPU

CPU

DHT

DHT

Global cache

Global cache

CPU

CPU

CPU

CPU

CPU

CPU

CPU

DHT

Global cache







Global Pool: RAID2.0+, Flashlink 2.0

Global Pool: RAID2.0+, Flashlink 2.0

Global Pool: RAID2.0+, Flashlink 2.0

Mid-range/entry-level with native multipath

25

CPU

Front End IO Module

Front End IO Module

Multi processor CPU

SAN

Front End IO Module

Front End IO Module

dynamic resource allocation

CPU

Host

Huawei Confidential

Mid-range/entry-level with ultrapath

High-end with native multipath

High Efficiency

Extreme Reliable

Extreme Performance

High Efficiency

Intelligent Front-end Connection: Shared Interface Based on Self-developed Networking Chipset Server FC switch For Host, only one session

WWN: 2100xxxxabcd

For controllers, have their own sessions to host. Backplane

Ctrl.0 26

Huawei Confidential

Ctrl.1

Ctrl.2

Ctrl.3

Extreme Reliable

Extreme Performance

Intelligent CPU Partition Scheduling Algorithm - Reducing Latency by 30% 64M

CPU Sockets in Controllers

(LUN, LBA), Data …. N7

N1 N2

DHT ring

N6

N5

CPU

N3

N4

Core Grouping in CPU I/O read/write

Data switching channel

Protocol parsing

Data flushing

CORE

CORE

CORE

CORE

CORE

CORE

CORE

CORE

CORE

CORE

CORE

CORE

CORE

CORE

CORE

CORE

Dedicated

Dedicated

Shared

Core-based Resource Isolation I/O read 1

CORE

I/O read 2

I/O write 1

CORE

CORE

I/O read/write grouping

27

Huawei Confidential

I/O write 2

CORE

High Efficiency

Extreme Reliable

Global Cache with RDMA & WAL LUN0 4K ……

4K

LUN1 4K ……

4K

8K

8K

4K

4K

LUN2 ……

Extreme Performance

High Efficiency

Write Latency 95us 8K 50us

Write Ahead Log Cache Linear space

A

B

C

D

……

E

Traditional Cache

Dorado V6

Data Write

A B C

D

E

……

Global memory virtual address space

AddrN1

Controller-A Memory

28

Huawei Confidential

AddrN2

Controller-B Memory

AddrN3

Controller-C Memory

Controller-D Memory

Hardware Design

29

Huawei Confidential

Extreme Reliable

Extreme Performance

High Efficiency

Extreme Reliable

Extreme Performance

High Efficiency

Performance Express Supported by E2E NVMe and RoCE write

HOST

read

Self-developed ASIC interface module: • Offload the FCP and NoF protocol stacks • Higher rate 32Gbps(FC)/100Gbps(ETH)

32G/FC 100G/RoCE

Shared Frontend

Shared Frontend

Shared Frontend

• The chip responds to the host directly, reducing the number of I/O interactions. • ASIC IO balancing/distribution

Shared Frontend

• Multi-queue and polling, lock-free. 50us

Storage Engine

Storage Controller

Storage Controller

Storage Controller

Storage Controller

Shared Backend

Shared Backend

Shared Backend

Shared Backend

30

Intelligent DAE

Huawei Confidential

Self-developed ASIC SSD disk/enclosure:

• Read priority technology: Read requests on SSDs are preferentially executed to respond to hosts in a timely manner.

100us

100G/RoCE

Intelligent DAE

30us

Intelligent DAE

Intelligent DAE

• The intelligent disk enclosure is equipped with the CPU, memory, and hardware acceleration engine. Data is reconstructed and unloaded to the intelligent disk enclosure to reduce latency. • Multi-queue and polling, lock-free.

Extreme Reliable

Extreme Performance

High Efficiency

What’s NVMe?

CPU Cores

SAS Controller

SSD/HDD

31

Huawei Confidential

CORE

SAS

NVMe

Designed for Disk

Designed for Flash/SCM

CORE

CORE

CORE

CORE

CORE

CORE

CORE

CPU Cores

SAS Controller

SSD

Extreme Reliable

Extreme Performance

High Efficiency

NVMe Reduces Protocol Processing Latency App

Reduced interactions: Communication interactions are reduced from 4 to 2, lowering latency

Block Layer SSD

Controller

SCSI

Controller Initiator

1. Transfer command

NVMe

2. Ready to transfer

SAS SAS

SAS

3. Transfer data 4. Response feedback

Target

1. NVMe write command

NVMe

NVMe

2. NVMe write finished

SCSI

SAS protocol stack

32

Huawei Confidential

NVMe protocol stack

NVMe provides an average storage latency less than SAS 3.0.

Extreme Reliable

Extreme Performance

NVMe Concurrent Queue and Lock-Free Processing Core 0

...

Core n

vs.

0

N

NVMe

SAS ...

SAS SSD 24

Number of queues = 25 (Dorado 5000 SAS with 25 SSDs)

33



Multiple queues and lock-free

Single queue with lock

SAS SSD 0

N

0



Lock

Core N

...

Core 0

NVMe SSD 0

...

NVMe SSD 35

Number of queues = 288 (Dorado 5000 NVMe with 36 SSDs, N = 7)



NVMe: Every CPU core has an exclusive queue on each SSD, which is lock-free.



Count of queues for each controller = Count of disks * Count of CPU cores for processing back-end I/O.



SAS: Each controller has a queue to each SSD, which is shared by all CPU cores. Locks are added to ensure exclusive access of multiple cores. The number of queues for a single controller equals to the number of disks. Huawei Confidential

High Efficiency

Extreme Reliable

NVMe architecture in Storage

34

Huawei Confidential

Extreme Performance

High Efficiency

Extreme Reliable

What’s RoCE RDMA supports zero-copy networking by enabling the network adapter to transfer data from the wire directly to application memory or from application memory directly to the wire, eliminating the need to copy data between application memory and the data buffers in the operating system. Such transfers require no work to be done by CPUs, caches, or context switches, and transfers continue in parallel with other system operations. This reduces latency in message transfer. -- https://en.wikipedia.org/wiki/Remote_direct_memory_access 35

Huawei Confidential

Extreme Performance

High Efficiency

Extreme Reliable

Extreme Performance

High Efficiency

PCIe vs NVMe-oF

Latency

PCIe

NVMe-oF (RoCE) 36

Huawei Confidential

≈40us

≈55us

Maximum of Distance < 1m

7m~10m

Maximum of SSDs

Shared Architecture

DMA Engine

256 for PCIe bus total, no more than 100 for SSDs

Can be shared by 2 controllers

Data Channel, No DMA, CPU dependent.

No limited

Can by shared by 8 controllers, even 32 controllers with switch.

DMA enabled, CPU independent.

Extreme Reliable

Extreme Performance

High Efficiency

Intelligent NIC optimization: Traditional NIC -> TOE -> DTOE Traditional NIC

TOE

DTOE

TCP offload engine

Direct TCP offload engine

I/O

I/O

I/O PHY

PHY

PHY

NIC

MAC

MAC

MAC

TOE NIC

IP

TCP buffer Driver

TCP

OS

DTOE NIC

IP

Socket

buffer IP TCP Driver Socket

Kernel space

buffer

Kernel space

OS

Driver Socket

OS DIF

App

User space

Challenge: A traditional network card needs to trigger an interruption for processing each data packet, and CPU resource consumption is severe.

DIF

App

User space

Huawei Confidential

App

User space

Advantage:

Advantage:

Each application can finish a complete data processing process before triggering an interrupt, significantly reducing the server's response to the interruption.

1. Move processing of the transport layer to the Huawei customized network card 1822's microcode 2. Optimize storage application software to adapt the new architecture 3. Implement data (from the link layer) directly to the application memory 4. Bypassing the kernel state, significantly reducing the latency

Challenge: There are still high latency overheads such as kernel mode interrupts, locks, system calls, and thread switching.

37

DIF

Extreme Reliable

Extreme Performance

FlashLink: Intelligent Disk Enclosure & Collaboration of Chipsets

High Efficiency

System Controller FEI 前端接口

System Controller

Data read/write

FEI 前端接口

*Garbage collection

PCIE DIMM

DIMM

Offloading workloads from controllers to Intelligent DAEs. • Improve system performance by 30%.

Intelligent DAE

Proprietary SSD Chips 38

Improve reconstruction speed by 100% Lower performance impact of reconstruction on business from 15% to

DIMM

Huawei Confidential

CPU

DIMM

DIMM

Data reconstruction

Offloading

CPU CPU CPU

• •

Chips

*Compression

DIMM

DIMM

CPU CPU CPU

Advanced features

Intelligent DAE

Chips PCIE

Cache flushing

CPU

DIMM

5%

Extreme Reliable

Extreme Performance

High Efficiency

Offload data reconstruction to Intelligent disk enclosures Controller Read Disk

Controller

3. Reconstruct data is written into hot spare space.

Calculation Verify

1.1 Reconstruction

DAE A

Chunk

SSD

Chunk

SSD



Chunk

Chunk

SSD

SSD

SSD

Chunk

SSD



Read Disk

Chunk

SSD

SSD

Intelligent Offload

Chunk

Chunk

SSD

SSD

4.1Transmits P” and Q”

Read Disk

Calculation verify

3.1 Obtain 12 pieces of data

2.1 Read data from disk Chunk

Intelligent DAE B 4.1 Transmits P’ and Q ‘

2. Obtain 23 pieces of data

Chunk

1.2 Reconstruction

Intelligent DAE A

DAE B 1. Read data from disk

5 Data reconstruction using P’ Q’ P” and Q”. Then data is written into hot spare space..



Calculation verify

2.2 Read data from disk

3.2 Obtain 11 pieces of data

Chunk

Chunk

Chunk

Chunk

SSD

SSD

SSD

SSD



Chunk

Chunk

SSD

SSD

The reconstruction bandwidth of a single disk in the 23+2 RAID group is reduced from 24 times to 5 times

39

Huawei Confidential

Extreme Reliable

Full stripe writing in RoW

Extreme Performance

High Efficiency

Same performance across different RAID levels LUN0 ... 3 KB

4 KB

7 KB

LUN1 ... 3 KB

4 KB

Performance

LUN2 7 KB

4 KB

...

4 KB

Logic space

16 KB

KIOPS

ROW and I/O aggregation

A

Log structure

B

C

D

...

E

Data update RAID5 A

B

C

D

Full stripe

E

P

RAID6

RAID-TP

Dorado

Q

CKG

Traditional Way

Dorado way

Configuration

Extra Reads

Extra Writes

Total IOs (extra IO)

Configuration

Extra Reads

Extra Writes

Total IOs

RAID-5

2

1

4 (3)

RAID-5

0

0

1

RAID-6

3

2

6 (5)

RAID-6

0

0

1

RAID-TP

4

3

8 (7)

RAID-TP

0

0

1

40

Huawei Confidential

Traditional Array

Extreme Reliable

Extreme Performance

High Efficiency

FlashLink : Smooth GC with Multi-stream to reduce WA by 60%

Array

Array

4

Multi-Stream

3.5

Multi-Stream

Normal

Standard

3 2.5 2

MultiStream

Standard

1.5 1 0.5 0

Physical Block

Hot Data

Deleted Data

Cold Data

41

Full Block Reclaim No Garbage movement

Write Amplifier

Life-Cycle

Write Amplification reduces over 60%, life cycle expands 2 times

Extreme Reliable

Extreme Performance

High Efficiency

FlashLink : Smooth GC with Multi-stream (cont.) Hot block list

ROW

Metadata

TRIM

ROW

Metadata is changed frequently, so metadata flows are assigned to the same block list to reduce the amount of data moved during GC and improve GC efficiency.

Warm block list

User data

ROW

User data is changed less frequently, and user data flows are also assigned to the same block list. Data to be moved can then be located sooner during GC and GC efficiency can be improved.

TRIM

Cold block list

Global GC

42

Huawei Confidential

User data that has remained unchanged for a long time is also less likely to be changed in the future. Such data flows are assigned to the same block list as well so that fewer blocks need to be scanned during GC and GC efficiency can be improved.

Extreme Reliable

FlashLink: Global Garbage Collection

FlashLink detects blocks with the highest garbage rates in real time.

Background global GC moves ROW to new block lists.

Blocks are "TRIMed" and SSDs are "informed" that these blocks can be reclaimed for use. So moving of data within disks is no longer required.

43

Huawei Confidential

Extreme Performance

High Efficiency

Extreme Reliable

Extreme Performance

Nearly all software components are in user-mode

Reducing latency caused by interactions between kernel-state and user-mode Traditional design

Dorado V6

less user-level components, more interaction

User space

Kernel space

OMM

Driver

Space Mgmt.

Disk Mgmt.

Value-added features

Pool Mgmt.

Call each other between user mode and kernel mode High Latency

44

Huawei Confidential

full user mode, less interaction

User space

Kernel space

OMM

Space Mgmt.

Value-added features

NVMe Diver

Disk Mgmt.

Pool Mgmt.

SAS Diver

Reduce interactions between two modes Low Latency

High Efficiency

Extreme Reliable

Extreme Performance

High Efficiency

CPU multi-core load balancing optimization:

No grouping -> Grouping -> Grouping + Intelligent scheduling Dorado V3

Traditional

CORE

CORE

0

1

I/O

CORE

2

Core grouping

CORE

3

Dorado V6



Mirror

CORE

CORE

0

1 I/O

CORE

2

Grouping + intelligent scheduling

CORE

3

CORE



Mirror

CORE

0

1 I/O I/O

CORE

CORE

2

3



MirrorMirror

Scheduler

Challenge:

Advantage:

Advantage:

Different tasks compete for time slices of different cores of the CPU, resulting in frequent copying of data IO between different cores, resulting in high latency.

Avoid interference and frequent resource switching

According to the load status, the intelligent scheduler dispatches tasks to other cores to achieve load balancing.

Challenge: Different services, partial nuclear overload, high latency

45

Huawei Confidential

Hardware Design

46

Huawei Confidential

Extreme Reliable

Extreme Performance

High Efficiency

Extreme Reliable

Extreme Performance

Similar Deduplication VS Variable/Fixed length Deduplication Data reduction effect: • 50% increase, compared to fixed length deduplication

512B

VS

64B

• 30% increase, compared to variable length deduplication

Scenario

Exactly the same Partially identical, sector offset Partially identical, bytes offset Partially identical, anywhere different 47

Huawei Confidential

Description

Fixed

Variable

Similar

High Efficiency

Extreme Reliable

Extreme Performance

Fuzzy Matching Increases Data Reduction Rate by 25%

0 0 0 1

1 1 1 1

0 0 0 1

1 1 1 1

1 1 0 1

1 0 0 1

0 0 0 1

1 1 1 1

1 1 1 1

1 1 1 1

0 0

1 0

0 0

1 0

1 0

1 0

0 0

1 0

1 0

1 0

0 1

1 0

0 1

0 0

1 1

1 0

0 1

1 0

1 0

1 0

Assemble similar data.

Extract reference fingerprints.

Identify similar data. 1 0.9 0.8 0.7 1 0.3 0.9 0

0

1

0

1

1

1

0

1

1

1

0

1

0

1

1

1

0

1

1

Dedupe & Comp 1

1

0

1

0

1

1

0

0

1

1

1

0

1

0

1

0

0

0

1

1

1

0

1

0

1

1

1

0

1

1

1

1

0

1

0

0

1

1

0

1

1

1

0.9

0

1

0

1

1

1

0

1

1

1

0

0.9 0.8

0

0

0

25% Higher image data reduction rate

48

Huawei Confidential

High Efficiency

Extreme Reliable

Non-Disruptive Upgrading (NDU)

Extreme Performance

No business impact, no performance loss  94% components in user mode, upgrading in 1s, no host connection loss by switching on global sharing frontend card  6% components in steady kernel, upgrading with rebooting in minutes

1 second Service

Protocol

Service

Data

Service

Service

Service

Control

Manage -ment

Inter Communi -cation

Stable OS Kernel

49

Huawei Confidential

High Efficiency

Online upgrading without business disruption

Extreme Reliable

Extreme Performance

High Efficiency

Host I/O

Hi1822

HiSilicon SmartIO Front-end interface

I/O

System System Management Management process process

① User mode upgrading, no system reboot

Device Device Management Management process process

IO Processing Processing process process

② Self-designed Chips hold IOs and maintain host link information

Self-designed chips hold IOs and keep links alive.

50

Huawei Confidential

Modular software architecture design, upgrading of component in 1S

Configuration Configuration Management Management process process

③ IO Processing Process upgrading

UI/CLIUI/CLI UI/CLIUI/CLI Management Management process process

④ Host IO recovers

IO Processing process upgrading period less than