What kind of value will Vector Engine provide for the Oil & Gas market?

Cost benefits of using improved HPC in Oil & Gas Domain

Improve success rate in exploration (reduce failure/delay cost)
Improve E&P by seismic modelling/imaging & reservoir simulation

You will save $750M if you could improve drilling success rate by 5%
(Assumption: cost per drill is $100M)
300M from drilling success + 300M not being delayed
or
150M x 5 wells = $750M

You will gain $250M if you could improve recovery factor by 2.5%

200M BBL x 2.5% x $50/BBL = $250M

Improving imaging + recovery factors by 4D seismic / EOR

In total $1B

Note: Numbers can vary depending on conditions

In the oil and gas industry, improving the success rate of exploration and improving Exploration and Production by seismic modeling/imaging & reservoir simulation are major challenges. NEC would like to contribute a computing resource for seismic & reservoir by NEC's HPC platform, NEC SX-Aurora TSUBASA. Seismic modeling/imaging & Reservoir simulation are a part of exploration. However, we think that it will create a big gain for the market.

Decrease Exploration Cost and Increase Recovery Factor with NEC

Decrease exploration cost by reducing the dry hole cost through better and faster seismic modelling/imaging
Increase recovery factor by better and faster reservoir simulation

By achieving 30% higher resolution
Empowered by 2x compute performance
(30% higher resolution per dimension: 1.3 x 1.3 x 1.3 ≈ 2.2)

NEC Vector Engine (10B) provides

2.1x performance compared to GPGPU (NVIDIA V100)
16x performance compared to Intel Gold 6148 (2socket) – 250% higher resolution!!

Again, higher resolution is only one factor. However NEC tested a performance of reservoir simulation and achieved a better performance compared to GPGPU and CPU.

In Collaboration with nag®

Regarding these performance test, we are thankful to the members of the Numerical Algorithms Group for extending their support by studying the vector architecture of NEC's Vector Engine and implementing an industry standard workload for reverse time migration (RTM) on Intel's CPU, NVIDIA's GPU and NEC's vector architectures.

They also performed enhancements to the programs through code-level tweaks and compiler level optimizations relevant to each architecture in order to bring out the best performance on each of them. Their effort has led to this unbiased performance evaluation of the most prevalent architectures in the Oil and Gas industry and helped benchmark the capability of NEC Vector Engine for workloads relevant to the Oil and Gas Industry.

Technical Benefits of using NEC HPC for Oil & Gas Applications

So, let's start our technical advantages when we use NEC HPC for Oil & Gas application.

Stencil Code

Stencil code refers to a procedure pattern that frequently appears in scientific simulations, image processing, signal processing, deep learning, etc.
Stencil patterns require updates to each element in a multidimensional array by referring to its neighbour elements.

These codes requires significant performance of both computation and memory access, since they load a value of each element several times while they store a new value once.

Stencil shapes and sizes differ based on the problem they solve:

At first we would like to talk about stencil Codes. The term Stencil refers to a pattern of data layout which is quite prominent in scientific computing for simulations and many other problems like image processing, signal processing and more recent frameworks for AI such as deep learning, etc. In a stencil layout, computation of each data element in a multidimensional array is done by referring to its neighboring elements. The number of neighboring elements to be referred is governed by the size of the stencil that is being computed.

In computer science terms, such updates require frequent access to the memory and require several loads from memory for a single store to the memory.
Here we can see several examples of varying stencil sizes over different dimensions that show the number of loads required for each store.

Reverse Time Migration (RTM)

Reverse time migration (RTM) modelling is a critical component in the seismic processing workflow of oil and gas exploration
RTM imaging enables accurate imaging in areas of complex structures and velocities by gathering a two-way acoustic image of seismic data in place of a one-way image
RTM spends most of its computation time in wave propagation kernels that utilize stencil codes

Full simulations of a generalized kernel called Anisotropic Elastic Wave Equation propagator can provide significant seismic information under a wide variety of geological assumptions

Having understood the importance of stencil codes and their compute and memory intensive attributes, we can now get a better understanding of the Seismic Imaging workloads relevant to the Oil and Gas industry.

Reverse Time Migration (RTM) modeling is a well-known critical component of seismic processing in Oil and Gas exploration. Given the programmatic design of a typical RTM workflow, stencil calculations take up a significant portion of the compute time. As observed on a 2-socket Intel Skylake based system, stencil calculations take up to 90% of the total compute time.

It hence becomes clear that for an enhanced performance on Seismic Imaging workloads, it is important to have a good performance on stencil calculations.

Fully Anisotropic Elastic Wave Equation Propagator

The wave propagation kernels numerically represent the type of physics the user need to emphasize for the migration
Isotropic acoustics is a common and simple wave propagation kernel for driving RTM, but with fewer assumptions on subsurface geology we obtain more accurate and expensive kernels like Vertical Transverse Isotropy (VTI) or Tilted Transverse Isotropy (TI)
The elasto-dynamic wave equation for anisotropic media can be expressed as:
This study covers both low frequency and high frequency fully anisotropic wave equation propagators
Both propagators are relevant when considering Reverse Time Migration (RTM) and Full Waveform Inversion (FWI)

The wave propagation kernels numerically represent the physics required to emphasize on migration. In order to establish an understanding of the performance of an RTM program, we selected a fully anisotropic elastic wave equation propagator developed in C. Our study would cover both low frequency and high frequency propagation since they are both relevant to well known seismic workloads such as Reverse Time Migration and Full Waveform Inversion.

STREAM Benchmark – A Reference

STREAM benchmark evaluates memory bandwidth and is a good benchmark for preliminary comparison.

Comparison with Intel Xeon Gold 6148:

On Intel machine STREAM Triad achieves 180 GB/s
On NEC machine STREAM Triad achieves 984 GB/s
Expected maximum speed-up on NEC – 984/180 = 5.5x

Comparison with NVIDIA Tesla V100:

On NVIDIA machine STREAM Triad achieves 830 GB/S
On NEC machine STREAM Triad achieves 984 GB/s
Expected maximum speed-up on NEC – 984/830 0 1.2x

The calculated speed-up builds an expected result for the experiment.

Note: the depicted are obtained with proto version of NEC VE20B. The comparison will be further improved in the near future

A more practical understanding to the ideas discussed till now can be seen through the STREAM benchmark. The bar chart here captures the memory bandwidth recorded for several architectures using the well-known STREAM benchmark, and this can be a good reference to theoretically estimate speedups for memory-bound workloads. This chart covers Intel Skylake, NVIDIA Tesla V100, Arm A64FX (Fugaku) and NEC's Vector Engine.

When compared to Intel, a theoretical speedup of up to 5.5x can be expected considering the memory bandwidth capability. Similarly, for Tesla V100, a theoretically calculated expected speed-up is around 1.2x on NEC Vector Engine.

Evaluation Setup

Evaluation Target System

The evaluation target systems were chosen based on the below three popular HPC architectures:

Software Setup

The implementation of the anisotropic wave equation kernel that computes results for three problem sizes and three stencil lengths:

With the basis of experiment established, let me talk about the setup.

To setup the experiment, we arranged for 3 target systems:

a 2-socket Intel Skylake CPU based system,
a Vector Engine 10B single card based system, and
an NVIDIA Tesla V100 single GPU card based system

For the software setup, we developed a program in C where we evaluated different stencil sizes of 2, 4 and 8 over varying data sizes from 64-cube through 256-cube. An illustration of Stencil length=4 is shown here for reference.

Evaluation Setup

The core computation is timed where the timing results are an average of 10 iterations of the wave equation solver
Code also computes min, max, standard deviation of timings but these are largely to certify the timing is sensible, i.e. if standard deviation is large the result is discarded and run again
Each iteration is compared against an analytic solution to ensure correctness, but this comparison is not timed
On an ideal system with perfect number of registers and caching, the stencil length would not change performance at all. However, with varying sizes of cache and vector registers, source code tuning was attempted across all architectures.

To evaluate the performance, we have timed 10 iterations of the wave equation solver. The program also computes minimum, maximum and standard deviation of these timings to gauge whether the timings are practical. In case the standard deviation for a specific run is too large, we discard the results and run again. This way we ensure fairness of evaluation. For correctness, each iteration is compared against an analytic solution in order to ensure correctness, but this comparison is not timed.

Ideally, from a pure software standpoint, the length of the stencil should not affect the performance at all. I mean if each hardware was developed with a perfect number of registers and caching, all stencil lengths would provide similar performance. However, with different hardware designs and varying register sizes and cache sizes, it is imperative that we tune the software in order to extract the best performance from the underlying hardware.

Source Code Modifications

Some tuning modifications that brought out the best performance of each architecture:

On Intel Xeon	On NEC VE	On NVIDIA V100
Simple loop Reorder: Fastest dimension to be kept in the innermost loop for best effect of vectorization through AVX512 Loop Blocking: Outer loops to be blocked for best cache utilization	Simple Loop Reorder: Fastest dimension to be kept in the innermost loop for best effect of vectorization Loop Collapse: Collapsing the inner two loops in order to ensure a single long vector for best utilization of the long vector pipe (256-words) Loop Unroll: Compiler automatically unrolls the outer-loop	Threading: In general threading on GPU is more flexible than vectorization Avoiding branching operations: Specific cases where the ratio of taken branches was high

While more complicated tuning approaches posed scope for better performance impacts, we didn’t lose our focus on the ease and simplicity of tuning for high performance.

Similarly, there were some more subtle tweaks performed in the code for each target architecture, and the ones that resulted in good performance boost are listed in this slide.

For CPU centric codes, a simple re-ordering of loops helped, and blocking outer loops for cache optimization led to significant performance boost on the Intel Skylake systems. AVX512 instructions were also used.

For vector centric codes, a simple reordering of loops helped in faster vector performance. Collapsing two nested loops in order to ensure a single long vector helped utilize the long vector pipes better. Unrolling the outermost loops provided an additional boost to the performance, however the compiler was automatically able to perform the unrolling without the requirement of manual code modifications.

For GPU centric codes, the overall program was written in CUDA where the computations were offloaded to the device. Complicated if-else branches were simplified where the branching was relatively more intricate.

While more complicated tuning approaches posed scope for better performance impacts, we didn't lose our focus on the ease and simplicity of tuning for high performance.

Performance Results (1)

For small problem sizes, NEC Vector Engine outperforms both CPU and GPU, although performing very similar to the GPU
These small problem sizes also do not utilize the large data processing capability of the vector engine due to smaller vector lengths
Smaller vector lengths provide better cache friendliness on CPUs and tend to perform well
As expected, the speed-up is in the range of 3.5x ~ 4.5x for CPU, which is short of the expected theoretical speed-up based on memory bandwidth considerations

Finally, we recorded the timings as planned and plotted them on the bar chart as shown in this slide. Lower bars indicate faster timings.

For the relatively small problem size of 64-cube, NEC Vector Engine outperforms both the CPU and GPU, although GPU performance is competitive, especially for stencil size = 8, where GPU is slightly faster than VE. Since the problem sizes are small, they do not tend to utilize the large data processing capability of the vector engine due to smaller vector lengths. In fact, smaller vector lengths provide better cache friendliness on CPUs and tend to perform well.

The speed-up here on the Vector Engine is about 3.5x to ~ 4.5x of the CPU, which is short of the expected theoretical speedup based on memory bandwidth considerations.

Performance Results (2)

With increase in dataset sizes, the speed-up improves
NEC Vector Engine still outperforms both the CPU and GPU based systems, with noticeable performance benefit compared to GPU
The speed-up is in 6.0x ~ 8.5x range compared to CPU, which is a much better representative of the expected performance speed-up on theoretical bases
NEC VE provides nearly 1.7x faster performance compared to GPUs

With increase in dataset sizes, the speedup improves. This slide shows the recorded timings for the 128-cube dataset.

NEC Vector Engine still outperforms both the CPU and GPU based systems, with a noticeable performance advantage compared to GPU.

The speedup is about 6 times to 8.5 times compared to CPU and is a much better representative of the expected performance speedup on theoretical bases.

NEC VE interestingly provides nearly 1.7x faster performance compared to GPUs, which is higher than the theoretically estimated performance.

Performance Results (3)

Larger problem sizes speed-up brought up to 16.0x between Intel Xeon and NEC VE, much higher than the theoretical best speed-up of 6x based solely on memory bandwidth considerations
Even for the NVIDIA GPU, the speed-up is in the 1.5x ~ 2.1x range, that is higher than the theoretical memory bandwidth consideration
The speed-up suggests that the NEC VE provides a god combination of boosts in memory performance, as well as computational performance

With larger problem sizes observed speedups range as high as 16x between the Intel and NEC systems, much higher than the theoretical best speedup of 6x based solely on memory bandwidth considerations. There seem to be more factors at play here, such as cache limitations on the CPU for large dataset sizes.

Even for the NVIDIA GPU, the speedup is in the 1.5x ~ 2.1x range, higher than the theoretical memory bandwidth consideration.

The speedup suggests that the NEC Vector Engine provides a good combination of boosts in memory performance, as well as computational performance.

Performance Results (4)

This plot represents the number of grid-points being evaluated per-second for each architecture
NEC Vector Engine consistently outperforms the CPU and GPU architectures, up to 16x faster than Intel Skylake, and more than 2x faster than NVIDIA V100 for large datasets

Having observed the timings and speed-ups obtained on the Vector Engine, we also evaluated the performance based on the number of grid points calculated per second on each architecture for the same stencil sizes and dataset sizes. Here we get a clearer picture of how the Vector Engine consistently outperforms the CPU and GPU architectures, by calculating up to 16x more grid points than Intel Skylake, and in some cases more than 2x more grid points than NVIDIA V100 for large datasets.

Performance Results (5)

The performance patterns also reveal the ideal grid size based on best performance per-core for each architecture:

	Intel Skylake Gold 6148	NVIDIA Tesla V100	NEC VE Type 10B
Stencil Length = 2	64-grid	256-grid	256-grid
Stencil Length = 4	64-grid	256-grid	128-grid
Stencil Length = 8	64-grid	256-grid	256-grid

This table can help a developer design the granularity of parallelism for their code based on what architecture they are working on.

For each stencil length, VE is consistently the best in terms of choice of grid size for each architecture.

In fact, the performance patterns observed here, also reveal the ideal grid size based on best performance per-core for each architecture. This table can serve as a reference to developers and help them design the granularity of parallelism, i.e. the distribution of grid size per core for their code based on what architecture they are working on. For each stencil length, VE is consistently the best in terms of choice of grid size for each architecture.

Summary of Performance Results (5)

Stencil codes are performance intensive on the memory as well as compute for any given architecture
Reverse Time Migration is a performance intensive code, especially memory bound and poses a genuine challenge relevant to the Oil and Gas industry
Vector architectures, particularly the NEC Vector Engine is capable of catering to the recurring challenges in seismic processing and providing better performance than the available leading architectures
Power efficient solution with minimal software engineering effort

To summarize this talk, we learned that stencil codes are performance intensive on the memory as well as compute for any given architecture.

Seismic Imaging (particularly Real Time Migration) is a performance intensive code, especially memory bound and poses a real problem relevant to the Oil and Gas Industry.

Our experiments establish the initial claim that Vector Architectures, particularly the NEC Vector Engine through its high memory bandwidth and computational capacity is capable of catering to the recurring challenges for prominent applications in seismic processing and providing better performance than the available leading architectures, with lower power consumption and reduced software engineering effort.

Other Report

RTM Scalability (study by Federal University of Rio de Janeiro)
SX-Aurora TSUABASA shows good performance and scalabiltywith less numbers of cores

This slide shows another report about RTM scalability studies by the Federal University of Rio de Janeiro.

In this chart, lower is better, and you can see SX-Aurora TSUBASA, the blue bar is showing good performance and scalability compared to Xeon processors depict in orange and green bars. Prof. Alvaro Coutinho stated "The NEC SX-Aurora TSUBASA vector system proved to be an excellent technology for enabling our workflow for seismic imaging under uncertainty". For details, please see his paper.

Conclusion

Conclusion of NEC HPC.

3 Reasons why you want to use the Vector Engine

Good Balance of compute power and memory capability

307 GF/core, 2.45 TF/processor
48 GB HBM2 memory on board, 1.53 TB/s mem bandwidth
Big core delivers high sustained performance

Easy to start

Start small and scale large
Standard programming (Fortran, C, C++)

Good for seismic modelling and reservoir simulation

VE is a good fit for Oil & Gas applications which require large memory bandwidth
Shorter computation time and more accurate results

NEC's Vector Engine provides a good balance of compute power and memory capacity for image processing, signal processing and situation of requiring a matrix calculation.

Also, you can start with a small system and then expand to large scale system, easily.

Especially in seismic modeling/imaging and reservoir simulation, you can get shorter computation time.

We believe that this will save time and improve the success rate of discoveries, resulting in cost reductions or increased profits through efficient mining through detailed simulations.

Find more Information on our Website

If you are interested in NEC's HPC platform, you can find more information about our hardware, software, and supported applications on our SX-Aurora TSUBASA website. Thank you!

- Intel and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

- NVIDIA and Tesla are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries.

- Linux is a trademark or a registered trademark of Linus Torvalds in the U.S. and other countries.

- Proper nouns such as product names are registered trademarks or trademarks of individual manufacturers.