Tuning for Data Parallelism – A practitioners approach to SIMD and AVX-512

Tuning for Data Parallelism – A practitioners approach to SIMD and AVX-512

Zakhar A. Matveev, Stephen Blair-Chappell, Laurent Duhem, Intel

It is well recognised that three ingredients are essential to secure maximum CPU performance – multi-core (thread aware) parallelism; vectorization (data parallelism); and efficient use of the memory subsystem. In this tutorial we focus on vectorization, and ask the questions: How can I best vectorise my code?; What are the typical hurdles to vectorisation, and how do I overcome them?; How can I measure the effectiveness and efficiency of my vectorised code?; How can I profile the modify the memory-access patterns in my code to get best performance?; How can I be certain that my code is ready for the next generation of ISAs such as the AVX-512 architecture – even when I don’t have access to the hardware?

In this tutorial we use a specially configure version of DL_MESO – a general purpose computational chemistry mesoscale simulation package – as a ‘playground’ to explore the different aspects vectorization. The tutorial is very practical in nature, with over 60% of the time dedicated to hands-on practical demonstration of code optimisation.

Topic area: Performance evaluation & tuning

Keywords: Data Parallelism; SIMD; AVX/AVX-512; OpenMP; Vectorization

Content Description

Using a cut down version of DL_MESO as a ‘playground’ we use the Intel Vectorization Advisor along with the Intel compiler to profile and then ‘tune’ the code. Although we are using the tools and compilers provided by the Intel Parallel Studio XE, the lessons learnt in this tutorial can meaningfully be applied to other environments.

On completion of the tutorial, the attendees will

Be able to analyse an application to see how well it is vectorised
Understand the importance of vectorization in the latest generation of processors
Know the most common reasons why some code will not vectorise
Identify and fix typical vectorization issues
Be able to estimate the potential speed up (wrt vectorization) of unvectorized and poorly
vectorised code
Be able to estimate the potential speed up on yet-to-be released hardware (e.g. Intel’s 2nd
generation Intel® Xeon® Phi – aka Intel Knights Landing, and future generation Xeon
processors)
Understand how to create a vectorized application that is ‘portable’ –ie is safe to run on all
x86 architectures
How to avoid ‘hard coding’ vectorization solutions in their code.

The tutorial is split into a series of steps or activities, with each activity demonstrating how to detect and fix a particular problem. Example makefiles and solutions will be provided for every step of the lab.

Logistics

The tutorial is half-day (3.5 hours) in length.

Percentage of content split:

Beginner 10%
Intermediate 60%
Advanced 30%

Requirements for attendees
The tutorial will be run in ‘demonstration mode’ – that is all the hands on steps will be carried out by the presenter, therefore there is no requirement for the attendees to bring their own laptop. However, we will provide lab instructions and source code, so attendees can optionally carry out the hands-on steps on their own laptops. In this case the laptops must

Be Linux or Windows
Preferably have an Intel CPU (but 90% of lab will work on any processor)
Latest version of Intel Parallel Studio XE (evaluation version available from
software.intel.com).

CVs of Presenters

Zakhar A. Matveev
Intel® Software and Services Group, Intel Russia.
Zakhar, PhD, is Intel Parallel Studio software architect. His current focus is software requirements, design and implementation for “Vector Advisor” – new feature-set aimed to help with efficient x86 SIMD programming and code modernization. Before joining Intel Zakhar worked in broadcast software automation and embedded software/hardware co-design domains. His professional interests focus on HPC systems optimization, parallel programming, computer graphics, software design and usability.
Stephen Blair-Chappell
Intel Corporation (UK) Limited
Stephen is a Technical Consulting Engineer at Intel, and has worked in the Compiler team for the last 17 years. Part of Stephens’s current role is to provide consulting and training to the Intel Parallel Computing Centres in Europe. Stephen is author of the book “Parallel Programming with Intel Parallel Studio XE”.
Laurent Duhem
Intel® Software and Services Group, Intel France
Laurent is a senior Application Engineer within the EMEA High Performance and Throughput Computing, carrying out business development of Intel products by primarily Enabling Tier 1 EMEA customers, ISVs and national labs, by optimizing their applications for Intel’s latest platforms, and acting as a customer influencer and software product evangelist. Laurent joined Intel back in 2006. He previously served as software architect for the flagship software PAM-CRASH at ESI-GROUP.

Tuning for Data Parallelism – A practitioners approach to SIMD and AVX-512

In this section