↑ Return to Tutorials

Tuning for Data Parallelism – A practitioners approach to SIMD and AVX-512

Tuning for Data Parallelism – A practitioners approach to SIMD and AVX-512

Zakhar A. Matveev, Stephen Blair-Chappell, Laurent Duhem, Intel

It is well recognised that three ingredients are essential to secure maximum CPU performance – multi-core (thread aware) parallelism; vectorization (data parallelism); and efficient use of the memory subsystem. In this tutorial we focus on  vectorization, and ask the questions: How can I best vectorise my code?; What are the typical hurdles to vectorisation, and how do I overcome them?; How can I measure the effectiveness and efficiency of my vectorised code?; How can I profile the modify the memory-access patterns in my code to get best performance?; How can I be certain that my code is ready for the next generation of ISAs such as the AVX-512 architecture – even when I don’t have access to the hardware?

In this tutorial we use a specially configure version of DL_MESO – a general purpose computational chemistry mesoscale simulation package – as a ‘playground’ to explore the different aspects vectorization. The tutorial is very practical in nature, with over 60% of the time dedicated to hands-on practical demonstration of code optimisation.

Topic area:  Performance evaluation & tuning

Keywords:  Data Parallelism; SIMD; AVX/AVX-512; OpenMP; Vectorization

Content Description 

Using a cut down version of DL_MESO as a ‘playground’ we use the Intel Vectorization Advisor along with the Intel compiler to profile and then ‘tune’ the code. Although we are using the tools and compilers provided by the Intel Parallel Studio XE, the lessons learnt in this tutorial can meaningfully be applied to other environments.

On completion of the tutorial, the attendees will

  • Be able to analyse an application to see how well it is vectorised
  • Understand  the importance of vectorization in the latest generation of processors
  • Know the most common reasons why some code will not vectorise
  • Identify and fix typical vectorization issues
  • Be able to estimate the potential speed up (wrt  vectorization) of unvectorized and poorly
    vectorised code
  • Be able to estimate the potential speed up on yet-to-be released hardware  (e.g. Intel’s 2nd
    generation  Intel® Xeon® Phi – aka Intel Knights Landing, and future generation Xeon
  • Understand how to create a vectorized application that is ‘portable’ –ie is safe to run on all
    x86 architectures
  • How to avoid ‘hard coding’ vectorization solutions in their code.

The tutorial is split into a series of steps or activities, with each activity demonstrating how to detect and fix a particular problem.  Example makefiles and solutions will be provided for every step of the lab.


The tutorial is half-day (3.5 hours) in length.

Percentage of content split:

  • Beginner 10%
  • Intermediate 60%
  • Advanced 30%

Requirements for attendees
The tutorial will be run in ‘demonstration mode’ – that is all the hands on steps will be carried out by the presenter, therefore there is no requirement for the attendees to bring their own laptop. However, we will provide lab instructions and source code, so attendees can optionally carry out the hands-on steps on their own laptops. In this case the laptops must

  • Be Linux or Windows
  • Preferably have an Intel CPU (but 90% of lab will work on any processor)
  • Latest version of Intel Parallel Studio XE (evaluation version available from

CVs of Presenters 

  • Zakhar A. Matveev 
    Intel® Software and Services Group, Intel Russia.
    Zakhar, PhD, is Intel Parallel Studio software architect. His current focus is software requirements, design and implementation for “Vector Advisor” – new feature-set aimed to help with efficient x86 SIMD programming and code modernization. Before joining Intel Zakhar worked in broadcast software automation and embedded software/hardware co-design domains. His professional interests focus on HPC systems optimization, parallel programming, computer graphics, software design and usability.
  • Stephen Blair-Chappell 
    Intel Corporation (UK) Limited
    Stephen is a Technical Consulting Engineer at Intel, and has worked in the Compiler team for the last 17 years. Part of Stephens’s current role is to provide consulting and training to the Intel Parallel Computing Centres in Europe.  Stephen is author of the book “Parallel Programming with Intel Parallel Studio XE”.
  • Laurent Duhem 
    Intel® Software and Services Group, Intel France 
    Laurent is a senior Application Engineer within the EMEA High Performance and Throughput Computing, carrying out business development of Intel products by primarily Enabling Tier 1 EMEA customers, ISVs and national labs, by optimizing their applications for Intel’s latest platforms, and acting as a customer influencer and software product evangelist. Laurent joined Intel back in 2006. He previously served as software architect for the flagship software PAM-CRASH at ESI-GROUP.

Permanent link to this article: http://europar2016.inria.fr/tutorials/tuning-for-data-parallelism-a-practitioners-approach-to-simd-and-avx-512/