Contents
Discover Where Vectorization Will Pay Off The Most
Identify Performance Bottlenecks Using Roofline
Identify High-impact Opportunities to Offload to GPU
Intel® Advisor is composed of a set of tools to help ensure Fortran, C, C++ (as well as .NET* on Windows*), OpenCL™, and Data Parallel C++ (DPC++) applications realize full performance potential on modern processors.
Intel Advisor is available as a standalone installation and as part ofIntel® oneAPI Base Toolkit.
Intel Advisor enables you to analyze your code from the following perspectives:
This document summarizes typical workflows to get started improving the performance potential of your application with Intel Advisor.
By default, the <install-dir> on Windows* OS is as follows:
By default, the <install-dir> on Linux* OS is as follows:
By default, the <install-dir> on macOS* is as follows:
On Windows* OS:
To set up the environment for Intel Advisor, run the <install-dir>\env\vars.bat script.
To set up the environment for Intel Advisor, run the source <install-dir>/env/vars.sh script.
To set up the environment for Intel Advisor, run one of the following commands:
Explore a step-by step instruction how to view your Intel Advisor results on a macOS* machine in the Intel Advisor Cookbook: Analyze Performance Remotely and Visualize Results on a Local macOS* System.
Vectorization perspective is a vectorization analysis toolset that lets you identify loops that will benefit most from vector parallelism. Profile your application using the Survey tool to locate un-vectorized and under-vectorized time-consuming functions/loops and calculate estimated performance gain achieved by vectorization.
In the
Analysis Workflow pane, use a drop-down menu to select the
Vectorization and Code Insights perspective, set data collection accuracy level to
Low, and click the
button to run it. When you select a perspective,
Intel Advisor automatically sets up analyses to be run. You can view the list of analyses that are going to be run at the selected accuracy level under the Accuracy pane. Upon completion,
Intel Advisor displays a
Survey Report.
Survey Report - Offers integrated compiler report data and performance data that shows where vectorization will pay off the most; if vectorized loops are providing benefit, and if not, why not; un-vectorized loops and why they are not vectorized; and performance problems in general.
If all loops are vectorizing properly and performance is satisfactory, you are done! Congratulations!
If one or more loops is not vectorizing properly and performance is unsatisfactory:
Improve application performance using various Intel Advisor features to guide your efforts, such as:
Information in the
Performance Issues column and associated
Recommendations tab.
Suggestions in Examine Non-Vectorized and Under-Vectorized Loops in the Intel Advisor User Guide.
Optional Dependencies and Memory Access Patterns (MAP) analyses to help you dig deeper.
Rebuild your modified code.
Run another Survey analysis to verify all loops are vectorizing properly and performance is satisfactory.
Explore the vectorization and memory-specific capabilities of Intel Advisor for collecting data and displaying results for MPI applications described in Intel Advisor Cookbook: Analyze Vectorization and Memory Aspects of an MPI Applications.
CPU Roofline perspective allows you to visualize actual performance against hardware-imposed performance ceilings, as well as determine the main limiting factor (memory bandwidth or compute capacity). When you run a Roofline analysis, the Intel® Advisor:
Measures the hardware limitations of your machine and collects loop/function timings using the Survey analysis.
Collects floating-point and integer operations data, and memory data using the Trip Counts and FLOP analysis.

In the
Analysis Workflow pane, the drop-down menu to select the CPU/MEMORY Roofline Insights perspective and click the
button to run it. Upon completion,
Intel Advisor displays a
Roofline chart.
The Roofline chart plots an application's achieved performance and arithmetic intensity against the machine's maximum achievable performance:
In general:
The greater the distance between a dot and the highest achievable roofline, the more opportunity exists for performance improvement.
If one or more loops is not vectorizing properly and performance is unsatisfactory:
Check data in associated Intel Advisor views to support your Roofline chart interpretation. For example: Check the Vectorized Loops/Efficiency values in the Survey Report or the data in the Code Analytics tab.
Improve application performance using various Intel Advisor features to guide your efforts, such as:
Information in the Code Analytics tab and associated Recommendations tab.
Suggestions in Examine Relationships Between Memory Levels in the Intel Advisor User Guide
If you need more information, continue your investigation by:
Explore the iterative approach to identifying performance bottlenecks described in Intel Advisor Cookbook - Identify Bottlenecks Iteratively: Cache-Aware Roofline.
Offload Modeling perspective allows you to identify high-impact opportunities to offload to GPU as well as the areas that are not profitable to offload. It provides performance speedup projection on accelerators along with offload overhead estimation and pinpoints accelerator performance bottlenecks.
Run Offload Modeling Perspective
In the
Analysis Workflow pane, use a drop-down menu to select the Offload Modeling perspective, set data collection accuracy level to
Low, and click the
button to run it. When you select a perspective,
Intel Advisor automatically sets up analyses to be run. You can view the list of analyses that are going to be run at the selected accuracy level under the Accuracy pane. Upon completion,
Intel Advisor displays an
Offload Modeling Summary.

Offload Modeling Summary offers you information on total speed-up of your application that can be achieved by offloading it to a target GPU platform, top 5 offloaded code regions in your call tree, top 5 regions that are not profitable to offload, number of offloaded functions/loops, and a fraction of offloaded code relative to total time of original application.
Run Offload Modeling Perspective Using CLI Interface
Important: If you want to analyze DPC++, OpenCL™, or OpenMP* target applications, set the INTEL_JIT_BACKWARD_COMPATIBILITY=1 environment variable.
There are several methods available. These methods vary in terms of simplicity and flexibility.
Method 1: run_oa.py
This method is the simplest, but it is less flexible and is only available for non-MPI applications. For example:advisor-python <APM>/run_oa.py <project-dir> -- <target> [target-options]
Method 2: collect.py + analyze.py
This method is simple, moderately flexible, but does not support MPI applications. collect.py automates profiling, while analyze.py implements performance modeling on a target device. For example:
advisor-python <APM>/collect.py <project-dir> -- <target> [target-options]
advisor-python <APM>/analyze.py <project-dir>
Method 3: advisor + analyze.py
This method is the most flexible and is applicable to MPI applications. To use this method:
advisor-python <APM>/collect.py <project-dir> –-dry-run -- <target> [target-options]
You will get several command lines for advisor collections. You can omit markup command (advisor-python collect.py --markup <markup-type>) to get more data for analysis at the cost of collection overhead.
advisor –collect=survey --auto-finalize --stackwalk-mode=online --static-instruction-mix --project-dir=<project-dir> -–profile-jit –-data-transfer-analysis -- <target> [target-options]
advisor --collect=tripcounts --flop --stacks -–enable-cache-simulation –-cache-config=<cache-configuration> --enable-data-transfer-analysis --project-dir=<project-dir> -- <target> [target-options]
advisor --collect=dependencies --loops="scalar" --filter-reductions --loop-call-count-limit=16 --project-dir=<project-dir> -- <target> [target-options]
advisor-python <APM>/analyze.py <project-dir>
In the current version, you can analyze OpenMP, DPC++, oneAPI Threading Building Blocks (oneTBB), and Intel® Data Analytics Acceleration Library (Intel® DAAL) parallel regions. Parallel regions do not have loop-carried dependencies and do not require you to run a Dependencies check. To save time, you can skip the dependencies check and use -assume-parallel / --no-assume-parallel in analyze.py.
After data collection, Offload Advisor generates a Performance Predictor report. The report is located in report.html file. It contains multiple sections: Summary, Offloaded Regions, Non-Offloaded Regions, Call Tree, Configuration, Logs. You can switch between different sections using the links in the top left box. The top right box highlights total speedup, number of loops and functions offloaded, and a fraction of code accelerated.
Run a Data Transfer analysis to examine data transfers for each code region and get more accurate data about how your application is going to execute on a target GPU platform.
Explore a typical scenario of optimizing GPU usage described in Intel Advisor Cookbook: Identify Code Regions to Offload to GPU and Visualize GPU Usage.
GPU Roofline perspective allows you to estimate and visualize actual performance of GPU kernels using benchmarks and hardware metric profiling against hardware-imposed performance ceilings, as well as determine the main limiting factor. When you run a GPU Roofline analysis, the Intel Advisor:
Measures the hardware limitations and collects OpenCL™ kernels timings and memory data using the Survey analysis with GPU profiling.
Collects floating-point and integer operations data using the Trip Counts and FLOP analysis with GPU profiling.
In the
Analysis Workflow pane, use a drop-down menu to select the GPU Roofline Insights perspective and click the
button to run it. Upon completion,
Intel Advisor displays a GPU Roofline Summary. View the
GPU Roofline Chart to identify the main factors limiting the performance of your application.
GPU profiling is applicable only to Intel® Processor Graphics.
A Roofline chart plots an application's achieved performance and arithmetic intensity against the machine's maximum achievable performance:
Arithmetic intensity (x axis) - measured in number of floating-point operations (FLOPs) per byte, based on the kernel algorithm, transferred between GPU and memory
Performance (y axis) - measured in billions of floating-point operations per second (GFLOPS)
In general:
The size and color of each dot represent relative execution time for each kernel. Large red dots take the most time, so are the best candidates for optimization. Small green dots take less time, so may not be worth optimizing.
Diagonal lines indicate memory bandwidth limitations preventing kernels from achieving better performance without some form of optimization.
L3 cache roof: Represents the maximal bandwidth of the L3 cache for your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that fits entirely into L3 cache.
SLM cache roof: Represents the maximal bandwidth of the Shared Local Memory for your current graphics hardware. Measured using an optimized sequence of load and store operations that work only with SLM.
GTI roof: Represents the maximum bandwidth between the GPU and the rest of the SoC. This estimate is calculated via analytical formula based on the maximum frequency of your current graphics hardware.
DRAM roof: Represents the maximal bandwidth of the DRAM memory available to your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that does not fit in GPU caches.
Horizontal lines indicate compute capacity limitations preventing kernels from achieving better performance without some form of optimization.
A dot cannot exceed the topmost rooflines, as these represent the maximum capabilities of the machine. However, not all kernels can utilize maximum machine capabilities.
The greater the distance between a dot and the highest achievable roofline, the more opportunity exists for performance improvement.
The GPU Roofline chart is based on a CPU Roofline chart, but there are some differences:
The dots on the chart correspond to OpenCL kernels, while in the CPU version, they correspond to individual loops.
Some displayed information and controls (for example, thread/core count) are not relevant to GPU Roofline. For more information, see the table below.
Integrated chart shows multiple dots for a single kernel . These dots correspond to different memory levels used to calculate arithmetic intensity. Hover over a dot to identify its arithmetic intensity. To show or hide certain dots from a chart, use the Memory Level drop-down filter.
Use the GPU Roofline Summary to compare performance of your application on a CPU and on a GPU device.
Explore the typical scenario of optimizing GPU usage described in Intel Advisor Cookbook: Identify Code Regions to Offload to GPU and Visualize GPU Usage.
Threading perspective allows you to identify the best candidates for parallelizing, prototype threading designs and check, if there are data dependencies preventing parallelizing of certain functions/loops.
In the
Analysis Workflow pane, use a drop-down menu to select the Threading perspective, set data collection accuracy to
Low, and click the
button to run the perspective. Use the resulting information to discover candidates for parallelization with threads.
Insert annotations to mark places in parts of your application that are good candidates for later replacement with parallel framework code that enables parallel execution.
The main types of Intel® Advisor annotations mark the location of:
A parallel site. A parallel site is a region of code that contains one or more tasks that may execute in one or more parallel threads to distribute work. An effective parallel site typically contains a hotspot that consumes application execution time. To distribute these frequently executed instructions to different tasks that can run at the same time, the best parallel site is not usually located at the hotspot, but higher in the call tree.
One or more parallel tasks within a parallel site. A task is a portion of time-consuming code with data that can be executed in one or more parallel threads to distribute work.
Locking synchronization, where mutual exclusion of data access must occur in the parallel application.
After adding annotations, rebuild your application to prototype threading designs using the Suitability analysis.
In the Analysis Workflow pane, set data collection accuracy level to Medium to collect Suitability data while your application executes. Upon completion, Intel Advisor displays a Suitability Report.

The Suitability Report predicts maximum speedup based on the inserted annotations and what-if modeling parameters with which you can experiment, such as:
Different hardware configurations and parallel frameworks
Different trip counts and instance durations
Any plans to address parallel overhead, lock contention, or task chunking when you implement your parallel framework code
Use the resulting information to choose the best candidates for parallelization with threads.
If the data collection accuracy level is set to Medium, Threading Advisor collects Dependencies data while your application executes. Use the resulting information to fix the data sharing problems if the predicted maximum speedup benefit justifies the effort.
If you decide the predicted maximum speedup benefit is worth the effort to add threading parallelism to your application:
Complete developer/architect design and code reviews about the proposed parallel changes.
Choose one parallel programming framework (threading model) for your application, such as oneTBB, OpenMP*, Microsoft Task Parallel Library* (TPL), or some other parallel framework.
Add the parallel framework to your build environment.
Add parallel framework code to synchronize access to the shared data resources, such as oneTBB or OpenMP locks.
Add parallel framework code to create parallel tasks.
As you add the appropriate parallel code from the chosen parallel framework, you can keep, comment out, or replace the Intel Advisor annotations.
Resource |
Description |
|---|---|
Refer to this guide for instructions to get started with the command line, detailed information on analysis types, information on how to use the GUI, and more. |
|
| View the most useful resources that can help you achieve better performance of your application using vectorization. | |
| Roofline Resources for Intel® Advisor Users | View the most useful resources that can help you identify hardware-imposed ceilings using Intel Advisor CPU/GPU Roofline perspectives. |
| Explore typical use-cases of Intel Advisor. Follow the step-by-step instructions to help effectively use more cores, vectorization, or heterogeneous processing. | |
| Explore a built-in graphical tool that helps you visualize and analyze graphs of a oneAPI Threading Building Blocks (oneTBB), OpenMP*, and Data Parallel C++ (DPC++) applications. | |
Analyze Performance Remotely and Visualize Results on a Local macOS* System |
View a step-by-step instruction how to visualize Intel Advisor perspective results on a macOS machine. |
| Explore new features of Intel Advisor. | |
Vectorization Tutorial for Windows* OS Vectorization Tutorial for Linux* OS Threading Tutorial for Windows* OS |
View tutorials that can help you experiment with Intel Advisor sample applications and run different perspectives. |
Offline Resources |
One of the key Vectorization perspective features is GUI-embedded advice on how to fix vectorization issues specific to your code. To help you quickly locate information that augments that GUI-embedded advice, the Intel Advisor provides offline compiler mini-guides. You can also find offline Recommendations and Compiler Diagnostic Details advice libraries in the same location as the mini-guides. Each issue and recommendation in these HTML files is collapsible/expandable. Linux* OS: Available offline documentation is installed inside <advisor-install-dir>/documentation/<locale>/. Windows* OS: Available offline documentation is installed inside <advisor-install-dir>\documentation\<locale>\. |
You may encounter the following known issues when using the following to view documentation:
Microsoft Windows Server* 2012 system: Trusted site prompt appears. Solution: Add about:internet to the list of trusted sites in the Tools > Internet Options > Security tab. You can remove after you finish viewing the documentation.
Microsoft Internet Explorer* 11 browser: Topics do not appear when you select them in the TOC pane. Solution: Add http://localhost to the list of trusted sites in the Tools > Internet Options > Security tab. You can remove after you finish viewing the documentation.
Microsoft Edge browser:
Context-sensitive (also known as F1) calls to a specific topic open the title page of the corresponding document instead. Solution: Use a different default browser.
Panes are truncated and a proper style sheet is not applied. Solution: Use a different default browser.
Intel technologies may require enabled hardware, software or service activation.
No product or component can be absolutely secure.
Your costs and results may vary.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.