Basic Concepts of Optimizing for Parallelism And Locality
Last Updated :
18 Nov, 2022
In this article, we’ll discuss some of the basic concepts in compiler design that can be used to exploit parallelism and locality. We’ll start by looking at how computers work, and then move on to discuss the three types of parallelism available on modern processors: loop-level parallelism, data locality, and affine transforms. Finally, we’ll look at some techniques for exploiting these features in your own code to make it run faster when executed on multiple machines or within a single machine with multiple cores.
Multiprocessors:
Multiprocessors are a collection of processors that share memory. Multiprocessor systems have the advantage of being able to execute multiple operations in parallel, thus increasing performance. One example would be if you wanted to perform several computations on some data at once; this would be impossible with a single processor because each instruction needs time for execution before it can start another one.
Another application where multiprocessor systems are useful is in graphics processing units (GPUs). GPUs use hundreds or thousands of cores for their processing power, so they don’t need any additional hardware support from your computer’s motherboard or CPU core count increase to achieve high-performance levels like those offered by multicore CPUs. This is because GPUs are designed to be as efficient and streamlined as possible, so they don’t need massive amounts of memory or many CPU cores to do their job.
Parallelism in Applications:
The parallel processing of a problem can be achieved by running multiple copies of the program on different processors. However, as the number of processors increases, so does the complexity and cost of creating such an application. In order to overcome this problem, we need to design our algorithms so that they will scale well with increasing parallelism levels. This implies that we need to analyze how many resources are necessary for each task within our algorithm and then decide which tasks should be executed in parallel or sequentially based on their relative importance or resource requirements (Amdahl’s Law).
In addition, there exist theoretical limits on parallelism which may not be reached due to various reasons:
- Data locality issues due to lack of memory access locality.
- Communication costs between processors etc.
In today’s world, there are many applications that require the execution of multiple processes in parallel. Examples include web servers that serve many requests at once; multi-threaded applications, which perform several tasks simultaneously (for example, a word processor that allows us to type while it displays documents on screen).
Loop-Level Parallelism:
At the level of the loop nest, there are several tools for modifying loops to take advantage of parallelism. Some of these tools are:
- Loop interchange: The ability to replace one loop with another without changing its execution order. This is achieved by storing data in memory and reading from it as necessary in each pass through the code.
- Loop unrolling: Removing unused variables from a loop body by replacing them with constants instead of variables (or functions). This can be done either manually or automatically by compilation time optimizer based on analysis of code structure.
- Distribution: The process where processors share the workload (i.e., run code) among themselves; this includes distributing tasks among different processors within one system or across multiple systems by interconnecting them via networks such as Ethernet/TCP/IP networks using specialized devices called load balancers so that each processor has access only when needed instead wasting resources waiting between requests.
The ability to partition a loop so that each thread performs only a portion of the task. This is done by using a parallelizing compiler (i.e., one that automatically detects and rewrites code for multiple processors).
Data Locality:
The data locality of a program is the extent to which the program data references are localized in memory. Data locality is important for performance because it allows the processor to access data in the same cache line.
Data locality can be measured by looking at how frequently your application references a particular type, function, or instruction and how close together those references are in memory. For example, if you want your code to run as fast as possible on disk (and thus not require any special hardware), then you want all kinds of functions and types that use disk I/O operations to be stored together so they can be reached quickly by reading from disks directly instead of having multiple copies spread out over different areas of memory.
Introduction to Affine Transform Theory:
The affine transform is a mapping that maps the coordinates of a point in space to another set of coordinates. Affine transforms can be used to optimize parallelism and locality, so we will use them in our discussion of compiler design.
An affine transform is defined by:
The data dependence property (DDP), states that each element in an array depends only on its immediate neighbors. This means that if you know the values at some index, you can calculate the value at other indexes because they are all related by dependencies with their neighbors.
The iteration space property (ISP), says that if we have an array whose elements are stored contiguously within each other, then there will be no gaps between them as we traverse through it from beginning to end without skipping any elements along this path (i.e., no “holes”).
The DDP and ISP together are sufficient for a compiler to be able to issue loads and stores in parallel without worrying about missing data. The only exception is if there are dependencies between different memory regions that can’t be broken by the compiler then it must wait until all of them have been computed before issuing the load or store instruction.
In order to make parallelization possible, the compiler must be able to determine which parts of an array can be accessed independently and which cannot. This can be done by applying the dependency analysis algorithm. This algorithm will tell us if there are any dependencies between two indexes, and therefore whether they can be stored in parallel or not.
Compiler techniques for exploiting parallelism and locality
To exploit parallelism and locality in a compiler, you should consider the following:
- Using affine transform theory to partition data.
- Using affine transform theory to partition loops.
- Using affine transform theory to partition the iteration space.
- Using affine transform theory to partition the data space.
Conclusion
The optimization techniques outlined in this article will help to make code as efficient as possible. In addition, they can be combined with other compiler optimizations to achieve even better results.
Similar Reads
Basic Concepts of Optimizing for Parallelism And Locality
In this article, we'll discuss some of the basic concepts in compiler design that can be used to exploit parallelism and locality. We'll start by looking at how computers work, and then move on to discuss the three types of parallelism available on modern processors: loop-level parallelism, data loc
6 min read
Matrix Multiply in Optimizing for Parallelism and Locality
Matrix multiplication is a fundamental operation in computer science, and it's also an expensive one. In this article, we'll explore how to optimize the operation for parallelism and locality by looking at different algorithms for matrix multiplication. We'll also look at some cache interference iss
5 min read
Locality of Reference and Cache Operation in Cache Memory
This article deals with the concept of locality of reference and cache operations in cache memory. Locality of reference refers to the process of accessing a particular portion of memory at any given time period. Cache memory is a small-sized memory that allows fast access by storing the frequently
7 min read
Basic Cache Optimization Techniques
Generally, in any device, memories that are large(in terms of capacity), fast and affordable are preferred. But all three qualities can't be achieved at the same time. The cost of the memory depends on its speed and capacity. With the Hierarchical Memory System, all three can be achieved simultaneou
5 min read
Analysis of time and space complexity of C++ STL containers
In this article, we will discuss the time and space complexity of some C++ STL classes. Characteristics of C++ STL: C++ has a low execution time as compared to other programming languages.This makes STL in C++ advantageous and powerful.The thing that makes STL powerful is that it contains a vast var
15 min read
Overview of Scaling: Vertical And Horizontal Scaling
Given architecture is an example of a client-server based system. In this, there is a client who sends requests to the server and then the client receives a response from the server accordingly but when the number of users/clients increases, the load on the server increases enormously which makes it
4 min read
Scalability and Decision Tree Induction in Data Mining
Pre-requisites: Data Mining Scalability in data mining refers to the ability of a data mining algorithm to handle large amounts of data efficiently and effectively. This means that the algorithm should be able to process the data in a timely manner, without sacrificing the quality of the results. In
5 min read
Analytical Approach to optimize Multi Level Cache Performance
Prerequisite - Multilevel Cache OrganizationThe execution time of a program is the product of the total number of CPU cycles needed to execute a program. For a memory system with a single level of caching, the total cycle count is a function of the memory speed and the cache miss ratio.m(C) = f(S, C
3 min read
Interprocedural Optimization using Inline Substitution
The different scopes of code optimization include Local Optimization, Regional Optimization, Global Optimization, and Interprocedural Optimization. Local Optimization refers to optimization in a Basic Block. Regional Optimization refers to optimization in an Extended Basic Block. Global Optimization
4 min read
Register Allocation Algorithms in Compiler Design
Register allocation is an important method in the final phase of the compiler . Registers are faster to access than cache memory . Registers are available in small size up to few hundred Kb .Thus it is necessary to use minimum number of registers for variable allocation . There are three popular Reg
5 min read
Register Allocations in Code Generation
Registers are the fastest locations in the memory hierarchy. But unfortunately, this resource is limited. It comes under the most constrained resources of the target processor. Register allocation is an NP-complete problem. However, this problem can be reduced to graph coloring to achieve allocation
6 min read
Difference between Sequential Organization and Linked Organization
Data structures are used to store and organize data in a way that makes it easy to access, modify, and analyze. Two of the most common types of data structures are : Linked Organization: In a linked organization, the elements or components of the system are connected through relationships, but the o
3 min read
Multithreading or Multiprocessing with Python and Selenium
Multithreading and multiprocessing are two popular approaches for improving the performance of a program by allowing it to run tasks in parallel. These approaches can be particularly useful when working with Python and Selenium, as they allow you to perform multiple actions simultaneously, such as a
11 min read
Code Optimization in Compiler Design
Code optimization is a crucial phase in compiler design aimed at enhancing the performance and efficiency of the executable code. By improving the quality of the generated machine code optimizations can reduce execution time, minimize resource usage, and improve overall system performance. This proc
9 min read
Handler's Classification in Computer Architecture
In 1977, Wolfgang Handler presented a computer architectural classification scheme for determining the degree of parallelism and pipelining built into the computer system hardware. Parallel systems are complicated to the program as compared to the single processor system because parallel system arch
3 min read
Speed up Code executions with help of Pragma in C/C++
The primary goal of a compiler is to reduce the cost of compilation and to make debugging produce the expected results. Not all optimizations are controlled directly by a flag, sometimes we need to explicitly declare flags to produce optimizations. By default optimizations are suppressed. To use sup
9 min read
Comparison of Different CPU Scheduling Algorithms in OS
A scheduling algorithm is used to estimate the CPU time required to allocate to the processes and threads. The prime goal of any CPU scheduling algorithm is to keep the CPU as busy as possible for improving CPU utilization. Scheduling Algorithms 1. First Come First Serve(FCFS): As the name implies t
5 min read
Difference between Simultaneous and Hierarchical Access Memory Organisations
In the Computer System Design, memory organization is important in order to enhance system performance and efficiency . So, two ways have been devised on the basis of which CPU tries to access different levels of Memory. These two types include Simultaneous Access Memory Organization and Hierarchica
5 min read
Six-State Process Model in Operating System
In this article, we are going to discuss the Six-State Process Model in Operating Systems and we will also understand what was the need for introducing these process models, what are the states present in these process models, and all the possible transitions that can occur in these models. Need for
4 min read
Factors affecting Cache Memory Performance
Computers are made of three primary blocs. A CPU, a memory, and an I/O system. The performance of a computer system is very much dependent on the speed with which the CPU can fetch instructions from the memory and write to the same memory. Computers are using cache memory to bridge the gap between t
5 min read