Microsoft Research334 тыс
Опубликовано 9 сентября 2016, 2:30
Scalable Parallel Computing on Many/Multicore Systems
This set of lectures will review the application and programming model issues that one must address when one gets chips with 32-1024 cores and “scalable” approaches will be needed to make good use of such systems. We will not discuss bit-level and instruction-level parallelism i.e. what happens on a possibly special purpose core, even though this is clearly important. We will use science and engineering applications to drive the discussion as we have substantial experience in these cases but we are interested in the lessons for commodity client and server applications that will be broadly used 5-10 years from now.
We start with simple applications and algorithms from a variety of fields and identify features that makes it “obvious” that “all” science and engineering run well and scalably in parallel. We explain why unfortunately it is equally obvious that there is no straightforward way of expressing this parallelism. Parallel hardware architectures are described in enough detail to understand performance and algorithm issues and need for cross-architecture compatibility; however in these lectures we will just be users of hardware. We must understand what features of multicore chips we can and should exploit. We can explicitly use the shared memory between cores or just enjoy its implications for very fast inter-core control and communication linkage. We note that parallel algorithm research is hugely successful although this success has reduced activity in an area that deserves new attention for the next generation of architectures.
The parallel software environment is discussed at several levels including programming paradigm, runtime and operating system. The importance of libraries, templates, kernels (dwarfs) and benchmarks is stressed. The programming environment has various tools including compilers with possible parallelism hints like OpenMP; tuners like Atlas; messaging models; parallelism and distribution support as in HPF, HPCS Languages, co-array Fortran, Global arrays and UPC. We also discuss the relevance of important general ideas like object-oriented paradigms (as in for example Charm++), functional languages and Software Transactional Memories. Streaming, pipelining, co-ordination, services and workflow are placed in context. Examples discussed in the last category include CCR/DSS from Microsoft and the Common Component Architecture CCA from DoE. Domain Specific environments like Matlab and Mathematica are important as there is no universal silver programming bullet; one will need interoperable focused environments.
We discuss performance analysis including speedup, efficiency, scaled speedup and Amdahl's law. We show how to relate performance to algorithm/application structure and the hardware characteristics. Applications will not get scalable parallelism accidentally but only if there is an understandable performance model. We review some of the many pitfalls for both performance and correctness; these include deadlocks, race conditions, nodes that are busy doing something else and the difficulty of second-guessing automatic parallelization methods. We describe the formal sources of overhead; load imbalance and the communication/control overhead and the ways to reduce them. We relate the blocking used in caching to that used in parallel computation. We note that load balancing was the one part of parallel computing that was easier than expected.
We will mix both simple idealized applications with “real problems” noting that usually it is simple problems that are the hardest as they have poor computation to control/communication ratios. We will explain the parallelism in several application classes including for science and engineering: Finite Difference, Finite Elements, FFT/Spectral, Meshes of all sorts, Particle Dynamics, Particle-Mesh, and Monte Carlo methods. We will discuss the innovative Sequential Dynamical Systems approach used in the EpiSims, TransSims and other critical infrastructure simulation environments. In all applications we need to identify the intrinsic parallelism and the degrees of freedom that can be parallelized and distinguish small parallelism (local to core) compared to large parallelism (scalable across many cores)
We will not forget many critical non-technical Issues including “Who programs – everybody or just the marine corps?”; “The market for science and engineering was small but it will be large for general multicore”; “The program exists and can’t be changed”; “What features will next hardware/software release support and how should I protect myself from change?”
We will summarize lessons and relate them to application and programming model categories. In last lecture or at end of all lectures we encourage the audience to bring their own multicore application or programming model so we can discuss examples that interest you.
This set of lectures will review the application and programming model issues that one must address when one gets chips with 32-1024 cores and “scalable” approaches will be needed to make good use of such systems. We will not discuss bit-level and instruction-level parallelism i.e. what happens on a possibly special purpose core, even though this is clearly important. We will use science and engineering applications to drive the discussion as we have substantial experience in these cases but we are interested in the lessons for commodity client and server applications that will be broadly used 5-10 years from now.
We start with simple applications and algorithms from a variety of fields and identify features that makes it “obvious” that “all” science and engineering run well and scalably in parallel. We explain why unfortunately it is equally obvious that there is no straightforward way of expressing this parallelism. Parallel hardware architectures are described in enough detail to understand performance and algorithm issues and need for cross-architecture compatibility; however in these lectures we will just be users of hardware. We must understand what features of multicore chips we can and should exploit. We can explicitly use the shared memory between cores or just enjoy its implications for very fast inter-core control and communication linkage. We note that parallel algorithm research is hugely successful although this success has reduced activity in an area that deserves new attention for the next generation of architectures.
The parallel software environment is discussed at several levels including programming paradigm, runtime and operating system. The importance of libraries, templates, kernels (dwarfs) and benchmarks is stressed. The programming environment has various tools including compilers with possible parallelism hints like OpenMP; tuners like Atlas; messaging models; parallelism and distribution support as in HPF, HPCS Languages, co-array Fortran, Global arrays and UPC. We also discuss the relevance of important general ideas like object-oriented paradigms (as in for example Charm++), functional languages and Software Transactional Memories. Streaming, pipelining, co-ordination, services and workflow are placed in context. Examples discussed in the last category include CCR/DSS from Microsoft and the Common Component Architecture CCA from DoE. Domain Specific environments like Matlab and Mathematica are important as there is no universal silver programming bullet; one will need interoperable focused environments.
We discuss performance analysis including speedup, efficiency, scaled speedup and Amdahl's law. We show how to relate performance to algorithm/application structure and the hardware characteristics. Applications will not get scalable parallelism accidentally but only if there is an understandable performance model. We review some of the many pitfalls for both performance and correctness; these include deadlocks, race conditions, nodes that are busy doing something else and the difficulty of second-guessing automatic parallelization methods. We describe the formal sources of overhead; load imbalance and the communication/control overhead and the ways to reduce them. We relate the blocking used in caching to that used in parallel computation. We note that load balancing was the one part of parallel computing that was easier than expected.
We will mix both simple idealized applications with “real problems” noting that usually it is simple problems that are the hardest as they have poor computation to control/communication ratios. We will explain the parallelism in several application classes including for science and engineering: Finite Difference, Finite Elements, FFT/Spectral, Meshes of all sorts, Particle Dynamics, Particle-Mesh, and Monte Carlo methods. We will discuss the innovative Sequential Dynamical Systems approach used in the EpiSims, TransSims and other critical infrastructure simulation environments. In all applications we need to identify the intrinsic parallelism and the degrees of freedom that can be parallelized and distinguish small parallelism (local to core) compared to large parallelism (scalable across many cores)
We will not forget many critical non-technical Issues including “Who programs – everybody or just the marine corps?”; “The market for science and engineering was small but it will be large for general multicore”; “The program exists and can’t be changed”; “What features will next hardware/software release support and how should I protect myself from change?”
We will summarize lessons and relate them to application and programming model categories. In last lecture or at end of all lectures we encourage the audience to bring their own multicore application or programming model so we can discuss examples that interest you.
Свежие видео
Случайные видео