Razpis: Energy Efficient Technologies in HPC

Horizon Europe (HORIZON) / Obzorje Evropa
Rok za prijavo: 07 feb. 2024 Objavljeno: 07 nov. 2023 Predviden proračun razpisa: 15.000.000 - 20.000.000 EUR

ExpectedOutcome:
Significantly improved energy efficiency at increased overall throughput and utilisation of supercomputers
An overall more competent and energy aware user community
Reliable common metrics to measure the efficiency of supercomputers to enable fine grained comparison between supercomputers and data centres to optimise operations.
A common data repository containing the complete collective operational data set of all participating HPC centres and other relevant data (e. g. application input files provided by users)
A holistic and modular European software stack of interoperable components with well-defined and well-documented interfaces, covering energy aware and dynamic workload management, monitoring and data analytics deployed in pre-exascale and exascale production environments.
A dynamic workload management solution will support
constrained operation, in particular w.r.t. to power capping at different levels (global, job and node level)
malleability across all layers of the software stack with support for different workloads at the same time (rigid, mouldable, malleable, etc.), linked to the monitoring system and power management system at different levels
co-scheduling of workloads sharing heterogeneous resources on the same node in the most efficient with effective profiling and monitoring mechanisms e. g. to efficiently combine CPU-, memory-, network- and I/O bound workloads on the same node
A comprehensive monitoring solution collecting all relevant operational data to identify inefficient use of resources, for example performance bottlenecks, congestion, adverse interaction of workloads, job and application configurations, power consumption preferably using open standards such as Power API. The monitoring solution should provide quantitative measure of the overhead it incurs and apply measures to minimize it.
An advanced data analytics framework for operational data analysis, linked to the monitoring solution, capable of processing all relevant data such as monitoring data, user input, job scripts and workflow definitions at exascale exploiting the emerging opportunities of novel approaches for intelligent decision making such as generative AI, deep learning, etc.
A data driven and AI based solution to correlate current system workload, scheduling, resource allocation, job configuration, application input etc. with performance and energy efficiency. The solution will (a) provide the relevant information to the workload management solution and (b) assist users with the preparation and selection of executables, configuration of job parameters and specific application input, to optimize the R&D output per Watt of consumed energy.
Full integration of the monitoring and operational data analytics solutions into the workload management software stack to guide scheduling, resource allocation, job and application configuration
Competent developer teams at the participating organisations with the ability to maintain and further develop the software stack
Feedback on future requirements and new capabilities for better hardware management with respect to energy monitoring and power management at different technology layers (system, node, board, chip etc.)
Production-quality solutions installed on EuroHPC supercomputers, within an ecosystem of services around the deployed solution to ensure reliable operation and support
Scope:
The action will establish and implement a strategic R&I initiative contributing to the development of innovative HPC software technology integrated with cross-layer energy-monitoring metrics tailored for exascale and post-exascale supercomputers. The action will ensure a common framework for implementation by maintaining an R&I roadmap with a critical timeline, control points and deliverables to govern the necessary activities.

Proposals should provide a holistic view on the entire energy aware software stack, including a dynamic resource and workload management solution for energy aware HPC, a comprehensive monitoring and profiling framework adhering to a common data format. Participants must commit to work towards a common standard and solution to address the challenges of energy efficiency, monitoring and resource scheduling at exascale. All participating HPC centres should gradually adopt the common solution once the required quality level has been reached which should be measures by appropriate KPIs. The developed solution should integrate an advanced data analytics solution providing the basis for automated intelligent decisions on scheduling, dynamic resource allocation, job and application configuration. The data driven solution should provide capabilities significantly beyond the state of the art, considering recent developments in the fields of modelling and optimisation, data science, deep learning and generative AI. Moreover, the solution should identify user input and applications resulting in inefficient use of resources to provide automated feedback on the optimal use of resources.

In general, the pursued approach following the principles of modularity, interoperability, cross-platform compatibility (avoid vendor lock in and prepare for new architectures such as RISC-V based hardware), extensibility (e. g. via plugins) and openness (e. g. by permissive licensing). Another important design principle should be a low impact on system performance with negligible overhead on the compute nodes. Consequently, the definition of interfaces and software documentation should receive proportionate attention in the work plan which should also be reflected in corresponding quantitative KPIs. Where appropriate, proposals should take into account developments in other related activities outside the EuroHPC ecosystem where relevant, for example efforts of the HPC-PowerStack and PowerAPI forums.

All significant software components should be subject to a professional planning and documentation process, including at least software requirements specification, software design description, software verification and validation as well as user and developer documentation for every developed software component.

Proposals should indicate, for each participating HPC centre, the current energy consumption with respect to the current throughput and portfolio of applications (on average, expressed in Gflops/Watt) and an estimated overall reduction of energy consumption by the end of the action when the software stack is expected to be used in production.

While the resulting software stack should generally provide support for malleable applications and workflows, the scope of the initiative is delimited by system software such as the operating system on the one hand and, on the other hand, the boundaries to user space software and applications, including, for example, programming models, software libraries and applications.

The selection of the software stack must avoid duplicated or redundant elements. An indicative list of software components envisaged should be provided in the proposal explaining their specific role. The JU provides a specific template for this purpose. Alternative solutions overlapping with the role or responsibilities of a component in the consolidated reference implementation may be used by individual partners, but no resources will be made available for activities linked to such components within this action. However participants should be invited to use the continuous integration platform free of charge to integrate their tailored solutions as an alternative into the common software stack.

All developed software and documentation should be available in a single software repository using a state of the art version control system which provides information on the development history and is accessible at least by all members of the consortium, the funding authority and external reviewers. Measures for Continuous Integration that verify dependencies between software packages before deployment should be applied. The requirements set out in the call must be reflected in the proposal as well as in corresponding deliverables and milestones. The work plan should provide for updates of deliverables whenever necessary.

Requirements

General requirements, interoperability, harmonization and standardisation:

Develop a common vision and technology roadmap towards an integrated and modular scheduling and resource management framework, including one common and complete software stack (a selection of one set of existing and to be developed components providing non-overlapping functionality) that fulfils best the requirements of all participating HPC operators.
Maintain a detailed strategic development roadmap for the action, anticipating future developments in HPC architectures and increasing heterogeneity, including emerging technologies such as quantum computing resources. The novel opportunities of exascale systems (e. g. statistical work load and user behaviour) and advances in data driven technologies (e. g. AI) should be identified and addressed.
Define common standards and data formats for the collection and exchange of data collected from the supercomputers for operational and application data analytics. Both technical and legal aspects should already be addressed by the proposal and not deferred to a later time or the consortium agreement. Where required, an appropriate modification of, e. g., the general terms and conditions for users of supercomputers should be elaborated and implemented by the participating HPC operators. A maximum of data should be shared and made available for European R&D, subject to an individual confidentiality declaration by every data analyst and in line with applicable legislation as required. Restrictions to access to specific data (e. g. on user data, vendors) must be specifically and duly justified.
Define a common approach to sanitise the shared data sets where required by the applicable legislation
Define a mechanism to pool operational data from all participating HPC centres for analytics in a common data repository
Develop performance and energy consumption metrics, supported by the collected operational data, at different levels (e. g. application and workflow performance, global throughput, tail latencies, network congestion and interactions of workloads)
Provide recommendations for additional operational data and requirements for sensor data needed (e. g. from systematic profiling, workload experiments or sensors on single component usage and occupancy) to fill gaps and improve insight and intelligence on system operations and feedback on the software development
Contribute to relevant standardisation and coordination efforts
Set up a common software repository applying best practices for continuous integration for all components developed within the action (possibly using a solution implemented by another EuroHPC JU project if available)
Define a catalogue of specific criteria and quality standards the developed solutions must fulfil before being included in (a) a pre-production environment (b) in the production supercomputing environment in operation at the participating HPC centres ("acceptance test"). The specifications must be sufficiently detailed to provide guidance to the technical implementation and ensure the participating HPC centres will adopt the solutions in their production environments. HPC centres participating in a proposal are expected to make respective commitments for deployment, subject to the condition that all previously defined requirements are met.
Define the roles and responsibilities of each software component (e. g. w.r.t. interaction with hardware), functional requirements, interfaces with the rest of the software stack in close collaboration with the competent technical experts
Define metrics including breakdown by application, throughput, availability etc. and corresponding qualitative and quantitative KPIs to drive the developments towards the objectives
Define effective KPIs on feedback, coordination and information flow between the different technical areas and developer groups, including mitigation measures in case the targets will not be met
Collect and analyse user feedback and system response to evaluate how the intelligent resource management system has affected workloads, scientific output and user productivity
Monitoring and data analysis

Collect all relevant information from hardware, system software to enable effective monitoring of energy consumption at different levels (system, job, node, processing unit)
Implement mechanisms to pool operational data from all participating HPC centres, such as job performance metrics, availability, system health, energy consumption, user activities, application and work load behaviour etc., in a common repository taking into account technical and legal aspects
Provide access to the common data repository, at least to EuroHPC actions. The data should be available for research on more efficient resource management, for example to explore fault tolerant and data/AI based dynamic scheduling approaches
Provide a data and AI driven solution to collect information on system health and performance, identify its symptoms and then diagnose, anticipate, predict and identify potential component failures, anomalies, silent data corruption, burst errors etc. with a detailed system and component health report and analysis
Monitor, profile and fingerprint applications and workflows to identify characteristic usage patterns, detecting inefficient code (e. g. not using optimized numerical libraries, wrong compiler flags), job configuration and execution at user and system level (process affinity and placement, competition with other workloads for resources, inappropriate application input)
Develop a solution to respond to inefficient job configurations and application input as early as possible (e. g. preventing a job to be scheduled, started or completed) and provide automated feedback to users) integrated in the workload management system
Based on intelligence from operational and application input data analysis, the analytics solution should identify the most (energy) efficient scheduling and resource allocation decision taking into account the respective state of the system
Integrate the monitoring and data analytics solution with the other layers of the software stack to provide the best possible information for an efficient dynamic scheduling and resource management as well as an improved lifecycle management, for example by optimizing hardware operation parameters to increase the lifespan of components.
Dynamic resource management

Develop and deploy a hierarchical workload management solution with intelligent scheduling capabilities for heterogeneous systems, available in production and tightly integrated with the advanced monitoring and analysis system
Implement a dynamic scheduling functionality on heterogeneous resources, including a global workload scheduler and resource manager, a job level manager and a physical node manager support different types of workloads (rigid, mouldable, malleable etc.)
Ensure as far as possible a programming model agnostic solution to avoid vendor lock in
Implement support for co-location/oversubscription on heterogeneous resources at the node and system level (e. g. oversubscription of nodes or islands), optimising energy efficiency and system utilisation by taking into the specific characteristics of CPU-, memory-, network- and I/O bound workloads
Develop resource allocation and scheduling policies taking into account energy consumption, optimal usage and throughput, power capping at different levels (global, job, node,…)
Link the monitoring and analytics solution to the resource manager to support AI driven and dynamic resource allocation and smart (co-)scheduling of workloads
Implement support for response to and operation under power constraint, linking the concepts of dynamic scheduling, co-location and power management
Optimise global resource management policies for increased performance, throughput and energy efficiency
Test and optimise resource management from simulation to a real operational environment with user access (e. g. using an island of a supercomputer with the relevant heterogeneous hardware components)
Test and optimise the overall system performance and dynamic adjustment of e. g. hardware parameters for individual applications, workflow and user behaviour
Scale-up to exascale in a production environment
Consortium composition and project management

Proposals are expected to present a detailed work breakdown in their software development plan with a professional implementation and management following industrial standards. Besides the required track record of each consortium member in the respective field, also the management team should demonstrate the relevant competences. To this end, the consortium is expected to appoint a general manager, with the respective professional competence and experience for the implementation, monitoring and management of complex software development projects. The management team is expected to work closely together with funding authority, reporting any (anticipated) changes in the roadmap and timeline without undue delay.

The consortium must include all required competences and operational capacity to perform the proposed work and to achieve the objectives set out by the call. In particular, the participation of HPC centres is critical for the required deployment of the solution in a production environment.

The participation of private companies, e. g. offering professional services for HPC is highly encouraged. However the software used and developed within a proposal should offer a permissive licencing model. Exceptions should be duly justified and reviewed on a regular basis during the implementation of the work. A data sheet for each significant software component should be provided according to the application template documents[1].

HPC centres that participate in the action must commit to

Contribute to the definition of the solutions developed by the action through specifications and requirements justified by their individual operational constraints
Implement the common standards and interoperability requirements defined and adopted by the consortium during the project
Provide all required data for the common data repository and within the applicable legislation. If operational or user generated data, e. g. application input, cannot be provided due to legal or contractual agreements, this must be notified and duly justified to the coordinator who will inform the granting authority. A sanitised dataset must be provided in such case.
Deploy the developed solutions to the production environment as soon as the solution has passed the common acceptance criteria defined by the consortium. Where technically possible, individual modules should be deployed as early as possible and before the implementation of entire software stack has been completed. If a participating HPC centre deviates from the reference solution defined in the proposal, e. g. by replacing a module with an alternative solution, this should be explained and justified in the respective deliverable and progress report.
The participation of HPC centres requires that all legal aspects related to the sharing of data must be clarified and sufficiently detailed evidence on the ability to share all relevant data must be provided to the granting authority before the signature of the grant agreement

Proposals should also clearly demonstrate that all partners in the consortium have a significant and justified role, including appropriate deliverables under their responsibility which cover the specific contributions of each partner. All participants in the action should contribute at least 5% of the total personnel resources, limiting the total consortium size to a maximum of 20 participants. Additionally, the contribution of each partner participating in the implementation of a particular technical area identified in the call (workload manager, monitoring framework, data driven analytics) should not be less than 2 full-time equivalents (FTEs). The consortium is required to establish an effective management structure with clear responsibilities and well defined reporting lines without boundaries across different participating organization. Moreover, the proposal should, in cooperation with the EuroHPC JU, develop and implement a mechanism for the efficient monitoring by the funding authority with meaningful progress reporting at least on a monthly basis. The status of the implementation should be available to the competent funding authority and every participant in the project at any time, e. g. via an issue tracking and ticketing system, dashboard, backlog, results from automated testing and similar as provided by standard continuous integration solutions.

Additional mandatory KPIs

Number of deployed components from common software stack with breakdown by pre-production, production environment and per HPC system
Test coverage for the developed components and APIs
Resource utilization and energy efficiency improvement of the solution on a per-job and workload basis
The JU considers that proposals requesting a contribution from the EU of up to EUR 20 million and a duration of 4 years would allow this specific challenge to be addressed appropriately. Nonetheless, this does not preclude submission and selection of proposals requesting another duration or other amounts. Only one proposal will be selected.

Background:

The operation of supercomputers consumes a considerable amount of energy. Modern exascale class supercomputers reach electricity intakes of more than 20MW. Besides technical challenges associated to stable electricity supply, infrastructure or heat dissipation, also the economic and environmental aspects are of outstanding importance in the context of energy efficiency. Without effective measures for energy efficient technology and HPC operations, costs for the operation of supercomputers may become prohibitively expensive and impact availability and use of HPC infrastructure.

In line with its mission[2] and strategic programme[3], the EuroHPC JU addresses energy efficiency and environmental sustainability across the entire HPC technology stack, for example through low-power hardware technology, dynamic power-saving and re-use techniques like advanced cooling and heat recycling. While the challenge of energy efficiency has many dimensions across the entire value chain of HPC, the relevant metrics for the EuroHPC infrastructure can be broadly defined as R&D output per Watt. Hence improving the amount and quality of R&D output per Watt is a central objective in the JU’s ambition towards a more energy efficient HPC ecosystem in Europe.

Currently perhaps the largest potential for improving energy efficiency is available in the areas of user competence and the responsible use of resources, algorithms and applications, and system operation. The JU addresses these areas with several initiatives such as the EUROCC 2 (user competence), Inno4Scale (algorithms), Centres of Excellence in HPC applications (applications) and the REGALE, DEEP and SEA projects (system software). However so far, the JU has achieved limited harmonisation and uptake of a common software stack for a more energy efficient system operation of the EuroHPC supercomputers, which is critical to address the cross-cutting topic of energy efficient HPC operation in a coherent manner.

The beginning of the exascale supercomputing era with compute nodes of increasing size and heterogeneous system architectures offers unprecedented opportunities to develop intelligent scheduling mechanisms for an improved global throughput and overall energy efficiency. Recent developments in advanced modelling, deep learning and generative AI may provide the required intelligence for smart scheduling, workload configurations and user assistance to optimise performance and energy efficiency. In this regard, the availability of a comprehensive data set is a key requirement for the development of advanced techniques for energy aware and energy efficient supercomputing. The EuroHPC JU has put in place one of the largest supercomputing infrastructures in the world, which offers a unique opportunity to place Europe at the forefront of intelligent data driven and energy efficient HPC operation.

[1] Only software components which are owned or controlled by the consortium members are eligible. This may include software owned by third parties which is provided under a permissive license. In such a case the consortium must demonstrate in the proposal the ability to develop the software independently of the owner.

[2] Council Regulation (EU) 2021/1173 of 13 July 2021 on establishing the European High Performance Computing Joint Undertaking and repealing Regulation (EU) 2018/1488,

http://data.europa.eu/eli/reg/2021/1173/oj

[3] EuroHPC JU Decision No 8/2023

https://eurohpc-ju.europa.eu/system/files/2023-06/Decision%2008_2023_%20Amendment%20MASP%202021-2027_0.pdf