Projects AND APPLICATIONS

I/O STEERING FOR BIG DATA AND HPC

Open Position: https://seagate.taleo.net/careersection/2/jobdetail.ftl?job=170174

Institution: Seagate

Description: We have entered the era of Big Data, where the size of data generated by digital media, social networks and scientific instruments is increasing at an extreme rate. To gain value from this data, Big Data must be stored, processed and analyzed. This in turn requires introducing new techniques for Big Data management at Exascale, which include Cloud and High Performance Computing infrastructures.

The traditional data storage techniques used in high performance parallel storage already show limitations when pushed to their limits with workflows and data organization often devised many decades ago and that are inconsistent with new generations of object based data storage that are rapidly replacing the traditional distributed file system architectures adopted so far.

This PhD thesis will address the problem of efficient Cloud and HPC Big Data management, developing a framework and defining policies for guided I/O interfaces overcoming existing limitations. This work will be done with reference to real and challenging use cases such as data handling for the Square Kilometer Array radio telescope (SKA) or flagship Human Brain projects

Efficient data transfer and streaming strategies for workflow-based Big Data processing

ESR: Ovidiu Marcu

Institution: Inria

Description: In the past years, a subclass of Big Data, fast data (i.e., high-speed real-time and near-real-time data streams) has also exploded in volume and availability. These specific data, often denoted as events, are typically characterised by a small unit size (in the order of kilobytes), but overwhelming collection rates. Examples of such data include sensor data streams, social networks feeds (e.g. 4k tweets per second, 35k Facebook likes and comment per second), stock-market updates. Numerous applications must process vast amounts of fast data collected at increasing rates from multiple sources, with minimal latency and high scalability. Enabling fast data transfers across geographically distributed sites allows such applications to manage the continuous streams of events in real time and quickly react to changes.

Traditional workflow processing engines often consider data resources as second-class citizens and support access to data only as a side-effect of computation. Currently, the workflow data handling is achieved using either some application specific overlays that map the output of one task to the input of another in a pipeline fashion, or, more recently, leveraging the MapReduce programming model, which clearly does not fit every scientific application. When deploying a large scale workflow across multiple datacenters, the geographically distributed computation faces a bottleneck from the data transfers, which incur in high costs and significant latencies. Without appropriate design and management, these geo-diverse networks can raise the cost of executing scientific applications.

The goal of this Ph.D. proposal is to design and implement next-generation data processing, transfer and streaming models, specifically targeting applications that require a general data orchestration, independent of any programming model. During this three-year PhD. position, the student will evaluate the limitations and bottlenecks of current NoSQL / MapReduce based solutions for general workflow / streaming Big Data applications, formalize the corresponding requirements and propose processing models and techniques for optimized data transfers between workflow nodes (inter and intra datacenter) and for efficient stream processing.

Multicriteria Decision Support Systems for efficient Big Data analysis

ESR: Alvaro Brandon

Institution: Universidad Politécnica de Madrid

Description: Decision Support Systems (DSS) are used to provide recommendation for users to make decisions in complex contexts. In the particular area of Big Data, DSS can help to choose the best big data process for a particular organization. This decision may depend on many different factors. Therefore, it is necessary to make a multi-criteria decision that grows in complexity as more tools appear in the market for big data analytics. It is not only about the step to be done in the analytics process from a technical perspective (i.e. choosing whether a data set requires pre-processing, understanding algorithms available, etc) but it is also about understanding other aspects that may impact company’s performance such as for instance the energy consumption required to perform the process.

In this project, we will study multicriteria DSS that are able to take into account multiple aspects, including energy consumption, to help companies make decisions on how to build big data analytics processes.

Predictive Models for Big Data

ESR: Pierre Matri

Institution: Universidad Politécnica de Madrid

Description: Current data-driven decision approaches do not scale in Big Data scenarios, where the volume, velocity and variety of data make their management and analysis difficult to address.

Like in the case of processing, where new programming models have arisen, novel techniques for efficient and scalable data-driven decision making processes are needed.

The goal of this job is to examine the limitations of current data-driven decision approaches in Big Data, from scalability and performance points of view, defining new predictive models suitable in these scenarios. To achieve these goals, the current Ph.D. proposal will focus on two more specific directions:

at the data management layer, we are interested in filling the gap existing in the storage for Big Data, which mainly focuses on binary large objects, by proposing new solutions for the efficient handling of huge volumes of small data objects
at the application level, we will leverage the above solutions for both small and large objects to validate a set of original predictive models for Big Data analytics and processing

Unifying HPC and Cloud storage

ESR: Nafiseh Moti

Institution: Johannes Gutenberg University Mainz

Description: HPC storage and Big Data storage have been starting from different assumptions, leading to different underlying storage architectures. This also led to different optimizations, which from an abstract view do not seem to be compatible with each other. Nevertheless, the experiences with HPC and Big Data have shown that it becomes necessary to overcome these differences, so that scientific workflows do not have to distinguish between HPC and Cloud storage, but can benefit from the advantages of both. The research focuses on developing and evaluating approaches to unify HPC and Cloud storage.

The candidate will be mainly hosted at the Johannes Gutenberg University Mainz, Germany. After the first year, the candidate will be visiting Seagate to get a better understanding of industrial requirements and to get into contact with experienced HPC storage developers. The candidate will also get the opportunity to discuss his approaches with experienced HPC and Cloud software developers and administrators based on exchanges with INRIA and BSC after his second year.

Data placement via code inspection

ESR: Rizkallah Touma

Institution: Barcelona Supercomputing Center

Description: The Storage System Research Group at BSC is developing a distributed storage platform to store and share data between applications. A distinguishing feature of this platform is that, from the point of view of the applications using it, data is stored in the form of objects, which include data, code (methods manipulating the objects) and behavior policies that are also stored together with the data. The main purpose for storing methods in the platform is to bring execution close to the data in order to avoid unnecessary data transfers from the data store to the application that is executed in the client.

Taking advantage of the fact that the platform knows about the methods that directly manipulate the data, it can use this information to improve the performance of the applications using this data. In particular, by analyzing the code of the methods, one can obtain information about which objects are always accessed together, or which parts of an object are accessed together and which not. This information must be complemented with additional data gathered during execution and analyzed using machine-learning techniques. This is essential in order to discover additional usage patterns that cannot be obtained from the methods because they depend on the applications that access them.

The candidate should obtain both kinds of information and use it to improve data placement among the different back-ends, with the goal of minimizing communications and data transfers also within the platform for the sake of performance, scalability and energy efficiency.

Elasticity in Cloud Computing and HPC storage solutions

ESR: Linh Thuy Nguyen

Institution: Inria

Description: The overall objective of the Phd is to investigate how HPC/Cloud distributed storage systems can integrate the notion of elasticity (i.e. vertical and horizontal scaling) in order to satisfy application requirements while minimizing "on-the-fly" resources costs.

The work will be organized around two major periods:

First, the Phd student will (i) identify the advantages and the limitations of existing storage systems for HPC and Clouds (locally attached devices vs. remote storage solutions) and (ii) determine the pros and cons of providing the elasticity capability according to the envisioned use-cases.

Second, the PhD candidate will investigate how virtualization technologies and autonomous mechanisms can be used to deliver storage elasticity. Among the ways to be explored, we envisage to reorganize the file system deployment by using consolidation technics at the IaaS level or by using the possibility to attach/detach virtual storage systems on each virtual node. Finally, we want to offer a direct control to end-users so that they can decide how the distributed storage system should be deployed. Providing such an elasticity capability can be a mean to reduce the energy footprint of storage systems.

During this three-year position, the student will learn to design, implement and evaluate prototypes to validate their proposals. Such validations will include manipulating simulator toolkits and conducting large-scale in-vivo experiments on well-known testbeds such as Grid’5000 or even directly on public cloud platforms.

Big Storage Handling/Big Data Processing

ESR: Athanasios Kiatipis

Institution: Fujitsu

Description: The objective of this job is to design next‐generation data processing models for applications that require general data orchestration, independent of any programming model. Based on the Petabyte-Storage Device ETERNUS CD10000, you focus on workflows how to link data and applications on high scale petabyte dimensions.

Huge amounts of data require a change in the model of handling them.

Previous technologies tend not to expand to the new dimensions. This opens ranges of opportunities solve the existing issues (virtualization, sync-and-sharing, mobile data) in new storage dimensions and new storage approaches.

You can examine how old and new storage systems can interact and how and storage clusters and computation can come work closer in future.

The research focuses the integration of new scale-out storage systems into new application areas. The candidate therefore explores this opportunities based on a list of example applications. The resulting approaches should generalize the gained insights, helping to bring storage and computation closer together.

You will be given the opportunity to present for a PhD in cooperation with the Johannes Gutenberg University in Mainz (JGU). There will be multiple visits may last several months, as well as visits to other locations of partners within the ETN for education and liaison purposes.

The storage location and design shall be improved. This shall conclude into more efficiency for the IT “cloud” storage eco system. You will implement methods to achieve more efficient storing, caching, replicating and moving big data.

Storage blobs hosted in the cloud

ESR: Fotis Nikolaidis

Institution: CEA

Description: Recent technologies (such as Cloud Storage) introduce new paradigms like storage objects. In particular, the idea of "storage blobs" came from there. A storage blob is a bag of bytes, with a weak type and full independence with other object of the same type.

Storage blobs will apply this model to HPC in order to build massive and scalable architecture, up to exabytes or yottabytes (in 2020). Inside BigStorage, the study will focus of technologies offered by storage clouds (Amazon, Google, …) and the available APIs to accessed such remote storage. The goal here is to build storage blobs residing inside the Cloud Storage, as replicas to local storage blobs.

Existing solutions will be studied and eventually adapted to fit HPC environment. In practical situation, blobs will be used jointly with a distributed FS. Results of this study will have direct impact of CEA’s tomorrow’s data center architecture.

Storage architectures using emerging storage devices

Institution: FORTH

Description: The goal is to examine how future storage systems should be architected to take advantage of new storage devices for improving the capabilities and efficiency of Big Data applications. This work will provide the ground-work for shifting storage architectures from centralized SANs to distributed approaches over low-latency devices. It will examine techniques for achieving the required performance and reliability, while offering increased flexibility.

NVRAM technologies and HPC

ESR: Umar Hameed

Institution: Johannes Gutenberg University Mainz

Description: The number of storage hierarchies in HPC is still increasing. While researchers have been able to successfully include new Flash-based devices into the HPC storage stack, it still seems unclear how to integrate even faster NVRAM technologies like Phase-change RAM or MRAM.

NVRAM offers a new byte-addressable interface to persist data. This opens new opportunities to communicate between applications and the storage backend, leaving room for new data structures and storage approaches.

The research focuses the integration of new NVRAM technologies into HPC environments. The Ph.D. candidate therefore explores this opportunities based on a list of example applications. The resulting approaches should generalize the gained insights, helping to bring storage and computation closer together.

Informed IO data Placement

ESR: Georgios Koloventzos

Institution: Barcelona Supercomputing Center

Description: The general objective of the PhD is to investigate how I/O hints (but not limited to) introduced by the applications, runtime or users about their I/O behavior can improve the energy efficiency and/or the performance of the I/O operations on different I/O hierarchies created by different devices SSD, HDD, RAM and NVRAM.

The work will be organized as follows:

First, the PhD student will (i) analyze existing I/O hints on different parallel and traditional file systems and (ii) propose a set of new and advanced I/O hints to achieve the objective proposed including a set of applications that can make use of them.

Second, the PhD candidate will start working with data schedulers (adapted or created by him/her) and job schedulers to manage the information obtained by the previous hints and while managing the data in several planes (movement, placement and schedule) improve energy and performance. Finally, the candidate will work with different devices and architectures (real and simulated) and study their effect.

During this three-year position, the student will have to create prototypes to validate their proposals. Such prototypes will go from simulations to real world implementations and integrations, for example in an existing parallel file system of their choice.

Energy-impact of data consistency management in Clouds and Beyond

ESR: Mohammed-Yacine Taleb

Institution: Inria

Description:We have entered the era of Big Data, where the size of data generated by digital media, social networks and scientific instruments is increasing at an extreme rate. To gain value from this data, Big Data must be stored, processed and analyzed. This in turn requires introducing new techniques for Big Data management at Exascale systems including the Cloud and High Performance Computing infrastructures.

The ever-increasing size of data has elevated Big Data management to a key issue in large-scale Cloud and HPC systems. At such large-scale, the power used to store and process these data is extremely increasing (i.e., the energy consumed to operate these infrastructures increases and results in high money bills in the order of millions of dollars). However, recently data can be geographically distributed and processed across multiple data-centers (e.g., data processed by MapReduce middleware) and can be shared and accessed (read and update) by multiple users and processes (e.g., data accessed in social applications and data generated, shared, and processed by several users/software stack in scientific simulations). This in turn posses, among others, the challenging issue of data consistency (i.e., how to provide a highly responsive system and ensure at the same a satisfactory consistency level for replicated/shared data considering the hierarchy of network latencies and heavy concurrent access). As energy consumption and scale of data are both on the rise, energy-efficient consistency management has become an important concern in Big Data infrastructures including Clouds and HPC systems.

This PhD thesis will address the problem of how to manage (i.e., store and process) Big Data on large scale Cloud and HPC systems in an energy-efficient way. This research is expected to bring substantial innovative contributions with respect to the following aspects: 1) Investigate the relation between consistency and energy consumption in Cloud and HPC. 2) Design a model of the energy consumption of consistency management in large-scale systems 3) Define new consistency models and data placement schemes that preserve the required data consistency while optimizing the overall energy consumption.

Data reduction through novel I/O interfaces

ESR: Yevhen Alforov

Institution: DKRZ

Description:The PhD student will work in two main areas:
1. Use case analysis and evaluation
2. Energy savings for data centers based on informed storage usage

In detail, the candidate will analyze existing applications from the domain of earth system science regarding their I/O requirements and extract this into appropriate benchmarks for use by other partners. These benchmarks will also be used to evaluate the approaches developed in other parts of the project. To this end, it will be necessary to communicate and cooperate with earth system scientists.

Additionally, the candidate will work on I/O interfaces that will allow the storage footprint to be reduced through the use of semantical information provided by application developers. General techniques for data reduction should also be investigated and improved.

Estimation of IT data center energy consumption for predicted workload allocation

ESR: Michał Zasadziński

Institution: CA Technologies

Description:The objective of this ESR is to manage predictive models for energy consumption as a function of projected workload allocation on server, rack, isle, facility, and data centre level. The project will include techniques for continuous validation of existing models and automatic or semiautomatic recalibration of these models. The combination of these models with graph models may also be considered. These graph models will be based on the use of NoSQL graph databases. Predictive models may range from simple regression models to complex Computational Fluid Dynamic (CFD). Techniques and specific directions will be decided depending on the availability of data, as new data centers with new sensoring systems and more instrumentation are available.

The outcome of this project is expected to be a set of models and associated algorithms estimating energy sending by a server, rack isle, facility and data centre next day or week hour by hour and day by day. Mechanisms for model validation and recalibration may also be considered.