Data Modeling Learning Infrastructure
Arguably, Data Science is one of the most potentially innovative fields around today. It is a cross-cutting science with multiple links to mathematics (specifically, statistics) and computer science. By nature it is a multidisciplinary field, often requiring a deep comprehension of application specifics.
This reveals a great challenge: for example, in the insurance sector costs could be reduced for the company while assuring improved service standards; in pharmaco-epidemiology, weak signals or anomalies could be detected; medical-economic points of view could be optimized to propose protocols for a given pathology, or propose a conduct score-card for automobile drivers in order to promote responsible driving.
The specificity of research in this field is determined by the characteristics of the data to be processed. The difficulty lies in the complex structure of the information contained in the data. In general, available data does not result from an experimental design, so a recognizable structure is not always obvious; data can be incomplete, heterogeneous, and almost always massive. It can be produced at high speed and require high speed analysis...in fact, certain cases demand instant decisions.
The particular issues related to the great volume and variety of data (Big Data) can result in problems of acquisition, storage, preparation, visualization, and the processing itself.
Clearly, this type of the data compels researchers to design new approaches: which axes, which methods?
The Data Science Initiative team covers most of the different aspects of data science research. It is a team of excellence in applied mathematics, statistics and computer science with the goal of developing innovative methodologies within the framework of demanding academic research. It is a team with a wide work experience in applying knowledge resulting in a strong impact in society and industry.
The Initiative works in close association with many industrial and institutional partners. In this exchange, its array of competences in mathematics and computer science allow feedback between statistic models, their implementation (the choice of appropriate hardware, optimization of the models for the selected hardware) and the use of the model.
Concerned with the evolution of data science, the algorithms and codes resulting from these developments are made available to the community in open source mode.
The Data Science Initiative invests its effort in 4 main axes:
(Un) Structure data, or how to describe it?
We have entered the era of Big Data. The explosion and the abundance of available data, raise new challenges, new opportunities in many disciplines, with the potential of strong impacts in science, business and society.
Managing data is at the core of the problems of Big Data and the formalization of its structure is an unavoidable stage. It is not of a field of research in itself, but more a required step in the pathway to the scientific approach to research in data science.
The different forms that these data files can take (often heterogeneous) require diverse approaches (preprocessing, visualization, modeling). They can be “structured”. They can often represent independent lists of characteristics (or features): biological data of patients (weight, tension, age, etc…), information of customers of a bank or an E-commerce site. These characteristics can also correspond to a real-time information flow (price of a stock in a financial market, purchase lists or credit card withdrawals).
But often the data is semi-structured or not structured at all. The key/value model is then not strictly applicable. This is notably true in the particular case of text data (web pages analyzed by a search engine) or graph data (links between different people in a social network). With less structured data, many tasks become more difficult. For example, searching for a word or a group of words in a given text must generally take in account all the forms of the different words and even the group of words.
A researcher in Data Science most often has a particular field of expertise: time series, graphs, text, among others.
Modeling or how to represent mathematically a given phenomenon ?
Statistical modeling is a way of approaching reality, in a simple mathematically formalized way, in other words, understanding the processes which generate the data.
Models constructed this way generally depend on a certain number of parameters which must be estimated by comparing models with reality (true data).
A “good” model is then a model which, for a given set of parameters, makes it possible to describe properly the underlying data and its structure. It generally contains a random component (dependent on the parameters) that allows certain flexibility in the description.
When the number of parameters is relatively high and allow structures of very different nature (with a few a priori conditions) we can talk, in general, of “non-parametric models”. There are many types of statistical models. They can be classified according to their formal structure.
When the underlying structure of the model is ranked (for example, a primary structure at the level of group of individuals, then another at the level of the different categories of these individuals and finally a final structure at the level of the individual itself), we refer to “Hierarchical models”. “Latent variable models” are those used when observed data follow simple distributions and their parameters depend on “latent” variables, variables that are not directly observed (therefore, the data observed associated to an individual would depend on an unobservable category).
A statistical model can also be classified according to the type of phenomenon which it seeks to describe or identify. For example, “Survival models”, describe the cause and effect links between a well identified event (the “death” of an individual) and the different possible risk factors. “Hawkes models” are used to identify the existing links and dynamics in the propagation of information between various “nodes” of a network (for example, the propagation of twits on Twitter, and the propagation of an earthquake). “Population models” describe biological or physical phenomena observed within a group of individuals, as well as the variability among them. This approach has applications in fields such as pharmacometrics, a field that requires describing and quantifying interactions between diseases, drugs and patients. These models take into account the fact that patients react differently to the same disease and the same treatment.
Machine learning, or how to produce instructions to automate decision-making ?
Machine learning or statistical learning deals with designing and developing methods that give computers the ability to learn how to use data, focused on an explanatory or predictive goal.
Two types of learning can be highlighted. Supervised learning, where the computer is presented with labeled data. Then algorithms are developed to predict these labels based on training data sets. The computer learns a general rule to transform inputs into outputs. The most classical examples are regression and classification problems.
In non-supervised training, there are no known labels, then data partitions are built (clustering) or representations that allow summarizing the information which they contain (even discovering hidden patterns within the data).
There are other forms of learning, like semi-supervised learning, reinforcement learning, or algorithms that learn how to learn, resulting in a good compromise between data use and exploration.
Statistical learning involves applied mathematics and computer science, particularly in the context of Big Data, where the volume of training data requires taking into account the technological constraints within the algorithms.
As members of the Initiative, our researchers in machine learning work on different topics including:
•Missing data (how to use data in spite of the presence of “holes” in the data).
• Aggregation of estimators/combination of experts (combining algorithms to improve performance).
• Selection of variables in large dimension contexts (how to identify the most significant variables).
• Recommender Systems (how “to guess” the product preferences of users).
The procesing infrastructure or what framework (Hadoop, spark, etc …) to use for the procesing model ?
The unprecedented growth of the amount of data and its “democratization” (everyone has access to data “consumption”) has created new challenges for data management systems. A first challenge consists in being able to carry out queries or statistical analyses on gigantic sets of data, and this with latency approaching real-time. To achieve this goal, it is necessary to change traditional infrastructure in order to integrate transparently parallel data processing, all while preserving data structure (complex temporary structures or sequential structures such as the genome). Within this framework, it is necessary to optimize system latency, the flow, but also the cost of the computing effort (within the framework of resource use in the cloud).
A second challenge is to improve “the intelligence” of the data management systems, such as its user-friendliness: filling the increasing variation between volumes of data and the limited human capacity to grasp this data. New systems of data mining are developed to guide the user through these huge data files towards the required information. Moreover, these systems must be capable to point out possible anomalies, with high quality explanations that allow corrections or preventive actions.