On characterization of sensory data in presence of missing values: The case of sensory coffee quality assessment

Multiple factor analysis was used to examine organoleptic coffee assessments such as aroma, aftertaste, flavor, acidity, balance, body, uniformity, sweetness, clean cup, and other organoleptic-related properties used in Coffee Quality Assessment. The Sensory analysis was performed using missing values (NA) scenarios with 5%, 10%, 20%


INTRODUCTION
Coffee is an important drink, different in every way and with different sensory characteristics. Consumer demand for products by quality characteristics is also growing [1]. Therefore, it is vital to produce high-quality and stable coffee that satisfies consumer demand and preferences [2]. Differences determined by the growing region's specific environmental conditions, temperature, altitude, latitude, and humidity, directly influence the grains [3]. Although several species of coffee are known today, there is a particular interest in Robusta and Arabica coffee. Both are hardy crops; these particular coffee plant species are resistant to disease, insects, and weather. Moreover, these species have economic and cultural importance in several countries in the world [4][5][6]. With a strong and long tradition in Colombia, coffee exportation is a relevant commercial activity [7].
Organoleptic quality is one of the most important characteristics of the successful marketing of coffee. Nowadays, a common way to evaluate coffee quality is through trained testers in a sensory panel [8,9]. Organoleptic quality is one of the most essential characteristics of successful marketing of coffee. Nowadays, a common way to evaluate coffee quality is through trained testers in the sensory panels [8], [9]. The sensory panel typically evaluates sensory characteristics such as aroma, flavor, and natural and chemical factors, important for the consumer of special and regular coffees [8]. Following the criteria of the Speciality Coffee Association (SCA) [10,11]. The sensory characteristics such as aroma, flavor, and natural and chemical factors are important for the consumer of special and regular coffees [8]. The sensory assessment of coffee characteristics is important because that could be related to consumers' acceptance and purchase [12,13]. Moreover, given the importance of coffee, several computer and statistical techniques can be used in sensory data analysis [14,15]. In this context, a judge or an electronic device [16] describes all sensations perceived and sets a quantitative or qualitative evaluation of the coffee beverage characteristics [17][18][19][20][21].
Multiple Factor Analysis (MFA) [22] is one of the most popular techniques to study multiple variables or factors such as those evaluated in sensory studies. It is an alternative both to Principal Component Analysis (PCA) and Simple and Multiple Correspondence Analysis (SCA, MCA) that permits synthesizing, representing, and interpreting relationships between several sets of features [23]. This technique has visualization tools and mathematical indices that can be helpful for coffee quality researchers. MFA has been widely studied as part of multivariate statistics and can analyze quantitative and qualitative variables grouped by type and interest within the study. MFA is a standard statistical technique that works with all available data. In particular, MFA has been used in sensorial data studies on wine quality, orange juices, coffee variety, and surveys-based studies [24][25][26][27]. Even though there are several strategies to avoid or treat them, it is common for this type of data to suffer losses or contain errors that prevent analysis. A variable may only have a small number of missing responses, but in combination, all datasets could have missing values and could be a problem in the study [28].
The presence of Missing Values or data not available (NA) is a problem that has been approached from different perspectives, among which is the Regularized Iterative method (RI-MFA) which uses the mean if the factor is quantitative or with the most acceptable value according to the proportions if it is qualitative [29]. RI-MFA is considered the best strategy when data contains missing values, but evaluating the performance is vital for proper usage of the techniques dealing with missing values [30,31]. the performance evaluation should incorporate controlled simulations with incomplete data to better understand the RI-MFA technique. Imputation and missing data generation techniques are underdeveloped and improved, and scenario simulation can uncover issues that may arise when missingness is induced on complete data.
Given the above, this study aims to apply the methods of MFA and RI-MFA to analyze the sweetness, flavor, bitterness, fragrance or aroma, saltness, body, acidity, mouthfeel, aftertaste, and cup balance of a sensory coffee assessment. Although these properties are not found in the SCA protocol, organizations such as the Coffee Quality Institute (CQI), which focuses on improving coffee quality and the producers0 lives, include other standards, protocols, and variables. The dataset included the evaluation of coffee from 1341 coffee-producing units around the world located in 36 countries and was obtained using the web scraping technique from Coffee Quality Institute in 2018. Coherent approximations have been obtained with MFA and RI-MFA multivariate methods on the coffee sensory dataset when considering samples with desirable and probable scenarios of 5%, 10%, 20%, and 30% missing values.

MATERIAL AND METHODS
The dataset was collected and downloaded from Coffee Quality Institute (CQI) in 2018 using a web scraping custom program to study the quality of coffee. The trained and accredited CQI panelist indicates which coffee is common or special according to sensory particularities. In the dataset, the organoleptic coffee assesses aroma, aftertaste, flavor, acidity, balance, body, uniformity, sweetness, clean cup, copper points, and others related to coffee data, (see Table 1).
The worldwide coffee producers were considered individuals of interest in this study. The data collected contains information about two species of coffee Arabica and Robusta, produced in 37 countries worldwide. A detailed description of the data and descriptive statistics of the sample scores were obtained. A radar graph of the attributes for presenting the available information was also obtained.
Initially, all dataset values were considered. MFA is used to explore multiple data tables of the same set of observations. In the MFA context, the dataset is a table {X, ..., Xn} and each column is an observed individual feature, which has been recorded in a row. MFA number and feature type can vary from another matrix with the same individual. Data analysis was carried out with R language [32]. A multivariate exploratory data analysis was conducted to understand the main relationship between subjects and sensory quality variables. The MFA function from FactoMineR [33] and imputeMFA from the missMDA [16] packages were used in experiments. Radar plot, web scraping, and heatmaps were generated using Python language.
Code and scripts are available on Github [34]. Figure 1 shows a schematic overview of the scenario creation and RI-MFA comparison. The presence of 5%, 10%, 20%, and 30% missing values on the coffee sensorial dataset called scenarios candidates was imputed with RI-MFA.

RESULTS AND DISCUSSION
We conducted an exploratory data analysis to understand the coffee dataset and summarize the main characteristics. The scores given for the coffee characteristics were primarily numerical. We considered the complete dataset with two coffee species Arabica and Robusta; Arabica is produced in 36 countries, whereas Robusta is produced in 5 countries around the world. The sensory dataset contains 1341 assessments and 12 coffee features without missing values. In Table 2 means and deviation standard (SD) are given, and the means were taken as the reference value.
Data were expressed in terms of individuals and scores evaluated by species and country to determine  According to Figure 2, Arabica (red) has the highest scores for Balance (EQ) and Quality Score (SRC). Robusta (blue) is superior in other characteristics. One relevant producing unit is located in Papua New Guinea, Japan, Ethiopia, and the United States. The coffee flavor (FLV) is one of the most important factors in differentiating quality. It is a list of attributes with greater weight in judging [35][36][37][38]. The Copper points (PCP) and Quality score (SRC) are global and subjective appreciation. Panelists determined the different sensory characteristics among the different samples, and according to several studies on genotype and environment influence, the source of the sample has a strong relationship with the quality of coffee.
Initially, descriptive sensory analysis performed on a worldwide coffee quality dataset demonstrated statistically that producer unit performance is similar. With the entire dataset, the results MFA show two selected components that explain about 79% of the total variation in the data set. The first dimension (Dim 1) explains 63.71% of variation, whereas the second (Dim 2) the 15.77%. The first part of the analysis consisted of identifying the producer unit's behavior according to coffee sensory attributes on the bottom-left of Figure  In comparison with the complete data, the MFA of imputed data shows an increment in the percentage of inertia near 74% when a percentage of 30% is   set, which can be explained by similarity added during the imputation process after coffee-producing units are generally more similar to each other than in the real data. Figure 4 shows the plotting of the inertia percentage reached by MFA with imputed data and the missing values scenario. Despite the low relevance of the scenarios with a percentage higher than 50, the plot presents the tendency of cumulative inertia based on missing value proportion.
Cumulative inertia may suggest that the imputed data play a similar role to the observed data, which increases the probability of finding similar data and therefore results with a theoretical sense but not logical.
The analysis of missing values in different scenarios reveals that specific producing units or individuals lose their similar profile, and several are located far from each other on the factor map. Figure 5 shows    factor axes and the variables. These results are the same for 35% and 45%. This finding suggests that the relationship between variables is overestimated with a direct positive relationship between bias and the percentage of imputed data.
There are many possible approaches to dealing with missing values in multivariate data. The comparative purpose is reached when in each RI-MFA imputed data set, a comparison of the inertia ratio was made from the MFA result. In this form, we make a modest contribution by providing empirical evidence to discuss how the imputation influences the inertial ratio when the dataset has missing values.

CONCLUSION
We obtained coherent results in this study when missing values were imputed using RI-MFA. These results are expected for multivariate analysis that as the percentage of NA grows, the inertia of all data points, which reveals component importance, also increases. In practice, it may be reasonable because the imputation method uses the available information and makes features according to other records in the dataset. However, analysts often obtained lower inertia when individuals have varied little.
Both MFA and RI-MFA are robust to the presence of NA and appear to be appropriate when sensory data have NA. Both MFA and RI-MFA perform well in the presence of NA and appear to be appropriate when sensory data contains NA. Based on the simulation study, it was observed that RI-MFA is a good strategy to estimate missing values, it would be interesting to see the behavior of the RI-MFA in other data sets and to combine this methodology with the cluster analysis. RI-MFA imputation may have good predictive accuracy in 1 to 10% of missing values; higher percentages may lead to severely biased inference when the imputed variables are used in subsequent factor analyses.
In this case study, it is shown that a correct analysis requires a broader knowledge of missing values and careful critique of the imputation mechanism.
The imputation is implemented in many software packages and appears to be the solution in all cases where missing values are present. Future studies and experiments should incorporate other imputation methods, estimators, interpolators, and pooling estimates strategies which, following our schematic scenario configuration, assess the robustness of the MFA method in sensorial data analysis. Alternatively, future research should also include Nonlinear estimation by Iterative Partial Least Square (NIPALS), based on available information. This inclusion is because adequate results have been found in recent studies and are an alternative solution to the problem of NA [39][40][41].