Creating metrics – Objective and Procedure
Introduction
Meaningful sound evaluation
Human perception of sound is very complex and cannot be represented by a single technical measurement parameter such as sound pressure level. If you want to define a more meaningful quality index for your sounds, you are well advised to use a calculation rule that links various parameters instead. Such calculation rules, also called metrics, use the results of different technical analyses, for example, and thus determine a characteristic single value for your product sounds. Linking relevant analyses allows different sound aspects to be taken into account and incorporated into the final result. This allows, for example, not only the noise level to be included in the evaluation but also proportions of high frequencies, expression of tonal components as well as contributions of other disturbing noise patterns.
Advantages of a sound quality metric
A good sound quality metric helps you to:
Metrics can be developed based on the results of a jury test, for instance. In this pro- cess, the results of the jury tests are mapped by measurement-based analysis results. A metric ascertained in this way will subsequently allow you to determine the perceived sound quality of your products in a time-saving manner without the need for further jury tests.
HEAD acoustics products for developing metrics
The development process of a sound quality metric involves several steps, for which HEAD acoustics provides you with various tools:
Suitable data for creating metrics
Before creating a metric, it must be checked whether the input data are actually suitable for metric determination.
Ordinal scale <-> Interval scale
Jury test results to be used for metric creation should be interval-scaled. Interval- scaled results can for example be obtained from a jury test with categorical evaluation. Results from a ranking test or paired-comparison are usually ordinal-scaled and are not readily usable.
This is because these tests only ask for information about the ranking order but not about the perceptual distance between the individual sounds. Thus, it is unknown whether the distance between the first and second rank is equal to the distance between the second and third rank. If ordinal-scaled results are simply translated into numerical values, this suggests an equidistant distribution that may not correspond to reality. Yet, if the numerical values misrepresent the jury test results, the basis of the correlation analysis is incorrect. Although statistical tools exist to convert ordinal-scaled data into interval-scaled data (e.g., Bradley-Terry-Luce (BTL) models), these are to be calculated with care and in many cases are not recommended to be used.
In order to obtain a robust metric that can actually replace performing jury tests, please consider the following guidance:
Appropriate experimental design
Only if the basis, i.e., the results of the jury test, are actually meaningful and adequately represent the perception of the sounds, can the resulting metric provide reasonable predictions for further sounds.
Example: You conduct a jury test and ask the participants to evaluate the quality of sounds made by seat adjustment motors. The sounds used in the jury test were acquired using different recording systems in different recording situations so that the sounds differ not only from motor to motor, but also due to the recording equipment used and the environment chosen. This leads the participants to not only include the actual motor sound quality in their evaluations but also the recording quality. Thus, the jury test results will not reflect the actual subject of the study. A metric created on the basis of these jury test results can therefore not provide a convincing prediction for further seat adjustment motors.
Meaningful static evaluation
When evaluating jury test results based on statistics, care must be taken to ensure that valuable evidence is not simply “averaged away”.
Example: You have conducted a jury test and, despite all possible care taken in formulating the task, the participants have difficulties in evaluating certain sounds. This will result in one group of participants rating these sounds very well and the other group rating them very poorly. If you simply average the results at this point, these sounds will receive a median rating. However, this median rating will not reflect the participants’ evaluation. A metric created on the basis of these scores will not give a good prediction of the sound quality. In such a case, you have to estimate which results are to be considered for your sounds and must not include the evaluations of the other group in the averaging. You may need to revise your test design and conduct another jury test as a check, or ask about and document the reasons for the subjects’ conflicting ratings by conducting an appropriate interview, for example.
Important information on creating metrics
ArtemiS SUITE Metric Project
The Metric Project in ArtemiS SUITE calculates a metric based on the correlation of two data series, e.g., the single values from physical-technical analyses and the perceptual evaluations from a jury test. Both manual input and semi-automatic determination of the calculation rule are possible. In semi- automatic mode, the Metric Project will support you and map the jury test results to measurement analysis results in the best possible way by calculating a linear regression model. You can select any number of previously calculated analysis single values, and ArtemiS SUITE will automatically determine a corresponding metric.
The correlation coefficient R is one of the values displayed for each calculated analysis. If there is a strong linear correlation between the analysis and the data from the jury test, this will result in a high correlation coefficient (maximum value: R = 1).
In addition, the quality of the current metric can be checked in the Metric Project. For this purpose, the coefficient of determination R2 may be used which indicates the proportion of the variance in the jury test data explained by the regression model. A high coefficient of determination thus indicates that the jury test results can be reproduced very well by implementing the mathematical formula found and the results from the technical measurement analysis (maximum value: R2 = 1).
Developing a robust metric
In principle, metric development should aim for a high coefficient of determination of the resulting calculation rule. This is because a high coefficient of determination indicates that the metric maps the jury test results well with the single values of the technical measurement analyses. However, a high coefficient of determination must not be the sole optimization criterion.
Instead, the goal of defining a robust metric is to be able to predict not only the results of the current jury test (training data) but also the sound quality of additional sound samples.
Concrete procedures
The following procedures are to be considered for the development of a robust metric:
Example: If you exclusively wish to evaluate broadband sounds, a high correlation to the single values of the tonality can only occur by chance. Despite an apparently high correlation, this single value is not to be included in the metric.
That is, a high number of predictors may increase the correlation to the jury test results used. However, it usually does not increase the predictive quality for un- known sounds, i.e., sounds that were not used to determine the metric. In order to create a robust metric, it is often more appropriate to use only a small number of analyses as predictors.
Further information
Much useful guidance on creating robust metrics is provided in the following publication: Fiebig, Kamp; “Development of metrics for characterizing product sound quality”, Proceedings Aachen Acoustics Colloquium 2015, 123–133
Example
The example is purely fictitious and serves only to illustrate the procedure. The numerical values given do not indicate when a sufficiently high correlation or coefficient of determination is achieved. There are no generally valid limits for these values. Instead, it must be decided on a case-by-case basis when the agreement between jury test results and the metric is sufficiently high
The following example is intended to illustrate that the use of additional analyses is not always expedient, and that the selection of technical analyses also needs to be made with care. For the example, the sounds of twelve hair dryers were evaluated in a jury test by several participants on a categorical, ten-point scale.
In order to determine a metric, the single values of the following analyses were calculated first:
Interpretation of the indicated values
After calculating the analysis single values, the correlation values R between the anal- ysis values and the jury test results for each analysis are displayed in a table. In this table, the desired single values for the metric can be activated. A corresponding metric based on the activated single values is automatically calculated by ArtemiS SUITE. The formulæ for this metric as well as its correlation coefficient R and coefficient of determination R2 are directly displayed.
R[reside]
In the present example, the single values of loudness and sharpness were activated first, as the values of these analyses show a significantly high correlation with the jury test results (R = 0,91 and R = 0,87 respectively). The coefficient of determination of the modeling based on these two analyses is R2 = 0,88. In order to increase this further, another single value is to be integrated into the metric. When selecting the additional parameter, it is not only the correlation coefficient R of the respective analysis that is to be considered but also, for example, the correlation coefficient to the residuals Rresid. In this case, the residuals represent the deviations between the values from the jury test and the values calculated with the current metric. If the residuals are small, the current metric will reflect the values from the jury test well. The indicated value Rresid describes the linear dependency between the respective analysis single value and the residuals. A high Rresid value means that the single values of this (deactivated) analysis are highly correlated to the residuals. That is, this analysis can probably reduce any existing prediction errors of the present metric and improve the coefficient of determination of the metric.
This is even possible if the correlation coefficient R of this analysis is lower than that of others. A small Rresid value suggests that activating this analysis will barely improve the metric.
In the present example, the speech intelligibility index SII actually has a higher correlation to the jury test results than has tonality. However, the correlation to the residual is higher for the tonality:
The SII is defined in such a way that high values indicate good speech intelligibility and low values indicate poor speech intelligibility. The jury test results were coded such that a good evaluation corresponds to a low numerical value. In contrast to loudness, sharpness, and tonality, speech intelligibility thus shows a negative correlation to the jury test results.
Improvement by an additional predictor
This means that activating tonality will increase the coefficient of determination of the resulting metric to a greater extent than that of the speech intelligibility index. This is because the speech intelligibility index responds to high levels, as does loudness.
Thus, activating the speech intelligibility index does not provide any additional information if a single value such as loudness is already activated. In contrast to that, tonality does provide additional information (a measure for the tonal components contained in the sound) and improves the coefficient of determination of the resulting metric to R2 = 0,9.
Nonetheless, the increase from 0,88 to 0,9 is only a minor improvement. In order to determine whether the metric can actually be improved by using the additional predictor, it should be checked with the help of a validation data set. This will help to rule out that the additional predictor does not cause an overfitting to the training data, but that the metric improves the prediction quality for other noise samples as well.
Conclusion
Thus, the example shows that when creating metrics, analyses with high correlation coefficients are not to be the only ones to be taken into account. Instead, users need to contemplate which analysis contains additional information on the sounds and covers another relevant aspect of noise.
Using the Created Metric
Applying the sound quality metric
Provided that a robust metric was created, the following evaluations of further sounds can be predicted numerically by using the mathematical formula and the results of the technical measurement analysis. It is very important to consider that the sound quality metric is only applied to sounds with similar sound characteristics. Only in this way can the metric be used to make meaningful predictions. Using the metric for other types of sounds does not provide convincing results in many cases.
Example: Sounds of sports cars during acceleration were used to perform the jury tests and create the metric. The resulting metric will reproduce the sound quality of comparable recordings very well. However, the metric will fail if it is used to evaluate idle measurements of luxury cars. Even though in both cases the sounds were generated by internal-combustion engines and measured at comparable positions (e.g., at the passenger position inside the car) and with comparable equipment, these sounds are barely comparable and cannot be analyzed with the same metric in a meaningful way.
We would be happy to advise you on the development of your sound quality metrics. Our experienced engineers will be pleased to assist you throughout the development process with technical know-how and technical measurement infrastructure. Benefit from our many years of experience in the field of automated product noise evaluation, acoustic measurement methodology and the acquisition of jury test results!
Contact us at: engineering@HEAD-acoustics.com
You can find the PDF version of this here on the HEAD acoustics website!
Do you have any questions? Your feedback is appreciated!
For questions on the content of this Article: BCremeans@HEADacoustics.com
For questions of the linked Document: Imke.Hauswirth@head-acoustics.com
For technical questions on our products: SVP-Support@head-acoustics.com
Helping professionals solve NVH and voice quality problems for over 25 years
2moTwo sayings come to 🧠 here: "Measure what's important" "Make the subjective objective" That is what metric development is all about...helping measure what sounds are important to your customer (👍 and 👎) AND making their subjective opinion about those sounds objective, quantifiable metrics. Learn more by reading Brian Cremeans article...nerd alert, it is a deep dive!
"One number to rule them all" – all the analyses and scales that customers deem important! Robust metrics become results that are “meaningful and adequately represent the perception of the sounds”. Statistics, psychology, and acoustics all brought together! This article describes why metrics are so much more than one number and the process needed to get there. Well done!
I solve Audio/Acoustic challenges
3moThis is like the "quick" explainer for the underpinnings of even advanced MOS metrics like 3QUEST or MDAQS.
NVH team leader | 30 years experience | Acoustics | Vibration | Sound Quality
3moThanks for posting this, Brian. Having a good set of properly developed sound metrics is an essential part of a solid NVH strategy. These enable you to show clearly and confidently how your product performs against targets and competitors, backed up by the opinions of a large group of users. More importantly, product acceptance can be predicted before design release, making you a more effective NVH engineer.
Engineering Services Manager @ HEAD acoustics, Inc. | Audio, Speech, Telecommunications
3moThe best part about this is that you will be able to give your boss a very simple number for performance without having to explain all the very complex analyses that you do regularly. They just don’t understand😂!