Category Archives: Uncategorized

Interlaboratory comparisons (Proficiency Testing) among calibration laboratories – compliance with ISO/IEC 17043

ABSTRACT
Calibration laboratories must often organise their own interlaboratory comparisons (ILCs) to evaluate/monitor their performance, as “in the field of calibration very few regularly organised PT schemes exist” [EA-4/18, “Guidance on the level and frequency of proficiency testing participation”, 2021]. Accreditation Bodies (ABs) will typically assess such a laboratory against “relevant elements” of ISO/IEC 17043. This article will discuss which of these elements are relevant to an ILC organiser (PT provider) and to an ILC participant, in the field of calibration. It also discusses how the assigned value (Reference Value) and its uncertainty may be estimated, focussing on small ILCs.

INTRODUCTION

(Clause numbers from ISO/IEC 17043:2023 are given below, for convenience.)

A calibration laboratory may often decide to organise an ILC itself, as
(i) the parameter it wishes to evaluate is not addressed by any scheme offered by a local accredited PT provider, or
(ii) the range or particular points of interest to the laboratory are not included in available PT schemes, or
(iii) the uncertainty achievable by available PT schemes may be unacceptable, owing to lengthy circulation or characteristics of the artefact, or
(iv) the time before receiving the results from an available PT scheme may be too long to meet the laboratory’s needs.

Note: The terms “PT” and “ILC” are used interchangeably in this article – commercial calibration laboratories almost always participate in ILCs for the purpose of evaluating their performance, so such ILCs would typically be classified as “PT”, according to the Introduction of 17043. Also, as most self-organised PT involves seven or less laboratories, it would usually be classified as “small ILC” according to the EA definition. (As ILCs are intended to compare results between laboratories, we count the number of laboratories (having independent measurement standards, equipment, etc) that participate, not the number of metrologists submitting results.)

If a calibration laboratory being assessed by an AB is the ILC organiser (PT provider), they will be expected to comply with some clauses of 17043 related to personnel (6.2), design and planning (7.2), stability of ILC items (7.3), evaluation and reporting (7.4), etc.
If the lab is merely a participant, they will be assessed on the content of the PT report (7.4.3) and the fitness-for-purpose/appropriateness of the ILC (performance of the lab and criteria used to evaluate performance).
Note: Regarding those clauses applicable only to a PT provider (ILC organiser), a PT participant may still be required “to ensure that the organiser … fulfils the relevant requirements” [EA-4/21, “Guidelines for the assessment of the appropriateness of small interlaboratory comparison within the process of accreditation”, 2026, section 5].

ASSIGNED VALUE (REFERENCE VALUE)

(i) If the assigned value of the ILC item (also called Reference Value, RV) is obtained externally, from one non-participating laboratory, the ILC is, in effect, a series of bilateral comparisons, of each participant with the Reference Lab. The credibility of the Reference Lab is critical, and, as has been shown in a previous article, may not be taken for granted.

(ii) If RV is obtained from one ILC participant, it is effectively the same situation as in (i).

(iii) If RV is a consensus value, the following approach is recommended:

a) only one result per laboratory to be included in RV, to avoid biasing RV towards labs with many participating metrologists, and

b) exclude a laboratory’s own result from the RV to which it is compared, to avoid it biasing RV “towards itself” (especially for a small number of participants), and

c) use the simple (not weighted) mean as RV, since many laboratories’ accredited CMCs are not proportional to their capabilities, but arise, for example, from one ILC of many years ago, or from a very conservative uncertainty budget, or from the modest accuracy needs of their clients, not from the real capabilities of their equipment and method.

Must an ILC evaluate the difference between the participant’s result and the “true value”, or is it sufficient to demonstrate equivalence between participants?:

17043 clause 7.2.3.2 requires that “PT schemes in the area of calibration shall have assigned values with metrological traceability.” If each of the laboratories uses a calibrated reference standard and makes a credible uncertainty estimate, then RV may be obtained from one laboratory or from a consensus of several: in both cases, “the result can be related to a reference through a documented unbroken chain of calibrations, each contributing to the measurement uncertainty” [VIM], so it is traceable.

One might argue that the purpose of the ILC is to evaluate performance, so one cannot assume that a participant’s uncertainty estimate is credible. However, let’s consider the case where RV comes from an external laboratory: in effect, the ILC is a series of bilateral comparisons, of each participant with that external laboratory. All that is proven is equivalence between each participant and the Reference Lab. In some cases, the Reference Lab may use a “reference method” [ISO 13528:2015 clause 7.5.1], but, in the calibration domain, the Reference Lab is usually one that reports a smaller uncertainty than any of the participants, while using the same method as they do. As the external laboratory’s traceability is not fundamentally different from that of an ILC participant, if an external RV has traceability, so does one calculated from participants’ results.

In the strict sense, all any ILC ever proves is equivalence – we rely on certain assumptions to relate this equivalence to “competence”, or agreement with the “true value”. These assumptions are:
(i) the laboratories contributing to RV are “determined to be reliable, by some pre-defined criteria, such as accreditation status or on the basis of prior performance” [13528:2015 clause 7.7.1.1], and
(ii) RV does not “include unknown bias due to the general use of faulty methodology … not reflected in the … uncertainty” [13528:2015 clause 7.7.1.4 b)].
The first assumption may be satisfied by choosing an external lab, or subset of participants, whose reliability is supported by historical performance. (As mentioned above, this is not fool-proof.)
The second assumption, that there is no common bias in the results contributing to RV (which would not be evident in the spread/dispersion of these results), is more justified, the greater the variety of equipment and variations in method that are applied by laboratories contributing to RV. So, the more independent contributors to RV, the better.
It does happen, even in Key Comparisons at NMI level, that common bias is only discovered by the “discrepant” results of a few laboratories (for example, in CCT-K7 of 2004, three out of 21 participants were “correct”), so neither the size nor the level of the ILC can completely remove the risk of bias.

In ILCs in the calibration domain, we do not realistically aim to prove anything more than equivalence to a credible Reference Value, no matter whether that Reference Value originates from one laboratory or several, participants or not.

Understanding that a PT Reference Value, unlike a Key Comparison Reference Value (KCRV), is not intended to be the best estimate of the SI value of the measurand, it is easier to accept the concept of a different consensus Reference Value for each ILC participant, which is the approach recommended by this author in (iii) b) above and in a previous paper.

The way in which the Reference Value and its uncertainty are estimated is critical to the value of any ILC, especially small ILC using a consensus value, where the “small dataset makes it challenging to accurately identify the distribution of the data and to reliably detect outliers” [EA-4/21:2026 section 4.1].
However, remembering that the ILC’s “modest” goal is to demonstrate equivalence with a credible Reference Value, within the reported uncertainty, the following approach is recommended (considering a comparison between four laboratories, A to D), if a consensus value is to be used:

(i) For laboratory A, use RV = mean of one result each from laboratories B, C and D. If more than one metrologist from each lab submits a result, the lab must decide which one of its results will contribute to RV, before having sight of other labs’ results.

(ii) For laboratory A, use U(RV) = 1/3∙√[U(B)^2 + U(C)^2 + U(D)^2]. This formula for the uncertainty of the (unweighted) mean may be understood by analogy with the standard deviation of the mean: if U(B) = U(C) = U(D) = U, then U(RV) = 1/3∙√[3U^2] = 1/√3∙U, i.e., the uncertainty is divided by the square root of the number of results.
Do not use the spread of participants’ results as U(RV), as the goal is to demonstrate equivalence within the reported uncertainties.
If an uncertainty component related to artefact drift must be added, this should be evident from stability checks performed at the start and end of the ILC, or other evidence of stability included in the ILC report. If, after consideration of possible artefact drift, the resultant uncertainties U(RV) and U(LV) (U(A), U(B), U(C) or U(D)) do not account fully for the observed spread of results (“over-dispersion”) [13528:2015 clause 7.6.3 c)], this may be interpreted as a failure, on the part of the organiser or participants, to identify all relevant uncertainty contributors.
(The ILC protocol, or instructions for participants, should specify all conditions that might significantly affect the comparability of results. Over-dispersion is a sign that one or more parties neglected or under-estimated some such factors.)

(iii) Calculate the normalised error, En, in the usual way, and consider |En| ≤ 1 to be acceptable.

In the case of a bilateral comparison, the above approach reduces to: RV for participant A is participant B’s result (with its uncertainty), and vice versa. If the participants’ uncertainties are similar, a bilateral comparison evaluates the performance of both participants. If U(A) is significantly smaller than U(B), then only participant B is rigorously evaluated.

May the uncertainty of the Reference Value, U(RV), be larger than the laboratory’s reported uncertainty, U(LV)? In South Africa, yes: there is no regulation mandating U(RV) ≤ U(LV), nor, in fact, how often an ILC should be performed to CMC. This is perhaps not unreasonable – for labs having the smallest CMCs in the country, insisting that every ILC test CMC may be onerous. But, at least in preparation for the initial assessment, |LV-RV|, U(LV) and U(RV) should be smaller than or equal to the lab’s proposed CMC. And, after any change in the lab’s measurement standards, equipment or method that might significantly change the achievable uncertainty, the lab should test CMC somehow, preferably by ILC.

ILC ORGANISER (PT PROVIDER)

The following clauses of 17043 should be addressed by an ILC organiser:

If the organiser is a 17025-accredited lab, then 4.1 Impartiality, 4.2 Confidentiality and 5. Structural requirements are already addressed elsewhere, and need not specifically be addressed in the context of PT.

6.2 Personnel should be competent and authorised to organise ILCs. “it is preferable for the personnel performing the measurements not to overlap with those organising the ILC. To prevent collusion, the ILC organiser should ensure that personnel performing measurement are not informed in advance of … assigned values” [EA-4/21:2026 section 5.2.1]. In small labs, overlap between the organiser and participant may be unavoidable. In such cases, the organising lab should preferably perform measurements before the lab(s) providing RV.

In an accredited cal lab, 6.3 Facilities should be addressed elsewhere.

6.4 Externally provided products and services: If RV is provided by an external lab, criteria for choosing such a lab should be documented.

In an accredited cal lab, 7.1 Contract review should be addressed elsewhere.

7.2 Design and planning: This is a “key focus of the assessment” [EA-4/21:2026 section 5.3.2].
7.2.1.3 The PT provider plan and/or instructions for participants (ILC protocol) should address the following:
a) Main contact person. If organised jointly, the list of persons or CABs involved.
d) List of participants.
e), f), g) The measurand or characteristic to be determined: All factors that must be harmonised between participants, to ensure comparability of results, should be specified. For example, for an infrared thermometer, the size of and distance to the target should be specified, as well as the preset emissivity of the thermometer (if not obvious).
h) Quality control (stability checks) required for the ILC item (also addressed in 17043 7.3.2): If the artefact should return to the organiser for periodic intermediate checks, the frequency of such checks should be planned. For example, a PRT may require an ice point check after visiting every participant.
j), k) Timeframe of the ILC: When each participant is scheduled to measure, deadlines for submission of results.
l) Handling, preparing, measuring and shipping (also addressed in 17043 7.3.3-4): How the artefact should be handled, prepared for measurement, measured and transported. Information on the method(s) or procedures to be used by the participants.
n) Description of the reporting format for participants.
o), p), r) Description of the method for evaluating the comparability of the results, including statistical analysis and criteria used for performance evaluation. See the discussion of ASSIGNED VALUE (REFERENCE VALUE), above. Discussed further in 17043 7.2.2 and 7.2.3.
s) Reporting format from the ILC organiser.
t) Confidentiality: Results should be anonymised in the report, unless the participants waive confidentiality.

7.3.5 Instructions for participants: See 7.2.1.3 e), f), g), j), k), l), n), above.

7.4.3.2 The PT report should include:
Date of the ILC.
a) Name and contact details of ILC organiser.
g) Identification of the small ILC scheme or round.
h) Description of the ILC item, including how the stability of the ILC item was determined.
i) Participants’ results.
j), k), l), m) Method for evaluating RV and U(RV), and their resultant values.
p) Participants’ performance.
s) Comments and recommendations based on the outcome of the ILC.

In an accredited cal lab, 7.5 Records, control of data, ensuring validity of results and non-conforming work, 7.6 Complaints and 7.7 Appeals should be addressed elsewhere.

8.8 Internal audit and 8.9 Management review should include self-organised ILCs.

ILC PARTICIPANT

The lab’s PT participation plan should determine the level and frequency of ILC participation via a risk analysis, considering [EA-4/18]:
∙use of internal quality control measures, such as intermediate checks
∙number of measurements undertaken
∙turnover of technical staff
∙experience and knowledge of technical staff
∙source of metrological traceability
∙known stability/instability of the methodology
∙unsatisfactory results in past PT

The performance of the lab, and criteria used to evaluate performance, as documented in ILC reports, will be assessed.

CONCLUSIONS

∙The purpose of PT/ILC for commercial calibration laboratories is concluded to be: to demonstrate equivalence with a credible Reference Value (RV), within reported uncertainties.
∙For RV to be credible, it should be determined from (i) the results of reputable laboratories, (ii) in a mathematically appropriate way.
∙Avoiding bias in RV is especially important for small ILCs: only one result per laboratory should contribute to RV, and the laboratory being evaluated should be excluded from the RV to which its result is compared.
∙The uncertainty of the Reference Value, U(RV), should be determined from participants’ uncertainties, not from the spread of results, as the goal is to demonstrate equivalence within reported uncertainties. (This makes it difficult to use the median and Median Absolute Deviation, MADe, as RV and U(RV).)
∙A consensus Reference Value may be as metrologically traceable as one from an external laboratory.
∙For a consensus RV, the mean is preferred over the weighted mean, as participants’ reported uncertainties may vary widely, without technical justification.
∙A PT provider (ILC organiser) should have:
personnel that are competent in the technical domain of the ILC and in statistical analysis of results,
– a PT provider plan and instructions for participants (protocol) that carefully define the measurand and conditions of measurement (so that results are comparable), and plan appropriate intermediate checks on ILC artefacts,
∙A PT participant should ensure
– that their PT participation plan plans the frequency of ILC participation in consideration of risk mitigating or enhancing factors present in their laboratory,
– that the ILCs they participate in do support their required performance (CMCs).

Contact the author at LMC-Solutions.co.za

Traceability from secondary fixed points in thermometry and hygrometry

ABSTRACT
Secondary fixed points, such as the ice point and palladium melting point in thermometry and saturated salt solutions in hygrometry, may be more reproducible than the interpolating instruments (PRTs, thermocouples and relative humidity hygrometers) with which they are used. For this reason, they are good starting points for metrological traceability in mid-level calibration laboratories.
However, the traceability chain becomes less clear in such cases, as they are neither (i) direct realisations of an SI unit, nor (ii) artefacts that may be sent to an NMI for calibration. (Such fixed points are often generated/realised when needed and thereafter disposed of.) In fact, traceability is to a method of preparation/realisation, often one which is internationally recognised and codified in a documentary standard.
However, the process of proving and maintaining traceability is similar for secondary and primary fixed points, as
(i) primary fixed points and their realisation are subject to the same sources of uncertainty, error or drift that affect secondary fixed points, and
(ii) primary fixed points (as well as interpolating standards such as PRTs) require both a) interlaboratory comparisons (ILCs) to validate the user’s ability to realise/use them, and b) intermediate checks to confirm their continuing accuracy between ILCs, in the same way that secondary fixed point realisations do.
Hence, it is argued that metrological traceability may be as reliably (and often more accurately) maintained using secondary fixed point standards (periodically benchmarked against other laboratories via ILCs using suitable transfer instruments, and subject to intermediate checks between such ILCs) as using more “traditional” thermometry and hygrometry measurement standards.
Considering the typical drift of interpolating instruments and fixed points over time, a hybrid calibration/fixed point approach is recommended for PRT thermometry, and a fully fixed point-based approach is recommended for relative humidity hygrometry.

INTRODUCTION

Metrological traceability in a mid-level Temperature or Humidity calibration laboratory is most simply achieved by periodically submitting an interpolating instrument (such as a PRT, thermocouple or hygrometer) for calibration at a more accurate laboratory, such as a National Metrology Institute (NMI) with Calibration and Measurement Capabilities (CMCs) published in Appendix C of the BIPM Key Comparison Database (KCDB) [ILAC-P10:2020, ISO/IEC 17025:2017 Annex A].
(High-level laboratories may maintain primary standards, such as ITS-90 defining fixed points in Temperature or Josephson voltage and quantum Hall resistance standards in the Electrical field, but there are no such commercial labs in South Africa.)

Traceability via periodic calibration is reasonable, if the drift of the calibrated reference standard is sufficiently small in the interval between calibrations. For the purposes of this article, we will consider drift up to one-third of the typical expanded uncertainty to be tolerable. (An uncertainty component smaller than 1/3 of the combined uncertainty is effectively negligible.)
In this article, it will be shown that
(i) while high-quality industrial (“semi-standard”) PRTs may drift little enough to rely only on periodic calibration, improved accuracy is achieved by “re-zeroing” the PRT against the ice point (i.e., applying a fresh R0 or Rtp value) between calibrations;
(ii) PRTs of moderate quality drift sufficiently that intermediate checks are essential and re-zeroing is highly recommended;
(iii) high-quality RH hygrometers are often stable enough to get by with a two-year calibration interval, but reference salt solutions may only need benchmarking every three to four years;
(iv) RH hygrometers of moderate quality may drift so much that a one-year calibration interval is barely frequent enough.

On the basis of this analysis of drift, it will be recommended that mid-level Thermometry laboratories use the ice point together with calibrated PRTs as the basis of their traceability, and that Hygrometry laboratories without access to dewpoint standards use reference salt solutions as their starting point for traceability.

DRIFT/REPRODUCIBILITY OF IPRTs AND THE ICE POINT

Let’s assume that a mid-level Thermometry laboratory wishes to achieve an expanded uncertainty of 0.1 K with its industrial PRT (IPRT) reference standard. So, they tolerate drift up to 0.033 K before recalibrating: how long may the calibration interval be?

High-quality IPRT drifts 9.7 μK/day, or 0.0035 K/year.
High-quality IPRT drifts 9.7 μK/day, or 0.0035 K/year.

At 0.0035 K/year, the calibration interval of this high-quality IPRT may be as long as nine years.

Is the ice point reproducible enough to improve this IPRT’s performance?

ILCs at the ice point between LMC Solutions (U(LV) = 0.020 K) and NMISA (U(RV) = 0.010 or 0.008 K) or Burns Engineering (Jan 2020, U(RV) = 0.025 K).
ILCs at the ice point between LMC Solutions (U(LV) = 0.020 K) and NMISA (U(RV) = 0.010 or 0.008 K) or Burns Engineering (Jan 2020, U(RV) = 0.025 K).

Yes, 2015, 2017 and 2024 ILCs show agreement with NMISA within ~0.01 K (and the less accurate 2020 ILC shows agreement with the reference lab within 0.02 K), so re-zeroing the PRT’s calibration function every three years by applying a fresh R0 value may be beneficial.

The author has experience of a moderate-quality IPRT, subjected to unknown usage, where R0 dropped by 0.06 K over three years. If the drift was uniform at 0.02 K/year, an intermediate check at the ice point (and possible re-zeroing) at least every six months would be highly recommended for such a reference standard.

DRIFT/REPRODUCIBILITY OF RH HYGROMETERS AND SATURATED SALT SOLUTIONS

Let’s assume that a mid-level Hygrometry laboratory wishes to achieve an expanded uncertainty of 1.5 %rh with its relative humidity (RH) hygrometer reference standard. So, they tolerate drift up to 0.5 %rh before recalibrating: how long may the calibration interval be? First, we consider two high-quality RH hygrometers:
(Note: The data series graphed below are offset from each other for greater clarity, so only the change/variation in each series should be considered.)

High-quality RH hygrometer drifts 0.0005 %rh/day or 0.2 %rh/year [Jonker et al, "The Humidity Calibration Facility of the National Metrology Institute of South Africa (NMISA)", Int J Thermophys, 2008].
High-quality RH hygrometer drifts 0.0005 %rh/day or 0.2 %rh/year [Jonker et al, “The Humidity Calibration Facility of the National Metrology Institute of South Africa (NMISA)”, Int J Thermophys, 2008].

High-quality RH hygrometer drifts 0.0006 %rh/day or 0.2 %rh/year.
High-quality RH hygrometer drifts 0.0006 %rh/day or 0.2 %rh/year.

At 0.2 %rh/year, these high-quality RH hygrometers may be used for around two years before recalibration.

Are saturated salt solutions more stable or reproducible than this?

Saturated salt solution capsules drift by less than 0.5 %rh over four years, or 0.1 %rh/year [Jonker et al, "The Humidity Calibration Facility of the National Metrology Institute of South Africa (NMISA)", Int J Thermophys, 2008].
Saturated salt solution capsules drift by less than 0.5 %rh over four years, or 0.1 %rh/year [Jonker et al, “The Humidity Calibration Facility of the National Metrology Institute of South Africa (NMISA)”, Int J Thermophys, 2008].

Yes, at 0.1 %rh/year, salt solutions appear to be at least twice as reproducible or stable as high-quality RH hygrometers.

Here are reproducibility data of home-made saturated salt solutions:

ILCs of home-made saturated salt solutions vs NMISA (6) or other labs (2), via RH hygrometers.
ILCs of home-made saturated salt solutions vs NMISA (6) or other labs (2), via RH hygrometers.

Estimating the achievable uncertainty from the spread of ILC results (LV-RV) over nine years, U(k=2) = (max-min)/√3 = 1.0 to 1.5 %rh. (Remember that the series are offset from each other in the graph: in fact, (LV-RV) is within ±1.5 %rh for all but one result over the nine years.)
(Note that the NMISA salt solutions in the preceding graph are “sealed” in capsules with a semi-permeable membrane, while the home-made ones are open. While the former may be more stable as long as a saturated solution persists, the latter are more easily refreshed by adding or removing solid or liquid. A salt solution capsule requires careful storage in an appropriate humidity-controlled environment, for a long operating life.)

How do moderate-quality RH hygrometers compare to salt solutions?

Moderate-quality RH hygrometer drifts 0.001 %rh/day or 0.4 %rh/year.
Moderate-quality RH hygrometer drifts 0.001 %rh/day or 0.4 %rh/year.

Moderate-quality RH hygrometer drifts 0.003 %rh/day or 1 %rh/year.
Moderate-quality RH hygrometer drifts 0.003 %rh/day or 1 %rh/year.

At 0.4 to 1 %rh/year, these moderate-quality RH hygrometers would need recalibration every 6 to 12 months to meet the drift requirement. Considering these drift rates, it would be preferable to base a mid-level Hygrometry laboratory’s traceability on saturated salt solutions (such as those studied by Greenspan), with verification of the salt standards by RH hygrometers (check hygrometers).

CONCLUSIONS

∙Measurement data over periods of five to nine years demonstrate that:
– the ice point may be reproducible to 0.01 K, equivalent to three years of drift for a high-quality IPRT, or six months of drift for a moderate-quality IPRT,
– a saturated salt solution may be stable or reproducible to 0.1 %rh/year, while an RH hygrometer may drift 0.2 to 1 %rh/year.
∙Based on these drift data, it is recommended that
– an IPRT reference standard be checked periodically at the ice point, and its calibration data be adjusted using a new value of R0, when observed drift at the ice point reaches 0.01 K,
– salt solutions be used as reference standards in preference over RH hygrometers, unless RH hygrometers of the highest quality are available,
– any fixed point, but particularly a salt solution, should be subject to intermediate checks, typically using one or more check hygrometers (or a check PRT for a thermometric fixed point).

Contact the author at LMC-Solutions.co.za.

Interpolating between discrete calibration points: least squares curve fitting

ABSTRACT
In preceding publications, the accuracy of interpolation using reference functions for temperature sensors (thermocouples and platinum resistance thermometers (PRTs)), as well as a crude approach to interpolation uncertainty in the absence of any knowledge of the interpolating function, have been discussed. This article discusses how to fit curves to PRT calibration results using the linear least squares technique (implemented using matrix functions in Microsoft Excel or Libreoffice Calc), both directly to the measured data and to deviations from a reference function. (An overview of the fitting method may be found in [Numerical Recipes in Fortran 77, Chapter 15. Modeling of data].)

INTRODUCTION

In the preceding articles, we saw that
(i) industrial PRT calibration results could be interpolated “piece-wise” (between any two calibration points a few hundred degrees Celsius apart) to an accuracy of around 0.05 °C, when expressed as deviations from the Callendar-van Dusen reference functions, and,
(ii) without any knowledge of the accuracy of the interpolating function, the accuracy of interpolated values could be grossly estimated as 0.15 to 0.5 °C (for the above-mentioned spacing between calibration points).
While the latter approach can be applied generically to almost any calibration data, it has the drawback that the uncertainty of interpolated values is unrealistically large when working close to a calibration point. In the present article, the most accurate approach to the problem will be implemented, namely, to fit a curve to the complete set of calibration results. Firstly, a Callendar equation will be fitted directly to the measured data, and, secondly, a quadratic polynomial will be fitted to the deviations of the measured data from the ITS-90 PRT reference function. As the ITS-90 reference function deals with most PRT non-linearity, it is expected that the latter approach will produce the more accurate interpolation.

THE MODEL EQUATION

The Callendar equation, applicable above 0 °C, is: Rt = R0∙(1 + A∙t + B∙t^2), where Rt is the resistance (in Ω) at temperature t (in °C) and R0 is the resistance (in Ω) at the ice point (0.00 °C). Rearranging the equation, A∙t + B∙t^2 = Rt/R0 – 1, indicating that the independent variable x = t and the dependent variable y = Rt/R0 – 1.
(As the sensitivity. d(Rt/R0)/dt, will be required to convert uncertainties from temperature units, we also note that d(Rt/R0)/dt = A + 2∙B∙t. We will use the coefficients from IEC 60751, namely, A = 3.9083e-3 and B = -5.775e-7, to calculate sensitivities below.)

The measured data are as follows:

IPRT calibration data.
IPRT calibration data.

To find R0 from R(0.01 °C): R0 = R(0.01 °C) + (0.00 °C – 0.01 °C) ∙ 0.391 Ω/°C. Uncertainties are converted to the same units as y. (The 1st data point is used to determine R0, so only the four subsequent data points are used in the fit.)
IPRT data converted to appropriate units.
IPRT data converted to appropriate units.

The general model equation is c1∙f1(x) + c2∙f2(x) + … = y. For the Callendar equation, c1 = A, f1(x) = x, c2 = B and f2(x) = x^2.
The four data points lead to four simultaneous equations, namely:
c1∙f1(x1) + c2∙f2(x1) = y1
c1∙f1(x2) + c2∙f2(x2) = y2
c1∙f1(x3) + c2∙f2(x3) = y3
c1∙f1(x4) + c2∙f2(x4) = y4
Expressed in matrix notation, they are:
The unweighted simultaneous equations, expressed in matrix notation, are: A x c = y.
The unweighted simultaneous equations, expressed in matrix notation, are: A x c = y.

WEIGHTING FACTORS

Now, apply the weighting factors 1/ui to each equation (smaller std uncertainty => larger weight). The weighted simultaneous equations, expressed in matrix notation, are: W x A x c = W x y.
Weighting_matrix_W

THE SOLUTION – NORMAL EQUATIONS

The optimal (least squares) solution to this over-determined system is obtained by left-multiplying by the matrix transpose (WA)T = ATxWT, to obtain the normal equations:
Normal_equations
Matrix WTxW simply contains 1/variance at each calibration point:
Matrix_WtW
Coeff_c
Then, to find the fitted y-value for any value of x:
y_for_x
Variance_of_y
Finally, expanded uncertainty of y = 2*√variance(y): this is called the “propagated uncertainty” of y at x.

The numerical implementation of this “linear least squares” curve-fitting technique follows, using the Excel or Calc matrix functions TRANSPOSE(), MMULT() and MINVERSE():
Numerical_implementation
Now to check the fitted curve against the measured data (residual = measured – fitted):
Residuals
Are the residuals small enough, or, does the model fit the data well enough, relative to the uncertainties?: The chi-squared statistic, chi^2 = [(residual_1/u1)^2 + (residual_2/u2)^2 + …], will tell us:
Chi-squared
Chi-squared is larger than the degrees of freedom (= number of data points – number of fitted parameters = 4-2), so either the uncertainties are underestimated, or the model does not represent the data well (relative to the uncertainties). If chi^2 ≲ d.o.f., then the fit is “good enough”. (This is only strictly true if the uncertainties at different points are uncorrelated, but we will assume that this is the case.)

DEVIATION FROM ITS-90 REFERENCE FUNCTION

Perhaps direct fitting to the Callendar equation is not good enough, at this level of uncertainty. Let’s try deviations from the ITS-90 reference function – the reference function should take care of much of the PRT’s non-linearity, leaving us with smoother data to fit.

For the range 0 to 660 °C, the ITS-90 document gives the deviation function as: W-Wr = a∙(W-1) + b∙(W-1)^2 + c∙(W-1)^3, where W = Rt/Rtp and Wr is the value of the reference function.
It is good practice in curve fitting to use as few coefficients as will adequately fit the data, so we will try two, a and b.
ITS-90_dev_data
Dev_implementation
Now to check the fitted curve against the measured data (residual = measured – fitted):
Resid_dev
It can be seen that this fit is better than the previous one. (Residuals are around half the size.)
Chi2_dev
Chi-squared is smaller than the degrees of freedom, indicating that the curve fits the data well, within the uncertainties.

Below is a table of fitted values. Because of the structure of the ITS-90 functions, the uncertainty of the curve is zero at the water triple point (WTP, 0.01 °C). The uncertainty at WTP is added to the uncertainty of the curve, in the rightmost column.
Fitted_table

CONCLUSIONS

∙Both curves, fitted directly to measured PRT data, and to deviations from a reference function, represent the behaviour of the UUT better than do the individual data points. (The curves take into account all the data points, thereby potentially “smoothing out” random errors in measurement.)
∙The ITS-90 deviation function fits the data better, as is expected when using a good reference function.
∙When weighted least squares fitting is performed, the covariance matrix provides a statistically justified manner of propagating uncertainty to intermediate points, which results in small uncertainties. (To be strictly correct, we should have considered possible correlation between data points, but, as long as the dominant uncertainty component(s) are uncorrelated, our approach is reasonable.)

(Contact the author at LMC-Solutions.co.za.)

Interpolating between discrete calibration points: the effect on uncertainty – Addendum

ABSTRACT
This publication follows “Interpolating between discrete calibration points: the effect on uncertainty” of September 2025, describing a more general approach to interpolation of calibration uncertainty (not specific to a particular type of sensor), drawing a tentative conclusion regarding the spacing of calibration points versus resultant uncertainty, and applying this approach to digital thermometer, PRT and thermocouple calibration data.

INTERPOLATING UNCERTAINTY “BY EYE”

Here are two sets of calibration results, reporting correction or error of the Unit Under Test (UUT), and expanded uncertainty, at various temperatures:

Calibration results of a digital thermometer with type K thermocouple sensor, and an industrial PRT.
Calibration results of a digital thermometer with type K thermocouple sensor, and an industrial PRT.

Looking at the left-hand (digital thermometer) results, what do we observe?:
1. The correction at all three temperatures is effectively constant, relative to the calibration uncertainty.
2. The UUT is a thermocouple thermometer with a range of -50 to 300 °C. So, none of the calibration temperatures appears to be “special”. (In this context, “special” temperatures are those where the thermometer might be adjusted to have small corrections, for example, the ends of the operating range, room temperature (where all signal comes from Cold Junction Compensation and none from the sensor), and temperatures where the measuring electronics change range. For a liquid-in-glass thermometer, “special” temperatures would be scale pointing marks.)
These observations lead us to a (tentative) conclusion: The thermometer correction is stable at three “random” temperatures, so the correction at intermediate temperatures can probably be estimated with confidence, without any increase in uncertainty.

Now, looking at the right-hand (IPRT) results, we see:
1. The error varies significantly between calibration points, relative to the uncertainty.
2. The error varies fairly linearly with temperature, though deviation from a straight line is sometimes larger than the calibration uncertainty (at 232 °C, in this case):

Calibration results of the above industrial PRT, graphed.
Calibration results of the above industrial PRT, graphed.

What may we conclude from the IPRT data?: If we wish to estimate the error at an intermediate temperature by linear interpolation, the uncertainty at this intermediate temperature should probably be larger than that at the neighbouring calibration temperatures.

INTERPOLATING UNCERTAINTY – NUMERICAL ESTIMATE

How may we numerically estimate the additional uncertainty caused by interpolation?: If we assume that the value at the interpolated point lies between the two neighbouring calibration values, with an equal probability anywhere in that range, the uncertainty caused by interpolation may be estimated as |corr_1 – corr_2|, as the full-width of a rectangular distribution. (Note: The assumption that the value lies between the two neighbouring calibration values is not necessarily correct for “badly behaved” instruments.) To obtain standard uncertainty, u_interp, divide this by 2√3. Or, for expanded uncertainty, U_interp = u_interp*2 = |corr_1 – corr_2| / √3. Total uncertainty U_tot = √[U_cal^2 + U_interp^2].
This technique is applied for the digital thermometer (using a more recent, larger, data set) and the IPRT mentioned above, with values interpolated between low, middle and high temperatures (in bold) being compared to measured values at intermediate temperatures:

Linear interpolation between bold calibration results, for digital thermometer and IPRT.
Linear interpolation between bold calibration results, for digital thermometer and IPRT.

Linear interpolation between the bold calibration results differs from the measured corrections or errors at intermediate temperatures (residual = measured – interpolated) by less than the estimated total uncertainty, suggesting that this approach to interpolating uncertainty is “safe”, at least for these two instruments.
For the digital thermometer, the uncertainty of interpolated values is essentially the same as that of neighbouring calibration points (differs less than 5%), while interpolated values have significantly (~ten times) larger uncertainty for the IPRT. This seems like a reasonable approach, without having any deeper knowledge of the interpolating function (reference function) being used. (The only assumption is that the correction or error varies “more-or-less” monotonically between calibration points.) Note that interpolation may be much more accurate if one does have deeper knowledge of the interpolating function, as reported in the preceding paper, or if one performs a least squares fit to the full set of data.

The digital thermometer results above were stable, and those of the IPRT close to linear. What about non-linear results? The tables and graphs below show results for a type R thermocouple (linear) and a type K thermocouple (non-linear):

Linear interpolation between bold calibration results, for type R and type K thermocouples.
Linear interpolation between bold calibration results, for type R and type K thermocouples.

Calibration results for type R and type K thermocouples, graphed.
Calibration results for type R and type K thermocouples, graphed.

It can be seen that, though the type K’s results are non-linear (and even slightly non-monotonic) at low temperatures, the residual (= measured – interpolated) is always smaller than the total uncertainty.

SPACING BETWEEN CALIBRATION POINTS

If one wishes to add negligible uncertainty from interpolation, how far apart may the calibration points be? Consider the following “generic” calibration results:

Generic calibration results; correction or error y vs independent variable x.
Generic calibration results; correction or error y vs independent variable x.

If the correction or error, y, changes by at most half the expanded uncertainty, between one calibration point and the next, then the total uncertainty at interpolated points is essentially the same as at the calibration points (differs less than 5%).

CONCLUSION

∙The correction of a UUT at a point intermediate between calibration points may be estimated in a simple manner, by interpolating linearly between the two adjacent calibration values. The uncertainty at this interpolated point may be grossly estimated by adding the difference between the two calibration values in quadrature to the calibration uncertainty. For UUTs with stable corrections, this adds negligibly to the uncertainty, but if the correction varies significantly between calibration points, the resultant uncertainty at the intermediate point is much larger.
∙If the user specifies the calibration points, he takes responsibility for the uncertainty between points [Petersen, “Principles for Calibration Point Selection”, NCSLI Measure, Volume 8, No 3, 2013].
∙If the user only specifies the range, the calibration laboratory should suggest calibration points based on
– instrument manufacturer’s recommendation
– understanding of the operating principles of the instrument
– historical experience with this type or model of instrument (“type testing”)
– in the absence of further information, sufficient points that the UUT correction changes by at most half the expanded uncertainty, between one calibration point and the next (unless some uncertainty is added for interpolation)
– ideally ,the calibration laboratory should be confident enough to include a statement in the certificate such as “Effects due to interpolation are considered negligible over the calibrated range.”
∙The user should agree to the suggested calibration points during contract review.

(Contact the author at LMC-Solutions.co.za.)

Interlaboratory comparisons (Proficiency Testing) among calibration laboratories – how to choose the assigned value (Reference Value) – Addendum

ABSTRACT
This publication follows “Interlaboratory comparisons (Proficiency Testing) among calibration laboratories – how to choose the assigned value (Reference Value)” of October 2025, describing additional techniques for visualizing and interpreting results. Participants’ reported probability distributions are combined for visual review, Cox’s Largest Consistent Subset approach is applied to remove outliers, and the recommendations in this and the previous paper are applied to a small Thermometry ILC with four participants.

VISUAL REVIEW OF DATA

First, we continue to discuss the infrared thermometry ILC involving Ref Lab and 13 participants, from the previous paper: The kernel density plots, that were used to examine the data for items A and D, combine normal distributions with equal standard deviations (widths) around each participant’s result. (Plots were generated using the density() function in the R programming language.) What would the plots look like, if the widths were derived from the participants’ estimated uncertainties (which vary by an order of magnitude)? And, what would be the effect of including Ref Lab in the combination (convolution) of normal distributions?

KDE-A

Convoluted_A

For item A, using the participants’ uncertainties (second row of graphs above) tends to “smear out” the secondary modes (lower peaks) observed in the kernel density plots, suggesting that results in these areas have larger estimated uncertainties. (The effect is similar to enlarging the bandwidth of the kernel density plot.) Including the Ref Lab results in the combined distributions (third row of graphs) adds or heightens a peak in the 23, 34 and 80 °C data. As Ref Lab has the smallest uncertainties, it produces sharp, high peaks. (This is similar to how the weighted mean is dominated by the results with the smallest uncertainties.)

KDE_D

Convoluted_D

For item D, using the participants’ uncertainties removes the secondary peak for 23 °C, splits the main peak for 34 °C into two, and reduces the secondary peak for 40 °C.
For 80 °C, two peaks become three: -3 K (participants P1, P5, P6, P7), +1.5 K (P9, P11) and +3 K (P12, P13). It is interesting to observe that, if their uncertainties are small, just two participants (out of p = 10) can create a significant peak. Ref Lab’s result at 80 °C lies around halfway between the -3 K and +1.5 K peaks: does this suggest a difference in method between P1-P5-P6-P7 and P9-P11, Ref Lab applying a measurement technique halfway between these two populations? A similar split (P1-P2-P3-P5-P6-P7 vs P9-P10-P11) may be present at 40 °C, but the differences are smaller and therefore less distinct.
(It should be noted that Ref Lab and participants P1 to P8 used blackbody targets with emissivity ≈ 1.00, while P9 to P13 used flat plates with ε ≈ 0.95. The latter results are corrected to ε ≈ 1.00 (corrections ~ 0.7 K around 37 °C and 2.5 K at 80 °C), but perhaps some emissivity-related effects remain.)

A general observation, regarding the convolution of reported results (with widely differing uncertainties) vs the use of a kernel density plot (with equal bandwidth for all results): In practice, calibration laboratories using similar measurement standards and equipment (and therefore having similar measurement capabilities in reality) may report very different uncertainties, because, for example, some accredited CMCs are conservative and others are not. For this reason, it is suggested that reported uncertainties are of little value in visualizing the distribution of results, and a kernel density plot (with equal bandwidth for all results) is recommended.

LARGEST CONSISTENT SUBSET OF RESULTS

The above distributions of results are mostly multi-modal (having several local maxima) and asymmetric. Would the removal of “inconsistent” results improve the situation? (Our ideal combined distribution would be symmetric, with only one peak.) [Cox, “The evaluation of key comparison data: determining the largest consistent subset”, 2005] proposes a chi-squared test for consistency between the Weighted Mean (assigned value, y) and the participants’ results, x_i, with the “worst” results (largest contributors to chi-squared) being removed one-by-one, until the observed chi-squared value “passes” and we are left with the “Largest Consistent Subset” (or, LCS):
Cox_chi-squared_test
(The threshold value of Χ^2 may be calculated using the spreadsheet function CHISQ.INV.RT(0.05,N-1), where N = number of participants contributing to RV, or the Largest Consistent Subset may be directly determined using the LCS() function in the metRology library of the R programming language.)
Conv_and_Ref_A_LCS_
For item A, only the 80 °C results are inconsistent (according to Cox’s criterion), and the removal of inconsistent results to find the “Largest Consistent Subset” (second row of graphs above) does not, unfortunately, create a unimodal or symmetric plot.
Conv_and_Ref_D_LCS
For item D, all but the 23 °C results are inconsistent. Removing inconsistent results does improve the appearance of the combined probability distributions, though some asymmetry remains.

The LCS approach to finding an assigned value (or Key Comparison Reference Value, KCRV, for a Key Comparison between National Metrology Institutes) is intended to produce the best estimate of the SI value of the measurand [Cox, “The evaluation of key comparison data: determining the largest consistent subset”, Metrologia 44 (2007) 187–200]. This is seldom the goal for an interlaboratory comparison (ILC) between commercial calibration laboratories, where each participant simply aims to demonstrate his equivalence to a “credible” Reference Value. Bearing this in mind, and considering that the number of participants in a typical calibration ILC is small (7 or less, according to [EA-4/21 INF: 2018, “Guidelines for the assessment of the appropriateness of small interlaboratory comparison within the process of laboratory accreditation”]), the removal of participants from RV using a chi-squared test is not always reasonable: the reported uncertainties used to calculate chi-squared may be unrealistic, and the resultant number of results contributing to RV may be too small for statistical confidence. Instead, it is suggested that commercial calibration laboratories arranging a small ILC use a “consensus value from participant results” [ISO 13528:2015 section 7.7], using “a subset of participants determined to be reliable, by some pre-defined criteria … on the basis of prior performance” [13528 clause 7.7.1.1], with this “prior performance” being their reputation in the relevant field. An example of such a small ILC is presented below.

EXAMPLE: A SMALL INFRARED THERMOMETRY ILC

In this example, one infrared thermometer was calibrated from -20 to 150 °C. The pilot laboratory invited the other three laboratories to participate based on their reputation for competence in the field, so that all four laboratories might contribute to RV. Viewing conditions were specified (100 mm from a 50 mm diameter target, or 300 mm from a 150 mm target, etc), so that differing Size-of-Source Effect would not render the results incomparable. Three laboratories submitted multiple result sets (measured by different metrologists), but selected one set to contribute to RV, as required by the protocol. The results are presented below, in the sequence in which they were measured, with secondary results from relevant laboratories being identified as “x.2″:
IR-2025-05_results

VISUAL REVIEW OF DATA
The results are plotted, relative to the mean correction, below:
IR-2025-05_symmetry_about_mean_incl_participant_2
The expected tight grouping of results within a laboratory may be clearly seen in laboratories A, C and D. Good agreement between initial and final measurements at laboratory A (results A.1 and A.2) indicates that the thermometer was stable during the one-month circulation.
Kernel density plots of the four “primary” results (chosen for inclusion in RV) are shown below:
KDP_IR-2025-05
It is observed that these kernel density plots allow for better visual review of the data than does the graph of results relative to the mean. (There is significant asymmetry in the 80 °C and 120 °C results, which was not obvious in the graph relative to the mean.)

a) U(RV) FROM PARTICIPANT UNCERTAINTIES
Two Reference Values are considered, the mean and the weighted mean. For each laboratory, RV is calculated from the results of the other three, to avoid the laboratory being “compared to itself”. The uncertainty of the mean is estimated from reported participant uncertainties by Latex formula, similar to the formula for the standard deviation of the mean. The variance of the weighted mean is found from reported participant uncertainties by Latex formula.
As seen below, either value of U(RV) is smaller than U(LV), for all but participant B.1 at lower temperatures, so tests the capabilities of laboratories A, C and D fairly rigorously. It does not, however, meet the criterion for uncertainty of the assigned value u(x_pt), relative to the “performance evaluation criterion” σ_pt, suggested in ISO 13528 clause 9.2.1, namely
sigma_pt_vs_U_pt
(If this criterion is met, then the uncertainty of the assigned value may be considered to be negligible.)
IR-2025-05_RV
As mentioned above, the uncertainties reported by commercial calibration laboratories in ILCs differ widely, often without any technical justification. For this reason, it is recommended to use the unweighted mean as the Reference Value. However, as the goal of the ILC is to demonstrate equivalence between participants within their reported uncertainties, U(RV) is estimated from these reported uncertainties, not from the observed spread of results.

b) U(RV) FROM SPREAD OF PARTICIPANT RESULTS
The median is often used as a robust Reference Value for large ILCs, with its uncertainty being estimated from the spread of results via the scaled median absolute deviation (MADe). Is the median an appropriate Reference Value for a small ILC?
IR-2025-05_median
(Note: In the above table, the median and its uncertainty are calculated using all four results, i.e., not excluding the participant’s own result.)
It is observed that, while the median may be a reasonable Reference Value for such an ILC, the uncertainty of the median, being derived from the spread of the participants’ results, is not suitable for the task of demonstrating equivalence within reported uncertainties. To achieve this goal, U(RV) should be derived from the reported uncertainties. Also, U(median) is often larger than U(LV), so does not test participants’ capabilities rigorously.

CONCLUSION

∙Kernel density plots are recommended for visual review of data (the first step in analysis of ILC results).
∙The ILC report should present participants’ results in the order in which they were measured, so that readers may themselves look for artefact drift.
∙If a laboratory submits multiple result sets, it should choose one set to contribute to a consensus Reference Value, before having sight of other participants’ results.
∙The ILC report should indicate which result sets are from the same laboratory, so that readers may interpret data clustering correctly.
∙For small ILCs, each participant should be evaluated against a Reference Value derived from other participants’ results (to avoid bias).
∙To promote the credibility of RV, a subset of participants may be pre-selected to contribute to it, on the basis of “prior performance” (reputation in the field).
∙If a consensus Reference Value is to be used, the mean is recommended, rather than the weighted mean, as reported uncertainties often vary widely, without technical justification.
∙The uncertainty of a consensus Reference Value, U(RV), should be determined from participants’ reported uncertainties, rather than from the spread of results, to achieve the goal of demonstrating equivalence within reported uncertainties.
∙Technical assessors should ask: Is RV credible? (Is it composed of reputable labs?) Does it omit the laboratory being tested, especially if the number of participants is small? Is U(RV) < U(LV), in order to test the participant’s capability rigorously? Contact the author at LMC-Solutions.co.za.