I asked a question that was basically this; Considering that all the performance testing (black box, white box, miami dade) was done on three conclusion scales, does moving to this 5 scale model undermine the validity of fingerprint identification, especially given that foundational validity as outlined by the PCAST:
While obviously the OSAC members didn't think it undermined the validity (spoiler alert: it does), they also pointed to a study that compared the 3 standard scale to the 5 standard scale which I believe was cited as evidence that the scale is valid (but not in a way recognized by the PCAST report mind you). They also framed the reluctance of people to adopt the scale in terms of fear. I always love the fact that any time there’s a new policy in the works it’s presented as literally having no risks or trade offs associated. This should always be a red flag. There are trade offs to every policy change.“requires that [the method] be shown, based on empirical studies, to be repeatable, reproducible, and accurate, at levels that have been measured and are appropriate to the intended application.”
Let's take a look at the paper and discuss what I mean.
From the paper:
There goes the 'appropriate to the intended application' prong of the PCAST quote above regarding foundational validity. This paper, in it's own words wasn't intended to do that. That was evident from the methods section where it's stated that:However, it is important to note that our results need only to be approximately similar to casework, because the goal of this study is not to measure error rates on an absolute scale, but to consider what changes might occur if an expanded conclusion scale is adopted.
This in effect, makes the study more of a Seven Minute Abs of comparisons than anything else.The experiment differed from normal casework in that participants only had 3 minutes to complete each trial and latent prints and exemplar prints were shown at the same time
But wait! There’s more. Later in the paper it says:
First of all, difficulty is a function of the Examiner. I wish someone would write that down. We’re more concerned with complexity, which is a function of the print. (lack of clarity, lack of orientation, use of level 3 data, limited data (read: boundary cases)). Complexity is even in the ULW comparison software as a checkbox. It also has a poor man’s version substituting as the Quality Metric in the ULW LFIS/LFFS encoding modules.The distribution of proportions shown in Table 5 suggests that our comparisons were of similar difficulty to those from black box studies, which are designed to emulate the difficulty of impressions encountered in casework. Thus, we believe that our choice of latent impressions and comparison exemplars produced an environment that is similar to actual casework.
Other experimental design flaws include:
Don't get me wrong, I like the study, I just don't think it's evidence for the validity of the 5 conclusion standard in the way that it was sold on the call. Let's look at what the paper has to say about the results.This experimental design omitted the “of value” decision. We made this decision because the interpretation of our results depend in part on model fits from signal detection theory, and it is difficult to fit models in which an initial quality threshold is assessed
Queue needle scratch on the record sound.First, the proportion of Identification responses to mated pairs drops from 0.377 in the 3-conclusions scale to 0.266 in the 5-conclusion scale. This suggests that examiners were redefining the term Identification to represent only the trials with the strongest evidence for same source.
This is problematic in the sense that it reduces IDs to being uni-dimensional. On the call, I forget who asked the question but there was some discussion that actually Support for Same Source had a multi-dimensional component, meaning that there could be a scale within that scale. And let's look at the problem with that.
Backing up a minute to the Materials and Methods section we see this:Second, note that the Inconclusive rate drops from 0.569 in the 3-conclusion scale to 0.351 in the 5-conclusion scale. Some of these Inconclusive responses likely distributed to the Support for same source response, because not all of the Support for Same Source responses could have come from the weak Identification trials (0.377-0.266 is only 0.111, whereas the proportion of support for same source is 0.241).
Given the inconclusive rate and their definition of inconclusive in this scale as literally having no value, the implications here are that there is a tendency to erroneously associate a person to a case where as in the 3 point scale, that is a not a problem. But that’s only if you think Support for Same source means you’re associating a person to the case (or inculpatory as the paper terms it). In Figure 1, it appears as though the Jury thinks that way.This experimental design omitted the “of value” decision. We made this decision because the interpretation of our results depend in part on model fits from signal detection theory, and it is difficult to fit models in which an initial quality threshold is assessed. Both scales included an ‘inconclusive’ category, and while we understand that in casework ‘no value’ and ‘inconclusive’ have different meanings, we considered the two to be approximately equal for the purposes of comparing the traditional and expanded conclusion scales.
So, in essence what’s happened here is that we’ve redefined the meaning of Identification and the jury will redefine the meaning of ‘Support for Same Source’ and we’ve actually added to this bucket people who should not have been there. The overall effect, we’ve outsourced overstating to the jury from the Examiner, but made no real difference.
Hold up though, there’s still one last ball we need to juggle. Varying degrees of Identification.
The Litmus test in any of these is always Mayfield. By the uni-dimensional redefinition of Identification to mean only the strongest, what happens to the Daoud ID? The Zero Point ID? Certainly we aren’t lumping Daoud into the strongest category of ID are we? Remember, the paper says:
So, in the instance of Mayfield, we actually have two competing ‘Support For Same Source’ propositions. Where is the guidance for that? Especially considering the fact that one of them was an error, not just by a 3 scale standard, by any standard.Out of 27 participants, 21 had an identification threshold shifted to the right (i.e. more conservative) in the 5 Conclusion scale relative to the 3 Conclusion scale (exact probability is 0.0029). This demonstrates that examiners redefine what they mean by an Identification when given more categories in the scale (become more conservative).
And this brings me to my next point. Rejection at the ASB level. On the DOJ call there were numerous questions about the appropriateness of the overlap between OSAC and ASB board members. The implication being that acceptance of the OSAC documents could be forced and not organic. If we look at the mission statement of the ASB https://www.asbstandardsboard.org/mission-vission/ (the typo in the url is too ironic) they’re actually charged with:
as well asProvide training on implementation of ASB standards
So riddle me this, How is the ASB supposed to provide training on a standard they didn’t approve, which obviously contradicts itself and shows itself to be impractical? (at least at this stage) Especially in light of the fact that the OSAC cut off collaboration in spite of having overlapping members.Foster collaboration and participation: AAFS; Constituents and Other SDOs
Lastly there’s this. Does the IAI Latent Print Certification become invalid now that the DRAFT has been removed? No one has been certified under a 5 standard scale after all. I would envision a whole host of problems with implementing a 5 point scale given the fiasco we had with adding one of the 3 into the mix not too long ago.
If anyone wants to do a Zoom mock trial where they defend the 5 conclusion scale where I play defense, let me know. We can record it and put it up here.