Sorting and Searching PAHMA Object Numbers

Analysis of the data in TMS and the TMS approach to sorting

PAHMA Object Numbers (also known as "Museum Numbers" in the PAHMA UI and elsewhere) are unique string identifiers, one per object or subobject.

Most of the identifiers (88.9%) follow a simple pattern: a short sequence of digits, a hyphen, and another string of digits, e.g. "1-123456". And 94.1% follow a slight variation: this pattern plus an "excresence" in the form of a few letters, e.g. "1-23456a,b". The remaining 5.9% (about 48,500 objects) fall into one of about 250 more varied patterns. The most frequent patterns are show in Table 1 below (up to a cumulative frequency of 99%).

  • "A" below means "a sequence of capital letters"
  • "a" below means "a sequence of lowercase letters"
  • "9" below means "a sequence of digits"

Table 1: Frequency Distribution of ObjectNumber ("Museum Number") Patterns in TMS

n=647,902

Pattern

Frequency

Relative F.

Cumulative F.

Example

9-9

575744

88.86%

88.86%

1-2345

A-9

22974

3.55%

92.41%

TB-192

9-9a

7857

1.21%

93.62%

1-45678b

9-9a,a

7085

1.09%

94.72%

1-47899a,b

9-9_a

6441

0.99%

95.71%

Aa.9

4705

0.73%

96.44%

Box.99

9.9-9

3776

0.58%

97.02%

9.9

3638

0.56%

97.58%

9-9a-a

3450

0.53%

98.11%

9-9.A#9.A

1753

0.27%

98.38%

A-A9

1723

0.27%

98.65%

9-9.A#9.A.9

892

0.14%

98.79%

Aa.9A

886

0.14%

98.92%

A-9a

801

0.12%

99.05%

New and Proposed Museum Number types

Going forward, in CSpace, PAHMA will begin to assign numbers using the "Number Patterns" provided in the UI. Some of these patterns do occur, infrequently, in the existing TMS database. Table 2 shows the various prefixes and numbering patterns that are to be used.

Table 2: Museum Number Patterns to be available in CSpace

Pattern

Description

Remarks

HMA 2012.1.2

Cataloging

LI 2012.1.2

In Loan

 

LO 2012.1.2

Out Loan

 

ACC 2012.1.2

Accession

 


Inventory/Movement

EX 2012.1.2

Object Exit

 

IN 2012.1.2

Object Entry

 

CL 2012.1.2

NAGPRA Claim

 


Group

 


Media

 

Patterns for some of the
not-yet-implemented procedures

CR 2012.1.2

Condition Checking

 

INS 2012.1.2

Insurance

 

VAL 2012.1.2

Valuation

 

EXH 2012.1.2

Exhibition

 

USE 2012.1.2

Use of Collections

 

TR 2012.1.2

Treatment

 

A few points about how these should be represented and collated:

  • Some of the patterns above are not going to be used as Museum Numbers; there is no reason therefore that they should collate like them. However, they will need to collate amongst themselves, and so will eventually deserve similiar (and hopefully identical) treatment.
  • A space between the alphabetic prefix and the number is desired, but the sortkey algorithm should accommodate numbers without a space.
  • The key of these types of numbers should sort in 4 levels, the first alphabetic, the next 3 numeric.
  • While the prefixes could be enumerated (they will be in the Number Patterns UI widget), it would be nice if the algorithm could handle new prefixes without requiring the code to be updated.

TMS Sort Key Generation

TMS employs an algorithm which generates a sort key from the input ObjectNumber value. This value is stored in the SortNumber column of the Objects table and used for collation and perhaps other purposes. The value is configurable by the user in consultation with TMS staff. The TMS manual contains the following configuration information:

Field

Example

Description

SortNumber.PartOrder

1.2.3

Specifies the order in which the different parts of the SortNumber Template should be joined.

SortNumber.Template

XXXXXX.XXXXXX.X XXXX.XXXXX.XXXX .XXX.XXX.XXXXX.X XXXX.XXXXX.XXXX X.XXXX

The template for object numbers (DO NOT CHANGE this setting without consulting with Gallery Systems support staff; see “Updating SortNumber Templates” on page 26-26)

TMS Users' Manual 26-14

(Note that this same template and instruction is used in TMS in a variety of places besides ObjectNumber.)

Options for PAHMA structuredObjectNumber in CSpace

Two options for migrating and maintaining a sortable value for objectNumbers in CSpace:

  1. Migrate the existing values of the SortNumber field to CSpace and reverse engineer the TMS method for creating them in CSpace. There is little value and some risk in doing this: since new ObjectNumbers will continue to be created, whatever algorithm (and implementation) is devised for CSpace it can be employed as part of the migration process to create new sortNumbers. Note also that the TMS implementation appears to have some problems and not all the values in the existing SortNumber field seem to be appropriate. Furthermore, implementation details for the TMS procedure are not included in the documentation, and may be hard to come by; at any rate, there is a risk that the re-implementation might not match the current implementation in some significant detail.
  2. Devise a new algorithm and implementation, hopefully relatively general and straightforward, that can be used to both re-generate sort keys as part of migration and to create new sort keys in the future.

The proposal for the second option is presented in more detail below.

Recommendations for PAHMA structuredObjectNumber in CSpace

It is desired to convert the objectNumber value into a scalar value which supports collation and range searching. This could result in a numeric or hash value being produced, but having a human-readable key that bears a structural resemblance to the original has some advantages.

As noted above, almost 95% of the existing objectNumbers in TMS match a relatively simple regex, and that this expression can be modified to handle the new _number patterns_. The remaining identifiers are pretty much all over the map, but could also be collated using a relatively simply algorithm (see below), and one which might be robust enough to serve as a general template for creating such sort keys in CSpace.

The main pattern has up to 5 parts:

PREFIX NUMBER NUMBER NUMBER ETC

Therefore, a two part algorithm is suggested:

  1. If the identifier matches the "main" 5-part pattern:
    / /^([cC](ons|ONS)?[\-\. ]?)?([A-Z]+)?([\-\. ])?(\d+)([\-\. ])?(\d+)([\.\- ]+)?(\d+)?([\.\- ]+)?(.*)$/
    we are nearly done: Ignore the first match group (the optional "component prefix" C. and its variants) and concatenate the five captured tokens (excluding punctuation) with blank as separator (padding to length 6 with zeroes if the token is numeric). If the first element is alphabetic, include it, otherwise omit it.
    Note that this suggestion has the nice property that the original value is easily retrieved from this sort key, which it closely resembles.
    Another fine point to note is that some "non-standard" input strings match the pattern (esp. cases which have an alphabetic first element and a numeric second element) and these get reformatted. This should be regarded as a feature not a bug since in the vast majority of such cases the result is an improved collation of these values, and it is unlikely that they will incorrectly collate with standard values. However, removing the \w from the third captured element (first output element) would element most of these from reformatting.
  2. For the remaining 5% that do not match this pattern, three options are possible. For PAHMA, it appears that the "pathological cases" below do not need to be handled, so "leave them as is" is fine as the ELSEWHERE case for objectNumbers that do not match the main pattern.!

Javascript snippet implementing algorithm suggested above (includes test cases)

pad = function(s, len){
  if (len-s.length+1 > 0) {
    return (new Array((len-s.length+1) ).join("0")) + s;
  }
  else {
    return s;
  }
};

isNumber = function(n) {
  return !isNaN(parseFloat(n)) && isFinite(n);
}

function testValues() {
  var resultString = "<table>";
  var objnums =
  { 'Do Not Touch!':0,
    'xxx,xxx 99':0, 
    'HMA 2012.1.20 b':4,
    '2012.10.20 b':4,
    '2012.1':4,
    'HMA 2012.01.20 b':4,
    'LI 2012.11.200':4,
    'LO 2012.1.20':4,
    'LO 2012.1.200':4,
    'LO2012.1.19':4,
    'LO 2012.1.20a':4,
    'ACC 2012.1.2':4,
    'ACC 2012.1.02x':4,
    'c-1-210':4,'C.99-1234-a':6,
    'C99-1234.9':6,'cons99-1234.92':6,
    'C 99-1234.93':6,'cons 99-1234.94':6,
    'CONS.99-1234-a99':6,
    'Cons.1-10011':1,'Cons1-10011':1,
    'IJ 1':1,
    'IJ-1':1,
    'K-1993m':1,
    'L 54.739.2':1,
    'L-16576e,f,h,i':1,
    '1-2':1,'1-200a':3,
    '1.3 a,b':2,'99-12345678912345 xx':5}
  for (objnum in objnums) {
    // console.log(objnum,createStructuredObjectNumber(objnum));
    resultString += "<tr><td>" + objnum + "<td>" + createStructuredObjectNumber(objnum);
    }
  resultString += "</table>";
  document.getElementById('results').innerHTML = '<pre>' + resultString + '</pre>';
}
function createStructuredObjectNumber(objnum) {
  //            1    2                   3        4         5    6         7    8          9     10         11   
  var objRe = /^([cC](ons|ONS)?[\-\. ]?)?([A-Z]+)?([\-\. ])?(\d+)([\-\. ])?(\d+)?([\.\- ]+)?(\d+)?([\.\- ]+)?(.*)$/;
  var objTokens = objRe.exec(objnum);
  if (objTokens == null) {
    return objnum;
  }
  else {
    for ( i = 0 ; i < objTokens.length ; i = i+1 ) {
      if (!objTokens[i]) {
        objTokens[i] = '';
      }
      else {
        if (isNumber(objTokens[i])) {
          objTokens[i] = pad(objTokens[i],6);
        }
      }
      objTokens[i] = objTokens[i] + ' ';
    }
    if (objTokens[3] == ' ') objTokens[3] == ''; // zap empty alphabetic prefix
    return objTokens[3]+objTokens[5]+objTokens[7]+objTokens[9]+objTokens[11];
  }
}

Some "Ordinary" Cases

(most of these match the "main pattern", though in some cases the resulting key is not strictly collatable)

ORIGINAL

PROPOSED KEY

Do Not Touch!

Do Not Touch!

xxx,xxx 99

xxx,xxx 99

HMA 2012.1.20 b

HMA 002012 000001 000020 b

2012.10.20 b

002012 000010 000020 b

2012.1

002012 000001

HMA 2012.01.20 b

HMA 002012 000001 000020 b

LI 2012.11.200

LI 002012 000011 000200

LO 2012.1.20

LO 002012 000001 000020

LO 2012.1.200

LO 002012 000001 000200

LO2012.1.19

LO 002012 000001 000019

LO 2012.1.20a

LO 002012 000001 000020 a

ACC 2012.1.2

ACC 002012 000001 000002

ACC 2012.1.02x

ACC 002012 000001 000002 x

c-1-210

000001 000210

C.99-1234-a

000099 001234 a

C99-1234.9

000099 001234 000009

cons99-1234.92

000099 001234 000092

C 99-1234.93

000099 001234 000093

cons 99-1234.94

000099 001234 000094

CONS.99-1234-a99

000099 001234 a99

Cons.1-10011

000001 010011

Cons1-10011

000001 010011

IJ 1

IJ 000001

IJ-1

IJ 000001

K-1993m

K 001993 m

L 54.739.2

L 000054 000739 000002

L-16576e,f,h,i

L 016576 e,f,h,i

1-2

000001 000002

1-200a

000001 000200 a

1.3 a,b

000001 000003 a,b

99-12345678912345 xx

000099 12345678912345 xx

More "Pathological" Cases

NB: the following sections are superceded, but is retained here to remind us to check these pathological cases at some point.

A more general algorithm, that can collate mixed strings of number, letters, and punctuation pretty well is the following, preserved here purely for speculative interest:

  • perform a two-step tokenization and reassembly of the parts as follows:
    • insert whitespace between each juncture of letter and digit:
      s/([a-zA-Z]+)([0-9]+)/\1 \2/g;
        s/([0-9]+)([a-zA-Z]+)/\1 \2/g;
      
    • split the identifier into tokens on white space and selected puncuation, i.e.
      split("[\s\_\-\.\;\:\,\-\+\/\#]+")
    • reassemble the tokens, with padding to fixed lengths, see script below.

These are handled by the "elsewhere" case in the script below. MTB says that most of these are ephemeral, apocryphal, or otherwise moot (i.e. they are accession numbers or "virtual objects" that do not refer to real objects, despite being in the ObjectNumber column in Objects). Nevertheless, we would be wise to spend just a little while considering identifiers like these.

ORIGINAL              PATTERN           PROPOSED KEY            TMS SORTNUM


 IJ 1                                   IJ-000001
 IJ-1                                   IJ-000001
 K-1993m                                K-001993^m
 L 54.739.2                             L-000054^739.2
 L-16576e,f,h,i                         L-016576^e,f,h,i

100.1-145272        	9.9-9          	0100-000001^145272  	^^^100^^^^^114527^^^^2
100.1-145274        	9.9-9          	0100-000001^145274  	^^^100^^^^^114527^^^^4
100.1-145275        	9.9-9          	0100-000001^145275  	^^^100^^^^^114527^^^^5
1-154197.dup        	9-9.a          	0001-154197^dup     	^^^^^1154197dup
1-205632_dupe       	9-9_a          	0001-205632^dupe    	^^^^^1205632dup
1-217004_dupe       	9-9_a          	0001-217004^dupe    	^^^^^1217004
1-217005_dupe       	9-9_a          	0001-217005^dupe    	^^^^^1217005
1-217006_dupe       	9-9_a          	0001-217006^dupe    	^^^^^1217006
14-1                	9-9            	0014-000001^        	^^^^14^^^^^1
14-10.1             	9-9.9          	0014-000010^1       	^^^^14^^^^10^^^^1
14-10.1.dup.1       	9-9.9.a.9      	0014-000010^1.dup.1 	^^^^14^^^^10^^^^1dup^^^^^1
14-10.1.dup.2       	9-9.9.a.9      	0014-000010^1.dup.2 	^^^^14^^^^10^^^^1dup^^^^^2
14-100              	9-9            	0014-000100^        	^^^^14^^^100
14-1000a.1          	9-9a.9         	0014-001000^a.1     	^^^^14^^1000a^^^^^^^^1
14-1000a.1.dup.1    	9-9a.9.a.9     	0014-001000^a.1.dup.1	^^^^14^^1000a^^^^^^^^1dup^^^1
14-1000b.2          	9-9a.9         	0014-001000^b.2     	^^^^14^^1000b^^^^^^^^2
14-1000b.2.dup.1    	9-9a.9.a.9     	0014-001000^b.2.dup.1	^^^^14^^1000b^^^^^^^^2dup^^^1
14-1000c.3          	9-9a.9         	0014-001000^c.3     	^^^^14^^1000c^^^^^^^^3
1-TEMP01a           	9-A9a          	^^^1TEMP^^01^^^a    	^^^^^1TEMP^^^^^^1a
1-TEMP01b           	9-A9a          	^^^1TEMP^^01^^^b    	^^^^^1TEMP^^^^^^1b
1-TEMP02            	9-A9           	^^^1TEMP^^02        	^^^^^1TEMP^^^^^^2
1-TEMP03            	9-A9           	^^^1TEMP^^03        	^^^^^1TEMP^^^^^^3
1168.Crem 1         	9.Aa 9         	1168Crem^^^1        	^^1168Crem^^^^^^1
1175.Crem 5         	9.Aa 9         	1175Crem^^^5        	^^1175Crem^^^^^^5

24-2382.S#12.A and B	9-9.A#9.A a A  	0024-002382^S#12.A^and^B ^^^^24^^2382S^^^^^^^12A^^^andB
4-"Blank" Box (1 of 3) Broughton 9-"Aa" Aa (9 a 9) Aa ^^^4Blank^Box^^^1^^of^^^3Broughton x_FaunalBox^^^^^^^1^broughton
4-"Blank" Box (2 of 3)	  9-"Aa" Aa (9 a 9)             ^^^4Blank^Box^^^2^^of^^^3          x_FaunalBox^^^^^^^2^broughton
Box #1 (Sac #1?) (2 of 2) Aa #9 (Aa #9?) (9 a 9)	^Box^^^1^Sac^^^1^^^2^^of^^^2     x_FaunalBox^^^^^^^1^2
Box #2 (Sac #2?) (1 of 2) Orig. Box 5	Aa #9 (Aa #9?) (9 a 9) Aa. Aa 9	^Box^^^2^Sac^^^2^^^1^^of^^^2Orig^Box^^^5	x_FaunalBox^^^^^^^0
Box (1-no #-1 of 2) 	Aa (9-a #-9 a 9)	^Box^^^1^^no^^^1^^of^^^2	x_FaunalBox^^^^^^^1
Box (1-no #-2 of 2) Broughton M-1	Aa (9-a #-9 a 9) Aa A-9	^Box^^^1^^no^^^2^^of^^^2Broughton^^^M^^^1	x_FaunalBox^^^^^^^2
Box (11-no#-1 of 2) 	Aa (9-a#-9 a 9)	^Box^^11^^no^^^1^^of^^^2	x_FaunalBox^^^^^^^1
Box (11-no#-2 of 2) 	Aa (9-a#-9 a 9)	^Box^^11^^no^^^2^^of^^^2	x_FaunalBox^^^^^^^2
1-___ CA-Mrn-193 duplicates	9-___ A-Aa-9 a 	^^^1^___^^CA^Mrn^193duplicates	^^^^^1CA^^^^Mrn^^^^193duplicates
1-___ CA-SFr-2500 Mason St. duplicate nums	9-___ A-Aa-9 Aa Aa. a a	^^^1^___^^CA^SFr2500Mason^^Stduplicatenums	^^^^^1CA^^^^SFr^^^2500Mason^^St^duplicate^nums

Hasty perl script to aid in prototyping structuredObjectNumber algorithm

execute as "perl -n sortable.pl testcases.csv". testcases.csv and this script are attached.

This script is retained for posterity, but bears only a modest resemblance to what is implemented in
the javascript above and what is defined in the algorithm.

# parse input line, clean up a bit
chomp;
s/ +$//;
s/ +\t/\t/;
my ($original,$sortNum) = split /\t/;
$_ = $original;

# make pattern
s/\d+/9/g;
s/[A-Z]+/A/g;
s/[a-z]+/a/g;
my $pattern = $_;

# generate sortkey
my $sortkey;
my ($left,$right,$extra);
($left,$right,$extra) = $original =~ /^([\d\w]+)[\-\.](\d+)(.*)$/; # basic pattern
$extra =~ s/^[\s\_\-\.\;\:\,\+\/\#]+//; # remove initial non-filing chars (i.e. punctuation)
                                        # NB: this may be problematic in some cases.
if ($left && $right) {
  if ($left =~ /^[0-9]+$/) {
    $sortkey = sprintf "%04d-%06d %s", ($left+0,$right,$extra);
  }
  else { # handle case where leftmost token is alphabetic...
    $sortkey = sprintf "%-4s-%06d %s", ($left,$right,$extra);
  }
}
else { # fallback : see prose above
  my $string = $original;
  $string =~ s/([a-zA-Z]+)([0-9]+)/\1 \2/g; # handle both cases: letter first
  $string =~ s/([0-9]+)([a-zA-Z]+)/\1 \2/g; #                    digits first
  $string =~ s/\W/ /g;
  @tokens = split(/[\s\-\.\:\,\-\+\/\#]+/,$string);
  my @paddedTokens = map { sprintf "%4s",$_ } @tokens;
  $sortkey = join '',@paddedTokens;
}
$sortkey =~ s/ /\^/g; # add ^ to make blanks visible
$sortNum =~ s/ /\^/g; # add ^ to make blanks visible
printf "%s\t%s\t%s\t%s\n", ($original,$pattern,$sortkey,$sortNum) ;