SIM_FACTOR

Define a similarity factor to manipulate the similarity of test items to training items.

Contents

Basic item similarity factor

SIM_FACTOR(aglss, 'fname') defines target values for low and high similarity test items, that is, test items that have low or high avereage similarity to training items. The average similarity between training items and low similarity test items will be the same as the 25th percentile of pairwise similarities of all grammatical items to each other. The average similarity between training items and high similarity items will be the same as the 75th percentile of similarities of grammatical items to each other.

Similarity is computed using the EDIT_SIM function (or other similarity function, if specified).

The example below generates training items based on the XXX_GRAMMAR. Test items are then generated that have either low or high similarity to the training items.

The first output variable returned by SIM_FACTOR is an AGLSS object, updated to reflect the chunk novelty factor. The second output variable is a cell array of strings naming the different levels of similarity defined. Actual similarity target values are returned as the third output variable.

s_xxx = aglss(xxx_grammar, [3 10]);

[s, levnames, tgts] = sim_factor(s_xxx, 'Sim');

levnames

tgts

s = factorial_testsets(s, {'Sim', levnames{:}});
s = choose_items(s, 6, 2);

disp('Training items:');
disp(format_train_items(s));

disp('Test items:');
disp(format_test_items(s));
 Potential items:
	Grammar involves 2 symbols (xy)
	2040 possible strings of length 3-10
		180 grammatical strings ( 8.82%)
		1860 ungrammatical strings (91.18%)
	Using all 180 grammatical strings
	Using all 1860 ungrammatical strings
 

levnames = 

    'LowSim'
    'HighSim'


tgts =

    0.5882    0.7600

 Choosing training item 1....
 Choosing training item 2....
 Choosing training item 3....
 Choosing training item 4....
 Updating potential items....
 Choosing test item 1 for each set... 2 1.
 Choosing training item 5....
 Updating potential items....
 Choosing test item 2 for each set... 2 1.
 Choosing training item 6....

Training items:
  Itm_num    Itm_name
       01    xyxyxyxy
       02      xxxxyx
       03       xxyxy
       04      xxyxyx
       05   xxxxxxyxy
       06      xyxyyx

Test items:
 Tset_num       Sim_cat  Itm_num    Itm_name        Sim    Sim_tgt
       01        LowSim       01   yyyxxyyxy      0.587      0.588
       01        LowSim       02     yyxxxyy      0.580      0.588
       02       HighSim       01   xxyxyxyxy      0.760      0.760
       02       HighSim       02      xxyxyy      0.762      0.760

Levels of similarity

SIM_FACTOR(aglss, 'fname', [p1 p2 ... pN]), specifies similarity percentiles for N levels of similarity. The similarity of test items in category I to training items will be approximately the same as the P(N)-th percentile of similarities of grammatical items to each other. For example, [25 75] specifies 25th and 75th percentiles.

The example below builds on the AGLSS object S_XXX created for the previous example. Three levels of similarity are defined, for test items at the 20th, 50th, and 80th percentiles of similarity.

[s, levnames, tgts] = sim_factor(s_xxx, 'MySim', [20 50 80]);

levnames

tgts

s = factorial_testsets(s, {'MySim', levnames{:}});
s = choose_items(s, 6, 2);

disp('Training items:');
disp(format_train_items(s));

disp('Test items:');
disp(format_test_items(s));
levnames = 

    'MySim1'
    'MySim2'
    'MySim3'


tgts =

    0.5682    0.6818    0.7778

 Choosing training item 1....
 Choosing training item 2....
 Choosing training item 3....
 Choosing training item 4....
 Updating potential items....
 Choosing test item 1 for each set... 1 3 2.
 Choosing training item 5....
 Updating potential items....
 Choosing test item 2 for each set... 3 1 2.
 Choosing training item 6....

Training items:
  Itm_num    Itm_name
       01    xyxyxyxy
       02      xxxxyx
       03       xxyxy
       04      xxyxyx
       05    xxyxyxyx
       06  xxyxyxyxyy

Test items:
 Tset_num     MySim_cat  Itm_num    Itm_name      MySim  MySim_tgt
       01        MySim1       01   yyxxxyyxx      0.568      0.568
       01        MySim1       02   yyxyyxxxy      0.562      0.568
       02        MySim2       01     xyxxyyx      0.685      0.682
       02        MySim2       02  xyxyyxyyxy      0.685      0.682
       03        MySim3       01     xyxxyxy      0.770      0.778
       03        MySim3       02    xyxyxxyx      0.787      0.778

Naming levels of similarity

SIM_FACTOR(aglss, 'fname', T, {'name1', 'name2', ...}) specifies names for the different levels of similarity, as an alternative to the default names otherwise assigned.

The example below builds on the AGLSS object S_XXX created for a previous example. The default levels of similarity are named 'L' and 'H' (for Low and High). These names appear in the display of test items.

[s, levnames, tgts] = sim_factor(s_xxx, 'Sim', [], {'L', 'H'});

levnames

tgts

s = factorial_testsets(s, {'Sim', levnames{:}});
s = choose_items(s, 6, 2);

disp('Training items:');
disp(format_train_items(s));

disp('Test items:');
disp(format_test_items(s));
levnames = 

    'L'    'H'


tgts =

    0.5882    0.7600

 Choosing training item 1....
 Choosing training item 2....
 Choosing training item 3....
 Choosing training item 4....
 Updating potential items....
 Choosing test item 1 for each set... 2 1.
 Choosing training item 5....
 Updating potential items....
 Choosing test item 2 for each set... 1 2.
 Choosing training item 6....

Training items:
  Itm_num    Itm_name
       01    xyxyxyxy
       02      xxxxyx
       03       xxyxy
       04      xxyxyx
       05   xxxxxxyxy
       06      xyxyyx

Test items:
 Tset_num       Sim_cat  Itm_num    Itm_name        Sim    Sim_tgt
       01             L       01   yyyxxyyxy      0.587      0.588
       01             L       02     yyxxxyy      0.580      0.588
       02             H       01   xxyxyxyxy      0.760      0.760
       02             H       02      xxyxyy      0.762      0.760

Changing the similarity function

SIM_FACTOR(aglss, 'fname', P, NAMES, @simfunc, {parm1 ... parmN}), specifies a function to use to compute the similarity between test and training items. SIMFUNC should be a function that takes two arguments (strings or cell arrays of strings), and may also take additional arguments specified as the cell array {PARM1 ... PARMN}. It should return a matrix of similarity values.

The example below builds on the AGLSS object S_XXX created for a previous example. Here, an anonymous similarity function computes D/(1+D), where D is the edit distance between two items. Additional parameters are passed to the EditDist function to specify equal costs for deletions, insertions, and substitutions.

[s, levnames, tgts] = sim_factor(s_xxx, 'MySim', [], [], ...
    @(x,y,varargin) EditDist(x, y, varargin{:}) ./ ...
        (1 + EditDist(x, y, varargin{:})), ...
    {1, 1, 1});

levnames

tgts

s = factorial_testsets(s, {'MySim', levnames{:}});
s = choose_items(s, 6, 2);

disp('Training items:');
disp(format_train_items(s));

disp('Test items:');
disp(format_test_items(s));
levnames = 

    'LowMySim'
    'HighMySim'


tgts =

    0.7500    0.8000

 Choosing training item 1....
 Choosing training item 2....
 Choosing training item 3....
 Choosing training item 4....
 Updating potential items....
 Choosing test item 1 for each set... 1 2.
 Choosing training item 5....
 Updating potential items....
 Choosing test item 2 for each set... 2 1.
 Choosing training item 6....

Training items:
  Itm_num    Itm_name
       01    xyxyxyxy
       02      xxxxyx
       03       xxyxy
       04      xxyxyx
       05   xyxyyxyxy
       06   yxyxyxyxy

Test items:
 Tset_num     MySim_cat  Itm_num    Itm_name      MySim  MySim_tgt
       01      LowMySim       01    xxyxxxxy      0.750      0.750
       01      LowMySim       02    xxyxxyyy      0.750      0.750
       02     HighMySim       01   yyxxyxxyy      0.800      0.800
       02     HighMySim       02   xxyyyxxyy      0.800      0.800

Changing the relative cost of deletions, insertions and substitutions

SIM_FACTOR(aglss, 'fname', T, NAMES, [], {delCost insCost subCost}) defines similarity target values based on specified deletion, insertion and substitution costs for the default EditSim similarity function.

The default values for these parameters are 0.7, 0.7, 1, that is, with deletion and insertion costs of 0.7 and substitution cost of 1 (see Hahn & Bailey, 2005).

The example below builds on the AGLSS object S_XXX created for a previous example. Here, deletion, insertion, and substitution costs are all set to 1.

[s, levnames, tgts] = sim_factor(s_xxx, 'EQSim', [], [], ...
    [], {1, 1, 1});

levnames

tgts

s = factorial_testsets(s, {'EQSim', levnames{:}});
s = choose_items(s, 6, 2);

disp('Training items:');
disp(format_train_items(s));

disp('Test items:');
disp(format_test_items(s));
levnames = 

    'LowEQSim'
    'HighEQSim'


tgts =

    0.5882    0.7600

 Choosing training item 1....
 Choosing training item 2....
 Choosing training item 3....
 Choosing training item 4....
 Updating potential items....
 Choosing test item 1 for each set... 1 2.
 Choosing training item 5....
 Updating potential items....
 Choosing test item 2 for each set... 2 1.
 Choosing training item 6....

Training items:
  Itm_num    Itm_name
       01    xyxyxyxy
       02      xxxxyx
       03       xxyxy
       04      xxyxyx
       05   xxxxxxyxy
       06      xyxyyx

Test items:
 Tset_num     EQSim_cat  Itm_num    Itm_name      EQSim  EQSim_tgt
       01      LowEQSim       01   yyyxxyyxy      0.587      0.588
       01      LowEQSim       02     yyxxxyy      0.580      0.588
       02     HighEQSim       01   xxyxyxyxy      0.760      0.760
       02     HighEQSim       02      xxyxyy      0.762      0.760

Scaling the values returned by a similarity function

SIM_FACTOR(aglss, 'fname', P, NAMES, @simfunc, FPARMS, @scalefunc), specifies a scaling function which is applied to values returned by the specified similarity function prior to computing goodness-of-fit between potential test items and the target similarity values. The scaling function should map values returned by SIMFUNC onto a scale that is strictly greater than zero and less than one. For comparability with pre-defined chunk strength and other factors, the scaling function should ordinarily return values of 0.25 and 0.75 for the lowest and highest target similarity values.

The example below builds on the AGLSS object S_XXX created for a previous example. Here, an anonymous similarity function returns 1 for strings that differ by a single letter or less, and 0 otherwise. An anonymous scaling function maps 0 and 1 onto 0.25 and 0.75, respectively. The low similarity category aims to find test items that are at least two letters different from every training item. The high similarity category aims to find test items that are within a single letter of half of the training items.

[s, levnames, tgts] = sim_factor(s_xxx, 'X1Sim', [0 0.5], {'L', 'H'}, ...
    @(x,y) single(EditDist(x, y) <= 1), ...
    [], ...
    @(x) (x+0.5) ./ 2 );

levnames

tgts

s = factorial_testsets(s, {'X1Sim', levnames{:}});
s = choose_items(s, 6, 2);

disp('Training items:');
disp(format_train_items(s));

disp('Test items:');
disp(format_test_items(s));
levnames = 

    'L'    'H'


tgts =

         0    0.5000

 Choosing training item 1....
 Choosing training item 2....
 Choosing training item 3....
 Choosing training item 4....
 Updating potential items....
 Choosing test item 1 for each set... 2 1.
 Choosing training item 5....
 Updating potential items....
 Choosing test item 2 for each set... 2 1.
 Choosing training item 6....

Training items:
  Itm_num    Itm_name
       01    xyxyxyxy
       02      xxxxyx
       03       xxyxy
       04      xxyxyx
       05   xyxyyxyxy
       06        xxxx

Test items:
 Tset_num     X1Sim_cat  Itm_num    Itm_name      X1Sim  X1Sim_tgt
       01             L       01   xyyyxxxxx      0.000      0.000
       01             L       02  xyyyyxxyyy      0.000      0.000
       02             H       01       xxyxx      0.500      0.500
       02             H       02    xyyyxyxy      0.333      0.500