SIM_FACTOR
Define a similarity factor to manipulate the similarity of test items to training items.
Contents
Basic item similarity factor
SIM_FACTOR(aglss, 'fname') defines target values for low and high similarity test items, that is, test items that have low or high avereage similarity to training items. The average similarity between training items and low similarity test items will be the same as the 25th percentile of pairwise similarities of all grammatical items to each other. The average similarity between training items and high similarity items will be the same as the 75th percentile of similarities of grammatical items to each other.
Similarity is computed using the EDIT_SIM function (or other similarity function, if specified).
The example below generates training items based on the XXX_GRAMMAR. Test items are then generated that have either low or high similarity to the training items.
The first output variable returned by SIM_FACTOR is an AGLSS object, updated to reflect the chunk novelty factor. The second output variable is a cell array of strings naming the different levels of similarity defined. Actual similarity target values are returned as the third output variable.
s_xxx = aglss(xxx_grammar, [3 10]); [s, levnames, tgts] = sim_factor(s_xxx, 'Sim'); levnames tgts s = factorial_testsets(s, {'Sim', levnames{:}}); s = choose_items(s, 6, 2); disp('Training items:'); disp(format_train_items(s)); disp('Test items:'); disp(format_test_items(s));
Potential items: Grammar involves 2 symbols (xy) 2040 possible strings of length 3-10 180 grammatical strings ( 8.82%) 1860 ungrammatical strings (91.18%) Using all 180 grammatical strings Using all 1860 ungrammatical strings levnames = 'LowSim' 'HighSim' tgts = 0.5882 0.7600 Choosing training item 1.... Choosing training item 2.... Choosing training item 3.... Choosing training item 4.... Updating potential items.... Choosing test item 1 for each set... 2 1. Choosing training item 5.... Updating potential items.... Choosing test item 2 for each set... 2 1. Choosing training item 6.... Training items: Itm_num Itm_name 01 xyxyxyxy 02 xxxxyx 03 xxyxy 04 xxyxyx 05 xxxxxxyxy 06 xyxyyx Test items: Tset_num Sim_cat Itm_num Itm_name Sim Sim_tgt 01 LowSim 01 yyyxxyyxy 0.587 0.588 01 LowSim 02 yyxxxyy 0.580 0.588 02 HighSim 01 xxyxyxyxy 0.760 0.760 02 HighSim 02 xxyxyy 0.762 0.760
Levels of similarity
SIM_FACTOR(aglss, 'fname', [p1 p2 ... pN]), specifies similarity percentiles for N levels of similarity. The similarity of test items in category I to training items will be approximately the same as the P(N)-th percentile of similarities of grammatical items to each other. For example, [25 75] specifies 25th and 75th percentiles.
The example below builds on the AGLSS object S_XXX created for the previous example. Three levels of similarity are defined, for test items at the 20th, 50th, and 80th percentiles of similarity.
[s, levnames, tgts] = sim_factor(s_xxx, 'MySim', [20 50 80]); levnames tgts s = factorial_testsets(s, {'MySim', levnames{:}}); s = choose_items(s, 6, 2); disp('Training items:'); disp(format_train_items(s)); disp('Test items:'); disp(format_test_items(s));
levnames = 'MySim1' 'MySim2' 'MySim3' tgts = 0.5682 0.6818 0.7778 Choosing training item 1.... Choosing training item 2.... Choosing training item 3.... Choosing training item 4.... Updating potential items.... Choosing test item 1 for each set... 1 3 2. Choosing training item 5.... Updating potential items.... Choosing test item 2 for each set... 3 1 2. Choosing training item 6.... Training items: Itm_num Itm_name 01 xyxyxyxy 02 xxxxyx 03 xxyxy 04 xxyxyx 05 xxyxyxyx 06 xxyxyxyxyy Test items: Tset_num MySim_cat Itm_num Itm_name MySim MySim_tgt 01 MySim1 01 yyxxxyyxx 0.568 0.568 01 MySim1 02 yyxyyxxxy 0.562 0.568 02 MySim2 01 xyxxyyx 0.685 0.682 02 MySim2 02 xyxyyxyyxy 0.685 0.682 03 MySim3 01 xyxxyxy 0.770 0.778 03 MySim3 02 xyxyxxyx 0.787 0.778
Naming levels of similarity
SIM_FACTOR(aglss, 'fname', T, {'name1', 'name2', ...}) specifies names for the different levels of similarity, as an alternative to the default names otherwise assigned.
The example below builds on the AGLSS object S_XXX created for a previous example. The default levels of similarity are named 'L' and 'H' (for Low and High). These names appear in the display of test items.
[s, levnames, tgts] = sim_factor(s_xxx, 'Sim', [], {'L', 'H'}); levnames tgts s = factorial_testsets(s, {'Sim', levnames{:}}); s = choose_items(s, 6, 2); disp('Training items:'); disp(format_train_items(s)); disp('Test items:'); disp(format_test_items(s));
levnames = 'L' 'H' tgts = 0.5882 0.7600 Choosing training item 1.... Choosing training item 2.... Choosing training item 3.... Choosing training item 4.... Updating potential items.... Choosing test item 1 for each set... 2 1. Choosing training item 5.... Updating potential items.... Choosing test item 2 for each set... 1 2. Choosing training item 6.... Training items: Itm_num Itm_name 01 xyxyxyxy 02 xxxxyx 03 xxyxy 04 xxyxyx 05 xxxxxxyxy 06 xyxyyx Test items: Tset_num Sim_cat Itm_num Itm_name Sim Sim_tgt 01 L 01 yyyxxyyxy 0.587 0.588 01 L 02 yyxxxyy 0.580 0.588 02 H 01 xxyxyxyxy 0.760 0.760 02 H 02 xxyxyy 0.762 0.760
Changing the similarity function
SIM_FACTOR(aglss, 'fname', P, NAMES, @simfunc, {parm1 ... parmN}), specifies a function to use to compute the similarity between test and training items. SIMFUNC should be a function that takes two arguments (strings or cell arrays of strings), and may also take additional arguments specified as the cell array {PARM1 ... PARMN}. It should return a matrix of similarity values.
The example below builds on the AGLSS object S_XXX created for a previous example. Here, an anonymous similarity function computes D/(1+D), where D is the edit distance between two items. Additional parameters are passed to the EditDist function to specify equal costs for deletions, insertions, and substitutions.
[s, levnames, tgts] = sim_factor(s_xxx, 'MySim', [], [], ... @(x,y,varargin) EditDist(x, y, varargin{:}) ./ ... (1 + EditDist(x, y, varargin{:})), ... {1, 1, 1}); levnames tgts s = factorial_testsets(s, {'MySim', levnames{:}}); s = choose_items(s, 6, 2); disp('Training items:'); disp(format_train_items(s)); disp('Test items:'); disp(format_test_items(s));
levnames = 'LowMySim' 'HighMySim' tgts = 0.7500 0.8000 Choosing training item 1.... Choosing training item 2.... Choosing training item 3.... Choosing training item 4.... Updating potential items.... Choosing test item 1 for each set... 1 2. Choosing training item 5.... Updating potential items.... Choosing test item 2 for each set... 2 1. Choosing training item 6.... Training items: Itm_num Itm_name 01 xyxyxyxy 02 xxxxyx 03 xxyxy 04 xxyxyx 05 xyxyyxyxy 06 yxyxyxyxy Test items: Tset_num MySim_cat Itm_num Itm_name MySim MySim_tgt 01 LowMySim 01 xxyxxxxy 0.750 0.750 01 LowMySim 02 xxyxxyyy 0.750 0.750 02 HighMySim 01 yyxxyxxyy 0.800 0.800 02 HighMySim 02 xxyyyxxyy 0.800 0.800
Changing the relative cost of deletions, insertions and substitutions
SIM_FACTOR(aglss, 'fname', T, NAMES, [], {delCost insCost subCost}) defines similarity target values based on specified deletion, insertion and substitution costs for the default EditSim similarity function.
The default values for these parameters are 0.7, 0.7, 1, that is, with deletion and insertion costs of 0.7 and substitution cost of 1 (see Hahn & Bailey, 2005).
The example below builds on the AGLSS object S_XXX created for a previous example. Here, deletion, insertion, and substitution costs are all set to 1.
[s, levnames, tgts] = sim_factor(s_xxx, 'EQSim', [], [], ... [], {1, 1, 1}); levnames tgts s = factorial_testsets(s, {'EQSim', levnames{:}}); s = choose_items(s, 6, 2); disp('Training items:'); disp(format_train_items(s)); disp('Test items:'); disp(format_test_items(s));
levnames = 'LowEQSim' 'HighEQSim' tgts = 0.5882 0.7600 Choosing training item 1.... Choosing training item 2.... Choosing training item 3.... Choosing training item 4.... Updating potential items.... Choosing test item 1 for each set... 1 2. Choosing training item 5.... Updating potential items.... Choosing test item 2 for each set... 2 1. Choosing training item 6.... Training items: Itm_num Itm_name 01 xyxyxyxy 02 xxxxyx 03 xxyxy 04 xxyxyx 05 xxxxxxyxy 06 xyxyyx Test items: Tset_num EQSim_cat Itm_num Itm_name EQSim EQSim_tgt 01 LowEQSim 01 yyyxxyyxy 0.587 0.588 01 LowEQSim 02 yyxxxyy 0.580 0.588 02 HighEQSim 01 xxyxyxyxy 0.760 0.760 02 HighEQSim 02 xxyxyy 0.762 0.760
Scaling the values returned by a similarity function
SIM_FACTOR(aglss, 'fname', P, NAMES, @simfunc, FPARMS, @scalefunc), specifies a scaling function which is applied to values returned by the specified similarity function prior to computing goodness-of-fit between potential test items and the target similarity values. The scaling function should map values returned by SIMFUNC onto a scale that is strictly greater than zero and less than one. For comparability with pre-defined chunk strength and other factors, the scaling function should ordinarily return values of 0.25 and 0.75 for the lowest and highest target similarity values.
The example below builds on the AGLSS object S_XXX created for a previous example. Here, an anonymous similarity function returns 1 for strings that differ by a single letter or less, and 0 otherwise. An anonymous scaling function maps 0 and 1 onto 0.25 and 0.75, respectively. The low similarity category aims to find test items that are at least two letters different from every training item. The high similarity category aims to find test items that are within a single letter of half of the training items.
[s, levnames, tgts] = sim_factor(s_xxx, 'X1Sim', [0 0.5], {'L', 'H'}, ... @(x,y) single(EditDist(x, y) <= 1), ... [], ... @(x) (x+0.5) ./ 2 ); levnames tgts s = factorial_testsets(s, {'X1Sim', levnames{:}}); s = choose_items(s, 6, 2); disp('Training items:'); disp(format_train_items(s)); disp('Test items:'); disp(format_test_items(s));
levnames = 'L' 'H' tgts = 0 0.5000 Choosing training item 1.... Choosing training item 2.... Choosing training item 3.... Choosing training item 4.... Updating potential items.... Choosing test item 1 for each set... 2 1. Choosing training item 5.... Updating potential items.... Choosing test item 2 for each set... 2 1. Choosing training item 6.... Training items: Itm_num Itm_name 01 xyxyxyxy 02 xxxxyx 03 xxyxy 04 xxyxyx 05 xyxyyxyxy 06 xxxx Test items: Tset_num X1Sim_cat Itm_num Itm_name X1Sim X1Sim_tgt 01 L 01 xyyyxxxxx 0.000 0.000 01 L 02 xyyyyxxyyy 0.000 0.000 02 H 01 xxyxx 0.500 0.500 02 H 02 xyyyxyxy 0.333 0.500