SDTL Best Practices and Conventions =================================== This section provides information on best practices for using SDTL. 1. **“$type” and “command” properties.** Commands and other items in SDTL are usually identified by the “$type” property. There are two important exceptions. a. Both “$type” and “command” are required in CommandBase. SDTL commands in CommandBase inherit the “command” property, which gives the name of the SDTL command. Even though the command name is also given in the “$type” property, both “$type” and “command” are required. This redundancy is due to a limitation in JSON, and it is needed to be sure that SDTL JSON is rendered correctly in other formats, such as XML. b. Command names should be spelled the same way in both "$type" and "command" properties, including capitalization. c. “$type” can be omitted when only one SDTL type is allowed. A number of SDTL types are used to specify complex properties of commands. If only one SDTL type can be used in a property, “$type” may be omitted. For example, the Rename command has a property called “renames” which is an array of “RenamePair”. Since only a “RenamePair” can be used in a “Rename.renames” property, the “$type”: “RenamePair” may be omitted. 2. **Execute**. Some statistical packages require an explicit command to “execute” or “run” a group of commands. Execute is included in SDTL for information, but it has no functions at this time. 3. **Data Types and Formats.** a. SDTL does not have a feature for setting default data types and display formats. The **SetDataType** and **SetDisplayFormat** should be used whenever the type or format are known. b. The **SetDataType** command accepts only a short list of general data types (Text, Numeric, Boolean, Date-Time, Factor). These types can be extended by using the **subTypeSchema** to point to a controlled vocabulary with a more specific **subType**. The **subTypeSchema** may refer to a software vendor or to a standrd list, such as a DDI Alliance controlled vocabulary See https://ddialliance.org/controlled-vocabularies/all . c. Since display formats are often specific to each software package, the **SetDisplayFormat** uses the **displayFormatSchema** and **displayFormatName** properties to point to controlled a controlled vocabulary. 4. **Lists and Ranges** a. Many commands involve lists of variables or values. SDTL includes types for representing lists and ranges of variables and values. For example, a SetDisplayFormat command may declare that a long list of variables should be displayed with two decimal places. b. A “variable range” refers to a group of variables in consecutive columns, such as “varC to varK”. A list may include both individual variables/values and ranges of variables/values. c. A list or a range is considered a single element in SDTL. For example, most languages have a “max” function that returns the maximum value from a list of variables. The SDTL “max” function has only one parameter which must be a VariableListExpression. Although the SPSS expression “MAX( varX, varY, varZ)” appears to have three parameters, it is translated into a VariableListExpression pointing to “varX, varY, varZ”, which is one parameter in SDTL. 5. **Loops and Macros** a. Most loops have an index parameter that changes in each pass through the loop. The index parameter may be controlled by a list of variable names or by a list of values. b. Whenever possible, loops should be expanded in SDTL scripts by repeating commands within the loop with each value of a parameter that changes. In general, it is much easier to expand loops over variables than loops over values, because the latter may depend on the state of the dataset. c. Even if a loop is expanded, an SDTL version of the loop should be provided for reference, and the original code of the full loop should be given in the SourceInformation parameter originalSourceText. d. When loops have been expanded, the “processedSourceText'' in the sourceInformation property is used to identify source code after expansion. Comparison of the “originalSourceText” and “processedSourceText” shows how the macro or loop has been expanded. 6. **DoIf** a. Used when the logical expression is evaluated once for the entire dataset before execution of the enclosed commands. b. SourceInformation includes the entire If/Else group of commands. SourceInformation for subcommands in both the Then block and the Else block describe only one command. 7. **IfRows**, a. Used when the logical expression is evaluated once on each row. Commands are executed row by row. b. SourceInformation includes the entire If/Else group of commands. SourceInformation for subcommands in both the Then block and the Else block describe only one command. 8. **MergeDatasets** Examples of MergeDatasets can be found in the SDTL Merge Gallery: * Spreadsheet version: https://gitlab.com/c2metadata/sdtl-cogs/-/blob/master/CompositeTypes/MergeDatasets/SDTL_Merge_Gallery.xlsx * PDF version: https://gitlab.com/c2metadata/sdtl-cogs/-/blob/master/CompositeTypes/MergeDatasets/SDTL_Merge_Gallery.pdf 9. **MergeFileDescription** Options for **MergeFileDescription** are also in this document *Properties and Options of MergeFileDescription*. See https://gitlab.com/c2metadata/sdtl-cogs/-/blob/master/CompositeTypes/MergeFileDescription/MergeFileDescription_options.markdown a. **mergeType** i. Sequential: Match rows from each input dataframe in sequential order. ii. OneToOne: Create one row for each value of the **mergeByVariables**. If a combination of the **mergeByVariables** is repeated, only one row is matched. Rows with repeated combinations of the MergeByVariables may or may not be included in the output file depending on the **newRow** property. iii. OneToMany: Create a row in the output dataframe by matching rows in this dataframe to every row in other dataframes with the same value of MergeByVariables. Note that OneToMany implies that one of the other input datarames is set to ManyToOne. iv. ManyToOne: Create a row in the output dataframe by matching all rows in this dataframe to the one row in the other dataframe with the same value of MergeByVariables. v. Cartesian: Create a new row in the output dataframe for every possible combination of rows having the same value of MergeByVariables. This is equivalent to a many to many merge. R and Python use a model derived from SQL, which is based on Cartesian joins. vi. Unmatched: Create a new row for every row that cannot be matched on the MergeByVariables vii. SASmatchMerge: SAS uses a merging approach that combines matching keys and sequential merges within groups. b. **update** i. Master: This dataframe is the Master dataframe. ii. Ignore: If a column with the same name exists in the Master dataframe, ignore the values in this dataframe. iii. FillNew: If a column with the same name exists in the Master dataframe, use the values from this dataframe only in new rows created from this dataframe. iv. UpdateMissing: If a column with the same name exists in the Master dataframe, use values from this dataframe when the value in the Master dataframe is missing. Rows not in the Master dataframe are filled from this dataframe. v. Replace: If a column with the same name exists in the Master dataframe, use values from this dataframe. c. **newRow** i. TRUE: Every row in the dataframe generates a new row in the output dataframe. ii. FALSE: Only rows that are matched generate a new row in the output dataframe. d. **mergeFlagVariable** i. **mergeFlagVariable** creates a new variable describing whether a row was derived from this file. ii. SPSS creates separate merge flag variables for each input file. These variables are binary (0,1). iii. Stata and Python create a categorical variable indicating which files contributed to each row. 10. **Use of VariableListExpression in the Function Library**. The Function Library operates by mapping parameters from other languages to a common set of parameters for the SDTL version of the function. Some functions operate on a list of variables, such as “mean(varX, varY, varZ). It would be impossible to specify parameters in the Function Library if every variable in a list was considered a parameter. So, VariableListExpression allows us to use one SDTL parameter for a list of variables. 11. **Character strings in statistical packages.** There are two different ways that statistical packages handle variables consisting of text. SPSS and SAS operate primarily on fixed length character variables. If the user assigns a string shorter than the declared length of the variable, it is padded with blanks on the right side. Stata, R, and Python were designed to work with string variables that vary in length. 12. **FunctionCallExpression: argumentName property required**. a. The **argumentName property** in a **FunctionCallExpression** must be present. 13. **Commands versus Functions** Some source language commands may be translated as functions in SDTL and vice versa. For example, the Python function “df.rename()” renames variables. In SDTL Rename is a command not a function. 14. **Parsing Comments** Comments in the source languages are delimited by certain special characters which may differ depending on the language; some languages also differentiate between single-line and multi-line comments with a different set of delimiting characters (for example, in Python, a single-line comment starts with a # symbol and ends with a new line, but a multi-line comment starts and ends with three quotation marks). Parsers should take care not to include comment delimiting characters in the commentText property of the corresponding SDTL Comment command because not all source languages use the same characters for that purpose and a comment delimiting character in one language may have an unintended side effect if the SDTL is used to translate the comment into another source language. 15. **variableInventory** a. **variableInventory**, a property of **DataframeDescription**, is used to provide an ordered list of the variables in a dataframe. All SDTL commands include **variableInventory**, because it is a sub-property of both **consumesDataframe** and **producesDataframe**, which are inherited from **CommandBase**. b. Parsers are encouraged to use **variableInventory** after any command that changes the number or order of variables in a dataframe. Most source languages allow variable ranges (SDTL **VariableRangeExpression**) in various commands. Since a variable range depends upon the order of variables in a dataframe, the parser should include that information in the SDTL script for use by updaters and other applications. 16. **Collapse** and **Aggregate** Some aggregation functions (e.g. mean) cannot be performed on text variables. However, users may apply these functions to a range of variables that includes text variables. Our tests in several source languages suggest that parsers can transfer the variable range in the user-supplied script to SDTL even when it includes text variables. When this happens, the statistical packages will either ignore text variables and aggregate only the numeric variables, or it will halt with an error message. See *Collapse: Handling of Non-numeric Variables* https://gitlab.com/c2metadata/sdtl-cogs/-/blob/master/CompositeTypes/Collapse/Collapse_Nonnumeric_Variables.rst 17. **Variable names in case-insensitive languages** a. If the source language is case insensitive, the parser will change all variable names to either all caps or all lower case. The originalSourceText property of the SourceInformation element will show capitalization as it appears in the original script. b. A Message command at the beginning of the SDTL script should say that variable names have been standardized. c. Standardization of variable names is necessary for compatibility between case sensitive and insensitive languages. 18. **Omitting optional properties in SDTL JSON** There are three acceptable ways of omitting an optional property from an SDTL JSON file: i. The property is omitted -- used for single objects or arrays ii. "property":null -- used for single objects or arrays iii. "property":[] -- only used for arrays 19. **sourceInformation is an array** The **sourceInformation** property in **CommandBase** is an array, which can describe more than one command in the source script. This supports cases where two or more commands in the source script contribute to a single SDTL command. 20. **Selecting by row number** The SDTL **row_number()** function returns the current row number in the dataframe. This function can be used for selecting subsets by row number. For example, in Python **dataFrame.iloc[2:4]** will select the 3rd and 4th rows in the data frame. (Ranges in Python are 0-indexed and open on the right.) The **row_number()** function can be used in an expression in the **DropCases** and **KeepCases** commands to select a subset, or in the **IfRows** command to control which rows a command or group of commands operate on. 21. **Factor subtypes** R and Python both include a categorical data type, which is called Factor in R and Categorical in Python. SDTL calls the type **Factor**. Both R and Python allow Factor/Categorical variables to be either ordered or unordered. Only ordered factor variables can be used in greater/less than logical conditions, but unordered factor variables can be used in equal/not equal expressions. However, there are several differences in the ways that factors are implemented in R and Python. For example, factors in R are always string values, but factors in Python can be string or numeric. Unordered factors can be used for sorting in R but not in Python. Because of these differences between languages, Factor variables should be described using the **subTypeSchema** and **subType** properties in the SetDataType command. These can be implemented like this:: Python factors subTypeSchema: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html subType: ordered, unordered R factors subTypeSchema: https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Factors subType: ordered, unordered 22. **Deep copy of a dataframe** Python and R distinguish between a deep copy and shallow (Python) or copy by reference (R). A deep copy creates a duplicate of a dataframe that is independent of the original. A shallow copy has a new name, but it points to the storage locations of the original dataframe. This acts as an alias for the original dataframe. If a deep copy is changed, the contents of the original dataframe are not affected. However, changing a shallow copy also changes the contents of the original dataframe. In SDTL, the NewDataframe command can be used to create deep copies. SDTL does not support shallow copies at this time. 23. **Representing indexed arrays and lists in SDTL using VariableArrayDereference() and ValueArrayDereference() SDTL does not include a data type for indexed arrays or lists, but the same functionality can sometimes be achieved using SDTL functions VariableArrayDereference() and ValueArrayDereference().** VariableArrayDereference() and ValueArrayDereference() both take two arguments. EXP1 is a number pointing to the location of the desired item in the list given as EXP2. EXP2 is an SDTL list expression (VariableListExpression or ValueListExpression), which may consist of a range expression (VariableRangeExpression, NumberRangeExpression, StringRangeExpression). The list expression must be repeated every time that the array dereference function is used. For example, the following SAS code uses a SAS array of variables in a loop. :: array musicArray {18} BIGBAND -- HVYMETAL ; do i= 1 to 18 ; if (musicArray[i] EQ 1 OR musicArray[i] EQ 2) then musicLike2=musicLike2 +1 ; end; In SDTL, we would replace musicArray[i] with a VariableArrayDereference(EXP1, EXP2) in which EXP1 is an SDTL IteratorSymbolExpression for i and EXP2 is a VariableRangeExpression for variables BIGBAND to HVYMETAL.