SDTL Best Practices and Conventions

This section provides information on best practices for using SDTL.

  1. “$type” and “command” properties. Commands and other items in SDTL are usually identified by the “$type” property. There are two important exceptions.
    1. Both “$type” and “command” are required in CommandBase. SDTL commands in CommandBase inherit the “command” property, which gives the name of the SDTL command. Even though the command name is also given in the “$type” property, both “$type” and “command” are required. This redundancy is due to a limitation in JSON, and it is needed to be sure that SDTL JSON is rendered correctly in other formats, such as XML.
    2. Command names should be spelled the same way in both “$type” and “command” properties, including capitalization.
    3. “$type” can be omitted when only one SDTL type is allowed. A number of SDTL types are used to specify complex properties of commands. If only one SDTL type can be used in a property, “$type” may be omitted. For example, the Rename command has a property called “renames” which is an array of “RenamePair”. Since only a “RenamePair” can be used in a “Rename.renames” property, the “$type”: “RenamePair” may be omitted.
  2. Execute. Some statistical packages require an explicit command to “execute” or “run” a group of commands. Execute is included in SDTL for information, but it has no functions at this time.
  3. Data Types and Formats.
    1. SDTL does not have a feature for setting default data types and display formats. The SetDataType and SetDisplayFormat should be used whenever the type or format are known.
    2. The SetDataType command accepts only a short list of general data types (Text, Numeric, Boolean, Date-Time, Factor). These types can be extended by using the subTypeSchema to point to a controlled vocabulary with a more specific subType. The subTypeSchema may refer to a software vendor or to a standrd list, such as a DDI Alliance controlled vocabulary See https://ddialliance.org/controlled-vocabularies/all .
    3. Since display formats are often specific to each software package, the SetDisplayFormat uses the displayFormatSchema and displayFormatName properties to point to controlled a controlled vocabulary.
  4. Lists and Ranges
    1. Many commands involve lists of variables or values. SDTL includes types for representing lists and ranges of variables and values. For example, a SetDisplayFormat command may declare that a long list of variables should be displayed with two decimal places.
    2. A “variable range” refers to a group of variables in consecutive columns, such as “varC to varK”. A list may include both individual variables/values and ranges of variables/values.
    3. A list or a range is considered a single element in SDTL. For example, most languages have a “max” function that returns the maximum value from a list of variables. The SDTL “max” function has only one parameter which must be a VariableListExpression. Although the SPSS expression “MAX( varX, varY, varZ)” appears to have three parameters, it is translated into a VariableListExpression pointing to “varX, varY, varZ”, which is one parameter in SDTL.
  5. Loops and Macros
    1. Most loops have an index parameter that changes in each pass through the loop. The index parameter may be controlled by a list of variable names or by a list of values.
    2. Whenever possible, loops should be expanded in SDTL scripts by repeating commands within the loop with each value of a parameter that changes. In general, it is much easier to expand loops over variables than loops over values, because the latter may depend on the state of the dataset.
    3. Even if a loop is expanded, an SDTL version of the loop should be provided for reference, and the original code of the full loop should be given in the SourceInformation parameter originalSourceText.
    4. When loops have been expanded, the “processedSourceText’’ in the sourceInformation property is used to identify source code after expansion. Comparison of the “originalSourceText” and “processedSourceText” shows how the macro or loop has been expanded.
  6. DoIf
    1. Used when the logical expression is evaluated once for the entire dataset before execution of the enclosed commands.
    2. SourceInformation includes the entire If/Else group of commands. SourceInformation for subcommands in both the Then block and the Else block describe only one command.
  7. IfRows,
    1. Used when the logical expression is evaluated once on each row. Commands are executed row by row.
    2. SourceInformation includes the entire If/Else group of commands. SourceInformation for subcommands in both the Then block and the Else block describe only one command.
  8. MergeDatasets Examples of MergeDatasets can be found in the SDTL Merge Gallery:
  9. MergeFileDescription

Options for MergeFileDescription are also in this document Properties and Options of MergeFileDescription. See https://gitlab.com/c2metadata/sdtl-cogs/-/blob/master/CompositeTypes/MergeFileDescription/MergeFileDescription_options.markdown

  1. mergeType
    1. Sequential: Match rows from each input dataframe in sequential order.
    2. OneToOne: Create one row for each value of the mergeByVariables. If a combination of the mergeByVariables is repeated, only one row is matched. Rows with repeated combinations of the MergeByVariables may or may not be included in the output file depending on the newRow property.
    3. OneToMany: Create a row in the output dataframe by matching rows in this dataframe to every row in other dataframes with the same value of MergeByVariables. Note that OneToMany implies that one of the other input datarames is set to ManyToOne.
    4. ManyToOne: Create a row in the output dataframe by matching all rows in this dataframe to the one row in the other dataframe with the same value of MergeByVariables.
    5. Cartesian: Create a new row in the output dataframe for every possible combination of rows having the same value of MergeByVariables. This is equivalent to a many to many merge. R and Python use a model derived from SQL, which is based on Cartesian joins.
    6. Unmatched: Create a new row for every row that cannot be matched on the MergeByVariables
    7. SASmatchMerge: SAS uses a merging approach that combines matching keys and sequential merges within groups.
  2. update
    1. Master: This dataframe is the Master dataframe.
    2. Ignore: If a column with the same name exists in the Master dataframe, ignore the values in this dataframe.
    3. FillNew: If a column with the same name exists in the Master dataframe, use the values from this dataframe only in new rows created from this dataframe.
    4. UpdateMissing: If a column with the same name exists in the Master dataframe, use values from this dataframe when the value in the Master dataframe is missing. Rows not in the Master dataframe are filled from this dataframe.
    5. Replace: If a column with the same name exists in the Master dataframe, use values from this dataframe.
  3. newRow
    1. TRUE: Every row in the dataframe generates a new row in the output dataframe.
    2. FALSE: Only rows that are matched generate a new row in the output dataframe.
  4. mergeFlagVariable
    1. mergeFlagVariable creates a new variable describing whether a row was derived from this file.
    2. SPSS creates separate merge flag variables for each input file. These variables are binary (0,1).
    3. Stata and Python create a categorical variable indicating which files contributed to each row.
  1. Use of VariableListExpression in the Function Library. The Function Library operates by mapping parameters from other languages to a common set of parameters for the SDTL version of the function. Some functions operate on a list of variables, such as “mean(varX, varY, varZ). It would be impossible to specify parameters in the Function Library if every variable in a list was considered a parameter. So, VariableListExpression allows us to use one SDTL parameter for a list of variables.

  2. Character strings in statistical packages. There are two different ways that statistical packages handle variables consisting of text. SPSS and SAS operate primarily on fixed length character variables. If the user assigns a string shorter than the declared length of the variable, it is padded with blanks on the right side. Stata, R, and Python were designed to work with string variables that vary in length.

  3. FunctionCallExpression: argumentName property required.

    a. The argumentName property in a FunctionCallExpression must be present.

  4. Commands versus Functions Some source language commands may be translated as functions in SDTL and vice versa. For example, the Python function “df.rename()” renames variables. In SDTL Rename is a command not a function.

  5. Parsing Comments

    Comments in the source languages are delimited by certain special characters which may differ depending on the language; some languages also differentiate between single-line and multi-line comments with a different set of delimiting characters (for example, in Python, a single-line comment starts with a # symbol and ends with a new line, but a multi-line comment starts and ends with three quotation marks). Parsers should take care not to include comment delimiting characters in the commentText property of the corresponding SDTL Comment command because not all source languages use the same characters for that purpose and a comment delimiting character in one language may have an unintended side effect if the SDTL is used to translate the comment into another source language.

  6. variableInventory

  1. variableInventory, a property of DataframeDescription, is used to provide an ordered list of the variables in a dataframe. All SDTL commands include variableInventory, because it is a sub-property of both consumesDataframe and producesDataframe, which are inherited from CommandBase.
  2. Parsers are encouraged to use variableInventory after any command that changes the number or order of variables in a dataframe. Most source languages allow variable ranges (SDTL VariableRangeExpression) in various commands. Since a variable range depends upon the order of variables in a dataframe, the parser should include that information in the SDTL script for use by updaters and other applications.
  1. Collapse and Aggregate

Some aggregation functions (e.g. mean) cannot be performed on text variables. However, users may apply these functions to a range of variables that includes text variables. Our tests in several source languages suggest that parsers can transfer the variable range in the user-supplied script to SDTL even when it includes text variables. When this happens, the statistical packages will either ignore text variables and aggregate only the numeric variables, or it will halt with an error message.

See Collapse: Handling of Non-numeric Variables https://gitlab.com/c2metadata/sdtl-cogs/-/blob/master/CompositeTypes/Collapse/Collapse_Nonnumeric_Variables.rst

  1. Variable names in case-insensitive languages

a. If the source language is case insensitive, the parser will change all variable names to either all caps or all lower case. The originalSourceText property of the SourceInformation element will show capitalization as it appears in the original script.

b. A Message command at the beginning of the SDTL script should say that variable names have been standardized.

c. Standardization of variable names is necessary for compatibility between case sensitive and insensitive languages.

  1. Omitting optional properties in SDTL JSON

There are three acceptable ways of omitting an optional property from an SDTL JSON file:

  1. The property is omitted – used for single objects or arrays
  2. “property”:null – used for single objects or arrays
  3. “property”:[] – only used for arrays
  1. sourceInformation is an array
The sourceInformation property in CommandBase is an array, which can describe more than one command in the source script. This supports cases where two or more commands in the source script contribute to a single SDTL command.
  1. Selecting by row number
The SDTL row_number() function returns the current row number in the dataframe. This function can be used for selecting subsets by row number. For example, in Python dataFrame.iloc[2:4] will select the 3rd and 4th rows in the data frame. (Ranges in Python are 0-indexed and open on the right.) The row_number() function can be used in an expression in the DropCases and KeepCases commands to select a subset, or in the IfRows command to control which rows a command or group of commands operate on.
  1. Factor subtypes

R and Python both include a categorical data type, which is called Factor in R and Categorical in Python. SDTL calls the type Factor. Both R and Python allow Factor/Categorical variables to be either ordered or unordered. Only ordered factor variables can be used in greater/less than logical conditions, but unordered factor variables can be used in equal/not equal expressions. However, there are several differences in the ways that factors are implemented in R and Python. For example, factors in R are always string values, but factors in Python can be string or numeric. Unordered factors can be used for sorting in R but not in Python.

Because of these differences between languages, Factor variables should be described using the subTypeSchema and subType properties in the SetDataType command. These can be implemented like this:

Python factors
subTypeSchema: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html
subType: ordered, unordered

R factors
subTypeSchema: https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Factors
subType: ordered, unordered