Skip to main content

CORRELATION

The CORRELATION command calculates the correlations in your dataset between each input column, including both numeric can categorical columns.

The CORRELATION command takes no inputs, since it uses all input data. This can be adjusted by using the ignore option to ignore certain columns.

The output will contain a new field column, that shows which correlation is being made to which input column. Additionally, the output will contain NxN correlation values, i.e. correlations for all other columns and itself.

Note that input columns are exactly correlated with themselves, so the diagonal part of the result will be a set of ones. This is normal!

Technical details:

  • numeric-numeric comparisons: Pearson's standard correlation coefficient
  • categorical-categorical: Cramer's V statistic
  • numeric-categorical: Correlation ratio

Syntax

CORRELATION([, ignore=<column_names>])

Options

  • ignore can be used to specify columns (as a comma separated list) returned by the SELECT statement but which you want the CORRELATION to ignore.

Returns

Appends a new column to the input dataset named column which describes which column the comparison is being made to for each row. A column for each input feature that contain N (number of columns) correlation values, i.e. correlations for all other columns and itself.

Examples

Get the CORRELATION values in a customer table.

SELECT * FROM customer CORRELATION

Ignore the customer_id.

SELECT * FROM customer CORRELATION(ignore='customer_id')

Show correlations only for gender.

SELECT * FROM customer CORRELATION WHERE column='gender'