CORRELATION
The CORRELATION
command calculates the correlations in your dataset between each input column, including both numeric can categorical columns.
The CORRELATION
command takes no inputs, since it uses all input data.
This can be adjusted by using the ignore
option to ignore certain columns.
The output will contain a new field column
, that shows which correlation is being made to which input column.
Additionally, the output will contain NxN correlation values, i.e. correlations for all other columns and itself.
Note that input columns are exactly correlated with themselves, so the diagonal part of the result will be a set of ones. This is normal!
Technical details:
- numeric-numeric comparisons: Pearson's standard correlation coefficient
- categorical-categorical: Cramer's V statistic
- numeric-categorical: Correlation ratio
Syntax
CORRELATION([, ignore=<column_names>])
Options
ignore
can be used to specify columns (as a comma separated list) returned by theSELECT
statement but which you want theCORRELATION
to ignore.
Returns
Appends a new column to the input dataset named column
which describes which column the comparison is being made to for each row.
A column for each input feature that contain N (number of columns) correlation values, i.e. correlations for all other columns and itself.
Examples
Get the CORRELATION values in a customer
table.
SELECT * FROM customer CORRELATION
Ignore the customer_id
.
SELECT * FROM customer CORRELATION(ignore='customer_id')
Show correlations only for gender.
SELECT * FROM customer CORRELATION WHERE column='gender'