diff --git a/code-lists.md b/code-lists.md index 14552c4..4faad2e 100644 --- a/code-lists.md +++ b/code-lists.md @@ -19,6 +19,11 @@ - [Analysis function guidance on symbols and shorthand in tables](#analysis-function-guidance-on-symbols-and-shorthand-in-tables) - [Themes](#themes) - [Media types](#media-types) + - [Reusable concepts in a CSV](#reusable-concepts-in-a-csv) + - [Periods of time](#periods-of-time) + - [Area code, label and type](#area-code-label-and-type) + - [Age code and label](#age-code-and-label) + - [Sex code and label](#sex-code-and-label) ## Codelists @@ -93,8 +98,8 @@ For example: | Property | Requirement level | Notes | | ------------------ | ----------------- | ------------------------------------------------------------------------- | | `skos:inScheme` | mandatory | See [codelists](#codelists) | -| `rdfs:label` | mandatory | See [titles](style.md#titles) | -| `skos:prefLabel` | mandatory | See [titles](style.md#titles) | +| `rdfs:label` | mandatory | See [titles](style.md#titles) | +| `skos:prefLabel` | mandatory | See [titles](style.md#titles) | | `skos:notation` | mandatory | | | `skos:broader` | recommended | See [hierarchical codelists](#hierarchical-codelists) | | `skos:narrower` | recommended | See [hierarchical codelists](#hierarchical-codelists) | @@ -324,7 +329,7 @@ Statisticians may wish to report statistics against multiple classifications. Do For example, consider a dataset which mixes codes from the NUTS geography codelist with codes from the ONS geography codelist. -| geography | geography_label | value | +| area_code | area_label | value | | --------- | ------------------- | ----- | | UKC | North East, England | ... | | UKD | North West, England | ... | @@ -332,7 +337,7 @@ For example, consider a dataset which mixes codes from the NUTS geography codeli The NUTS codes have IRIs which are maintained by Eurostat, such as `http://data.europa.eu/nuts/code/UKC`, whereas the ONS geography codes are maintained by the ONS at the `http://statistics.data.gov.uk/id/statistical-geography/E92000001` namespace. -We map the cells of the dataset to RDF by using the `valueUrl` CSVW property. Only a single `valueUrl` can be applied to all the cells in a column. This is problematic, as the IRIs we wish to map to have different bases. Setting `valueUrl` to `http://data.europa.eu/nuts/code/{geography}` would result in a non-existant identifier `http://data.europa.eu/nuts/code/E92000001` appearing in the RDF output. +We map the cells of the dataset to RDF by using the `valueUrl` CSVW property. Only a single `valueUrl` can be applied to all the cells in a column. This is problematic, as the IRIs we wish to map to have different bases. Setting `valueUrl` to `http://data.europa.eu/nuts/code/{area_code}` would result in a non-existant identifier `http://data.europa.eu/nuts/code/E92000001` appearing in the RDF output. We address this by creating new identifiers for each of the codes under a shared namespace, and using `skos:exactMatch` relations to relate these new identifiers to the more commonly used identifiers. For example, @@ -508,8 +513,95 @@ Data providers should adopt the [analytical function guidance](https://analysisf > TODO: Cover media types from [IANA](https://www.w3.org/ns/iana/media-types/) -| Label | IRI | -| ------ | ------------------------------------------------------------------ | +| Label | IRI | +| ------ | ----------------------------------------------------------------- | | CSV | `http://www.w3.org/ns/iana/media-types/text/csv#Resource` | | JSON | `http://www.w3.org/ns/iana/media-types/application/json#Resource` | | Turtle | `http://www.w3.org/ns/iana/media-types/text/turtle#Resource` | + +## Reusable concepts in a CSV + +### Periods of time + +There are a varieety of different ways that time can be represented in your data. Below are some examples: + +| period_type | period_code | period_label | +| ----------- | ----------- | ---------------- | +| day | 1999-12-31 | 31-December-1999 | + +For calendar day data we require the `period_type` to be day. In the `period_code` we require the year, the month followed by the day. For `period_label` we require the field to be the day, the month written fully and then the year. This will help with human readability. + +| period_type | period_code | period_label | +| ----------- | ----------- | ------------ | +| month | 2020-01 | January-2020 | + +For monthly data that is from a calendar period we require the `period_type` to be month. In the `period_code` we require the year followed by the specified digit of the month. The `period_label` column is more human readble hence why it is showing the month's full name and the year. + +| period_type | period_code | period_label | +| ----------- | ----------- | ------------ | +| quarter | 2020-Q1 | 2020-Q1 | + +For quarterly data that is from a calendar period we require the `period_type` to be quarter. In the `period_code` and `period_label` we require the field to be the same. The year followed by which quarter. + +| period_type | period_code | period_label | +| ----------- | ----------- | ------------ | +| year | 2020 | 2020 | + +For calendar year data we require the `period_type` to be year. In the `period_code` and `period_label` we require the field to be the same. Just the year. + +| period_type | period_code | period_label | +| --------------- | ----------- | ------------ | +| government-year | 2020-2021 | 2020-2021 | + +For government year which starts in April we require the `period_type` to be government-year. In the `period_code` and `period_label` we require the field to be the same. The year the period starts and the period where it ends. + +| period_type | period_code | period_label | +| ------------------ | ----------------------- | ------------ | +| gregorian-interval | 2001-04-01 00:00:00/P3M | Apr-Jun 2001 | + +Gregorian interval can be used if the time frame of your data does not conform to a standard time frame. This can be used for monthly, quarterly and yearly data. You need to enter the start date of when your dataset starts. Using the example above it is the 1st April 2001. The P3M refers to how much time has been captured. Using the example it is 3 months. You can add P1Y for yearly data to show the data is being captured for a year period. + +### Area code, label and type + +| area_code | area_label | area_type | +| --------- | -------------- | --------------------------------- | +| K02000001 | United Kingdom | Country | +| E92000001 | England | Nation | +| E12000001 | North East | Region | +| E06000047 | County Durham | County or Unitary Authority | +| E08000037 | Gateshead | Local Authority District | +| E47000006 | Tees Valley | Combined Authority or City Region | + +The table above shows the variety of area types that can be represented in your data. The important thing is that in the area code column each entry has its own identifiable code. + +### Age code and label + +| age_code | age_label | +| -------- | ---------------------- | +| Y_GE16 | Aged 16 years and over | +| Y16T24 | Aged 16 to 24 | +| Y25T34 | Aged 25 to 34 | +| Y35T44 | Aged 35 to 44 | +| Y45T54 | Aged 45 to 54 | +| Y55T74 | Aged 55 to 74 | +| Y_GE75 | Aged 75 and over | + +The examples in the table above show the best way to represent different age categories. his has come from the Statistical Data and Metadata eXchange (SDMX) guidelines [^1] + +### Sex code and label + +| sex_code | sex_label | +| -------- | -------------- | +| F | Female | +| M | Male | +| _N | Non response | +| _O | Other | +| _U | Unknown | +| _Z | Not applicable | + +The examples in the table above show the best way to represent different sex categories. This has come from the Statistical Data and Metadata eXchange (SDMX) guidelines [^2] + + +[^1]: +[^2]: + diff --git a/csv.md b/csv.md index af95143..a4fdfcd 100644 --- a/csv.md +++ b/csv.md @@ -33,27 +33,34 @@ CSV files used in our service should be saved as UTF-8 encoded text files with a Column headers should be in lowercase and snake case (e.g. `column_header`). This is to ensure consistency and readability. Column headers should also be unique, and should not contain any special characters (e.g. `!@#$%^&*()`). This even includes the pound sign (i.e. `£`), which should be replaced with `gbp` when appropriate. -Related columns should have the same prefix (e.g. `area_code`, `area_label`, `area_type` or `time_period_type`, `time_period_code`, `time_period_label`, or even `observation` and `observation_status`), and should be adjacent. This is to ensure that related columns are grouped together when sorted alphabetically, and to make it easier to find concepts whose values are spread across multiple columns. +Related columns should have the same prefix (e.g. `geography_code`, `geography_label`, `geography_type` or `period_type`, `period_code`, `period_label`, or even `observation` and `observation_status`), and should be adjacent. This is to ensure that related columns are grouped together when sorted alphabetically, and to make it easier to find concepts whose values are spread across multiple columns. **Note** When expressing a dimension which has a label and a code, the code should come first, followed by the label; in the case of area geography, you can add an additional value which helps disambiguates geography labels by providing the geography type which would only be disambiguated by the geography code. -TODO: Provide an example of a geography code, geography label, and geography type where the geography type/code is required to disambiguate the geography label. +Below is an example of how we would like code, label and type to be represented. -### Types +| period_code | period_label | period_type | geography_code | geography_label | +| ----------------------- | ----------------- | ------------------ | -------------- | --------------- | +| 1999-12-31 | 31-Decemnber-1999 | day | K02000001 | United Kingdom | +| 2020-01 | Jaunuary-2020 | month | E92000001 | England | +| 2020-Q1 | 2020_Q1 | quarter | E12000001 | North East | +| 2020 | 2020 | year | E06000047 | County Durham | +| 2020-2021 | 2020-2021 | government-year | E07000088 | Gosport | +| 2001-04-01 00:00:00/P2M | Apr-Jun 2001 | gregorian-interval | E14001252 | Gosport | -There are five types of columns, and each CSV file should contain at least two of them. -#### Observation -Observation columns must only contain numbers. Suppressed or missing values must be left blank. If a value is suppressed, there should be a related column explaining the suppressed value. This is referred to as an observation status column (i.e. a special kind of attribute column called "observation status"). When dealing with whole numbers (i.e. counts of people) and where there isn't scaling (i.e. thousands, millions, etc.), the number should be expressed as an integer. When dealing with decimal numbers (i.e. percentages, indexes, scaled currency counts), the number should be expressed as a decimal number. +TODO: Provide an example of a geography code, geography label, and geography type where the geography type/code is required to disambiguate the geography label. -**Note:** Try and keep the same number of decimal places for all values in a given column. This will make it easier to read, and not imply false precision. +### Types + +There are five types of columns, and each CSV file should contain at least two of them. #### Dimension Dimension columns (otherwise known as factors or concepts) are used to identify the observation through a combination of concepts. Where each dimension in a CSV is filtered to a specific value there should only be one observation. In relational databases terminology all dimensions combine to a composite key. Some examples of dimensions are: -- `time_period_code` with one value being `2019-2020` (i.e. the period of `April 2019 to March 2020`) +- `period_code` with one value being `2019-2020` (i.e. the period of `April 2019 to March 2020`) - `geography_code` with one value being `E09000001` (i.e. the nation of `England`) - `sic_2007` with one value being `01.11` (i.e. the concept `Growing of cereals (except rice), leguminous crops and oil seeds`) @@ -63,25 +70,21 @@ A quick way to check if a column only contains related data and unique identifia For example the three columns prefixed with `area_` are related in the table below, filtering on any two of the three would only ever result in one value for the remaining column. The area_code column is a unique identifier for each geography. -| area_code | area_label | Area_type | value | ... | +| area_code | area_label | area_type | value | ... | | --------- | ----------------- | ---------------------- | ----- | --- | | E08000006 | Salford | Metropolitan Districts | 42 | ... | | E92000001 | England | Country | 1337 | ... | | K04000001 | England and Wales | England and Wales | | ... | -**Note:** Dimension columns must contain values for every row in the CSV file and not be blank (i.e. they must be dense) +If you need further help on how to configure dimensions such as period, geography, age and sex. Here is a link to help. [^1] -#### Attributes - -Attribute columns are used to qualify the observation. Most commonly the attribute columns are used to describe the absence or quality of an observation, these are commonly called "observation status" columns. There are two types of attribute columns, literal and resource columns. - -##### Literal attributes +**Note:** Dimension columns must contain values for every row in the CSV file and not be blank (i.e. they must be dense) -Literal attributes are used to describe the observation. When providing point estimates, often there are additional values which help provide context. For example, when providing a point estimate for the number of people in a given area, there may be a confidence interval (of which there are two values, the upper and lower bounds), a sample size, and a standard deviation. These values are all literal attributes. +#### Observation -##### Observation status columns +Observation columns must only contain numbers. Suppressed or missing values must be left blank. If a value is suppressed, there should be a related column explaining the suppressed value. This is referred to as an observation status column (i.e. a special kind of attribute column called "observation status"). When dealing with whole numbers (i.e. counts of people) and where there isn't scaling (i.e. thousands, millions, etc.), the number should be expressed as an integer. When dealing with decimal numbers (i.e. percentages, indexes, scaled currency counts), the number should be expressed as a decimal number. -When creating observation status columns a naming convention helps users understand how they relate to the observation to the qualification. In this case for a given column name containing observations, the observation status column should have the same name as the observation column plus `_status` as a suffix. For example an observation column called `observation` should have a corresponding observation status called `observation_status`. +**Note:** Try and keep the same number of decimal places for all values in a given column. This will make it easier to read, and not imply false precision. #### Measure columns @@ -119,6 +122,20 @@ When scaling units take the base unit and suffix the multiplication factor prece - `ratio_0.001` for per thousands (used in SOME PUBLICATION) - `L_100` for hectolitres (used in HMRC Alcohol Bulletin) +#### Attributes + +Attribute columns are used to qualify the observation. Most commonly the attribute columns are used to describe the absence or quality of an observation, these are commonly called "observation status" columns. There are two types of attribute columns, literal and resource columns. + +##### Literal attributes + +Literal attributes are used to describe the observation. When providing point estimates, often there are additional values which help provide context. For example, when providing a point estimate for the number of people in a given area, there may be a confidence interval (of which there are two values, the upper and lower bounds), a sample size, and a standard deviation. These values are all literal attributes. + +##### Observation status columns + +When creating observation status columns a naming convention helps users understand how they relate to the observation to the qualification. In this case for a given column name containing observations, the observation status column should have the same name as the observation column plus `_status` as a suffix. For example an observation column called `observation` should have a corresponding observation status called `observation_status`. + +**Note:** An `observation_status` column is not required if all the cells in the `observation` column have data. + ### Ordering Ensuring that users can understand your CSV files is important. To help with this, the columns should be ordered as follows: @@ -133,6 +150,13 @@ Ensuring that users can understand your CSV files is important. To help with thi 6. Observation status column (if necessary). 7. All other attribute columns. +## Example + +Below is a basic example of how the columns should be ordered and shown. + +| period_code | period_type | period_label | geography_code | geography_label | observation | measure | unit | observation_status | +| ----------- | ----------- | ------------ | -------------- | --------------- | ----------- | ------- | ---- | ------------------ | + ## Overall principles Concept Clarity: Ensure that the concepts used in the CSV files are clear and easily understandable to enhance human readability. @@ -140,3 +164,6 @@ Concept Clarity: Ensure that the concepts used in the CSV files are clear and ea Unique Addressability: Each observation should be uniquely addressable by filtering all dimension columns to a value. Value Completeness: All columns should have values for every observation, except for the observation, observation status, or attribute type columns. + + +[^1]: \ No newline at end of file