subset and replace functions to filter, select, and clean air quality data

Once you read your air quality data into R, one of the first things to do is find the highest concentration values. For example, you need to find how many days the air quality standard was not satisfied. For this case, we can use subset.

Another common situation is to quickly clean your data. If analyzers are calibrated every day at 4 am, you need to replace the data at this hour for all your study period. Another situation is that you need to replace wrong data, like when you have some negative numbers. For these cases, we use replace.

Creating a sample data

We are going to create a sample data with daily PM2.5 and CO concentrations.

set.seed(9999) # To ensure to get the same results

pm25 <- runif(100, -1, 35)
co <- runif(100, -1, 5)

date <- seq(
  as.POSIXct("2023-01-01"),
  length.out = length(pm25),
  by = "day"
)

daily_pol <- data.frame(
  date = date,
  pm25 = pm25,
  co = co
)

Finding days with concentration higher than the air quality standard

The WHO air quality standard for PM25 is 15 ug/m3, and for CO is 4 ppm. So, we can use subset to count the days that surpassed the air quality standard. The syntax is the following subset(name_of_dataframe, subset = condition_using_name_of_column).

over_who_pm25 <- subset(daily_pol, subset = pm25 > 15)
nrow(over_who_pm25)
# 54 days with concentration over the air quality concentration

You can also select the column that you want by adding select argument. If you only want to have date and pm25 columns you can use:

over_who_pm25 <- subset(daily_pol, 
        subset = pm25 > 15, 
        select = c("date", "pm25"))

subset support more complex conditions. For example, if we want to know the days that surpass both PM25 and CO air quality standards.

day_over_who <- subset(daily_pol, 
        subset = pm25 > 15 & co > 4)
nrow(day_over_who)
# 8 days

By the way…

The first example can be done using the [ ] operation, but I think that using subset make the code more readable, and that is important.

over_who_pm25 <- daily_pol[daily_pol$pm25 > 15, ]

Replacing data

In our sample data set, we got some negative values. For PM25, we could count them by using:
sum(daily_pol$pm25 &lt; 0)
or by applying what we have learnt
nrow(subset(daily_pol, pm25 &lt; 0))
And we found that there are 4 negative values.

So, we need to replace them with ǸA, to do that we use replace. The syntax is: replace(the _dataframe, the_condition, the_replace_value). For example, we can replace all the negative values in the dataframe by:

daily_pol_clean <- replace(daily_pol, daily_pol < 0, NA)

Likewise, if we want to replace the values only for PM2.5, we can do the following:

daily_pol$pm25 <- replace(daily_pol$pm25, daily_pol$pm25, NA)

That’s it.

The complete code is available in this link.

Leave a comment