Water Quality

The wq submodule contains datasets that represent surface water chemistry at various locations worldwide. Currently, it includes 16 water quality datasets, but we anticipate this number will increase in the future. The spatial and temporal coverage of these datasets are detailed in following table.

List of datasets

Summary of datasets
Dataset	Class / Function Name	Variables Covered	Temporal Coverage	Spatial Coverage	Reference
Surface Water Chemistry	`aqua_fetch.SWatCh`	24	1960 - 2022	Global	Lobke et al., 2022
Global River Water Quality Archive	`aqua_fetch.GRQA`	42	1898 - 2020	Global	Virro et al., 2021
water QUAlity, DIscharge and Catchment Attributes	`aqua_fetch.Quadica`	10	1950 - 2018	Germany	Ebeling et al., 2022
river chemistry for US coasts	`aqua_fetch.RC4USCoast`	21	1850 - 2020	USA	Gomez et al., 2022
Busan Beach	`aqua_fetch.busan_beach`	14	2018 - 2019	Busan, S.Korea	Jang et al
Ecoli Mekong River	`aqua_fetch.ecoli_mekong`	10	2011 - 2021	Mekong river (Houay Pano)	Boithias et al., 2022
Ecoli Mekong River (Laos)	`aqua_fetch.ecoli_mekong_laos`	10	2011 - 2021	Mekong River (Laos)	Boithias et al., 2022
Ecoli Houay Pano (Laos)	`aqua_fetch.ecoli_houay_pano`	10	2011 - 2021	Houay Pano (Laos)	Boithias et al., 2022
CamelsChem	`aqua_fetch.CamelsChem`	18	1980 - 2018	Continental USA	Sterle et al., 2024
Global River Methane	`aqua_fetch.GRiMeDB`	18		Global	Stanley et al., 2024
Sylt Roads	`aqua_fetch.SyltRoads`	18	1973 - 2019	Red Sea (Arctic)	Rick et al., 2023
San Francisco Bay	`aqua_fetch.SanFranciscoBay`	18	1973 - 2019	San Francisco (USA)	Schraga et al., 2017
Buzzards Bay	`aqua_fetch.BuzzardsBay`	18	1992 - 2018	Buzzards Bay (USA)	Jakuba et al.,
White Clay Creek	`aqua_fetch.WhiteClayCreek`	2	1973 - 2019	White Clay Creek (USA)	Newbold and Damiano 2013
Selune River, France	`aqua_fetch.SeluneRiver`	5	2021 - 2022	Selune River, (France)	Moustapha Ba et al., 2023

Functions and Classes

class aqua_fetch.SWatCh(remove_csv_after_download=False, path=None, **kwargs)[source]

Bases: Datasets

The Surface Water Chemistry (SWatCh) database of 27 variables from 26322 locations as introduced in Lobke et al., 2022 . It should be noted not all the variables are available for all the locations. Following are the variables available in the dataset:

Total Phosphorus, mixed forms

Sulfate

pH

Temperature, water

Chloride

Magnesium

Calcium

Sodium

Potassium

Aluminum

Nitrate

Nitrite

Fluoride

Hardness, carbonate

Iron

Ammonium

Organic carbon

Bicarbonate

Orthophosphate

Gran acid neutralizing capacity

Alkalinity, total

Inorganic carbon

Carbonate

Alkalinity, carbonate

Hardness, non-carbonate

Carbon Dioxide, free CO2

Alkalinity, Phenolphthalein (total hydroxide+1/2 carbonate)

Examples

>>> from water.datasets import Swatch
>>> ds = Swatch()
>>> df = ds.fetch()
>>> df.shape
(3901296, 6)
>>> len(ds.parameters)
22
>>> len(ds.sites)
26322
>>> coords = ds.stn_coords()
>>> coords.shape
(26322, 2)

__init__(remove_csv_after_download=False, path=None, **kwargs)[source]

Parameters:: remove_csv_after_download (bool (default=False)) – if True, the csv will be removed after downloading and processing.

fetch(parameters: list | str = None, station_id: list | str = None, station_names: list | str = None) → DataFrame[source]

Parameters:

parameters (str/list (default=None)) – Names of parameters to fetch. By default, name, value, val_unit, location, lat, and long are read.
station_id (str/list (default=None)) – name/names of station id for which the data is to be fetched. By default, the data for all stations is fetched. If given, then station_names should not be given.
station_names (str/list (default=None)) – name/names of station id for which the data is to be fetched. By default, the data for all stations is fetched. If given, then station_id should not be given.

Return type:

pd.DataFrame

Examples

>>> from water.datasets import Swatch
>>> ds = Swatch()
>>> df = ds.fetch()
>>> df.shape
(3901296, 6)
>>> st_name = "Jordan Lake"
>>> df = df[df['location'] == st_name]
>>> df.shape
(4, 6)

property names: dict: tells the names of parameters in this class and their original names in SWatCh dataset in the form of a python dictionary

num_samples(parameter, station_id=None) → int[source]

Parameters:

parameter (str) – name of the water quality parameter whose samples are to be quantified.
station_id – if given, samples of parameter will be returned for only this site/sites otherwise for all sites

property parameters: list: list of water quality parameters available

property site_names: list: list of site names

property sites: list: list of site names

stn_coords()[source]

Returns the coordinates of all the stations in the dataset

Returns:: A dataframe with columns ‘lat’, ‘long’
Return type:: pd.DataFrame

class aqua_fetch.GRQA(download_source: bool = False, path=None, **kwargs)[source]

Bases: Datasets

Global River Water Quality Archive following the work of Virro et al., 2021 . This dataset comprises of 42 parameters for 94955 sites across 116 countries.

Examples

>>> from water_datasets import GRQA
>>> ds = GRQA(path="/mnt/datawaha/hyex/atr/data")
>>> ds.parameters
['TPP', 'PON', 'TEMP', 'TSS', ...]
>>> print(len(ds.parameters))
42
>>> len(ds.countries)
116
>>> len(ds.stations())
94955
>>> len(ds.parameters)
>>> coords = ds.stn_coords()
>>> coords.shape
(94955, 2)
>>> country = "Pakistan"
>>> len(ds.fetch_parameter('TEMP', country=country))
1324
>>> df = ds.fetch_parameter("TEMP", country=country)
>>> print(df.shape)
(1324, 38)
>>> df = ds.fetch_parameter("NH4N", country=country)
>>> print(df.shape)
(28, 36)

__init__(download_source: bool = False, path=None, **kwargs)[source]

Parameters:: download_source (bool) – whether to download source data or not

Parameters:

parameter (str, optional) – name of parameter
site_name (str/list, optional) – location for which data is to be fetched.
country (str/list optional (default=None))
st (str) – starting date date or index
en (str) – end date or index

Returns:

a pandas dataframe

Return type:

pd.DataFrame

Example

>>> from water_quality import GRQA
>>> dataset = GRQA()
>>> df = dataset.fetch_parameter()
fetch data for only one country
>>> cod_pak = dataset.fetch_parameter("COD", country="Pakistan")
fetch data for only one site
>>> cod_kotri = dataset.fetch_parameter("COD", site_name="Indus River - at Kotri")
we can find out the number of data points and sites available for a specific country as below
>>> for para in dataset.parameters:
>>>     data = dataset.fetch_parameter(para, country="Germany")
>>>     if len(data)>0:
>>>         print(f"{para}, {df.shape}, {len(df['site_name'].unique())}")

sites_data() → DataFrame[source]: Returns the meta data for the dataset

stations() → List[str][source]: Returns names of stations/site_id

stn_coords()[source]

Returns the coordinates of all the stations in the dataset

Returns:: A dataframe with columns ‘lat’, ‘long’
Return type:: pd.DataFrame

class aqua_fetch.Quadica(path=None, **kwargs)[source]

Bases: Datasets

This is dataset of 10 water quality parameters of Germany from 1386 stations from 1950 to 2018 at monthly timestep following the work of Ebeling et al., 2022 . The time-step is monthly and annual but the monthly timeseries data is not continuous. Following are the parameters available in this dataset:

Q : Discharge

NO3 : Nitrate

NO3N : Nitrate-N

NMin : Nitrogen mineralization

TN : Total Nitrogen

PO4 : Phosphate

PO4P : Phosphate-P

TP : Total Phosphorus

DOC : Dissolved Organic Carbon

TOC : Total Organic Carbon

Examples

>>> from water_datasets import Quadica
>>> dataset = Quadica()
>>> len(ds.stations())
1386
>>> coords = ds.stn_coords()
>>> coords.shape
(1386, 2)
>>> df = dataset.wrtds_monthly()
>>> df.shape
(50186, 47)
>>> df = dataset.wrtds_annual()
>>> df.shape
(4213, 46)
>>> df = dataset.pet()
>>> df.shape
(828, 1386)
>>> df = dataset.avg_temp()
>>> df.shape
(828, 1388)
>>> df = dataset.precipitation()
>>> df.shape
(828, 1388)
>>> df = dataset.catchment_attributes()
>>> df.shape
(1386, 112)
>>> df = dataset.metadata()
>>> df.shape
(1386, 60)
>>> df = dataset.monthly_medians()
>>> df.shape
(16629, 18)
>>> df = dataset.annual_medians()
>>> df.shape
(24393, 18)
>>> df = dataset.fetch_monthly()
>>> df[0].shape
(50186, 47)

__init__(path=None, **kwargs)[source]

Parameters:

name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz

annual_medians() → DataFrame[source]

Annual medians over the whole time series of water quality variables and discharge

Returns:: a dataframe of shape (24393, 18)
Return type:: pd.DataFrame

monthly median average temperatures starting from 1950-01 to 2018-09

Parameters:

stations – name of stations for which data is to be retrieved. By default, data for all stations is retrieved.
st (optional) – starting point of data. By default, the data starts from 1950-01
en (optional) – end point of data. By default, the data ends at 2018-09

Returns:

a pandas dataframe of shape (time_steps, stations). With default input arguments, the shape is (828, 1386)

Return type:

pd.DataFrame

Examples

>>> from water_quality import Quadica
>>> dataset = Quadica()
>>> df = dataset.avg_temp() # -> (828, 1388)

catchment_attributes(parameters: List[str] | str = None, stations: List[int] | int = None) → DataFrame[source]

Returns static physical catchment attributes in the form of dataframe.

Parameters:

parameters (list/str, optional, (default=None)) – name/names of static attributes to fetch
stations (list/int, optional (default=None)) – name/names of stations whose static/physical parameters are to be read

Returns:

a pandas dataframe of shape (stations, parameters). With default input arguments, shape is (1386, 113)

Return type:

pd.DataFrame

Examples

>>> from water_quality import Quadica
>>> dataset = Quadica()
>>> cat_features = dataset.catchment_attributes()
... # get attributes of only selected stations
>>> dataset.catchment_attributes(stations=[1,2,3])

fetch_monthly(parameters: List[str] | str = None, stations: List[int] | int = 'all', median: bool = True, fnc: bool = True, fluxes: bool = True, precipitation: bool = True, avg_temp: bool = True, pet: bool = True, only_continuous: bool = True, cat_features: bool = True, max_nan_tol: int | None = 0) → Tuple[DataFrame, DataFrame][source]

Fetches monthly concentrations of water quality parameters.

Parameters:

parameters (str/list, optional (default=None)) –
name or names of water quality parameters to fetch. By default following parameters are considered
- NO3
- NO3N
- TN
- Nmin
- PO4
- PO4P
- TP
- DOC
- TOC
stations (int/list, optional (default=None)) – name or names of stations whose data is to be fetched
median (bool, optional (default=True)) – whether to fetch median concentration values or not
fnc (bool, optional (default=True)) – whether to fetch flow normalized concentrations or not
fluxes (bool, optional (default=True)) – Setting this to true will add two parameters i.e. mean_Flux_FEATURE and mean_FNFlux_FEATURE
precipitation (bool, optional (default=True)) – whether to fetch average monthly precipitation or not
avg_temp (bool, optional (default=True)) – whether to fetch average monthly temperature or not
pet (bool, optional (default=True)) – whether to fether potential evapotranspiration data or not
only_continuous (bool, optional (default=True)) – If true, will return data for only those stations who have continuos monthly timeseries data from 1993-01-01 to 2013-01-01.
cat_features (bool, optional (default=True)) – whether to fetch catchment parameters or not.
max_nan_tol (int, optional (default=0)) – setting this value to 0 will remove the whole time-series with any missing values. If None, no time-series with NaNs values will be removed.

Returns:

two dataframes whose length is same but the columns are different

a pandas dataframe of timeseries of parameters (stations*timesteps, dynamic_features)
a pandas dataframe of static parameters (stations*timesteps, catchment_features)

Return type:

tuple

Examples

>>> from water_quality import Quadica
>>> dataset = Quadica()
>>> mon_dyn, mon_cat = dataset.fetch_monthly(max_nan_tol=None)
... # However, mon_dyn contains data for all parameters and many of which have
... # large number of nans. If we want to fetch data only related to TN without any
... # missing value, we can do as below
>>> mon_dyn_tn, mon_cat_tn = dataset.fetch_monthly(parameters="TN", max_nan_tol=0)
... # if we want to find out how many catchments are included in mon_dyn_tn
>>> len(mon_dyn_tn['OBJECTID'].unique())
... # 25

metadata() → DataFrame[source]

fetches the metadata about the stations as pandas’ dataframe. Each row represents metadata about one station and each column represents one feature. The R2 and pbias are regression coefficients and percent bias of WRTDS models for each parameter.

Returns:: a dataframe of shape (1386, 60)
Return type:: pd.DataFrame

monthly_medians(parameters: List[str] | str = None, stations: List[int] | int = None) → DataFrame[source]

This function reads the c_months.csv file which contains the monthly medians over the whole time series of water quality variables and discharge

Parameters:

parameters (list/str, optional, (default=None)) – name/names of parameters
stations (list/int, optional (default=None)) – stations for which

Returns:

a dataframe of shape (16629, 18). 15 of the 18 columns represent a water chemistry parameter. 16629 comes from 1386*12 where 1386 is stations and 12 is months.

Return type:

pd.DataFrame

property parameters: list: names of water quality parameters available in this dataset

average monthly potential evapotranspiration starting from 1950-01 to 2018-09

Returns:: a dataframe of shape (828, 1386), where 828 is the number of months from 1950-01 to 2018-09 and 1386 is the number of stations
Return type:: pd.DataFrame

Examples

>>> from water_quality import Quadica
>>> dataset = Quadica()
>>> df = dataset.pet() # -> (828, 1386)

sums of precipitation starting from 1950-01 to 2018-09

Parameters:

stations – name of stations for which data is to be retrieved. By default, data for all stations is retrieved.
st (optional) – starting point of data. By default, the data starts from 1950-01
en (optional) – end point of data. By default, the data ends at 2018-09

Returns:

a dataframe of shape (828, 1388)

Return type:

pd.DataFrame

Examples

>>> from water_quality import Quadica
>>> dataset = Quadica()
>>> df = dataset.precipitation() # -> (828, 1388)

property station_names: List[str]: names of stations

stations() → list[source]: IDs of stations for which data is available

stn_coords() → DataFrame[source]

Returns the coordinates of all the stations in the dataset in wgs84 projection.

Returns:: A dataframe with columns ‘lat’, ‘long’
Return type:: pd.DataFrame

to_DataSet(target: str = 'TP', input_features: list = None, split: str = 'temporal', lookback: int = 24, **ds_args)[source]

This function prepares data for machine learning prediction problem. It returns an instance of ai4water.preprocessing.DataSetPipeline which can be given to model.fit or model.predict

Parameters:

target (str, optional (default="TN")) – parameter to consider as target
input_features (list, optional) – names of input parameters
split (str, optional (default="temporal")) – if temporal, validation and test sets are taken from the data of each station and then concatenated. If spatial, training validation and test is decided based upon stations.
lookback (int)
**ds_args – key word arguments

Returns:

an instance of DataSetPipeline

Return type:

ai4water.preprocessing.DataSet

Example

>>> from water_datasets import Quadica
... # initialize the Quadica class
>>> dataset = Quadica()
... # define the input parameters
>>> inputs = ['median_Q', 'OBJECTID', 'avg_temp', 'precip', 'pet']
... # prepare data for TN as target
>>> dsp = dataset.to_DataSet("TN", inputs, lookback=24)

Annual median concentrations, flow-normalized concentrations, and mean fluxes estimated using Weighted Regressions on Time, Discharge, and Season (WRTDS) for stations with enough data availability.

Parameters:

parameters (optional)
st (optional) – starting point of data. By default, the data starts from 1992
en (optional) – end point of data. By default, the data ends at 2013

Returns:

a dataframe of shape (4213, 46)

Return type:

pd.DataFrame

Examples

>>> from water_quality import Quadica
>>> dataset = Quadica()
>>> df = dataset.wrtds_annual()

Monthly median concentrations, flow-normalized concentrations and mean fluxes of water chemistry parameters. These are estimated using Weighted Regressions on Time, Discharge, and Season (WRTDS) for stations with enough data availability. This data is available for total 140 stations. The data from all stations does not start and end at the same period. Therefore, some stations have more datapoints while others have less. The maximum datapoints for a station are 576 while smallest datapoints are 244.

Parameters:

parameters (str/list, optional)
stations (int/list optional (default=None)) – name/names of satations whose data is to be retrieved.
st (optional) – starting point of data. By default, the data starts from 1992-09
en (optional) – end point of data. By default, the data ends at 2013-12

Returns:

a dataframe of shape (50186, 47)

Return type:

pd.DataFrame

Examples

>>> from water.datasets import Quadica
>>> dataset = Quadica()
>>> df = dataset.wrtds_monthly()

class aqua_fetch.RC4USCoast(path=None, *args, **kwargs)[source]

Bases: Datasets

Monthly river water chemistry (N, P, SIO2, DO, … etc), discharge and temperature of 140 monitoring sites of US coasts from 1950 to 2020 following the work of Gomez et al., 2022.

Examples

>>> from water_quality import RC4USCoast
>>> dataset = RC4USCoast()
>>> len(dataset.stations)
140
>>> len(dataset.parameters)
27
>>> stn_coords = dataset.stn_coords()
>>> stn_coords.shape
(140, 2)

__init__(path=None, *args, **kwargs)[source]

Parameters:: path – path where the data is already downloaded. If None, the data will be downloaded into the disk.

fetch_chem(parameter, stations: List[int] | int | str = 'all', as_dataframe: bool = False, st: int | str | DatetimeIndex = None, en: int | str | DatetimeIndex = None)[source]

Returns water chemistry parameters from one or more stations.

Parameters:

parameter (list, str) – name/names of parameters to fetch
stations (list, str) – name/names of stations from which the parameters are to be fetched
as_dataframe (bool (default=False)) – whether to return data as pandas.DataFrame or xarray.Dataset
st – start time of data to be fetched. The default starting date is 19500101
en – end time of data to be fetched. The default end date is 20201201

Return type:

pandas DataFrame or xarray Dataset

Examples

>>> from water_quality import RC4USCoast
>>> ds = RC4USCoast()
>>> data = ds.fetch_chem(['temp', 'do'])
>>> data
>>> data = ds.fetch_chem(['temp', 'do'], as_dataframe=True)
>>> data.shape  # this is a multi-indexed dataframe
(119280, 4)
>>> data = ds.fetch_chem(['temp', 'do'], st="19800101", en="20181230")

returns discharge data

Parameters:

stations – stations for which q is to be fetched
as_dataframe (bool (default=True)) – whether to return the data as pd.DataFrame or as xarray.Dataset
nv (int (default=0))
st – start time of data to be fetched. The default starting date is 19500101
en – end time of data to be fetched. The default end date is 20201201

Examples

>>> from water_quality import RC4USCoast
>>> ds = RC4USCoast()
# get data of all stations as DataFrame
>>> q = ds.fetch_q("all")
>>> q.shape
(852, 140)  # where 140 is the number of stations
# get data of only two stations
>>> q = ds.fetch_q([1,10])
>>> q.shape
(852, 2)
# get data as xarray Dataset
>>> q = ds.fetch_q("all", as_dataframe=False)
>>> type(q)
xarray.core.dataset.Dataset
# getting data between specific periods
>>> data = ds.fetch_q("all", st="20000101", en="20181230")

property parameters: List[str]

returns names of parameters

Examples

>>> from water_quality import RC4USCoast
>>> ds = RC4USCoast()
>>> len(ds.parameters)
27

property stations: List[str]

>>> from water_quality import RC4USCoast
>>> ds = RC4USCoast(path=r'F:\data\RC4USCoast')
>>> len(ds.stations)
140

stn_coords() → DataFrame[source]

Returns the coordinates of all the stations in the dataset in wgs84 projection.

Returns:: A dataframe with columns ‘lat’, ‘long’
Return type:: pd.DataFrame

class aqua_fetch.CamelsChem(path=None, **kwargs)[source]

Bases: Datasets

Water Quality data from USA following the works of Sterle et al., 2024 . This dataset has 18 water chemistry parameters from 1980-01-01 - 2018-12-31. The data is is downloaded from hydroshare Out of 671 stations, 155 stations have not water quality data. The wet depisition data consist of 12 parameters from 1985 - 2018.

Examples

>>> from water_datasets import CamelsChem
>>> ds = CamelsChem(path='/path/to/dataset')
>>> len(ds.stations())
516
>>> len(ds.parameters)
28

__init__(path=None, **kwargs)[source]

Parameters:

name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz

atm_dep_data() → DataFrame[source]: reads the atmospheric deposition data

atm_dep_metadata() → DataFrame[source]: reads the atm_dep_metadata

property atm_dep_parameters: List[str]: returns the names of parameters in the atm_dep dataset

data() → DataFrame[source]: reads the main dataset which has shape of (76284, 45)

fetch(stations: str | List[str] = 'all', parameters: str | List[str] = 'all') → Dict[str, DataFrame][source]

fetches the data for the given stations and parameters

Parameters:

stations (Union[str, List[str]]) – list of stations to fetch data for
parameters (Union[str, List[str]]) – list of parameters to fetch data for

Returns:

dictionary of dataframes for each station

Return type:

Dict[str, pd.DataFrame]

Examples

>>> ds = CamelsChem(path='/mnt/datawaha/hyex/atr/data')
>>> data = ds.fetch(stations=['1591400', '6350000'], parameters=['cl_mg/l', 'na_mg/l'])
>>> data = ds.fetch('1591400', 'cl_mg/l')['1591400']
>>> data.shape # (55, 1)
... get all parameters for a station
>>> data = ds.fetch('1591400')['1591400']
>>> data.shape # (55, 28)
>>> all_data = ds.fetch()  # get all parameters of all stations
>>> len(all_data) # 516

fetch_atm_dep(stations: str | List[str] = 'all', parameters: str | List[str] = 'all') → Dict[str, DataFrame][source]

fetches the data for the given stations and parameters

Parameters:

stations (Union[str, List[str]]) – list of stations to fetch data for
parameters (Union[str, List[str]]) – list of parameters to fetch data for

Returns:

dictionary of dataframes for each station

Return type:

Dict[str, pd.DataFrame]

Examples

>>> ds = CamelsChem(path='/mnt/datawaha/hyex/atr/data')
... get data for a single station and a single parameter
>>> data = ds.fetch_atm_dep(stations='1591400', parameters='cl')
>>> print(data['1591400'].shape)  # (34, 8)
... get data for multiple stations and multiple parameters
>>> data = ds.fetch_atm_dep(stations=['1591400', '6350000'], parameters=['cl', 'na'])
>>> print(data['1591400'].shape)  # (34, 16)
>>> print(data['6350000'].shape)  # (34, 16)
.. get data for all stations and for all parameters
>>> data = ds.fetch_atm_dep()
>>> print(len(data))  # 671

gauge_and_region_names() → DataFrame[source]: reads the gauge and region names

metrics()[source]: reads metrics.xlsx which contains metadata

property parameters: List[str]: returns the names of parameters in the dataset

stations() → List[str][source]: returns the list of stations in the dataset

stn_coords() → DataFrame[source]

Returns the coordinates of all the stations in the dataset in wgs84 projection.

Returns:: A dataframe with columns ‘lat’, ‘long’
Return type:: pd.DataFrame

topography() → DataFrame[source]: reads the topography data

class aqua_fetch.SyltRoads(path=None, **kwargs)[source]

Bases: Datasets

Dataset of physico-hydro-chemical time series data at Sylt Roads from 1973 - 2019 following Rick et al., 2023 . Following parameters are available

location

Depth water [m]

Sal

Temp [°C]

[PO4]3- [µmol/l]

[NH4]+ [µmol/l]

[NO2]- [µmol/l]

[NO3]- [µmol/l]

Si(OH)4 [µmol/l]

SPM [mg/l]

pH

O2 [µmol/l]

Chl a [µg/l]

DON [µmol/l]

DOP [µmol/l]

DIN [µmol/l]

Examples

>>> from water_datasets import SyltRoads
>>> ds = SyltRoads()

__init__(path=None, **kwargs)[source]

Parameters:

name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz

fetch(parameters: str | List[str] = 'all') → DataFrame[source]

Fetch the data from the dataset

Parameters:: parameters (str or List[str], optional) – Parameters to fetch. Default is None which will fetch all parameters
Returns:: DataFrame containing the data
Return type:: pd.DataFrame

Examples

>>> from water_datasets import SyltRoads
>>> ds = SyltRoads()
>>> df = ds.fetch()
>>> df.shape
(5710, 16)
>>> len(ds.parameters)
16
>>> ds.fetch(['Sal', 'Temp [°C]', 'pH']).shape
(5710, 3)

property parameters: List[str]: returns names of parameters in the dataset

stn_coords() → DataFrame[source]

Returns the coordinates of all the stations in the dataset in wgs84 projection.

Returns:: A dataframe with columns ‘lat’, ‘long’
Return type:: pd.DataFrame

class aqua_fetch.SanFranciscoBay(path=None, **kwargs)[source]

Bases: Datasets

Time series of water quality parameters from 59 stations in San-Francisco from 1969 - 2015. For details on data see Cloern et al.., 2017 and Schraga et al., 2017. Following parameters are available:

Depth

Discrete_Chlorophyll

Ratio_DiscreteChlorophyll_Pheopigment

Calculated_Chlorophyll

Discrete_Oxygen

Calculated_Oxygen

Oxygen_Percent_Saturation

Discrete_SPM

Calculated_SPM

Extinction_Coefficient

Salinity

Temperature

Sigma_t

Nitrite

Nitrate_Nitrite

Ammonium

Phosphate

Silicate

Examples

>>> from water_datasets import SanFranciscoBay
>>> ds = SanFranciscoBay()
>>> data = ds.data()
>>> data.shape
(212472, 19)
>>> stations = ds.stations()
>>> len(stations)
59
>>> parameters = ds.parameters()
>>> len(parameters)
18
... # fetch data for station 18
>>> stn18 = ds.fetch(stations='18')
>>> stn18.shape
(13944, 18)

__init__(path=None, **kwargs)[source]

Parameters:

name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz

fetch(stations: str | List[str] = 'all', parameters: str | List[str] = 'all') → DataFrame[source]

Parameters:: parameters (Union[str, List[str]], optional) – The parameters to return. The default is ‘all’.
Returns:: DESCRIPTION.
Return type:: pd.DataFrame

stn_data(stations: str | List[str] = 'all') → DataFrame[source]: Get station metadata.

class aqua_fetch.GRiMeDB(path=None, **kwargs)[source]

Bases: Datasets

Global river database of methan concentrations and fluxes from 5029 stations of 305 rivers following Stanley et al., 2023

Examples

>>> from water_datasets import GRiMeDB
>>> ds = GRiMeDB(path='/path/to/dataset')
>>> ds.stations()
>>> ds.streams
>>> ds.stn_coords()
>>> ds.shape
5029, 2

__init__(path=None, **kwargs)[source]

Parameters:

name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz

concentrations(stations: str | List[str] = 'all', streams: str | List[str] = 'all', parameters: str | List[str] = 'all')[source]

Get concentrations data.

Parameters:

stations (Union[str, List[str]], optional) – station ID or list of station IDs, by default “all”. If given, then streams must not be given. Check .stations() method for available stations.
streams (Union[str, List[str]], optional) – stream name or list of stream names, by default “all”. If given, then stations must not be given. Check .streams attribute for available streams.
parameters (Union[str, List[str]], optional) – parameters to return, by default “all”. Check .parameters attribute for available parameters.

fluxes(stations: str | List[str] = 'all') → DataFrame[source]: returns fluxes data as a pandas dataframe

stn_coords() → DataFrame[source]

Returns the coordinates of all the stations in the dataset in wgs84 projection.

Returns:: A dataframe with columns ‘lat’, ‘long’
Return type:: pd.DataFrame

property streams: List[str]: returns names of streams

class aqua_fetch.BuzzardsBay(path=None, **kwargs)[source]

Bases: Datasets

Water quality measurements in Buzzards Bay from 1992 - 2018. For more details on data see Jakuba et al., data is downloaded from MBLWHOI Library

Examples

>>> from water_datasets import BuzzardsBay
>>> ds = BuzzardsBay()
>>> doc = ds.doc()
>>> doc.shape
(11092, 4)
>>> chla = ds.chla()
>>> chla.shape
(1028, 10)

__init__(path=None, **kwargs)[source]

Parameters:

name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz

fetch(parameters: str | List[str] = 'all') → DataFrame[source]: Fetch data for the specified parameters.

class aqua_fetch.WhiteClayCreek(path=None, **kwargs)[source]

Bases: Datasets

Time series of water quality parameters from White Clay Creek.

chl-a : 2001 - 2012

Dissolved Organic Carbon : 1977 - 2017

__init__(path=None, **kwargs)[source]

Parameters:

name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz

chla() → DataFrame[source]: Chlorophyll-a data

doc() → DataFrame[source]: Dissolved Organic Carbon data

class aqua_fetch.SeluneRiver(path=None, **kwargs)[source]

Bases: Datasets

Dataset of physico-chemical variables measured at different levels, for a 2021 and 2022 for characterization of Hyporheic zone of Selune River, Manche, Normandie, France following Moustapha Ba et al., 2023 . The data is available at data.gouv.fr . The following variables are available:

water level

temperature

conductivity

oxygen

pressure

__init__(path=None, **kwargs)[source]

Parameters:

name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz

data() → DataFrame[source]: Return a DataFrame of the data

aqua_fetch.busan_beach(inputs: list = None, target: list | str = 'tetx_coppml') → DataFrame[source]

Loads the Antibiotic resitance genes (ARG) data from a recreational beach in Busan, South Korea along with environment variables.

The data is in the form of mutlivariate time series and was collected over the period of 2 years during several precipitation events. The frequency of environmental data is 30 mins while that of ARG is discontinuous. The data and its pre-processing is described in detail in Jang et al., 2021

Parameters:

inputs –
features to use as input. By default all environmental data is used which consists of following parameters
- tide_cm
- wat_temp_c
- sal_psu
- air_temp_c
- pcp_mm
- pcp3_mm
- pcp6_mm
- pcp12_mm
- wind_dir_deg
- wind_speed_mps
- air_p_hpa
- mslp_hpa
- rel_hum
target –
feature/features to use as target/output. By default tetx_coppml is used as target. Logically one or more from following can be considered as target
- ecoli
- 16s
- inti1
- Total_args
- tetx_coppml
- sul1_coppml
- blaTEM_coppml
- aac_coppml
- Total_otus
- otu_5575
- otu_273
- otu_94

Returns:

a pandas dataframe with inputs and target and indexed with pandas.DateTimeIndex

Return type:

pd.DataFrame

Examples

>>> from water_quality import busan_beach
>>> dataframe = busan_beach()
>>> dataframe.shape
(1446, 14)
>>> dataframe = busan_beach(target=['tetx_coppml', 'sul1_coppml'])
>>> dataframe.shape
(1446, 15)

See usage here for more details.

E. coli data from Mekong river (Houay Pano) area from 2011 to 2021 Boithias et al., 2022 .

Parameters:

st (optional) – starting time. The default starting point is 2011-05-25 10:00:00
en (optional) – end time, The default end point is 2021-05-25 15:41:00
parameters (str, optional) –
names of features to use. use all to get all features. By default following input features are selected
- station_name name of station/catchment where the observation was made
- T temperature
- EC electrical conductance
- DOpercent dissolved oxygen concentration
- DO dissolved oxygen saturation
- pH pH
- ORP oxidation-reduction potential
- Turbidity turbidity
- TSS total suspended sediment concentration
- E-coli_4dilutions Eschrechia coli concentration
overwrite (bool) – whether to overwrite the downloaded file or not

Returns:

with default parameters, the shape is (1602, 10)

Return type:

pd.DataFrame

Examples

>>> from water_quality import ecoli_mekong
>>> ecoli_data = ecoli_mekong()
>>> ecoli_data.shape
(1602, 10)

coli data from Mekong river (Northern Laos).

Parameters:

st – starting time
en – end time
station_name (str)
parameters (str, optional)
overwrite (bool) – whether to overwrite or not

Returns:

with default parameters, the shape is (1131, 10)

Return type:

pd.DataFrame

Examples

>>> from water_quality import ecoli_mekong_laos
>>> ecoli = ecoli_mekong_laos()
>>> ecoli.shape
(1131, 10)

coli data from Mekong river (Houay Pano) area.

Parameters:

st (optional) – starting time. The default starting point is 2011-05-25 10:00:00
en (optional) – end time, The default end point is 2021-05-25 15:41:00
parameters (str, optional) –
names of features to use. use all to get all features. By default following input features are selected

station_name name of station/catchment where the observation was made T temperature EC electrical conductance DOpercent dissolved oxygen concentration DO dissolved oxygen saturation pH pH ORP oxidation-reduction potential Turbidity turbidity TSS total suspended sediment concentration E-coli_4dilutions Eschrechia coli concentration
overwrite (bool) – whether to overwrite the downloaded file or not

Returns:

with default parameters, the shape is (413, 10)

Return type:

pd.DataFrame

Examples

>>> from water_quality import ecoli_houay_pano
>>> ecoli = ecoli_houay_pano()
>>> ecoli.shape
(413, 10)

coli data from Mekong river from 2016 from 29 catchments

Parameters:

st – starting time
en – end time
parameters (str, optional) – names of parameters to use. use all to get all features.
overwrite (bool) – whether to overwrite the downloaded file or not

Returns:

with default parameters, the shape is (58, 10)

Return type:

pd.DataFrame

Examples

>>> from water_quality import ecoli_mekong_2016
>>> ecoli = ecoli_mekong_2016()
>>> ecoli.shape
(58, 10)