Water Quality

The wq submodule contains datasets that represent surface water chemistry at various locations worldwide. Currently, it includes 16 water quality datasets, but we anticipate this number will increase in the future. The spatial and temporal coverage of these datasets are detailed in following table.

List of datasets

Summary of datasets

Dataset

Class / Function Name

Variables Covered

Temporal Coverage

Spatial Coverage

Reference

Surface Water Chemistry

aqua_fetch.SWatCh

24

1960 - 2022

Global

Lobke et al., 2022

Global River Water Quality Archive

aqua_fetch.GRQA

42

1898 - 2020

Global

Virro et al., 2021

water QUAlity, DIscharge and Catchment Attributes

aqua_fetch.Quadica

10

1950 - 2018

Germany

Ebeling et al., 2022

river chemistry for US coasts

aqua_fetch.RC4USCoast

21

1850 - 2020

USA

Gomez et al., 2022

Busan Beach

aqua_fetch.busan_beach

14

2018 - 2019

Busan, S.Korea

Jang et al

Ecoli Mekong River

aqua_fetch.ecoli_mekong

10

2011 - 2021

Mekong river (Houay Pano)

Boithias et al., 2022

Ecoli Mekong River (Laos)

aqua_fetch.ecoli_mekong_laos

10

2011 - 2021

Mekong River (Laos)

Boithias et al., 2022

Ecoli Houay Pano (Laos)

aqua_fetch.ecoli_houay_pano

10

2011 - 2021

Houay Pano (Laos)

Boithias et al., 2022

CamelsChem

aqua_fetch.CamelsChem

18

1980 - 2018

Continental USA

Sterle et al., 2024

Global River Methane

aqua_fetch.GRiMeDB

18

Global

Stanley et al., 2024

Sylt Roads

aqua_fetch.SyltRoads

18

1973 - 2019

Red Sea (Arctic)

Rick et al., 2023

San Francisco Bay

aqua_fetch.SanFranciscoBay

18

1973 - 2019

San Francisco (USA)

Schraga et al., 2017

Buzzards Bay

aqua_fetch.BuzzardsBay

18

1992 - 2018

Buzzards Bay (USA)

Jakuba et al.,

White Clay Creek

aqua_fetch.WhiteClayCreek

2

1973 - 2019

White Clay Creek (USA)

Newbold and Damiano 2013

Selune River, France

aqua_fetch.SeluneRiver

5

2021 - 2022

Selune River, (France)

Moustapha Ba et al., 2023

Functions and Classes

class aqua_fetch.SWatCh(remove_csv_after_download=False, path=None, **kwargs)[source]

Bases: Datasets

The Surface Water Chemistry (SWatCh) database of 27 variables from 26322 locations as introduced in Lobke et al., 2022 . It should be noted not all the variables are available for all the locations. Following are the variables available in the dataset:

  • Total Phosphorus, mixed forms

  • Sulfate

  • pH

  • Temperature, water

  • Chloride

  • Magnesium

  • Calcium

  • Sodium

  • Potassium

  • Aluminum

  • Nitrate

  • Nitrite

  • Fluoride

  • Hardness, carbonate

  • Iron

  • Ammonium

  • Organic carbon

  • Bicarbonate

  • Orthophosphate

  • Gran acid neutralizing capacity

  • Alkalinity, total

  • Inorganic carbon

  • Carbonate

  • Alkalinity, carbonate

  • Hardness, non-carbonate

  • Carbon Dioxide, free CO2

  • Alkalinity, Phenolphthalein (total hydroxide+1/2 carbonate)

Examples

Examples

>>> from water.datasets import Swatch
>>> ds = Swatch()
>>> df = ds.fetch()
>>> df.shape
(3901296, 6)
>>> len(ds.parameters)
22
>>> len(ds.sites)
26322
>>> coords = ds.stn_coords()
>>> coords.shape
(26322, 2)
__init__(remove_csv_after_download=False, path=None, **kwargs)[source]
Parameters:

remove_csv_after_download (bool (default=False)) – if True, the csv will be removed after downloading and processing.

fetch(parameters: list | str = None, station_id: list | str = None, station_names: list | str = None) DataFrame[source]
Parameters:
  • parameters (str/list (default=None)) – Names of parameters to fetch. By default, name, value, val_unit, location, lat, and long are read.

  • station_id (str/list (default=None)) – name/names of station id for which the data is to be fetched. By default, the data for all stations is fetched. If given, then station_names should not be given.

  • station_names (str/list (default=None)) – name/names of station id for which the data is to be fetched. By default, the data for all stations is fetched. If given, then station_id should not be given.

Return type:

pd.DataFrame

Examples

>>> from water.datasets import Swatch
>>> ds = Swatch()
>>> df = ds.fetch()
>>> df.shape
(3901296, 6)
>>> st_name = "Jordan Lake"
>>> df = df[df['location'] == st_name]
>>> df.shape
(4, 6)
property names: dict

tells the names of parameters in this class and their original names in SWatCh dataset in the form of a python dictionary

num_samples(parameter, station_id=None) int[source]
Parameters:
  • parameter (str) – name of the water quality parameter whose samples are to be quantified.

  • station_id – if given, samples of parameter will be returned for only this site/sites otherwise for all sites

property parameters: list

list of water quality parameters available

property site_names: list

list of site names

property sites: list

list of site names

stn_coords()[source]

Returns the coordinates of all the stations in the dataset

Returns:

A dataframe with columns ‘lat’, ‘long’

Return type:

pd.DataFrame

class aqua_fetch.GRQA(download_source: bool = False, path=None, **kwargs)[source]

Bases: Datasets

Global River Water Quality Archive following the work of Virro et al., 2021 . This dataset comprises of 42 parameters for 94955 sites across 116 countries.

Examples

>>> from water_datasets import GRQA
>>> ds = GRQA(path="/mnt/datawaha/hyex/atr/data")
>>> ds.parameters
['TPP', 'PON', 'TEMP', 'TSS', ...]
>>> print(len(ds.parameters))
42
>>> len(ds.countries)
116
>>> len(ds.stations())
94955
>>> len(ds.parameters)
>>> coords = ds.stn_coords()
>>> coords.shape
(94955, 2)
>>> country = "Pakistan"
>>> len(ds.fetch_parameter('TEMP', country=country))
1324
>>> df = ds.fetch_parameter("TEMP", country=country)
>>> print(df.shape)
(1324, 38)
>>> df = ds.fetch_parameter("NH4N", country=country)
>>> print(df.shape)
(28, 36)
__init__(download_source: bool = False, path=None, **kwargs)[source]
Parameters:

download_source (bool) – whether to download source data or not

fetch_parameter(parameter: str = 'COD', site_name: List[str] | str = None, country: List[str] | str = None, st: int | str | DatetimeIndex = None, en: int | str | DatetimeIndex = None) DataFrame[source]
Parameters:
  • parameter (str, optional) – name of parameter

  • site_name (str/list, optional) – location for which data is to be fetched.

  • country (str/list optional (default=None))

  • st (str) – starting date date or index

  • en (str) – end date or index

Returns:

a pandas dataframe

Return type:

pd.DataFrame

Example

>>> from water_quality import GRQA
>>> dataset = GRQA()
>>> df = dataset.fetch_parameter()
fetch data for only one country
>>> cod_pak = dataset.fetch_parameter("COD", country="Pakistan")
fetch data for only one site
>>> cod_kotri = dataset.fetch_parameter("COD", site_name="Indus River - at Kotri")
we can find out the number of data points and sites available for a specific country as below
>>> for para in dataset.parameters:
>>>     data = dataset.fetch_parameter(para, country="Germany")
>>>     if len(data)>0:
>>>         print(f"{para}, {df.shape}, {len(df['site_name'].unique())}")
sites_data() DataFrame[source]

Returns the meta data for the dataset

stations() List[str][source]

Returns names of stations/site_id

stn_coords()[source]

Returns the coordinates of all the stations in the dataset

Returns:

A dataframe with columns ‘lat’, ‘long’

Return type:

pd.DataFrame

class aqua_fetch.Quadica(path=None, **kwargs)[source]

Bases: Datasets

This is dataset of 10 water quality parameters of Germany from 1386 stations from 1950 to 2018 at monthly timestep following the work of Ebeling et al., 2022 . The time-step is monthly and annual but the monthly timeseries data is not continuous. Following are the parameters available in this dataset:

  • Q : Discharge

  • NO3 : Nitrate

  • NO3N : Nitrate-N

  • NMin : Nitrogen mineralization

  • TN : Total Nitrogen

  • PO4 : Phosphate

  • PO4P : Phosphate-P

  • TP : Total Phosphorus

  • DOC : Dissolved Organic Carbon

  • TOC : Total Organic Carbon

Examples

>>> from water_datasets import Quadica
>>> dataset = Quadica()
>>> len(ds.stations())
1386
>>> coords = ds.stn_coords()
>>> coords.shape
(1386, 2)
>>> df = dataset.wrtds_monthly()
>>> df.shape
(50186, 47)
>>> df = dataset.wrtds_annual()
>>> df.shape
(4213, 46)
>>> df = dataset.pet()
>>> df.shape
(828, 1386)
>>> df = dataset.avg_temp()
>>> df.shape
(828, 1388)
>>> df = dataset.precipitation()
>>> df.shape
(828, 1388)
>>> df = dataset.catchment_attributes()
>>> df.shape
(1386, 112)
>>> df = dataset.metadata()
>>> df.shape
(1386, 60)
>>> df = dataset.monthly_medians()
>>> df.shape
(16629, 18)
>>> df = dataset.annual_medians()
>>> df.shape
(24393, 18)
>>> df = dataset.fetch_monthly()
>>> df[0].shape
(50186, 47)
__init__(path=None, **kwargs)[source]
Parameters:
  • name – str (default=None) name of dataset

  • units – str, (default=None) the unit system being used

  • path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded

  • processes – int number of processes to use for parallel processing

  • verbosity – int determines the amount of information to be printed

  • remove_zip – bool whether to remove the zip files after unz

annual_medians() DataFrame[source]

Annual medians over the whole time series of water quality variables and discharge

Returns:

a dataframe of shape (24393, 18)

Return type:

pd.DataFrame

avg_temp(stations: List[int] | int = None, st: str | int | DatetimeIndex = None, en: str | int | DatetimeIndex = None) DataFrame[source]

monthly median average temperatures starting from 1950-01 to 2018-09

Parameters:
  • stations – name of stations for which data is to be retrieved. By default, data for all stations is retrieved.

  • st (optional) – starting point of data. By default, the data starts from 1950-01

  • en (optional) – end point of data. By default, the data ends at 2018-09

Returns:

a pandas dataframe of shape (time_steps, stations). With default input arguments, the shape is (828, 1386)

Return type:

pd.DataFrame

Examples

>>> from water_quality import Quadica
>>> dataset = Quadica()
>>> df = dataset.avg_temp() # -> (828, 1388)
catchment_attributes(parameters: List[str] | str = None, stations: List[int] | int = None) DataFrame[source]

Returns static physical catchment attributes in the form of dataframe.

Parameters:
  • parameters (list/str, optional, (default=None)) – name/names of static attributes to fetch

  • stations (list/int, optional (default=None)) – name/names of stations whose static/physical parameters are to be read

Returns:

a pandas dataframe of shape (stations, parameters). With default input arguments, shape is (1386, 113)

Return type:

pd.DataFrame

Examples

>>> from water_quality import Quadica
>>> dataset = Quadica()
>>> cat_features = dataset.catchment_attributes()
... # get attributes of only selected stations
>>> dataset.catchment_attributes(stations=[1,2,3])
fetch_monthly(parameters: List[str] | str = None, stations: List[int] | int = 'all', median: bool = True, fnc: bool = True, fluxes: bool = True, precipitation: bool = True, avg_temp: bool = True, pet: bool = True, only_continuous: bool = True, cat_features: bool = True, max_nan_tol: int | None = 0) Tuple[DataFrame, DataFrame][source]

Fetches monthly concentrations of water quality parameters.

Parameters:
  • parameters (str/list, optional (default=None)) –

    name or names of water quality parameters to fetch. By default following parameters are considered

    • NO3

    • NO3N

    • TN

    • Nmin

    • PO4

    • PO4P

    • TP

    • DOC

    • TOC

  • stations (int/list, optional (default=None)) – name or names of stations whose data is to be fetched

  • median (bool, optional (default=True)) – whether to fetch median concentration values or not

  • fnc (bool, optional (default=True)) – whether to fetch flow normalized concentrations or not

  • fluxes (bool, optional (default=True)) – Setting this to true will add two parameters i.e. mean_Flux_FEATURE and mean_FNFlux_FEATURE

  • precipitation (bool, optional (default=True)) – whether to fetch average monthly precipitation or not

  • avg_temp (bool, optional (default=True)) – whether to fetch average monthly temperature or not

  • pet (bool, optional (default=True)) – whether to fether potential evapotranspiration data or not

  • only_continuous (bool, optional (default=True)) – If true, will return data for only those stations who have continuos monthly timeseries data from 1993-01-01 to 2013-01-01.

  • cat_features (bool, optional (default=True)) – whether to fetch catchment parameters or not.

  • max_nan_tol (int, optional (default=0)) – setting this value to 0 will remove the whole time-series with any missing values. If None, no time-series with NaNs values will be removed.

Returns:

two dataframes whose length is same but the columns are different
  • a pandas dataframe of timeseries of parameters (stations*timesteps, dynamic_features)

  • a pandas dataframe of static parameters (stations*timesteps, catchment_features)

Return type:

tuple

Examples

>>> from water_quality import Quadica
>>> dataset = Quadica()
>>> mon_dyn, mon_cat = dataset.fetch_monthly(max_nan_tol=None)
... # However, mon_dyn contains data for all parameters and many of which have
... # large number of nans. If we want to fetch data only related to TN without any
... # missing value, we can do as below
>>> mon_dyn_tn, mon_cat_tn = dataset.fetch_monthly(parameters="TN", max_nan_tol=0)
... # if we want to find out how many catchments are included in mon_dyn_tn
>>> len(mon_dyn_tn['OBJECTID'].unique())
... # 25
metadata() DataFrame[source]

fetches the metadata about the stations as pandas’ dataframe. Each row represents metadata about one station and each column represents one feature. The R2 and pbias are regression coefficients and percent bias of WRTDS models for each parameter.

Returns:

a dataframe of shape (1386, 60)

Return type:

pd.DataFrame

monthly_medians(parameters: List[str] | str = None, stations: List[int] | int = None) DataFrame[source]

This function reads the c_months.csv file which contains the monthly medians over the whole time series of water quality variables and discharge

Parameters:
  • parameters (list/str, optional, (default=None)) – name/names of parameters

  • stations (list/int, optional (default=None)) – stations for which

Returns:

a dataframe of shape (16629, 18). 15 of the 18 columns represent a water chemistry parameter. 16629 comes from 1386*12 where 1386 is stations and 12 is months.

Return type:

pd.DataFrame

property parameters: list

names of water quality parameters available in this dataset

pet(stations: List[str] | str = 'all', st: str | int | DatetimeIndex = None, en: str | int | DatetimeIndex = None) DataFrame[source]

average monthly potential evapotranspiration starting from 1950-01 to 2018-09

Returns:

a dataframe of shape (828, 1386), where 828 is the number of months from 1950-01 to 2018-09 and 1386 is the number of stations

Return type:

pd.DataFrame

Examples

>>> from water_quality import Quadica
>>> dataset = Quadica()
>>> df = dataset.pet() # -> (828, 1386)
precipitation(stations: List[int] | int = None, st: str | int | DatetimeIndex = None, en: str | int | DatetimeIndex = None) DataFrame[source]

sums of precipitation starting from 1950-01 to 2018-09

Parameters:
  • stations – name of stations for which data is to be retrieved. By default, data for all stations is retrieved.

  • st (optional) – starting point of data. By default, the data starts from 1950-01

  • en (optional) – end point of data. By default, the data ends at 2018-09

Returns:

a dataframe of shape (828, 1388)

Return type:

pd.DataFrame

Examples

>>> from water_quality import Quadica
>>> dataset = Quadica()
>>> df = dataset.precipitation() # -> (828, 1388)
property station_names: List[str]

names of stations

stations() list[source]

IDs of stations for which data is available

stn_coords() DataFrame[source]

Returns the coordinates of all the stations in the dataset in wgs84 projection.

Returns:

A dataframe with columns ‘lat’, ‘long’

Return type:

pd.DataFrame

to_DataSet(target: str = 'TP', input_features: list = None, split: str = 'temporal', lookback: int = 24, **ds_args)[source]

This function prepares data for machine learning prediction problem. It returns an instance of ai4water.preprocessing.DataSetPipeline which can be given to model.fit or model.predict

Parameters:
  • target (str, optional (default="TN")) – parameter to consider as target

  • input_features (list, optional) – names of input parameters

  • split (str, optional (default="temporal")) – if temporal, validation and test sets are taken from the data of each station and then concatenated. If spatial, training validation and test is decided based upon stations.

  • lookback (int)

  • **ds_args – key word arguments

Returns:

an instance of DataSetPipeline

Return type:

ai4water.preprocessing.DataSet

Example

>>> from water_datasets import Quadica
... # initialize the Quadica class
>>> dataset = Quadica()
... # define the input parameters
>>> inputs = ['median_Q', 'OBJECTID', 'avg_temp', 'precip', 'pet']
... # prepare data for TN as target
>>> dsp = dataset.to_DataSet("TN", inputs, lookback=24)
wrtds_annual(parameters: str | list = None, st: str | int | DatetimeIndex = None, en: str | int | DatetimeIndex = None) DataFrame[source]

Annual median concentrations, flow-normalized concentrations, and mean fluxes estimated using Weighted Regressions on Time, Discharge, and Season (WRTDS) for stations with enough data availability.

Parameters:
  • parameters (optional)

  • st (optional) – starting point of data. By default, the data starts from 1992

  • en (optional) – end point of data. By default, the data ends at 2013

Returns:

a dataframe of shape (4213, 46)

Return type:

pd.DataFrame

Examples

>>> from water_quality import Quadica
>>> dataset = Quadica()
>>> df = dataset.wrtds_annual()
wrtds_monthly(parameters: str | list = None, stations: List[str] | str = 'all', st: str | int | DatetimeIndex = None, en: str | int | DatetimeIndex = None) DataFrame[source]

Monthly median concentrations, flow-normalized concentrations and mean fluxes of water chemistry parameters. These are estimated using Weighted Regressions on Time, Discharge, and Season (WRTDS) for stations with enough data availability. This data is available for total 140 stations. The data from all stations does not start and end at the same period. Therefore, some stations have more datapoints while others have less. The maximum datapoints for a station are 576 while smallest datapoints are 244.

Parameters:
  • parameters (str/list, optional)

  • stations (int/list optional (default=None)) – name/names of satations whose data is to be retrieved.

  • st (optional) – starting point of data. By default, the data starts from 1992-09

  • en (optional) – end point of data. By default, the data ends at 2013-12

Returns:

a dataframe of shape (50186, 47)

Return type:

pd.DataFrame

Examples

>>> from water.datasets import Quadica
>>> dataset = Quadica()
>>> df = dataset.wrtds_monthly()
class aqua_fetch.RC4USCoast(path=None, *args, **kwargs)[source]

Bases: Datasets

Monthly river water chemistry (N, P, SIO2, DO, … etc), discharge and temperature of 140 monitoring sites of US coasts from 1950 to 2020 following the work of Gomez et al., 2022.

Examples

>>> from water_quality import RC4USCoast
>>> dataset = RC4USCoast()
>>> len(dataset.stations)
140
>>> len(dataset.parameters)
27
>>> stn_coords = dataset.stn_coords()
>>> stn_coords.shape
(140, 2)
__init__(path=None, *args, **kwargs)[source]
Parameters:

path – path where the data is already downloaded. If None, the data will be downloaded into the disk.

fetch_chem(parameter, stations: List[int] | int | str = 'all', as_dataframe: bool = False, st: int | str | DatetimeIndex = None, en: int | str | DatetimeIndex = None)[source]

Returns water chemistry parameters from one or more stations.

Parameters:
  • parameter (list, str) – name/names of parameters to fetch

  • stations (list, str) – name/names of stations from which the parameters are to be fetched

  • as_dataframe (bool (default=False)) – whether to return data as pandas.DataFrame or xarray.Dataset

  • st – start time of data to be fetched. The default starting date is 19500101

  • en – end time of data to be fetched. The default end date is 20201201

Return type:

pandas DataFrame or xarray Dataset

Examples

>>> from water_quality import RC4USCoast
>>> ds = RC4USCoast()
>>> data = ds.fetch_chem(['temp', 'do'])
>>> data
>>> data = ds.fetch_chem(['temp', 'do'], as_dataframe=True)
>>> data.shape  # this is a multi-indexed dataframe
(119280, 4)
>>> data = ds.fetch_chem(['temp', 'do'], st="19800101", en="20181230")
fetch_q(stations: int | List[int] | str | ndarray = 'all', as_dataframe: bool = True, nv=0, st: int | str | DatetimeIndex = None, en: int | str | DatetimeIndex = None)[source]

returns discharge data

Parameters:
  • stations – stations for which q is to be fetched

  • as_dataframe (bool (default=True)) – whether to return the data as pd.DataFrame or as xarray.Dataset

  • nv (int (default=0))

  • st – start time of data to be fetched. The default starting date is 19500101

  • en – end time of data to be fetched. The default end date is 20201201

Examples

>>> from water_quality import RC4USCoast
>>> ds = RC4USCoast()
# get data of all stations as DataFrame
>>> q = ds.fetch_q("all")
>>> q.shape
(852, 140)  # where 140 is the number of stations
# get data of only two stations
>>> q = ds.fetch_q([1,10])
>>> q.shape
(852, 2)
# get data as xarray Dataset
>>> q = ds.fetch_q("all", as_dataframe=False)
>>> type(q)
xarray.core.dataset.Dataset
# getting data between specific periods
>>> data = ds.fetch_q("all", st="20000101", en="20181230")
property parameters: List[str]

returns names of parameters

Examples

>>> from water_quality import RC4USCoast
>>> ds = RC4USCoast()
>>> len(ds.parameters)
27
property stations: List[str]
>>> from water_quality import RC4USCoast
>>> ds = RC4USCoast(path=r'F:\data\RC4USCoast')
>>> len(ds.stations)
140
stn_coords() DataFrame[source]

Returns the coordinates of all the stations in the dataset in wgs84 projection.

Returns:

A dataframe with columns ‘lat’, ‘long’

Return type:

pd.DataFrame

class aqua_fetch.CamelsChem(path=None, **kwargs)[source]

Bases: Datasets

Water Quality data from USA following the works of Sterle et al., 2024 . This dataset has 18 water chemistry parameters from 1980-01-01 - 2018-12-31. The data is is downloaded from hydroshare Out of 671 stations, 155 stations have not water quality data. The wet depisition data consist of 12 parameters from 1985 - 2018.

Examples

>>> from water_datasets import CamelsChem
>>> ds = CamelsChem(path='/path/to/dataset')
>>> len(ds.stations())
516
>>> len(ds.parameters)
28
__init__(path=None, **kwargs)[source]
Parameters:
  • name – str (default=None) name of dataset

  • units – str, (default=None) the unit system being used

  • path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded

  • processes – int number of processes to use for parallel processing

  • verbosity – int determines the amount of information to be printed

  • remove_zip – bool whether to remove the zip files after unz

atm_dep_data() DataFrame[source]

reads the atmospheric deposition data

atm_dep_metadata() DataFrame[source]

reads the atm_dep_metadata

property atm_dep_parameters: List[str]

returns the names of parameters in the atm_dep dataset

data() DataFrame[source]

reads the main dataset which has shape of (76284, 45)

fetch(stations: str | List[str] = 'all', parameters: str | List[str] = 'all') Dict[str, DataFrame][source]

fetches the data for the given stations and parameters

Parameters:
  • stations (Union[str, List[str]]) – list of stations to fetch data for

  • parameters (Union[str, List[str]]) – list of parameters to fetch data for

Returns:

dictionary of dataframes for each station

Return type:

Dict[str, pd.DataFrame]

Examples

>>> ds = CamelsChem(path='/mnt/datawaha/hyex/atr/data')
>>> data = ds.fetch(stations=['1591400', '6350000'], parameters=['cl_mg/l', 'na_mg/l'])
>>> data = ds.fetch('1591400', 'cl_mg/l')['1591400']
>>> data.shape # (55, 1)
... get all parameters for a station
>>> data = ds.fetch('1591400')['1591400']
>>> data.shape # (55, 28)
>>> all_data = ds.fetch()  # get all parameters of all stations
>>> len(all_data) # 516
fetch_atm_dep(stations: str | List[str] = 'all', parameters: str | List[str] = 'all') Dict[str, DataFrame][source]

fetches the data for the given stations and parameters

Parameters:
  • stations (Union[str, List[str]]) – list of stations to fetch data for

  • parameters (Union[str, List[str]]) – list of parameters to fetch data for

Returns:

dictionary of dataframes for each station

Return type:

Dict[str, pd.DataFrame]

Examples

>>> ds = CamelsChem(path='/mnt/datawaha/hyex/atr/data')
... get data for a single station and a single parameter
>>> data = ds.fetch_atm_dep(stations='1591400', parameters='cl')
>>> print(data['1591400'].shape)  # (34, 8)
... get data for multiple stations and multiple parameters
>>> data = ds.fetch_atm_dep(stations=['1591400', '6350000'], parameters=['cl', 'na'])
>>> print(data['1591400'].shape)  # (34, 16)
>>> print(data['6350000'].shape)  # (34, 16)
.. get data for all stations and for all parameters
>>> data = ds.fetch_atm_dep()
>>> print(len(data))  # 671
gauge_and_region_names() DataFrame[source]

reads the gauge and region names

metrics()[source]

reads metrics.xlsx which contains metadata

property parameters: List[str]

returns the names of parameters in the dataset

stations() List[str][source]

returns the list of stations in the dataset

stn_coords() DataFrame[source]

Returns the coordinates of all the stations in the dataset in wgs84 projection.

Returns:

A dataframe with columns ‘lat’, ‘long’

Return type:

pd.DataFrame

topography() DataFrame[source]

reads the topography data

class aqua_fetch.SyltRoads(path=None, **kwargs)[source]

Bases: Datasets

Dataset of physico-hydro-chemical time series data at Sylt Roads from 1973 - 2019 following Rick et al., 2023 . Following parameters are available

  • location

  • Depth water [m]

  • Sal

  • Temp [°C]

  • [PO4]3- [µmol/l]

  • [NH4]+ [µmol/l]

  • [NO2]- [µmol/l]

  • [NO3]- [µmol/l]

  • Si(OH)4 [µmol/l]

  • SPM [mg/l]

  • pH

  • O2 [µmol/l]

  • Chl a [µg/l]

  • DON [µmol/l]

  • DOP [µmol/l]

  • DIN [µmol/l]

Examples

>>> from water_datasets import SyltRoads
>>> ds = SyltRoads()
__init__(path=None, **kwargs)[source]
Parameters:
  • name – str (default=None) name of dataset

  • units – str, (default=None) the unit system being used

  • path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded

  • processes – int number of processes to use for parallel processing

  • verbosity – int determines the amount of information to be printed

  • remove_zip – bool whether to remove the zip files after unz

fetch(parameters: str | List[str] = 'all') DataFrame[source]

Fetch the data from the dataset

Parameters:

parameters (str or List[str], optional) – Parameters to fetch. Default is None which will fetch all parameters

Returns:

DataFrame containing the data

Return type:

pd.DataFrame

Examples

>>> from water_datasets import SyltRoads
>>> ds = SyltRoads()
>>> df = ds.fetch()
>>> df.shape
(5710, 16)
>>> len(ds.parameters)
16
>>> ds.fetch(['Sal', 'Temp [°C]', 'pH']).shape
(5710, 3)
property parameters: List[str]

returns names of parameters in the dataset

stn_coords() DataFrame[source]

Returns the coordinates of all the stations in the dataset in wgs84 projection.

Returns:

A dataframe with columns ‘lat’, ‘long’

Return type:

pd.DataFrame

class aqua_fetch.SanFranciscoBay(path=None, **kwargs)[source]

Bases: Datasets

Time series of water quality parameters from 59 stations in San-Francisco from 1969 - 2015. For details on data see Cloern et al.., 2017 and Schraga et al., 2017. Following parameters are available:

  • Depth

  • Discrete_Chlorophyll

  • Ratio_DiscreteChlorophyll_Pheopigment

  • Calculated_Chlorophyll

  • Discrete_Oxygen

  • Calculated_Oxygen

  • Oxygen_Percent_Saturation

  • Discrete_SPM

  • Calculated_SPM

  • Extinction_Coefficient

  • Salinity

  • Temperature

  • Sigma_t

  • Nitrite

  • Nitrate_Nitrite

  • Ammonium

  • Phosphate

  • Silicate

Examples

>>> from water_datasets import SanFranciscoBay
>>> ds = SanFranciscoBay()
>>> data = ds.data()
>>> data.shape
(212472, 19)
>>> stations = ds.stations()
>>> len(stations)
59
>>> parameters = ds.parameters()
>>> len(parameters)
18
... # fetch data for station 18
>>> stn18 = ds.fetch(stations='18')
>>> stn18.shape
(13944, 18)
__init__(path=None, **kwargs)[source]
Parameters:
  • name – str (default=None) name of dataset

  • units – str, (default=None) the unit system being used

  • path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded

  • processes – int number of processes to use for parallel processing

  • verbosity – int determines the amount of information to be printed

  • remove_zip – bool whether to remove the zip files after unz

fetch(stations: str | List[str] = 'all', parameters: str | List[str] = 'all') DataFrame[source]
Parameters:

parameters (Union[str, List[str]], optional) – The parameters to return. The default is ‘all’.

Returns:

DESCRIPTION.

Return type:

pd.DataFrame

stn_data(stations: str | List[str] = 'all') DataFrame[source]

Get station metadata.

class aqua_fetch.GRiMeDB(path=None, **kwargs)[source]

Bases: Datasets

Global river database of methan concentrations and fluxes from 5029 stations of 305 rivers following Stanley et al., 2023

Examples

>>> from water_datasets import GRiMeDB
>>> ds = GRiMeDB(path='/path/to/dataset')
>>> ds.stations()
>>> ds.streams
>>> ds.stn_coords()
>>> ds.shape
5029, 2
__init__(path=None, **kwargs)[source]
Parameters:
  • name – str (default=None) name of dataset

  • units – str, (default=None) the unit system being used

  • path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded

  • processes – int number of processes to use for parallel processing

  • verbosity – int determines the amount of information to be printed

  • remove_zip – bool whether to remove the zip files after unz

concentrations(stations: str | List[str] = 'all', streams: str | List[str] = 'all', parameters: str | List[str] = 'all')[source]

Get concentrations data.

Parameters:
  • stations (Union[str, List[str]], optional) – station ID or list of station IDs, by default “all”. If given, then streams must not be given. Check .stations() method for available stations.

  • streams (Union[str, List[str]], optional) – stream name or list of stream names, by default “all”. If given, then stations must not be given. Check .streams attribute for available streams.

  • parameters (Union[str, List[str]], optional) – parameters to return, by default “all”. Check .parameters attribute for available parameters.

fluxes(stations: str | List[str] = 'all') DataFrame[source]

returns fluxes data as a pandas dataframe

stn_coords() DataFrame[source]

Returns the coordinates of all the stations in the dataset in wgs84 projection.

Returns:

A dataframe with columns ‘lat’, ‘long’

Return type:

pd.DataFrame

property streams: List[str]

returns names of streams

class aqua_fetch.BuzzardsBay(path=None, **kwargs)[source]

Bases: Datasets

Water quality measurements in Buzzards Bay from 1992 - 2018. For more details on data see Jakuba et al., data is downloaded from MBLWHOI Library

Examples

>>> from water_datasets import BuzzardsBay
>>> ds = BuzzardsBay()
>>> doc = ds.doc()
>>> doc.shape
(11092, 4)
>>> chla = ds.chla()
>>> chla.shape
(1028, 10)
__init__(path=None, **kwargs)[source]
Parameters:
  • name – str (default=None) name of dataset

  • units – str, (default=None) the unit system being used

  • path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded

  • processes – int number of processes to use for parallel processing

  • verbosity – int determines the amount of information to be printed

  • remove_zip – bool whether to remove the zip files after unz

fetch(parameters: str | List[str] = 'all') DataFrame[source]

Fetch data for the specified parameters.

class aqua_fetch.WhiteClayCreek(path=None, **kwargs)[source]

Bases: Datasets

Time series of water quality parameters from White Clay Creek.

  • chl-a : 2001 - 2012

  • Dissolved Organic Carbon : 1977 - 2017

__init__(path=None, **kwargs)[source]
Parameters:
  • name – str (default=None) name of dataset

  • units – str, (default=None) the unit system being used

  • path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded

  • processes – int number of processes to use for parallel processing

  • verbosity – int determines the amount of information to be printed

  • remove_zip – bool whether to remove the zip files after unz

chla() DataFrame[source]

Chlorophyll-a data

doc() DataFrame[source]

Dissolved Organic Carbon data

class aqua_fetch.SeluneRiver(path=None, **kwargs)[source]

Bases: Datasets

Dataset of physico-chemical variables measured at different levels, for a 2021 and 2022 for characterization of Hyporheic zone of Selune River, Manche, Normandie, France following Moustapha Ba et al., 2023 . The data is available at data.gouv.fr . The following variables are available:

  • water level

  • temperature

  • conductivity

  • oxygen

  • pressure

__init__(path=None, **kwargs)[source]
Parameters:
  • name – str (default=None) name of dataset

  • units – str, (default=None) the unit system being used

  • path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded

  • processes – int number of processes to use for parallel processing

  • verbosity – int determines the amount of information to be printed

  • remove_zip – bool whether to remove the zip files after unz

data() DataFrame[source]

Return a DataFrame of the data

aqua_fetch.busan_beach(inputs: list = None, target: list | str = 'tetx_coppml') DataFrame[source]

Loads the Antibiotic resitance genes (ARG) data from a recreational beach in Busan, South Korea along with environment variables.

The data is in the form of mutlivariate time series and was collected over the period of 2 years during several precipitation events. The frequency of environmental data is 30 mins while that of ARG is discontinuous. The data and its pre-processing is described in detail in Jang et al., 2021

Parameters:
  • inputs

    features to use as input. By default all environmental data is used which consists of following parameters

    • tide_cm

    • wat_temp_c

    • sal_psu

    • air_temp_c

    • pcp_mm

    • pcp3_mm

    • pcp6_mm

    • pcp12_mm

    • wind_dir_deg

    • wind_speed_mps

    • air_p_hpa

    • mslp_hpa

    • rel_hum

  • target

    feature/features to use as target/output. By default tetx_coppml is used as target. Logically one or more from following can be considered as target

    • ecoli

    • 16s

    • inti1

    • Total_args

    • tetx_coppml

    • sul1_coppml

    • blaTEM_coppml

    • aac_coppml

    • Total_otus

    • otu_5575

    • otu_273

    • otu_94

Returns:

a pandas dataframe with inputs and target and indexed with pandas.DateTimeIndex

Return type:

pd.DataFrame

Examples

>>> from water_quality import busan_beach
>>> dataframe = busan_beach()
>>> dataframe.shape
(1446, 14)
>>> dataframe = busan_beach(target=['tetx_coppml', 'sul1_coppml'])
>>> dataframe.shape
(1446, 15)

See usage here for more details.

aqua_fetch.ecoli_mekong(st: str | Timestamp | int = '20110101', en: str | Timestamp | int = '20211231', parameters: str | list = None, overwrite=False) DataFrame[source]

E. coli data from Mekong river (Houay Pano) area from 2011 to 2021 Boithias et al., 2022 .

Parameters:
  • st (optional) – starting time. The default starting point is 2011-05-25 10:00:00

  • en (optional) – end time, The default end point is 2021-05-25 15:41:00

  • parameters (str, optional) –

    names of features to use. use all to get all features. By default following input features are selected

    • station_name name of station/catchment where the observation was made

    • T temperature

    • EC electrical conductance

    • DOpercent dissolved oxygen concentration

    • DO dissolved oxygen saturation

    • pH pH

    • ORP oxidation-reduction potential

    • Turbidity turbidity

    • TSS total suspended sediment concentration

    • E-coli_4dilutions Eschrechia coli concentration

  • overwrite (bool) – whether to overwrite the downloaded file or not

Returns:

with default parameters, the shape is (1602, 10)

Return type:

pd.DataFrame

Examples

>>> from water_quality import ecoli_mekong
>>> ecoli_data = ecoli_mekong()
>>> ecoli_data.shape
(1602, 10)
aqua_fetch.ecoli_mekong_laos(st: str | Timestamp | int = '20110101', en: str | Timestamp | int = '20211231', parameters: str | list = None, station_name: str = None, overwrite=False) DataFrame[source]
  1. coli data from Mekong river (Northern Laos).

Parameters:
  • st – starting time

  • en – end time

  • station_name (str)

  • parameters (str, optional)

  • overwrite (bool) – whether to overwrite or not

Returns:

with default parameters, the shape is (1131, 10)

Return type:

pd.DataFrame

Examples

>>> from water_quality import ecoli_mekong_laos
>>> ecoli = ecoli_mekong_laos()
>>> ecoli.shape
(1131, 10)
aqua_fetch.ecoli_houay_pano(st: str | Timestamp | int = '20110101', en: str | Timestamp | int = '20211231', parameters: str | list = None, overwrite=False) DataFrame[source]
  1. coli data from Mekong river (Houay Pano) area.

Parameters:
  • st (optional) – starting time. The default starting point is 2011-05-25 10:00:00

  • en (optional) – end time, The default end point is 2021-05-25 15:41:00

  • parameters (str, optional) –

    names of features to use. use all to get all features. By default following input features are selected

    station_name name of station/catchment where the observation was made T temperature EC electrical conductance DOpercent dissolved oxygen concentration DO dissolved oxygen saturation pH pH ORP oxidation-reduction potential Turbidity turbidity TSS total suspended sediment concentration E-coli_4dilutions Eschrechia coli concentration

  • overwrite (bool) – whether to overwrite the downloaded file or not

Returns:

with default parameters, the shape is (413, 10)

Return type:

pd.DataFrame

Examples

>>> from water_quality import ecoli_houay_pano
>>> ecoli = ecoli_houay_pano()
>>> ecoli.shape
(413, 10)
aqua_fetch.ecoli_mekong_2016(st: str | Timestamp | int = '20160101', en: str | Timestamp | int = '20161231', parameters: str | list = None, overwrite=False) DataFrame[source]
  1. coli data from Mekong river from 2016 from 29 catchments

Parameters:
  • st – starting time

  • en – end time

  • parameters (str, optional) – names of parameters to use. use all to get all features.

  • overwrite (bool) – whether to overwrite the downloaded file or not

Returns:

with default parameters, the shape is (58, 10)

Return type:

pd.DataFrame

Examples

>>> from water_quality import ecoli_mekong_2016
>>> ecoli = ecoli_mekong_2016()
>>> ecoli.shape
(58, 10)