Water Quality
The wq submodule contains datasets that represent surface water chemistry at various locations worldwide. Currently, it includes 16 water quality datasets, but we anticipate this number will increase in the future. The spatial and temporal coverage of these datasets are detailed in following table.
List of datasets
Dataset |
Class / Function Name |
Variables Covered |
Temporal Coverage |
Spatial Coverage |
Reference |
|---|---|---|---|---|---|
Surface Water Chemistry |
24 |
1960 - 2022 |
Global |
||
Global River Water Quality Archive |
42 |
1898 - 2020 |
Global |
||
water QUAlity, DIscharge and Catchment Attributes |
10 |
1950 - 2018 |
Germany |
||
river chemistry for US coasts |
21 |
1850 - 2020 |
USA |
||
Busan Beach |
14 |
2018 - 2019 |
Busan, S.Korea |
||
Ecoli Mekong River |
10 |
2011 - 2021 |
Mekong river (Houay Pano) |
||
Ecoli Mekong River (Laos) |
10 |
2011 - 2021 |
Mekong River (Laos) |
||
Ecoli Houay Pano (Laos) |
10 |
2011 - 2021 |
Houay Pano (Laos) |
||
CamelsChem |
18 |
1980 - 2018 |
Continental USA |
||
Global River Methane |
18 |
Global |
|||
Sylt Roads |
18 |
1973 - 2019 |
Red Sea (Arctic) |
||
San Francisco Bay |
18 |
1973 - 2019 |
San Francisco (USA) |
||
Buzzards Bay |
18 |
1992 - 2018 |
Buzzards Bay (USA) |
||
White Clay Creek |
2 |
1973 - 2019 |
White Clay Creek (USA) |
||
Selune River, France |
5 |
2021 - 2022 |
Selune River, (France) |
Functions and Classes
- class aqua_fetch.SWatCh(remove_csv_after_download=False, path=None, **kwargs)[source]
Bases:
DatasetsThe Surface Water Chemistry (SWatCh) database of 27 variables from 26322 locations as introduced in Lobke et al., 2022 . It should be noted not all the variables are available for all the locations. Following are the variables available in the dataset:
Total Phosphorus, mixed forms
Sulfate
pH
Temperature, water
Chloride
Magnesium
Calcium
Sodium
Potassium
Aluminum
Nitrate
Nitrite
Fluoride
Hardness, carbonate
Iron
Ammonium
Organic carbon
Bicarbonate
Orthophosphate
Gran acid neutralizing capacity
Alkalinity, total
Inorganic carbon
Carbonate
Alkalinity, carbonate
Hardness, non-carbonate
Carbon Dioxide, free CO2
Alkalinity, Phenolphthalein (total hydroxide+1/2 carbonate)
Examples
Examples
>>> from water.datasets import Swatch >>> ds = Swatch() >>> df = ds.fetch() >>> df.shape (3901296, 6) >>> len(ds.parameters) 22 >>> len(ds.sites) 26322 >>> coords = ds.stn_coords() >>> coords.shape (26322, 2)
- __init__(remove_csv_after_download=False, path=None, **kwargs)[source]
- Parameters:
remove_csv_after_download (bool (default=False)) – if True, the csv will be removed after downloading and processing.
- fetch(parameters: list | str = None, station_id: list | str = None, station_names: list | str = None) DataFrame[source]
- Parameters:
parameters (str/list (default=None)) – Names of parameters to fetch. By default,
name,value,val_unit,location,lat, andlongare read.station_id (str/list (default=None)) – name/names of station id for which the data is to be fetched. By default, the data for all stations is fetched. If given, then
station_namesshould not be given.station_names (str/list (default=None)) – name/names of station id for which the data is to be fetched. By default, the data for all stations is fetched. If given, then
station_idshould not be given.
- Return type:
pd.DataFrame
Examples
>>> from water.datasets import Swatch >>> ds = Swatch() >>> df = ds.fetch() >>> df.shape (3901296, 6) >>> st_name = "Jordan Lake" >>> df = df[df['location'] == st_name] >>> df.shape (4, 6)
- property names: dict
tells the names of parameters in this class and their original names in SWatCh dataset in the form of a python dictionary
- class aqua_fetch.GRQA(download_source: bool = False, path=None, **kwargs)[source]
Bases:
DatasetsGlobal River Water Quality Archive following the work of Virro et al., 2021 . This dataset comprises of 42 parameters for 94955 sites across 116 countries.
Examples
>>> from water_datasets import GRQA >>> ds = GRQA(path="/mnt/datawaha/hyex/atr/data") >>> ds.parameters ['TPP', 'PON', 'TEMP', 'TSS', ...] >>> print(len(ds.parameters)) 42 >>> len(ds.countries) 116 >>> len(ds.stations()) 94955 >>> len(ds.parameters) >>> coords = ds.stn_coords() >>> coords.shape (94955, 2) >>> country = "Pakistan" >>> len(ds.fetch_parameter('TEMP', country=country)) 1324 >>> df = ds.fetch_parameter("TEMP", country=country) >>> print(df.shape) (1324, 38) >>> df = ds.fetch_parameter("NH4N", country=country) >>> print(df.shape) (28, 36)
- __init__(download_source: bool = False, path=None, **kwargs)[source]
- Parameters:
download_source (bool) – whether to download source data or not
- fetch_parameter(parameter: str = 'COD', site_name: List[str] | str = None, country: List[str] | str = None, st: int | str | DatetimeIndex = None, en: int | str | DatetimeIndex = None) DataFrame[source]
- Parameters:
- Returns:
a pandas dataframe
- Return type:
pd.DataFrame
Example
>>> from water_quality import GRQA >>> dataset = GRQA() >>> df = dataset.fetch_parameter() fetch data for only one country >>> cod_pak = dataset.fetch_parameter("COD", country="Pakistan") fetch data for only one site >>> cod_kotri = dataset.fetch_parameter("COD", site_name="Indus River - at Kotri") we can find out the number of data points and sites available for a specific country as below >>> for para in dataset.parameters: >>> data = dataset.fetch_parameter(para, country="Germany") >>> if len(data)>0: >>> print(f"{para}, {df.shape}, {len(df['site_name'].unique())}")
- class aqua_fetch.Quadica(path=None, **kwargs)[source]
Bases:
DatasetsThis is dataset of 10 water quality parameters of Germany from 1386 stations from 1950 to 2018 at monthly timestep following the work of Ebeling et al., 2022 . The time-step is monthly and annual but the monthly timeseries data is not continuous. Following are the parameters available in this dataset:
Q : Discharge
NO3 : Nitrate
NO3N : Nitrate-N
NMin : Nitrogen mineralization
TN : Total Nitrogen
PO4 : Phosphate
PO4P : Phosphate-P
TP : Total Phosphorus
DOC : Dissolved Organic Carbon
TOC : Total Organic Carbon
Examples
>>> from water_datasets import Quadica >>> dataset = Quadica() >>> len(ds.stations()) 1386 >>> coords = ds.stn_coords() >>> coords.shape (1386, 2) >>> df = dataset.wrtds_monthly() >>> df.shape (50186, 47) >>> df = dataset.wrtds_annual() >>> df.shape (4213, 46) >>> df = dataset.pet() >>> df.shape (828, 1386) >>> df = dataset.avg_temp() >>> df.shape (828, 1388) >>> df = dataset.precipitation() >>> df.shape (828, 1388) >>> df = dataset.catchment_attributes() >>> df.shape (1386, 112) >>> df = dataset.metadata() >>> df.shape (1386, 60) >>> df = dataset.monthly_medians() >>> df.shape (16629, 18) >>> df = dataset.annual_medians() >>> df.shape (24393, 18) >>> df = dataset.fetch_monthly() >>> df[0].shape (50186, 47)
- __init__(path=None, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz
- annual_medians() DataFrame[source]
Annual medians over the whole time series of water quality variables and discharge
- Returns:
a dataframe of shape (24393, 18)
- Return type:
pd.DataFrame
- avg_temp(stations: List[int] | int = None, st: str | int | DatetimeIndex = None, en: str | int | DatetimeIndex = None) DataFrame[source]
monthly median average temperatures starting from 1950-01 to 2018-09
- Parameters:
stations – name of stations for which data is to be retrieved. By default, data for all stations is retrieved.
st (optional) – starting point of data. By default, the data starts from 1950-01
en (optional) – end point of data. By default, the data ends at 2018-09
- Returns:
a pandas dataframe of shape (time_steps, stations). With default input arguments, the shape is (828, 1386)
- Return type:
pd.DataFrame
Examples
>>> from water_quality import Quadica >>> dataset = Quadica() >>> df = dataset.avg_temp() # -> (828, 1388)
- catchment_attributes(parameters: List[str] | str = None, stations: List[int] | int = None) DataFrame[source]
Returns static physical catchment attributes in the form of dataframe.
- Parameters:
parameters (list/str, optional, (default=None)) – name/names of static attributes to fetch
stations (list/int, optional (default=None)) – name/names of stations whose static/physical parameters are to be read
- Returns:
a pandas dataframe of shape (stations, parameters). With default input arguments, shape is (1386, 113)
- Return type:
pd.DataFrame
Examples
>>> from water_quality import Quadica >>> dataset = Quadica() >>> cat_features = dataset.catchment_attributes() ... # get attributes of only selected stations >>> dataset.catchment_attributes(stations=[1,2,3])
- fetch_monthly(parameters: List[str] | str = None, stations: List[int] | int = 'all', median: bool = True, fnc: bool = True, fluxes: bool = True, precipitation: bool = True, avg_temp: bool = True, pet: bool = True, only_continuous: bool = True, cat_features: bool = True, max_nan_tol: int | None = 0) Tuple[DataFrame, DataFrame][source]
Fetches monthly concentrations of water quality parameters.
- Parameters:
parameters (str/list, optional (default=None)) –
name or names of water quality parameters to fetch. By default following parameters are considered
NO3NO3NTNNminPO4PO4PTPDOCTOC
stations (int/list, optional (default=None)) – name or names of stations whose data is to be fetched
median (bool, optional (default=True)) – whether to fetch median concentration values or not
fnc (bool, optional (default=True)) – whether to fetch flow normalized concentrations or not
fluxes (bool, optional (default=True)) – Setting this to true will add two parameters i.e. mean_Flux_FEATURE and mean_FNFlux_FEATURE
precipitation (bool, optional (default=True)) – whether to fetch average monthly precipitation or not
avg_temp (bool, optional (default=True)) – whether to fetch average monthly temperature or not
pet (bool, optional (default=True)) – whether to fether potential evapotranspiration data or not
only_continuous (bool, optional (default=True)) – If true, will return data for only those stations who have continuos monthly timeseries data from 1993-01-01 to 2013-01-01.
cat_features (bool, optional (default=True)) – whether to fetch catchment parameters or not.
max_nan_tol (int, optional (default=0)) – setting this value to 0 will remove the whole time-series with any missing values. If None, no time-series with NaNs values will be removed.
- Returns:
- two dataframes whose length is same but the columns are different
a pandas dataframe of timeseries of parameters (stations*timesteps, dynamic_features)
a pandas dataframe of static parameters (stations*timesteps, catchment_features)
- Return type:
Examples
>>> from water_quality import Quadica >>> dataset = Quadica() >>> mon_dyn, mon_cat = dataset.fetch_monthly(max_nan_tol=None) ... # However, mon_dyn contains data for all parameters and many of which have ... # large number of nans. If we want to fetch data only related to TN without any ... # missing value, we can do as below >>> mon_dyn_tn, mon_cat_tn = dataset.fetch_monthly(parameters="TN", max_nan_tol=0) ... # if we want to find out how many catchments are included in mon_dyn_tn >>> len(mon_dyn_tn['OBJECTID'].unique()) ... # 25
- metadata() DataFrame[source]
fetches the metadata about the stations as pandas’ dataframe. Each row represents metadata about one station and each column represents one feature. The R2 and pbias are regression coefficients and percent bias of WRTDS models for each parameter.
- Returns:
a dataframe of shape (1386, 60)
- Return type:
pd.DataFrame
- monthly_medians(parameters: List[str] | str = None, stations: List[int] | int = None) DataFrame[source]
This function reads the c_months.csv file which contains the monthly medians over the whole time series of water quality variables and discharge
- Parameters:
parameters (list/str, optional, (default=None)) – name/names of parameters
stations (list/int, optional (default=None)) – stations for which
- Returns:
a dataframe of shape (16629, 18). 15 of the 18 columns represent a water chemistry parameter. 16629 comes from 1386*12 where 1386 is stations and 12 is months.
- Return type:
pd.DataFrame
- pet(stations: List[str] | str = 'all', st: str | int | DatetimeIndex = None, en: str | int | DatetimeIndex = None) DataFrame[source]
average monthly potential evapotranspiration starting from 1950-01 to 2018-09
- Returns:
a dataframe of shape (828, 1386), where 828 is the number of months from 1950-01 to 2018-09 and 1386 is the number of stations
- Return type:
pd.DataFrame
Examples
>>> from water_quality import Quadica >>> dataset = Quadica() >>> df = dataset.pet() # -> (828, 1386)
- precipitation(stations: List[int] | int = None, st: str | int | DatetimeIndex = None, en: str | int | DatetimeIndex = None) DataFrame[source]
sums of precipitation starting from 1950-01 to 2018-09
- Parameters:
stations – name of stations for which data is to be retrieved. By default, data for all stations is retrieved.
st (optional) – starting point of data. By default, the data starts from 1950-01
en (optional) – end point of data. By default, the data ends at 2018-09
- Returns:
a dataframe of shape (828, 1388)
- Return type:
pd.DataFrame
Examples
>>> from water_quality import Quadica >>> dataset = Quadica() >>> df = dataset.precipitation() # -> (828, 1388)
- stn_coords() DataFrame[source]
Returns the coordinates of all the stations in the dataset in wgs84 projection.
- Returns:
A dataframe with columns ‘lat’, ‘long’
- Return type:
pd.DataFrame
- to_DataSet(target: str = 'TP', input_features: list = None, split: str = 'temporal', lookback: int = 24, **ds_args)[source]
This function prepares data for machine learning prediction problem. It returns an instance of ai4water.preprocessing.DataSetPipeline which can be given to model.fit or model.predict
- Parameters:
target (str, optional (default="TN")) – parameter to consider as target
input_features (list, optional) – names of input parameters
split (str, optional (default="temporal")) – if
temporal, validation and test sets are taken from the data of each station and then concatenated. Ifspatial, training validation and test is decided based upon stations.lookback (int)
**ds_args – key word arguments
- Returns:
an instance of DataSetPipeline
- Return type:
ai4water.preprocessing.DataSet
Example
>>> from water_datasets import Quadica ... # initialize the Quadica class >>> dataset = Quadica() ... # define the input parameters >>> inputs = ['median_Q', 'OBJECTID', 'avg_temp', 'precip', 'pet'] ... # prepare data for TN as target >>> dsp = dataset.to_DataSet("TN", inputs, lookback=24)
- wrtds_annual(parameters: str | list = None, st: str | int | DatetimeIndex = None, en: str | int | DatetimeIndex = None) DataFrame[source]
Annual median concentrations, flow-normalized concentrations, and mean fluxes estimated using Weighted Regressions on Time, Discharge, and Season (WRTDS) for stations with enough data availability.
- Parameters:
parameters (optional)
st (optional) – starting point of data. By default, the data starts from 1992
en (optional) – end point of data. By default, the data ends at 2013
- Returns:
a dataframe of shape (4213, 46)
- Return type:
pd.DataFrame
Examples
>>> from water_quality import Quadica >>> dataset = Quadica() >>> df = dataset.wrtds_annual()
- wrtds_monthly(parameters: str | list = None, stations: List[str] | str = 'all', st: str | int | DatetimeIndex = None, en: str | int | DatetimeIndex = None) DataFrame[source]
Monthly median concentrations, flow-normalized concentrations and mean fluxes of water chemistry parameters. These are estimated using Weighted Regressions on Time, Discharge, and Season (WRTDS) for stations with enough data availability. This data is available for total 140 stations. The data from all stations does not start and end at the same period. Therefore, some stations have more datapoints while others have less. The maximum datapoints for a station are 576 while smallest datapoints are 244.
- Parameters:
parameters (str/list, optional)
stations (int/list optional (default=None)) – name/names of satations whose data is to be retrieved.
st (optional) – starting point of data. By default, the data starts from 1992-09
en (optional) – end point of data. By default, the data ends at 2013-12
- Returns:
a dataframe of shape (50186, 47)
- Return type:
pd.DataFrame
Examples
>>> from water.datasets import Quadica >>> dataset = Quadica() >>> df = dataset.wrtds_monthly()
- class aqua_fetch.RC4USCoast(path=None, *args, **kwargs)[source]
Bases:
DatasetsMonthly river water chemistry (N, P, SIO2, DO, … etc), discharge and temperature of 140 monitoring sites of US coasts from 1950 to 2020 following the work of Gomez et al., 2022.
Examples
>>> from water_quality import RC4USCoast >>> dataset = RC4USCoast() >>> len(dataset.stations) 140 >>> len(dataset.parameters) 27 >>> stn_coords = dataset.stn_coords() >>> stn_coords.shape (140, 2)
- __init__(path=None, *args, **kwargs)[source]
- Parameters:
path – path where the data is already downloaded. If None, the data will be downloaded into the disk.
- fetch_chem(parameter, stations: List[int] | int | str = 'all', as_dataframe: bool = False, st: int | str | DatetimeIndex = None, en: int | str | DatetimeIndex = None)[source]
Returns water chemistry parameters from one or more stations.
- Parameters:
stations (list, str) – name/names of stations from which the parameters are to be fetched
as_dataframe (bool (default=False)) – whether to return data as pandas.DataFrame or xarray.Dataset
st – start time of data to be fetched. The default starting date is 19500101
en – end time of data to be fetched. The default end date is 20201201
- Return type:
pandas DataFrame or xarray Dataset
Examples
>>> from water_quality import RC4USCoast >>> ds = RC4USCoast() >>> data = ds.fetch_chem(['temp', 'do']) >>> data >>> data = ds.fetch_chem(['temp', 'do'], as_dataframe=True) >>> data.shape # this is a multi-indexed dataframe (119280, 4) >>> data = ds.fetch_chem(['temp', 'do'], st="19800101", en="20181230")
- fetch_q(stations: int | List[int] | str | ndarray = 'all', as_dataframe: bool = True, nv=0, st: int | str | DatetimeIndex = None, en: int | str | DatetimeIndex = None)[source]
returns discharge data
- Parameters:
stations – stations for which q is to be fetched
as_dataframe (bool (default=True)) – whether to return the data as pd.DataFrame or as xarray.Dataset
nv (int (default=0))
st – start time of data to be fetched. The default starting date is 19500101
en – end time of data to be fetched. The default end date is 20201201
Examples
>>> from water_quality import RC4USCoast >>> ds = RC4USCoast() # get data of all stations as DataFrame >>> q = ds.fetch_q("all") >>> q.shape (852, 140) # where 140 is the number of stations # get data of only two stations >>> q = ds.fetch_q([1,10]) >>> q.shape (852, 2) # get data as xarray Dataset >>> q = ds.fetch_q("all", as_dataframe=False) >>> type(q) xarray.core.dataset.Dataset # getting data between specific periods >>> data = ds.fetch_q("all", st="20000101", en="20181230")
- property parameters: List[str]
returns names of parameters
Examples
>>> from water_quality import RC4USCoast >>> ds = RC4USCoast() >>> len(ds.parameters) 27
- class aqua_fetch.CamelsChem(path=None, **kwargs)[source]
Bases:
DatasetsWater Quality data from USA following the works of Sterle et al., 2024 . This dataset has 18 water chemistry parameters from 1980-01-01 - 2018-12-31. The data is is downloaded from hydroshare Out of 671 stations, 155 stations have not water quality data. The wet depisition data consist of 12 parameters from 1985 - 2018.
Examples
>>> from water_datasets import CamelsChem >>> ds = CamelsChem(path='/path/to/dataset') >>> len(ds.stations()) 516 >>> len(ds.parameters) 28
- __init__(path=None, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz
- fetch(stations: str | List[str] = 'all', parameters: str | List[str] = 'all') Dict[str, DataFrame][source]
fetches the data for the given stations and parameters
- Parameters:
- Returns:
dictionary of dataframes for each station
- Return type:
Dict[str, pd.DataFrame]
Examples
>>> ds = CamelsChem(path='/mnt/datawaha/hyex/atr/data') >>> data = ds.fetch(stations=['1591400', '6350000'], parameters=['cl_mg/l', 'na_mg/l']) >>> data = ds.fetch('1591400', 'cl_mg/l')['1591400'] >>> data.shape # (55, 1) ... get all parameters for a station >>> data = ds.fetch('1591400')['1591400'] >>> data.shape # (55, 28) >>> all_data = ds.fetch() # get all parameters of all stations >>> len(all_data) # 516
- fetch_atm_dep(stations: str | List[str] = 'all', parameters: str | List[str] = 'all') Dict[str, DataFrame][source]
fetches the data for the given stations and parameters
- Parameters:
- Returns:
dictionary of dataframes for each station
- Return type:
Dict[str, pd.DataFrame]
Examples
>>> ds = CamelsChem(path='/mnt/datawaha/hyex/atr/data') ... get data for a single station and a single parameter >>> data = ds.fetch_atm_dep(stations='1591400', parameters='cl') >>> print(data['1591400'].shape) # (34, 8) ... get data for multiple stations and multiple parameters >>> data = ds.fetch_atm_dep(stations=['1591400', '6350000'], parameters=['cl', 'na']) >>> print(data['1591400'].shape) # (34, 16) >>> print(data['6350000'].shape) # (34, 16) .. get data for all stations and for all parameters >>> data = ds.fetch_atm_dep() >>> print(len(data)) # 671
- class aqua_fetch.SyltRoads(path=None, **kwargs)[source]
Bases:
DatasetsDataset of physico-hydro-chemical time series data at Sylt Roads from 1973 - 2019 following Rick et al., 2023 . Following parameters are available
locationDepth water [m]SalTemp [°C][PO4]3- [µmol/l][NH4]+ [µmol/l][NO2]- [µmol/l][NO3]- [µmol/l]Si(OH)4 [µmol/l]SPM [mg/l]pHO2 [µmol/l]Chl a [µg/l]DON [µmol/l]DOP [µmol/l]DIN [µmol/l]
Examples
>>> from water_datasets import SyltRoads >>> ds = SyltRoads()
- __init__(path=None, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz
- fetch(parameters: str | List[str] = 'all') DataFrame[source]
Fetch the data from the dataset
- Parameters:
parameters (str or List[str], optional) – Parameters to fetch. Default is None which will fetch all parameters
- Returns:
DataFrame containing the data
- Return type:
pd.DataFrame
Examples
>>> from water_datasets import SyltRoads >>> ds = SyltRoads() >>> df = ds.fetch() >>> df.shape (5710, 16) >>> len(ds.parameters) 16 >>> ds.fetch(['Sal', 'Temp [°C]', 'pH']).shape (5710, 3)
- class aqua_fetch.SanFranciscoBay(path=None, **kwargs)[source]
Bases:
DatasetsTime series of water quality parameters from 59 stations in San-Francisco from 1969 - 2015. For details on data see Cloern et al.., 2017 and Schraga et al., 2017. Following parameters are available:
DepthDiscrete_ChlorophyllRatio_DiscreteChlorophyll_PheopigmentCalculated_ChlorophyllDiscrete_OxygenCalculated_OxygenOxygen_Percent_SaturationDiscrete_SPMCalculated_SPMExtinction_CoefficientSalinityTemperatureSigma_tNitriteNitrate_NitriteAmmoniumPhosphateSilicate
Examples
>>> from water_datasets import SanFranciscoBay >>> ds = SanFranciscoBay() >>> data = ds.data() >>> data.shape (212472, 19) >>> stations = ds.stations() >>> len(stations) 59 >>> parameters = ds.parameters() >>> len(parameters) 18 ... # fetch data for station 18 >>> stn18 = ds.fetch(stations='18') >>> stn18.shape (13944, 18)
- __init__(path=None, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz
- class aqua_fetch.GRiMeDB(path=None, **kwargs)[source]
Bases:
DatasetsGlobal river database of methan concentrations and fluxes from 5029 stations of 305 rivers following Stanley et al., 2023
Examples
>>> from water_datasets import GRiMeDB >>> ds = GRiMeDB(path='/path/to/dataset') >>> ds.stations() >>> ds.streams >>> ds.stn_coords() >>> ds.shape 5029, 2
- __init__(path=None, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz
- concentrations(stations: str | List[str] = 'all', streams: str | List[str] = 'all', parameters: str | List[str] = 'all')[source]
Get concentrations data.
- Parameters:
stations (Union[str, List[str]], optional) – station ID or list of station IDs, by default “all”. If given, then
streamsmust not be given. Check .stations() method for available stations.streams (Union[str, List[str]], optional) – stream name or list of stream names, by default “all”. If given, then
stationsmust not be given. Check .streams attribute for available streams.parameters (Union[str, List[str]], optional) – parameters to return, by default “all”. Check .parameters attribute for available parameters.
- fluxes(stations: str | List[str] = 'all') DataFrame[source]
returns fluxes data as a pandas dataframe
- class aqua_fetch.BuzzardsBay(path=None, **kwargs)[source]
Bases:
DatasetsWater quality measurements in Buzzards Bay from 1992 - 2018. For more details on data see Jakuba et al., data is downloaded from MBLWHOI Library
Examples
>>> from water_datasets import BuzzardsBay >>> ds = BuzzardsBay() >>> doc = ds.doc() >>> doc.shape (11092, 4) >>> chla = ds.chla() >>> chla.shape (1028, 10)
- __init__(path=None, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz
- class aqua_fetch.WhiteClayCreek(path=None, **kwargs)[source]
Bases:
DatasetsTime series of water quality parameters from White Clay Creek.
chl-a : 2001 - 2012
Dissolved Organic Carbon : 1977 - 2017
- __init__(path=None, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz
- class aqua_fetch.SeluneRiver(path=None, **kwargs)[source]
Bases:
DatasetsDataset of physico-chemical variables measured at different levels, for a 2021 and 2022 for characterization of Hyporheic zone of Selune River, Manche, Normandie, France following Moustapha Ba et al., 2023 . The data is available at data.gouv.fr . The following variables are available:
water level
temperature
conductivity
oxygen
pressure
- __init__(path=None, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz
- aqua_fetch.busan_beach(inputs: list = None, target: list | str = 'tetx_coppml') DataFrame[source]
Loads the Antibiotic resitance genes (ARG) data from a recreational beach in Busan, South Korea along with environment variables.
The data is in the form of mutlivariate time series and was collected over the period of 2 years during several precipitation events. The frequency of environmental data is 30 mins while that of ARG is discontinuous. The data and its pre-processing is described in detail in Jang et al., 2021
- Parameters:
inputs –
features to use as input. By default all environmental data is used which consists of following parameters
tide_cm
wat_temp_c
sal_psu
air_temp_c
pcp_mm
pcp3_mm
pcp6_mm
pcp12_mm
wind_dir_deg
wind_speed_mps
air_p_hpa
mslp_hpa
rel_hum
target –
feature/features to use as target/output. By default tetx_coppml is used as target. Logically one or more from following can be considered as target
ecoli
16s
inti1
Total_args
tetx_coppml
sul1_coppml
blaTEM_coppml
aac_coppml
Total_otus
otu_5575
otu_273
otu_94
- Returns:
a pandas dataframe with inputs and target and indexed with pandas.DateTimeIndex
- Return type:
pd.DataFrame
Examples
>>> from water_quality import busan_beach >>> dataframe = busan_beach() >>> dataframe.shape (1446, 14) >>> dataframe = busan_beach(target=['tetx_coppml', 'sul1_coppml']) >>> dataframe.shape (1446, 15)
See usage here for more details.
- aqua_fetch.ecoli_mekong(st: str | Timestamp | int = '20110101', en: str | Timestamp | int = '20211231', parameters: str | list = None, overwrite=False) DataFrame[source]
E. coli data from Mekong river (Houay Pano) area from 2011 to 2021 Boithias et al., 2022 .
- Parameters:
st (optional) – starting time. The default starting point is 2011-05-25 10:00:00
en (optional) – end time, The default end point is 2021-05-25 15:41:00
parameters (str, optional) –
names of features to use. use
allto get all features. By default following input features are selectedstation_namename of station/catchment where the observation was madeTtemperatureECelectrical conductanceDOpercentdissolved oxygen concentrationDOdissolved oxygen saturationpHpHORPoxidation-reduction potentialTurbidityturbidityTSStotal suspended sediment concentrationE-coli_4dilutionsEschrechia coli concentration
overwrite (bool) – whether to overwrite the downloaded file or not
- Returns:
with default parameters, the shape is (1602, 10)
- Return type:
pd.DataFrame
Examples
>>> from water_quality import ecoli_mekong >>> ecoli_data = ecoli_mekong() >>> ecoli_data.shape (1602, 10)
- aqua_fetch.ecoli_mekong_laos(st: str | Timestamp | int = '20110101', en: str | Timestamp | int = '20211231', parameters: str | list = None, station_name: str = None, overwrite=False) DataFrame[source]
coli data from Mekong river (Northern Laos).
- Parameters:
- Returns:
with default parameters, the shape is (1131, 10)
- Return type:
pd.DataFrame
Examples
>>> from water_quality import ecoli_mekong_laos >>> ecoli = ecoli_mekong_laos() >>> ecoli.shape (1131, 10)
- aqua_fetch.ecoli_houay_pano(st: str | Timestamp | int = '20110101', en: str | Timestamp | int = '20211231', parameters: str | list = None, overwrite=False) DataFrame[source]
coli data from Mekong river (Houay Pano) area.
- Parameters:
st (optional) – starting time. The default starting point is 2011-05-25 10:00:00
en (optional) – end time, The default end point is 2021-05-25 15:41:00
parameters (str, optional) –
names of features to use. use
allto get all features. By default following input features are selectedstation_namename of station/catchment where the observation was madeTtemperatureECelectrical conductanceDOpercentdissolved oxygen concentrationDOdissolved oxygen saturationpHpHORPoxidation-reduction potentialTurbidityturbidityTSStotal suspended sediment concentrationE-coli_4dilutionsEschrechia coli concentrationoverwrite (bool) – whether to overwrite the downloaded file or not
- Returns:
with default parameters, the shape is (413, 10)
- Return type:
pd.DataFrame
Examples
>>> from water_quality import ecoli_houay_pano >>> ecoli = ecoli_houay_pano() >>> ecoli.shape (413, 10)
- aqua_fetch.ecoli_mekong_2016(st: str | Timestamp | int = '20160101', en: str | Timestamp | int = '20161231', parameters: str | list = None, overwrite=False) DataFrame[source]
coli data from Mekong river from 2016 from 29 catchments
- Parameters:
- Returns:
with default parameters, the shape is (58, 10)
- Return type:
pd.DataFrame
Examples
>>> from water_quality import ecoli_mekong_2016 >>> ecoli = ecoli_mekong_2016() >>> ecoli.shape (58, 10)