Waste Water Treatment

The wwt submodule contains data from approximately 20,000 experiments focused on the removal of various contaminants from wastewater using treatment strategies such as adsorption, photocatalysis, membrane filtration, and sonolysis. This submodule provides a unified interface to access all this data, which is scattered across the literature, in a standardized format using a few Python functions. It is important to note that we do not introduce this data since this data has already been utilized and analyzed in various peer-reviewed scientific publications. However, we offer a simple and easy-to-use interface to access this existing data. The availability of such a large corpus of experimental data can significantly aid in data-driven modeling and material discovery. A summary of these datasets is provided in following table.

List of datasets

Summary of datasets

Treatment Process

Function Name

Parameters

Target Pollutant

Data Points

Reference

Adsorption

aqua_fetch.ec_removal_biochar()

26

Emerg. Contaminants

3,757

Jaffari et al., 2023

Adsorption

aqua_fetch.cr_removal()

15

Cr

219

Ishtiaq et al., 2024

Adsorption

aqua_fetch.heavy_metal_removal()

30

heavy metals

1518

Jaffari et al., 2023

Adsorption

aqua_fetch.po4_removal_biochar()

30

po4

5014

Iftikhar et al., 2024

Adsorption

aqua_fetch.industrial_dye_removal()

12

Industrial Dye

1514

Iftikhar et al., 2023

Adsorption

aqua_fetch.heavy_metal_removal_Shen()

17

Heavy Metals

689

Shen et al., 2023

Adsorption

aqua_fetch.P_recovery()

8

P

504

Leng et al., 2024

Adsorption

aqua_fetch.N_recovery()

8

N

211

Leng et al., 2024

Adsorption

aqua_fetch.As_recovery()

13

As

1605

Huang et al., 2024

Photocatalysis

aqua_fetch.mg_degradation()

11

Melachite Green

1200

Jaffari et a., 2023

Photocatalysis

aqua_fetch.dye_removal()

23

Dyes

1527

Kim et al., 2024

Photocatalysis

aqua_fetch.dichlorophenoxyacetic_acid_removal()

15

2,4,Dichlorophenoxyacetic acid

1044

Kim et al., 2024

Photocatalysis

aqua_fetch.pms_removal()

2078

submitted et al., 2024

Photocatalysis

aqua_fetch.tetracycline_degradation()

8

Tetracycline

374

Abdi et al., 2022

Photocatalysis

aqua_fetch.tio2_degradation()

7

TiO2

446

Jiang et al., 2020

Photocatalysis

aqua_fetch.photodegradation_Jiang()

8

multiple

457

Jiang et al., 2021

Membrane

aqua_fetch.micropollutant_removal_osmosis()

18

micropollutants

1906

Jeong et al., 2021

sonolysis

aqua_fetch.cyanobacteria_disinfection()

6

Cyanobacteria

314

Jaffari et al., 2024

Adsorption

aqua_fetch.ec_removal_biochar(parameters: str | List[str] = 'all', encoding: str = None) Tuple[DataFrame, Dict[str, OneHotEncoder | LabelEncoder | Any]][source]

Data of removal of emerging contaminants/pollutants from wastewater using biochar. The data consists of three types of features, 1) adsorption experimental conditions, 2) elemental composition of adsorbent (biochar) and 3) parameters representing physical and synthesis conditions of biochar. For more description of this data see Jaffari et al., 2023

Parameters:
  • parameters

    By default following features are used as input

    • adsorbent

    • pyrolysis_temperature

    • pyrolysis_time

    • C

    • H

    • O

    • N

    • (O+N)/C

    • ash

    • H/C

    • O/C

    • N/C

    • surface_area

    • pore_volume

    • average_pore_size

    • pollutant

    • adsorption_time

    • concentration

    • Solution_ph

    • rpm

    • volume

    • adsorbent_dosage

    • adsorption_temperature

    • ion_concentration

    • humid_acid

    • wastewater_type

    • adsorption_type

    • final_concentration

    • capacity

  • encoding (str, default=None) – the type of encoding to use for categorical features. If not None, it should be either ohe or le.

Returns:

A tuple of length two. The first element is a DataFrame while the second element is a dictionary consisting of encoders with adsorbent pollutant, wastewater_type and adsorption_type as keys.

Return type:

tuple

Examples

>>> from water_datasets import ec_removal_biochar
>>> data, _ = ec_removal_biochar()
>>> data.shape
(3757, 29)
>>> data, encoders = ec_removal_biochar(encoding="le")
>>> data.shape
(3757, 29)
>>> len(set(encoders['adsorbent'].inverse_transform(data.loc[:, "adsorbent"])))
15
>>> len(set(encoders['pollutant'].inverse_transform(data.iloc[:, "Pollutant"])))
14
>>> set(encoders['wastewater_type'].inverse_transform(data.loc[:, "wastewater_type"]))
{'Ground water', 'Lake water', 'Secondary effluent', 'Synthetic'}
>>> set(encoders['adsorption_type'].inverse_transform(data.loc[:, "adsorption_type"]))
{'Competative', 'Single'}

We can also use one hot encoding to convert categorical features into numerical features. This will obviously increase the number of features/columns in DataFrame

>>> data, encoders = ec_removal_biochar(encoding="ohe")
>>> data.shape
(3757, 60)
>>> len(set(encoders['adsorbent'].inverse_transform(data.loc[:, [col for col in data.columns if col.startswith('adsorbent')]].values)))
15
>>> len(set(encoders['pollutant'].inverse_transform(data.loc[:, [col for col in data.columns if col.startswith('pollutant')]].values)))
14
>>> set(encoders['wastewater_type'].inverse_transform(data.loc[:, [col for col in data.columns if col.startswith('wastewater_type')]].values))
{'Ground water', 'Lake water', 'Secondary effluent', 'Synthetic'}
>>> set(encoders['adsorption_type'].inverse_transform(data.iloc[:, [col for col in data.columns if col.startswith('adsorption_type')]].values))
{'Competative', 'Single'}
aqua_fetch.cr_removal(parameters: str | List[str] = 'all', encoding: str = None) Tuple[DataFrame, Dict[str, OneHotEncoder | LabelEncoder | Any]][source]

Data from experiments conducted for Cr removal from wastewater using adsorption Ishtiaq et al., 2024

Parameters:
  • parameters

    By default following parameters are used

    • adsorbent

    • NaOH_conc_M

    • surface_area

    • pore_volume

    • C_%

    • Al_%

    • Nb_%

    • O_%

    • Na_%

    • pore_size

    • adsorption_time

    • initial_conc

    • loading_g/L

    • volume_l

    • loading_g

    • solution_ph

    • cycle_number

    • final_conc

    • adsorption_capacity

    • removal_efficiency

  • encoding (str, default=None) – the type of encoding to use for categorical parameters. If not None, it should be either ohe or le.

Returns:

A tuple of length two. The first element is a DataFrame while the second element is a dictionary consisting of encoder with adsorbent as key.

Return type:

tuple

Examples

>>> from water_datasets import cr_removal
>>> data, _ = cr_removal()
>>> data.shape
(219, 20)
>>> data, encoders = cr_removal(encoding="le")
>>> data.shape
(219, 20)
>>> len(set(encoders['adsorbent'].inverse_transform(data.loc[:, "adsorbent"])))
5
>>> set(encoders['adsorbent'].inverse_transform(data.loc[:, "adsorbent"]))
{'5M Nb2CTx', '20M Nb2CTx', '15M Nb2CTx', 'Nb2AlC', '10M Nb2CTx'}
>>> data, encoders = cr_removal(encoding="ohe")
>>> data.shape
(219, 24)

We can also use one hot encoding to convert categorical features into numerical features. This will obviously increase the number of features/columns in DataFrame

>>> data, encoders = ec_removal_biochar(encoding="ohe")
>>> data.shape
(3757, 60)
aqua_fetch.po4_removal_biochar(parameters: str | List[str] = 'all', encoding: str = None) Tuple[DataFrame, Dict[str, OneHotEncoder | LabelEncoder | Any]][source]

Data from adsorption experiments conducted for Cr removal from wastewater using biochar. For details on data see Iftikhar et al., 2023

Parameters:
  • parameters

    The parameters of the adsorption. It must be one of the following:

    • adsorbent

    • feedstock

    • activation

    • pyrolysis_temp

    • heating_rate

    • pyrolysis_time

    • C_%

    • H_%

    • O_%

    • N_%

    • S_%

    • Ca_%

    • ash

    • H/C

    • O/C

    • N/C

    • (O+N/C)

    • surface_area

    • pore_volume

    • avg_pore_size

    • adsorption_time_min

    • Ci_ppm

    • solution_pH

    • rpm

    • volume_l

    • loading_g

    • loading_g/L

    • adsorption_temp

    • ion_concentration_mM

    • ion_type

    • final_conf

    • qe

    • efficiency

  • encoding (str, default=None) – the type of encoding to use for categorical parameters. If not None, it should be either ohe or le.

aqua_fetch.heavy_metal_removal(parameters: str | List[str] = 'all', encoding: str = None) Tuple[DataFrame, Dict[str, OneHotEncoder | LabelEncoder | Any]][source]

Data from experiments conducted for heavy metal removal from wastewater using adsorption. For more details on data see Jaffari et al., 2024 .

Parameters:
  • parameters

    By default following parameters are used

    • adsorbent

    • NaOH_conc_M

    • surface_area

    • pore_volume

    • C_%

    • Al_%

    • Nb_%

    • O_%

    • Na_%

    • pore_size

    • adsorption_time

    • initial_conc

    • loading_g/L

    • volume_l

    • loading_g

    • solution_ph

    • cycle_number

    • final_conc

  • encoding (str, default=None) – the type of encoding to use for categorical parameters. If not None, it should be either ohe or le.

Returns:

A tuple of length two. The first element is a DataFrame while the second element is a dictionary consisting of encoder with adsorbent as key.

Return type:

tuple

Examples

>>> from water_datasets import heavy_metal_removal
>>> data, _ = heavy_metal_removal()
>>> data.shape
(219, 18)
>>> data, encoders = heavy_metal_removal(encoding="le")
>>> data.shape
(219, 18)
>>> len(set(encoders['adsorbent'].inverse_transform(data.loc[:, "adsorbent"])))
5
>>> set(encoders['adsorbent'].inverse_transform(data.loc[:, "adsorbent"]))
{'5M Nb2CTx', '20M Nb2CTx', '15M Nb2CTx', 'Nb2AlC', '10M Nb2CTx'}
>>> data, encoders = heavy_metal_removal(encoding="ohe")
>>> data.shape
(219, 22)
>>> len(set(encoders['adsorbent'].inverse_transform(data.loc[:, [col for col in data.columns if col.startswith('adsorbent')]].values)))
5
aqua_fetch.industrial_dye_removal(parameters: str | List[str] = 'all', encoding: str = None) Tuple[DataFrame, Dict[str, OneHotEncoder | LabelEncoder | Any]][source]

Data from experiments conducted for industrial dye removal from wastewater using adsorption. For more details on data see Iftikhar et al., 2023 .

Parameters:
  • parameters

    By default following parameters are used

    • adsorbent

    • calcination_temperature

    • calcination_time_min

    • C_%

    • H_%

    • O_%

    • N_%

    • ash

    • H/C

    • O/C

    • N/C

    • surface_area

    • pore_volume

    • average_pore_size

    • dye

    • adsorption_time_min

    • initial_concentration

    • solution_ph

    • rpm

    • volume_l

    • loading_g/l

    • adsorption_temperature

    • ion_concentration_M

    • humic_acid

    • wastewater_type

    • adsorption_type

    • final_concentration

    • qe

    • adsorbent_loading

  • encoding (str, default=None) – the type of encoding to use for categorical parameters. If not None, it should be either ohe or le.

Returns:

A tuple of length two. The first element is a DataFrame while the second element is a dictionary consisting of encoders with adsorbent and dye as keys.

Return type:

tuple

Examples

>>> from water_datasets import industrial_dye_removal
>>> data, _ = industrial_dye_removal()
>>> data.shape
(680, 29)
>>> data, encoders = industrial_dye_removal(encoding="le")
>>> data.shape
(680, 29)
>>> len(set(encoders['adsorbent'].inverse_transform(data.loc[:, "adsorbent"])))
7
>>> len(set(encoders['dye'].inverse_transform(data.loc[:, "dye"])))
4
>>> data, encoders = industrial_dye_removal(encoding="ohe")
>>> data.shape
(680, 38)
>>> len(set(encoders['adsorbent'].inverse_transform(data.loc[:, [col for col in data.columns if col.startswith('adsorbent')]].values)))
7
>>> len(set(encoders['dye'].inverse_transform(data.loc[:, [col for col in data.columns if col.startswith('dye')]].values)))
4
aqua_fetch.heavy_metal_removal_Shen(parameters: str | List[str] = 'all', encoding: str = None) Tuple[DataFrame, Dict[str, OneHotEncoder | LabelEncoder | Any]][source]

Data from experiments conducted for heavy metal removal from wastewater using adsorption. For more details on data see Shen et al., 2024

Parameters:
  • parameters

    By default following parameters are used

    • heavy_metal

    • hm_label

    • ph_bichar

    • C_%

    • (O+N)/C

    • O/C

    • H/C

    • ash

    • PS

    • SA

    • CEC

    • temperature

    • solution_ph

    • C0

    • χ

    • r

    • Ncharge

    • n

  • encoding (str, default=None) – the type of encoding to use for categorical parameters. If not None, it should be either ohe or le.

Returns:

A tuple of length two. The first element is a DataFrame while the second element is a dictionary consisting of encoders with heavy_metal and hm_label as keys.

Return type:

tuple

Examples

>>> from water_datasets import heavy_metal_removal_Shen
>>> data, _ = heavy_metal_removal_Shen()
>>> data.shape
(353, 18)
>>> data, encoders = heavy_metal_removal_Shen(encoding="le")
>>> data.shape
(353, 18)
>>> len(set(encoders['heavy_metal'].inverse_transform(data.loc[:, "heavy_metal"])))
10
>>> len(set(encoders['hm_label'].inverse_transform(data.loc[:, "hm_label"])))
42
>>> data, encoders = heavy_metal_removal_Shen(encoding="ohe")
>>> data.shape
(353, 68)
>>> len(set(encoders['heavy_metal'].inverse_transform(data.loc[:, [col for col in data.columns if col.startswith('heavy_metal')]].values)))
10
>>> len(set(encoders['hm_label'].inverse_transform(data.loc[:, [col for col in data.columns if col.startswith('hm_label')]].values)))
42
aqua_fetch.P_recovery(parameters: str | List[str] = 'all', encoding: str = None)[source]

Data from experiments conducted for P recovery from wastewater using adsorption. For more details on data see Leng et al., 2024 .

Parameters:
  • parameters

    parameters to use as input. By default following parameters are used

    • stir(rpm)

    • t(min)

    • T(℃)

    • pH

    • N:P

    • Mg:P

    • P_initial(mg/L)

    • P_recovery(%)

  • encoding (str, default=None) – the type of encoding to use for categorical parameters. If not None, it should be either ohe or le.

Returns:

A tuple of length two. The first element is a DataFrame while the second element is an empty dictionary.

Return type:

tuple

Examples

>>> from water_datasets import P_recovery
>>> data, _ = P_recovery()
>>> data.shape
(504, 8)
aqua_fetch.N_recovery(parameters: str | List[str] = 'all', encoding: str = None) Tuple[DataFrame, dict][source]

Data from experiments conducted for N recovery from wastewater using adsorption. For more details on data see Leng et al., 2024 .

Parameters:
  • parameters

    parameters to use as input. By default following parameters are used

    • stir(rpm)

    • t(min)

    • T(℃)

    • pH

    • N:P

    • Mg: N

    • P_initial(mg/L)

    • N_recovery(%)

  • encoding (str, default=None) – the type of encoding to use for categorical parameters. If not None, it should be either ohe or le.

Returns:

A tuple of length two. The first element is a DataFrame while the second element is an empty dictionary.

Return type:

tuple

Examples

>>> from water_datasets import N_recovery
>>> data, _ = N_recovery()
>>> data.shape
(210, 8)
aqua_fetch.As_recovery(parameters: str | List[str] = 'all', encoding: str = None) Tuple[DataFrame, Dict[str, Any]][source]

Data from experiments conducted for As recovery from wastewater using adsorption. For more details on data see Huang et al., 2023 .

Parameters:
  • parameters

    parameters to use as input. By default following parameters are used

    • material

    • biochar_modification

    • biochar_type

    • BET_surface_area

    • pore_volume

    • solution_pH

    • reactor_temperature

    • initial_As_concentration_mg_L

    • adsorbent_dosage

    • equilibrium_reaction_time_h

    • pyrolysis_temperature

    • As_mg_g

    • As_type

  • encoding (str, default=None) – the type of encoding to use for categorical parameters. If not None, it should be either ohe or le.

Returns:

A tuple of length two. The first element is a DataFrame while the second element is a dictionary consisting of encoders with material, biochar_modification, biochar_type and As_type as keys.

Return type:

tuple

Examples

>>> from water_datasets import As_recovery
... # Using default parameters
>>> data, _ = As_recovery()
>>> data.shape
(1605, 13)
... # Using label encoding
>>> data, encoders = As_recovery(encoding="le")
>>> data.shape
(1605, 13)
>>> len(set(encoders['material'].inverse_transform(data.loc[:, "material"])))
72
>>> len(set(encoders['biochar_modification'].inverse_transform(data.loc[:, "biochar_modification"])))
2
>>> len(set(encoders['biochar_type'].inverse_transform(data.loc[:, "biochar_type"])))
159
>>> len(set(encoders['As_type'].inverse_transform(data.loc[:, "As_type"])))
2
... # Using one hot encoding
>>> data, encoders = As_recovery(encoding="ohe")
>>> data.shape
(1605, 244)
>>> len(set(encoders['material'].inverse_transform(data.loc[:, [col for col in data.columns if col.startswith('material')]].values)))
72
>>> len(set(encoders['biochar_modification'].inverse_transform(data.loc[:, [col for col in data.columns if col.startswith('biochar_modification')]].values)))
2
>>> len(set(encoders['biochar_type'].inverse_transform(data.loc[:, [col for col in data.columns if col.startswith('biochar_type')]].values)))
159
>>> len(set(encoders['As_type'].inverse_transform(data.loc[:, [col for col in data.columns if col.startswith('As_type')]].values)))
2

Photocatalysis

aqua_fetch.mg_degradation(parameters: str | List[str] = 'all', encoding: str = None) Tuple[DataFrame, Dict[str, OneHotEncoder | LabelEncoder | Any]][source]

This data is about photocatalytic degradation of melachite green dye using nobel metal dobe BiFeO3. For further description of this data see Jafari et al., 2023 and for the use of this data for removal efficiency prediction see . This dataset consists of 1200 points collected during ~135 experiments.

Parameters:
  • parameters (list, optional) –

    features to use as input. By default following features are used as input

    • Catalyst_type

    • Surface area

    • Pore Volume

    • Catalyst_loading (g/L)

    • Light_intensity (W)

    • time (min)

    • solution_pH

    • HA (mg/L)

    • Anions

    • Ci (mg/L)

    • Cf (mg/L)

    • Efficiency (%)

    • k_first

    • k_2nd

  • encoding (str, default=None) – type of encoding to use for the two categorical features i.e., catalyst_type and anions, to convert them into numberical. Available options are ohe, le and None. If ohe is selected the original input columns are replaced with ohe hot encoded columns. This will result in 6 columns for Anions and 15 columns for catalyst_type.

Returns:

A tuple of length two. The first element is a DataFrame of shape (1200, len(parameters)) while the second element is a dictionary consisting of encoders with catalyst_type and anions as keys.

Return type:

tuple

Examples

>>> from water_datasets import mg_degradation
>>> mg_data, encoders = mg_degradation()
>>> mg_data.shape
(1200, 14)
... # the default encoding is None, but if we want to use one hot encoder
>>> mg_data_ohe, encoders = mg_degradation(encoding="ohe")
>>> mg_data_ohe.shape
(1200, 33)
>>> encoders['catalyst_type'].inverse_transform(mg_data_ohe.loc[:, [col for col in data.columns if col.startswith('catalyst_type')]].values)
>>> encoders['anions'].inverse_transform(mg_data_ohe.loc[:, [col for col in data.columns if col.startswith('anions')]].values)
... # if we want to use label encoder
>>> mg_data_le, cat_enc, an_enc = mg_degradation(encoding="le")
>>> mg_data_le.shape
(1200, 14)
>>> encoders['catalyst_type'].inverse_transform(mg_data_le.loc[:, 'catalyst_type'].values.astype(int))
>>> encoders['anions'].inverse_transform(mg_data_le.loc[:, 'anions'].values.astype(int))
... # By default the target is efficiency but if we want
... # to use first order k as target
>>> mg_data_k, _ = mg_degradation()
... # if we want to use 2nd order k as target
>>> mg_data_k2, _ = mg_degradation()
aqua_fetch.dye_removal(parameters: str | List[str] = 'all', encoding: str = None) Tuple[DataFrame, Dict[str, OneHotEncoder | LabelEncoder | Any]][source]

Data from experiments conducted to measure dye removal rate from wastewater treatment using photocatalysis method. For more information on data see Kim et al., 2024 .

Parameters:

parameters (list, optional) –

features to use as input. It must be a subset of the following features

  • catalyst

  • hydrothermal_synthesis_time_min)

  • energy_Band_gap_Eg) eV

  • C_%

  • O_%

  • Fe_%

  • Al_%

  • Ni_%

  • Mo_%

  • S_%

  • Bi

  • Ag

  • Pd

  • Pt

  • surface_area_m2/g

  • pore_volume_cm3/g

  • pore_size_nm

  • volume_L

  • loading_g

  • catalyst_loading_mg

  • light_intensity_watt

  • light_source_distance_cm

  • time_m

  • dye

  • log_Kw

  • hydrogen_bonding_acceptor_count

  • hydrogen_bonding_donor_count

  • solubility_g/L

  • molecular_wt_g/mol

  • pka1

  • pka2

  • dye_concentration_mg/L

  • solution_pH

  • HA_mg/L

  • anions

encodingstr, default=None

type of encoding to use for the two categorical features i.e., Catalyst_type dye and Anions, to convert them into numberical. Available options are ohe, le and None.

Returns:

A tuple of length two. The first element is a DataFrame of shape (1200, len(parameters)) while the second element is a dictionary consisting of encoders with catalyst_type and anions as keys.

Return type:

tuple

Examples

>>> from water_datasets import dye_removal
>>> data, encoders = dye_removal()
>>> assert data.shape == (1527, 36)
# using label encoding to encode the categorical variables
>>> data, encoders = dye_removal(encoding='le')
>>> assert data.shape == (1527, 36), data.shape
>>> catalysts = encoders['catalyst'].inverse_transform(data.loc[:, 'catalyst'].values)
>>> len(set(catalysts.tolist()))
18
>>> dye = encoders['dye'].inverse_transform(data.loc[:, "dye"].values)
>>> set(dye.tolist())
{'Melachite Green', 'Indigo'}
>>> anions = encoders['anions'].inverse_transform(data.loc[:,'anions'].values)
>>> set(anions.tolist())
{'NaCO3', 'N/A', 'Na2SO4', 'Na2HPO4', 'NaHCO3', 'NaCl'}
# using one hot encoding for categroicla parameters
>>> data, encoders = dye_removal(encoding='ohe')
>>> assert data.shape == (1527, 59), data.shape
>>> catalysts = encoders['catalyst'].inverse_transform(data.loc[:, [col for col in data.columns if col.startswith('catalyst')]].values)
>>> len(set(catalysts.tolist()))
18
>>> dye = encoders['dye'].inverse_transform(data.loc[:, ["dye_0", "dye_1"]].values)
>>> set(dye.tolist())
{'Melachite Green', 'Indigo'}
>>> anions = encoders['anions'].inverse_transform(data.loc[:, [col for col in data.columns if col.startswith('anions')]].values)
>>> set(anions.tolist())
{'NaCO3', 'N/A', 'Na2SO4', 'Na2HPO4', 'NaHCO3', 'NaCl'}
aqua_fetch.dichlorophenoxyacetic_acid_removal(parameters: str | List[str] = 'all', encoding: str = None) Tuple[DataFrame, Dict[str, OneHotEncoder | LabelEncoder | Any]][source]

Data for photodegradation of 2,4-dichlorophenoxyacetic acid using gold-doped bismuth ferrite

Parameters:

parameters (list, optional) –

features to use as input. It must be a subset of the following features

  • catalyst

  • surface_area

  • pore_volume

  • energy_band_gap_eV

  • Au_%

  • Bi_%

  • Fe_%

  • O_%

  • catalyst_loading_g/l

  • light_intensity_watt

  • ``time_min

  • solution_ph

  • anions

  • ini_conc_mg/l

  • final_conc_mg/l

  • efficiency_%

encodingstr, default=None

type of encoding to use for the two categorical features i.e., Catalyst_type dye and Anions, to convert them into numberical. Available options are ohe, le and None.

Returns:

A tuple of length two. The first element is a DataFrame of shape (1200, len(parameters)) while the second element is a dictionary consisting of encoders with catalyst_type and anions as keys.

Return type:

tuple

Examples

>>> from water_datasets import dichlorophenoxyacetic_acid_removal
... # by default all parameters are returned
>>> data, encoders = dichlorophenoxyacetic_acid_removal()
>>> assert data.shape == (1044, 16), data.shape
# using label encoding for categorical parameters
>>> data, encoders = dichlorophenoxyacetic_acid_removal(encoding='le')
>>> assert data.shape == (1044, 16), data.shape
>>> catalysts = encoders['catalyst'].inverse_transform(data.loc[:, 'catalyst'].values)
>>> assert len(set(catalysts.tolist())) == 7
>>> anions = encoders['anions'].inverse_transform(data.loc[:,'anions'].values)
>>> set(anions.tolist())
{'Na2SO4', 'Without Anions', 'Na2HPO4', 'NaHCO3', 'NaCO3', 'NaCl'}
# using one hot encoding for categorical parameters
>>> data, encoders = dichlorophenoxyacetic_acid_removal(encoding='ohe')
>>> assert data.shape == (1044, 27), data.shape
>>> catalysts = encoders['catalyst'].inverse_transform(data.loc[:, ['catalyst_0', 'catalyst_1', 'catalyst_2',
   'catalyst_3', 'catalyst_4', 'catalyst_5', 'catalyst_6']].values)
>>> assert len(set(catalysts.tolist())) == 7
>>> anions = encoders['anions'].inverse_transform(data.loc[:, [col for col in data.columns if col.startswith('anions')]].values)
>>> set(anions.tolist())
{'Na2SO4', 'Without Anions', 'Na2HPO4', 'NaHCO3', 'NaCO3', 'NaCl'}
aqua_fetch.pms_removal(parameters: str | List[str] = 'all', encoding: str = None) Tuple[DataFrame, Dict[str, OneHotEncoder | LabelEncoder | Any]][source]

Data for photodegradation of phenol using peroxymonosulfate.

Parameters:
  • parameters (list, optional) –

    Names of the parameters to use. By default following parameters are used

    • time_min

    • catalyst_type

    • magnetization_Ms_emu/g

    • energy_band_gap_eV

    • calcination_temp_C

    • min_calcination_time

    • surface_area

    • pore_size

    • pollutant

    • poll_mol_formula

    • pms_concentration_g/l

    • light_intensity_watt

    • light_type

    • catalyst_dosage_g/l

    • ini_conc_ppm

    • solution_ph

    • H2O2_Conc_ppm

    • volume_ml

    • stirring_speed_rpm

    • radical_scavenger

    • inorganic anions

    • water_type

    • cycle_num

    • final_conc_ppm

    • removal_efficiency_%

  • encoding (str, default=None) – type of encoding to use for the two categorical features i.e., Catalyst_type dye and Anions, to convert them into numberical. Available options are ohe, le and None.

Returns:

A tuple of length two. The first element is a DataFrame of shape (2078, len(parameters)) while the second element is a dictionary consisting of encoders with catalyst_type, pollutant, poll_mol_formula and water_type as keys.

Return type:

tuple

Examples

>>> from water_datasets import pms_removal
>>> data, encoders = pms_removal()
>>> data.shape
(2078, 25)
... # the default encoding is None, but if we want to use one hot encoder
>>> data_ohe, encoders = pms_removal(encoding="ohe")
>>> data_ohe.shape
(2078, 100)
>>> catalysts = encoders['catalyst_type'].inverse_transform(data_ohe.loc[:, [col for col in data.columns if col.startswith('catalyst_type')]].values)
>>> len(set(catalysts))
42
>>> pollutants = encoders['pollutant'].inverse_transform(data_ohe.loc[:, [col for col in data.columns if col.startswith('pollutant')]].values)
>>> len(set(pollutants))
14
>>> poll_mol_formula = encoders['poll_mol_formula'].inverse_transform(data_ohe.loc[:, [col for col in data.columns if col.startswith('poll_mol_formula')]].values)
>>> len(set(poll_mol_formula))
14
>>> water_type = encoders['water_type'].inverse_transform(data_ohe.loc[:, [col for col in data.columns if col.startswith('water_type')]].values)
>>> len(set(water_type))
9
... # if we want to use label encoder
>>> data_le, encoders = pms_removal(encoding="le")
>>> data_le.shape
(2078, 25)
>>> catalysts = encoders['catalyst_type'].inverse_transform(data_le.loc[:, 'catalyst_type'].values)
>>> len(set(catalysts))
42
>>> pollutants = encoders['pollutant'].inverse_transform(data_le.loc[:, 'pollutant'].values)
>>> len(set(pollutants))
14
>>> poll_mol_formula = encoders['poll_mol_formula'].inverse_transform(data_le.loc[:, 'poll_mol_formula'].values)
>>> len(set(poll_mol_formula))
14
>>> water_type = encoders['water_type'].inverse_transform(data_le.loc[:, 'water_type'].values)
>>> len(set(water_type))
9
aqua_fetch.tetracycline_degradation(parameters: str | List[str] = 'all', encoding: str = None) Tuple[DataFrame, dict][source]

Data for photodegradation of tetracycline. For details on data see Abdi et al., 2022 .

Parameters:
  • parameters (list, optional) –

    Names of the parameters to use. By default, following parameters are used

    • surf_area_m2g

    • pore_vol_cm3g

    • catalyst_dosage_gL

    • antibiotic_dosage_mgL

    • illumination_time_min

    • pH

    • metallic_org_framework

    • efficiency_%

  • encoding (str, default=None) – type of encoding to use for the categorical features. It can be either ‘ohe’, ‘le’ or None. If ‘ohe’ is selected the original categroical column (metallic_org_framework) is replaced with one hot encoded columns. If ‘le’ is selected the original column is replaced with a label encoded column. If None is selected, the original column is not replaced.

Returns:

A tuple of length two. The first element is a DataFrame of shape (474, len(parameters)) while the second element is a dictionary consisting of encoders with metallic_org_framework as key.

Return type:

tuple

Examples

>>> from water_datasets import tetracycline_degradation
>>> data, encoders = tetracycline_degradation()
>>> data.shape
(374, 8)
>>> data, encoders = tetracycline_degradation(encoding='le')
>>> data.shape
(374, 8)
>>> mofs = encoders['metallic_org_framework'].inverse_transform(data.loc[:, 'metallic_org_framework'].values)
>>> len(set(mofs))
10
>>> data, encoders = tetracycline_degradation(encoding='ohe')
>>> data.shape
(374, 17)
>>> mofs = encoders['metallic_org_framework'].inverse_transform(data.loc[:, [col for col in data.columns if col.startswith('metallic_org_framework')]].values)
>>> len(set(mofs))
10
aqua_fetch.tio2_degradation(parameters: str | List[str] = 'all', encoding: str = None) Tuple[DataFrame, dict][source]

Data for photodegradation of tio2

For details on data see Jiang et al., 2020 .

Parameters:
  • parameters (list, optional) –

    Names of the parameters to use. By default following parameters are used

    • OC

    • i_mWpercm2

    • temp_C

    • D_gl

    • C0_mgl

    • pH

    • neglog_k_permin

  • encoding (str, default=None) – type of encoding to use for the categorical features.

Returns:

A tuple of length two. The first element is a DataFrame of shape (446, len(parameters)) while the second element is an empty dictionary.

Return type:

tuple

Examples

>>> from water_datasets import tio2_degradation
>>> data, encoders = tio2_degradation()
>>> data.shape
(446, 7)
aqua_fetch.photodegradation_Jiang(parameters: str | List[str] = 'all', encoding: str = None) Tuple[DataFrame, Dict[str, OneHotEncoder | LabelEncoder | Any]][source]

Data for photodegradation of multiple pollutants using various photocatalysts. For details on data see Jiang et al., 2021 .

Parameters:
  • parameters (list, optional) –

    Names of the parameters to use. By default following parameters are used

    • photocatalyst

    • contaminants

    • photocat_dosage_gl

    • photocat_size_nm

    • initial_conc_mgl

    • pH

    • light_type

    • k_min-1

  • encoding (str, default=None) – type of encoding to use for the categorical features. It can be either ohe, le or None. If ohe is selected the original categroical column is replaced with one hot encoded columns. If le is selected the original column is replaced with a label encoded column. If None is selected, the original column is not replaced.

Returns:

A tuple of length two. The first element is a DataFrame of shape (446, len(parameters)) while the second element is a dictionary consisting of encoders with photocatalyst and contaminants as keys.

Return type:

tuple

Examples

>>> from water_datasets import photodegradation_Jiang
>>> data, encoders = photodegradation_Jiang()
>>> data.shape
(449, 8)
... # the default encoding is None, but if we want to use one hot encoder
>>> data_ohe, encoders = photodegradation_Jiang(encoding="ohe")
>>> data_ohe.shape
(449, 16)
>>> photocatalysts = encoders['photocatalyst'].inverse_transform(data_ohe.loc[:, [col for col in data.columns if col.startswith('photocatalyst')]].values)
>>> len(set(photocatalysts))
100
>>> contaminants = encoders['contaminants'].inverse_transform(data_ohe.loc[:, [col for col in data.columns if col.startswith('contaminants')]].values)
>>> len(set(contaminants))
47
... # if we want to use label encoder
>>> data_le, encoders = photodegradation_Jiang(encoding="le")
>>> data_le.shape
(449, 8)
>>> photocatalysts = encoders['photocatalyst'].inverse_transform(data_le.loc[:, 'photocatalyst'].values)
>>> len(set(photocatalysts))
100
>>> contaminants = encoders['contaminants'].inverse_transform(data_le.loc[:, 'contaminants'].values)
>>> len(set(contaminants))
47

Membrane

aqua_fetch.micropollutant_removal_osmosis()[source]

Jeong et al., 2021

aqua_fetch.ion_transport_via_reverse_osmosis()[source]

Jeong et al., 2023

Sonolysis

aqua_fetch.cyanobacteria_disinfection()[source]

Jaffari et al., 2024