FastData Help

Context-sensitive help entries for FastData application

Introduction

Using This Help

Start here to understand how the help content is organized.

This help guide is organized around the workflows and concepts you use in FastData.

Use the navigation on the left to jump directly to the part of the application you are working with.

Basics

Data Concepts

Files and Sheets (Imports)

Files/sheets are stored as imports; for users these are effectively the same thing.

The first concept is the import. In practice, users select files and sheets, and each selected file/sheet becomes an import.

Technically, an import is represented as a database table. From the user perspective, it is fine to think of file/sheet = import.

Typical import sources:

Imports are the atomic source units that later get grouped into datasets.

Datasets

A dataset is a collection of one or more imports used together for analysis.

A dataset groups related imports into one analysis scope.

One dataset can contain many imports (for example, multiple files/sheets from the same process).

The active dataset drives most work in the app: filtering, preprocessing, visualization, and modeling.

Use datasets to separate contexts like production lines, time periods, or raw vs cleaned data.

Systems

A system contains multiple datasets and its own feature set.

A system is a higher-level container above datasets.

Each system can hold multiple datasets, and each system owns its own feature set.

This means feature definitions are managed in system context, even when data is split across multiple datasets in that same system.

Use systems to separate fundamentally different processes, plants, machines, or product families.

Features

Features are columns used for analysis and modeling inside a system.

Features are the measurable variables (columns) used as model inputs or analysis dimensions.

Because systems own feature sets, feature metadata is managed consistently across datasets within the same system.

For each feature, users should understand at least:

Clear feature metadata improves filtering, preprocessing, model quality, and interpretation.

Tags

Tags label features for organization, filtering, and workflow clarity.

Tags are lightweight labels you assign to features (and optionally other entities) to keep large models understandable.

Typical tag examples: target, input, quality, temperature, critical, lab, calculated.

Use tags to:

Tags are metadata only. They do not change raw measured values.

Data Flow

Canonical structure: files/sheets -> datasets -> systems -> features -> tags.

The conceptual model is:

  1. Files/Sheets are ingested as imports (database tables)
  2. Datasets group one or many imports
  3. Systems group multiple datasets and own feature definitions
  4. Features define the usable variables with source/unit/type metadata
  5. Tags label features for organization and filtering

Operationally, you import files first, assign/group them into datasets, work inside a system context, then manage features and tags for efficient analysis.

Target Variable

The variable you want to predict or model.

The Target Variable (also called the dependent variable or label) is what your model is trying to predict.

Examples:

The target variable should be clearly defined before beginning model training.

Imported Feature

A feature column with source, unit, type, and optional tags.

This is a feature column available in the current system/dataset context.

Check and maintain feature metadata:

To use this feature effectively:

You can view detailed statistics and visualizations for this feature in the data exploration tab.

Modeling

Model Selection

Choose the best algorithm for your prediction task.

Model Selection is the process of choosing the most appropriate machine learning algorithm for your specific problem.

Key considerations:

FastData lets you run and compare multiple models so you can choose the best fit based on your metrics and constraints.

Features

Data

Import data, manage databases, and define the active selection scope.

The Data tab is the main entry point for preparing data in FastData.

Selections configured here are reused by downstream tabs through the same data-selection model.

Filters

Data Filters

Apply conditions to subset your dataset.

Filters allow you to select specific rows from your dataset based on conditions.

Use filters to:

The selected filter set is applied to the dataframe used by downstream analysis and visualizations.

Systems and datasets

Scope results by system and dataset in one place.

Use these selectors together to narrow the data to specific systems and the datasets they contain.

Date range

Limit data between a start and end timestamp.

Pick a starting and ending date/time to bound the records that are loaded, previewed, and charted.

Months and groups

Filter by calendar months and predefined groups.

Select calendar months to focus on seasonal patterns, and choose database groups to restrict which entities are processed.

Imports and tags

Filter by selected imports and feature tags together.

Use Imports to restrict data to specific ingestion events in the currently selected dataset.

Use Tags to filter which features are available/selected for analysis.

Import Options

System name

Label the imported data with a system name.

Choose the system that owns the data you are importing.

Dataset name

Attach the data to a specific dataset.

Pick a dataset that belongs to the selected system.

Header row amount

Number of header rows in Excel files.

Define how many rows at the top of the sheet are header rows.

Header delimiter

Split multi-part headers by a delimiter.

Split header text into multiple parts using a delimiter (for example "_" or " | ").

Base name row

Choose which header row contains the base feature name.

Select the header row that contains the base feature name (e.g., Temperature).

Source row

Select the header row for source/series labels.

Use this when a header row identifies the source or series for a feature.

Unit row

Select the header row containing units.

Point to the header row that contains measurement units (e.g., °C, kW).

Type row

Select the header row for qualifiers or annotations.

Qualifiers capture extra descriptors like min/max, status, or quality.

Force meta columns

Treat specific columns as metadata (not measurements).

Comma-separated list of column names to force into metadata.

Ignore column prefixes

Skip columns that start with listed prefixes.

Comma-separated prefixes to ignore when importing columns.

Date column

Choose which column contains timestamps.

Select the column that holds datetime values.

Assume day-first dates

Interpret 03/04 as 3 April instead of 4 March.

Enable this if your date strings use day/month order.

Dot time formatting

Parse 9.00 as 09:00.

Convert dot-separated times to colon-separated times while parsing.

Datetime formats

Provide explicit datetime formats (optional).

Comma-separated list of strftime-compatible formats.

CSV delimiter

Character used to separate CSV columns.

Set the delimiter between columns (for example , or ;).

CSV decimal

Decimal separator for numeric values.

Specify the decimal character for numeric values.

CSV encoding

Text encoding for CSV files.

Provide an encoding name (for example utf-8 or latin-1).

Use DuckDB CSV import

Load large CSVs with DuckDB's fast parser.

Enable DuckDB's CSV import pipeline for large files.

Preprocessing

Preprocessing

Prepare selected data with the built-in preprocessing controls.

Preprocessing in FastData is configured from the shared Data selection area used across tabs.

The app currently provides these preprocessing options:

These controls shape the dataframe used by Statistics, Charts, SOM, Regression, and other analysis flows.

Timestep

Resample data to a fixed interval or leave it on auto.

Choose the interval used to resample incoming measurements.

Resampling can make downstream statistics and charts easier to compare.

Moving average

Smooth measurements with an optional rolling window.

Applies a rolling mean over the selected window to reduce noise.

Use smoothing when charts or models are sensitive to rapid fluctuations.

Fill empty

Choose how to handle missing timestamps after resampling.

Determines how gaps created during resampling are filled.

Forward or backward filling is useful when signals change slowly and occasional gaps appear.

Aggregation

Summarize multiple points that fall within a timestep.

When multiple measurements land in the same resampled bucket, aggregation decides how they are combined.

Pick the method that best matches how you would summarize overlapping readings.

Selections

Create reusable feature selections and filter presets.

The Selections tab lets you curate which features to keep for downstream tasks.

Adjusting selections here keeps your preprocessing and modeling steps consistent.

Statistics

Review derived statistics and prepared measurements.

The Statistics tab focuses on validating the numbers produced from your data.

This tab focuses on statistical outputs, while preprocessing is configured in the shared Data selection widget used by this tab.

Statistics

Statistics actions

Run the computation and store results when you are happy with them.

Use these buttons to generate and persist the statistics previewed in this tab.

Statistics to compute

Pick the summary measures calculated for each time bucket or group.

Select one or more statistics to include in the output. Each choice adds a column to the preview and saved results.

Pick only what you need to keep previews fast; you can always rerun with more metrics.

Aggregation mode

Choose whether to aggregate over time or by a specific column.

Switching modes changes how the preview table and chart summarize your data.

Group column

Pick the categorical column that defines each group in column mode.

Available options come from the dataset and only apply when Group by column is selected.

Statistics period

Define the window used for time-based aggregation.

Time-based mode buckets records into regular windows before applying your chosen statistics.

Separate timeframes

Choose whether group-kind statistics are split by each saved timeframe segment.

This setting applies when using Group by column with a database group kind (for example group:som_cluster).

Charts

Visualize your dataset with quick plots.

The Charts tab provides fast visual feedback on your data.

Use charts to spot trends, outliers, and data quality issues early.

SOM

SOM

Explore data structure with Self-Organizing Maps.

The SOM tab visualizes high-dimensional data on a 2D grid.

Use this view for exploratory analysis and for identifying interesting cohorts.

Self-Organizing Maps (SOM)

Unsupervised neural network for data visualization and clustering.

Self-Organizing Maps (also called Kohonen maps) are a type of artificial neural network trained using unsupervised learning to produce a low-dimensional representation of the input space.

Key features:

SOMs are particularly useful for exploratory data analysis and pattern recognition.

FastData uses the MiniSom library for efficient SOM computation.

Clustering

Feature clustering model

Pick the algorithm used to group feature planes.

Choose how features are grouped based on their component planes.

All options support automatic K search when allowed by the method.

Max K & Clusters

Control the cluster search range or pin a fixed number.

These two fields work together:

Scoring metric

Metric used to pick the best K during auto-search.

Scores compare candidate clusterings:

Only used when Clusters is set to Auto.

Cluster features

Group SOM component planes into feature clusters.

Runs clustering on the trained SOM component planes.

Auto cluster features

Automatically run feature clustering after SOM training.

When enabled, feature clustering starts automatically each time a SOM model finishes training.

Neuron clustering model

Pick the algorithm used to group SOM neurons.

The same clustering algorithms are available for neurons as for features:

Choose the method that matches how you expect neurons to organize across the grid.

Max K & Clusters (neurons)

Set the range or fixed count for neuron groups.

Configure how many neuron groups to consider:

Scoring metric (neurons)

Metric used when auto-selecting neuron clusters.

Uses the same metrics as feature clustering:

Cluster neurons

Partition the SOM grid into neuron clusters.

Runs clustering on the neurons themselves, using their codebook vectors.

Auto cluster timeline

Automatically run neuron clustering after SOM training.

When enabled, neuron clustering starts automatically after each SOM training run.

Controls

Cluster Timeline

Shows BMU/cluster states over time, with optional selected-feature overlays.

This chart plots timeline layers selected in Display.

Timeline Display Layers

Select one or more layers: BMU, neuron clusters, and selected features.

Use the Display multi-select control to choose timeline layers:

Data Table

Row-level BMU assignments used by the timeline.

This table lists BMU assignments for each data row.

Cluster Map

Neuron grid colored by cluster assignment.

The cluster map colors each neuron by its cluster ID.

Save as timeframes

Store contiguous cluster runs as start/end ranges instead of single points.

When saving timeline clusters, this option controls how group assignments are stored.

Hyperparameters

Map width

Horizontal size of the SOM grid (auto if left blank).

Map width controls how many neurons the map has along the X-axis.

Map height

Vertical size of the SOM grid (auto if left blank).

Map height controls how many neurons the map has along the Y-axis.

Sigma

Initial neighborhood radius for SOM training.

Sigma sets how far the learning influence of a winning neuron spreads across the grid at the start of training.

Learning rate

Step size used when updating SOM weights.

Learning rate controls how quickly the map adapts to each sample.

Epochs

How many training passes to run (defaults to 100).

Epochs determines how many times the algorithm sweeps through your data.

Normalisation

How features are scaled before training.

Scaling features keeps each variable on a comparable range.

Training mode

Choose between batch and random SOM updates.

Pick the strategy used to present samples during training.

Regression

Regression

Build and evaluate regression models.

The Regression tab guides you through training models for continuous targets.

Use this tab when predicting numeric outcomes such as prices, temperatures, or measurements.

Regression Analysis

Build predictive models to estimate continuous values.

Regression is a statistical method for modeling the relationship between a dependent variable and one or more independent variables (features).

FastData supports multiple regression algorithms including:

Use regression when your target variable is continuous (e.g., price, temperature, sales volume).

Cross Validation

Cross-validation strategy

Pick how training/validation folds are built.

Cross-validation helps estimate how well a model generalises. Choose a strategy that matches your data.

Time-series splits build expanding/rolling windows in chronological order, ignore shuffling, and can use a time gap to reduce leakage.

Folds

Set how many slices to use when performing K-Fold style validation.

The number of folds controls how many train/validation rotations are executed.

Shuffle folds

Randomly reshuffle rows before forming non-time-series folds.

Enable shuffling to randomise the order of records before K-Fold or Stratified K-Fold splits.

Time gap

Skip recent observations between training and validation windows.

When using the Time series split strategy, the gap adds a buffer between the end of the training window and the start of the validation window.

Stratify by

Balance folds by matching the distribution of a feature.

Select a categorical feature or the target to stratify K-Fold splits.

Group

Choose the group label used for Group K-Fold.

Select a group kind so all rows from the same group stay in the same fold.

Hyperparameters

Fit intercept

Include an intercept term in the linear regression model.

When enabled, the model learns a bias term in addition to feature weights.

Positive coefficients

Constrain coefficients to be non-negative.

Forces all learned weights to be positive, which can help with interpretability.

Alpha (ridge)

Regularization strength for ridge regression.

Higher values apply stronger L2 penalty and reduce coefficient magnitude.

Ridge solver

Numerical method used to fit ridge regression.

Select a solver to match dataset size and stability requirements.

Random state (ridge)

Seed for solver randomness where applicable.

Set a fixed value to make the ridge solution reproducible.

Only affects solvers that rely on stochastic optimization.

Alpha (lasso)

Regularization strength for lasso regression.

Higher values apply stronger L1 penalty and promote sparsity.

Too much regularization can drive useful coefficients to zero.

Max iterations (lasso)

Maximum optimizer steps before stopping.

Increase if you see convergence warnings or unstable results.

Larger datasets or high regularization may require more iterations.

Random state (lasso)

Seed for solver randomness.

Fix this value to make lasso results repeatable.

Applies when the optimizer uses random coordinate selection.

Alpha (elastic net)

Regularization strength for elastic net.

Controls the combined L1/L2 penalty magnitude.

Increase to shrink coefficients more aggressively.

L1 ratio

Balance between L1 and L2 penalties.

0.0 is pure ridge (L2), 1.0 is pure lasso (L1).

Intermediate values blend sparsity with coefficient shrinkage.

Max iterations (elastic net)

Maximum optimizer steps before stopping.

Increase if convergence is slow or unstable.

Higher values can improve accuracy on large feature sets.

Random state (elastic net)

Seed for solver randomness.

Fix this value to reproduce elastic net results.

Only relevant when the optimizer is stochastic.

Polynomial degree

Degree of polynomial features.

Higher degrees allow more complex curves but can overfit.

Degrees 2-3 are common starting points for nonlinear trends.

Fit intercept (polynomial)

Include a bias term in the polynomial model.

Enable unless your features are centered around zero.

Disable if you already include a bias feature in preprocessing.

Number of trees (random forest)

How many trees to build in the forest.

More trees improve stability but increase runtime.

Accuracy gains diminish once the forest is large enough.

Max depth (random forest)

Limit the depth of each tree.

Use None to expand until all leaves are pure or minimal.

Shallower trees reduce variance but may underfit.

Min samples split (random forest)

Minimum samples required to split a node.

Higher values reduce overfitting.

Set as a count or fraction of the training samples.

Min samples leaf (random forest)

Minimum samples required in a leaf node.

Higher values smooth the model and reduce variance.

Set as a count or fraction for larger datasets.

Random state (random forest)

Seed for the bootstrap and feature selection.

Fix this value for repeatable forests.

Helps compare runs when tuning other parameters.

Number of trees (extra trees)

How many extra trees to build.

More trees improve stability but increase runtime.

Extra Trees are more randomized, so more estimators help.

Max depth (extra trees)

Limit the depth of each tree.

Use None to expand until all leaves are pure or minimal.

Lower depths reduce variance and improve generalization.

Min samples split (extra trees)

Minimum samples required to split a node.

Higher values reduce overfitting.

Set as a count or fraction of the training samples.

Min samples leaf (extra trees)

Minimum samples required in a leaf node.

Higher values smooth the model and reduce variance.

Larger leaves can help with noisy measurements.

Random state (extra trees)

Seed for randomized splits.

Fix this value for repeatable results.

Controls the randomness of feature and split selection.

Boosting stages

Number of boosting stages to perform.

More stages can improve accuracy but raise risk of overfitting.

Pair with a smaller learning rate for smoother training.

Learning rate (gradient boosting)

Shrinkage applied to each boosting step.

Smaller values require more estimators but can generalize better.

Larger values converge faster but risk overshooting.

Max depth (gradient boosting)

Depth of individual regression trees.

Shallow trees reduce variance but may underfit.

Depth 2-4 is common for stable boosting models.

Min samples split (gradient boosting)

Minimum samples required to split a node.

Higher values reduce model variance.

Set as a count or fraction of the training samples.

Min samples leaf (gradient boosting)

Minimum samples required in a leaf node.

Higher values improve generalization on noisy data.

Larger leaves produce smoother predictions.

Random state (gradient boosting)

Seed for the boosting process.

Fix this value for repeatable results.

Ensures the same subsampling order where applicable.

Number of estimators (AdaBoost)

Number of weak learners to combine.

More estimators can improve accuracy but increase runtime.

Too many stages can overfit noisy data.

Learning rate (AdaBoost)

Contribution of each weak learner.

Lower values require more estimators to reach the same performance.

Higher values can overemphasize errors and reduce stability.

Random state (AdaBoost)

Seed for boosting randomness.

Fix this value for repeatable results.

Helps compare parameter tweaks consistently.

Kernel (SVR)

Kernel function used by support vector regression.

Choose the kernel that best matches the data shape.

Regularization (C)

Penalty for errors in SVR.

Higher values fit the training data more closely.

Too high can overfit; too low can underfit.

Epsilon (SVR)

Margin of tolerance in the loss function.

Higher values ignore small errors and create a smoother fit.

Larger epsilon usually yields fewer support vectors.

Number of neighbors (KNN)

How many neighbors to average for predictions.

Lower values fit locally; higher values smooth predictions.

Odd numbers can reduce tie votes in classification-like data.

Weights (KNN)

Weighting strategy for neighbors.

Algorithm (KNN)

Search algorithm for nearest neighbors.

Max depth (decision tree)

Limit the depth of the tree.

Use None to expand until all leaves are pure or minimal.

Shallower trees are easier to interpret but may underfit.

Min samples split (decision tree)

Minimum samples required to split a node.

Higher values reduce overfitting.

Set as a count or fraction of the training samples.

Min samples leaf (decision tree)

Minimum samples required in a leaf node.

Higher values smooth predictions and improve generalization.

Use larger leaves for noisy or sparse datasets.

Random state (decision tree)

Seed for randomized splits.

Fix this value for repeatable trees.

Useful when comparing depth or split settings.

Layers (MLP)

Sizes of the hidden layers for the neural network.

Enter comma-separated sizes, e.g. 64,32,16 for three layers.

More layers and neurons increase capacity but can overfit.

Activation (MLP)

Activation function used in hidden layers.

relu is a strong default; tanh can help with bounded data.

Solver (MLP)

Optimizer used to train the network.

adam is robust for most datasets.

lbfgs can converge faster on smaller datasets.

Alpha (MLP)

L2 regularization strength for the network.

Higher values apply stronger weight decay.

Learning rate schedule (MLP)

How the learning rate evolves during training.

constant uses a fixed rate.

adaptive reduces the rate when progress stalls.

Max iterations (MLP)

Maximum training epochs.

Increase if the model stops before converging.

Random state (MLP)

Seed for weight initialization.

Fix this value for repeatable runs.

Variance threshold

Remove features with variance below this threshold.

Use higher values to drop low-variance features.

A threshold of 0 removes only constant features.

Number of features (Select K Best)

Pick the top K features by score.

Choose All to keep every feature.

Lower values enforce more aggressive feature selection.

Number of features (Mutual Info)

Pick the top K features by mutual information.

Choose All to keep every feature.

Use smaller K to focus on the strongest nonlinear signals.

Random state (Mutual Info)

Seed for mutual information estimation.

Fix this value for repeatable scores.

Important when comparing ranking stability.

Importance threshold (Random Forest)

Minimum importance required to keep a feature.

Use median, mean, or a numeric value.

Features below the threshold are removed from the dataset.

Number of trees (RF importance)

How many trees to build for importance scores.

More trees give more stable importance estimates.

Smaller values run faster but increase variance.

Random state (RF importance)

Seed for the importance estimator.

Fix this value for repeatable importances.

Useful when comparing thresholds.

Importance threshold (Extra Trees)

Minimum importance required to keep a feature.

Use median, mean, or a numeric value.

Higher thresholds keep only the most influential features.

Number of trees (Extra Trees importance)

How many trees to build for importance scores.

More trees give more stable importance estimates.

Extra Trees are noisy, so more estimators help stability.

Random state (Extra Trees importance)

Seed for the importance estimator.

Fix this value for repeatable importances.

Helps compare threshold values consistently.

Importance threshold (Gradient Boosting)

Minimum importance required to keep a feature.

Use median, mean, or a numeric value.

Higher values prune more aggressive feature sets.

Number of estimators (GB importance)

How many estimators to build for importance scores.

More estimators give more stable importance estimates.

Pair with smaller learning rates for stability.

Random state (GB importance)

Seed for the importance estimator.

Fix this value for repeatable importances.

Use when comparing thresholds across runs.

Features to select (RFE)

Target number of features to keep.

Use None to keep half the features by default.

Smaller targets yield more aggressive reduction.

Features removed per step (RFE)

How many features to eliminate each iteration.

Smaller steps are more precise but take longer.

Use larger steps for faster but coarser selection.

Random state (RFE)

Seed for randomized estimator behavior.

Fix this value for repeatable selection.

Helps verify if rankings are stable across runs.

PCA components

How many principal components to keep.

Use None to keep all components.

Lower values reduce dimensionality more aggressively.

PCA solver

Algorithm used to compute principal components.

auto picks a reasonable solver based on data shape.

full uses a deterministic full SVD.

arpack computes a truncated decomposition.

randomized is faster for large datasets but stochastic.

Whiten components

Scale components to unit variance.

Whitening can help some models but may amplify noise.

Keep it off unless you have a specific reason to enable it.

Random state (PCA)

Seed for randomized PCA behavior.

Fix this value when using the randomized solver to reproduce results.

Components (PLSRegression)

Number of latent variables to keep.

Higher values capture more signal but increase model complexity.

Keep this below the effective rank of your input data.

Scale data (PLSRegression)

Standardize X and y inside the PLS step.

Enable in most cases unless inputs are already consistently scaled.

Max iterations (PLSRegression)

Maximum iterations for the NIPALS solver.

Increase if convergence warnings occur.

Tolerance (PLSRegression)

Convergence tolerance for iterative updates.

Lower tolerance can improve precision but may require more iterations.

Components (FastICA)

Number of independent components to estimate.

Use None to infer based on feature count.

Algorithm (FastICA)

Parallel or deflation update strategy.

parallel estimates components together; deflation extracts one by one.

Whiten mode (FastICA)

Whitening behavior before ICA optimization.

unit-variance is a robust default for regression pipelines.

Contrast function (FastICA)

Nonlinearity used to estimate non-Gaussian components.

logcosh is usually stable; try others when decomposition quality is poor.

Max iterations (FastICA)

Maximum iterations before stopping.

Increase if ICA fails to converge.

Tolerance (FastICA)

Stopping tolerance for ICA updates.

Lower values enforce stricter convergence.

Random state (FastICA)

Seed for ICA initialization.

Fix this value for reproducible components.

Components (FactorAnalysis)

Number of latent factors.

Use None to infer from input dimensionality.

SVD method (FactorAnalysis)

Backend used during factor estimation.

randomized is faster on large data; lapack is deterministic.

Iterated power (FactorAnalysis)

Power iterations used in randomized SVD.

Higher values can improve approximation quality.

Rotation (FactorAnalysis)

Optional factor rotation for interpretability.

Use None for unrotated factors, or varimax/quartimax for rotated solutions.

Tolerance (FactorAnalysis)

Convergence tolerance.

Lower values require stricter convergence.

Random state (FactorAnalysis)

Seed for randomized solver components.

Set for reproducible randomized runs.

Components (TruncatedSVD)

Number of singular vectors to keep.

Higher values preserve more information but reduce compression.

Algorithm (TruncatedSVD)

Solver for truncated decomposition.

randomized is efficient for large matrices; arpack can be more precise.

Power iterations (TruncatedSVD)

Additional iterations for randomized solver accuracy.

Increase when singular value gaps are small.

Tolerance (TruncatedSVD)

Convergence tolerance for ARPACK solver.

Mostly relevant when using arpack.

Random state (TruncatedSVD)

Seed for randomized solver.

Fix to make randomized decompositions reproducible.

Inputs

Target feature

Choose the numeric column you want the models to predict.

Select exactly one target feature. The algorithms will try to predict this value from the input features.

Feature selection

Optionally enable automatic selectors before training.

Feature selectors reduce the input columns to the most informative subset.

Dimensionality reduction

Optionally project features into a lower-dimensional representation before modeling.

Dimensionality-reduction methods transform your selected input features before the regression model is trained.

PCA

Linear projection that keeps directions with the highest variance.

PCA transforms correlated features into orthogonal components that preserve as much variance as possible.

PLSRegression

Supervised projection that uses both inputs and target to build latent variables.

PLSRegression finds latent components that maximize covariance between features and the target.

FastICA

Independent component analysis for separating non-Gaussian latent sources.

FastICA estimates statistically independent components instead of variance-maximizing ones.

FactorAnalysis

Latent-factor model that explains observed variables through shared factors plus noise.

FactorAnalysis models each feature as a combination of latent factors and feature-specific noise.

TruncatedSVD

Low-rank SVD projection that works well with sparse or large matrices.

TruncatedSVD projects features onto a smaller number of singular vectors without centering.

Regression models

Pick the algorithms to evaluate for this experiment.

Select one or more models to train on the same dataset and compare their results.

Test Split

Hold-out test set

Reserve part of the data for final evaluation.

Enable the test set to keep a portion of data untouched during training and cross-validation.

Test size

Choose the fraction or fixed number of records set aside for testing.

The test size sets how much of the dataset is held out.

Test split strategy

Control how rows are separated into train and test sets.

Pick a strategy that matches the nature of your data.

Strategies that rely on stratification will use the selected feature and binning settings below.

Test stratification

Balance the test split using a target or helper feature.

Select a feature to stratify the test split when using the Stratified strategy.

Stratify bins

Set how many buckets to create when stratifying continuous values.

When the stratify field contains continuous numbers, the values are bucketed into bins to enable stratified sampling.

Forecasting

Forecasting

Create time-series forecasts from historical data (feature currently disabled in default builds).

The Forecasting tab focuses on time-aware modeling.

This feature exists in the codebase but is currently disabled in both development and release tab flags by default.

When enabled, use it for scenarios like demand planning, capacity prediction, or financial projections.

Time Series Forecasting

Predict future values based on historical time series data.

Forecasting analyzes time-ordered data to predict future values.

Key concepts in time series forecasting:

FastData uses forecasting models with time-series feature engineering and split strategies for experiments when the tab is enabled.

The previous sktime version is kept as a non-active reference in backend/services/legacy_forecasting/forecasting_service_sktime.py.

Common use cases: sales forecasting, demand prediction, financial projections.

Controls

Forecast horizon

How many future time steps to predict in each run.

The horizon sets how far ahead each model predicts.

Models will generate this many points for every feature you selected.

Window strategy

Pick how training and validation windows move through the series.

Choose the cross-validation style used while fitting time-series models.

Sliding windows are good when concept drift is likely; expanding windows favor stability.

Initial window size

Length of the first training window (auto when set to Auto).

Controls how many observations are used before the first forecast window.

This applies to sliding and expanding strategies; the single split uses the full training span.

Hyperparameters

Strategy (naive)

How the naive forecaster projects future values.

Select the baseline rule used to forecast.

Seasonal period (naive)

Length of the seasonality cycle used by the naive model.

Defines how many time steps make up one season.

Seasonal period (theta)

Season length used by the theta forecaster.

Helps the model separate seasonality from trend when present.

Deseasonalize (theta)

Remove seasonality before fitting the theta model.

Enable when your series has strong seasonal patterns.

Trend (exponential smoothing)

Type of trend component to include.

Controls how the model extrapolates the long-term movement.

Seasonal (exponential smoothing)

Seasonality mode for the exponential smoothing model.

Select how seasonal patterns are modeled.

Seasonal period (exponential smoothing)

Number of steps in one seasonal cycle.

Use the length of your repeating pattern to get better seasonal fits.

Polynomial degree

Complexity of the polynomial trend.

Higher degrees allow more curvature in the trend.

Use Box-Cox (BATS)

Apply a Box-Cox transform to stabilize variance.

Helps when variability grows with the level of the series.

Use trend (BATS)

Include a trend component in the BATS model.

Enable to capture long-term upward or downward movement.

Use damped trend (BATS)

Dampen the trend so it flattens over time.

Useful when trends should level off rather than grow indefinitely.

Use Box-Cox (TBATS)

Apply a Box-Cox transform to stabilize variance.

Improves modeling when variance changes with the level.

Use trend (TBATS)

Include a trend component in the TBATS model.

Enable to track long-term movement in the series.

Use damped trend (TBATS)

Dampen the trend component over time.

Prevents overly aggressive trend extrapolation.

Smoothing (Croston)

Smoothing factor for intermittent demand forecasting.

Controls how quickly the model reacts to new observations.

Alpha (time series ridge)

Regularization strength for ridge regression in time series.

Higher alpha shrinks coefficients and reduces overfitting.

Window length (time series ridge)

Number of past steps used as features.

Longer windows capture more history but increase model size.

Strategy (time series ridge)

How multi-step forecasts are generated.

Choose between recursive, direct, or multioutput strategies.

Alpha (time series lasso)

Regularization strength for lasso regression.

Higher alpha encourages sparse coefficients.

Window length (time series lasso)

Number of lagged steps used as input features.

Increase to capture longer temporal dependencies.

Strategy (time series lasso)

Approach for multi-step forecasting.

Recursive, direct, or multioutput behavior for generating horizons.

Estimators (time series random forest)

Number of trees in the forest.

More trees improve stability but increase runtime.

Max depth (time series random forest)

Maximum depth of each tree.

Use None to allow trees to expand fully.

Window length (time series random forest)

Number of historical steps turned into features.

Larger windows capture more history but can add noise.

Strategy (time series random forest)

How the model produces multi-step forecasts.

Select recursive, direct, or multioutput forecasting.

Random state (time series random forest)

Seed for the random forest training process.

Set a fixed seed to make runs reproducible.

Estimators (time series gradient boosting)

Number of boosting stages.

More stages can improve accuracy but increase risk of overfitting.

Max depth (time series gradient boosting)

Depth of individual regression trees.

Shallower trees generalize better on noisy data.

Learning rate (time series gradient boosting)

Shrinkage applied to each boosting step.

Smaller values require more estimators but often generalize better.

Window length (time series gradient boosting)

Number of lagged steps used as input features.

Increase to capture longer temporal patterns.

Strategy (time series gradient boosting)

Multi-step forecasting method for gradient boosting.

Choose between recursive, direct, or multioutput forecasting.

Random state (time series gradient boosting)

Seed for model training randomness.

Set to ensure reproducible results across runs.

Layers (time series MLP)

Hidden layer sizes for the MLP regressor.

Enter comma-separated sizes, e.g. 64,32,16.

More layers increase capacity but also training time.

Activation (time series MLP)

Activation function in hidden layers.

relu is a good default; tanh can smooth outputs.

Solver (time series MLP)

Optimizer for MLP training.

adam works well for most datasets.

lbfgs can be faster on smaller datasets.

Alpha (time series MLP)

L2 regularization strength.

Higher values apply more weight decay to reduce overfitting.

Learning rate (time series MLP)

Learning rate schedule.

constant keeps a fixed rate.

adaptive reduces the rate when learning stalls.

Max iterations (time series MLP)

Maximum training epochs.

Increase if the model fails to converge.

Window length (time series MLP)

Number of lagged steps used as features.

Longer windows capture more history but increase model size.

Strategy (time series MLP)

How the MLP produces multi-step forecasts.

Recursive, direct, or multioutput forecasting.

Random state (time series MLP)

Seed for weight initialization.

Set to ensure reproducible results.

Models

Forecasting models

Pick one or more algorithms to run for each selected feature.

Select the models to include in the experiment. Each option offers different strengths:

Run several models together to compare metrics and pick the best-performing approach.

Target feature

Optional exogenous target for time series regression models.

Select a target feature when using regression-based forecasters (ridge, lasso, random forest, gradient boosting).

Only one target can be active at a time; deselect it to return to univariate mode.

Chat

Chat sessions

Create, switch, and delete independent chat histories.

Chat history is stored per session.

Using Chat

Ask the built-in assistant and keep a record of conversations.

The chat window lets you send questions to the assistant and keeps a history in the log database.

Settings

LLM provider

Choose the backend that powers chat replies.

Select which provider handles the chat requests sent from the chat window.

Helpful links:

If you want more guidance, use Ask from AI below to send this topic to the chat window.

Model name

Enter the model identifier for the selected provider.

Provide the exact model name your provider expects.

Browse model lists:

Use the refresh icon next to Model name to fetch available models for the selected provider.

You can click Ask from AI below for help choosing a model.

Thinking mode

Control how much intermediate reasoning the model exposes in chat.

Choose how much model reasoning is shown in the assistant reply card.

Provider behavior varies by model. If a model ignores this setting, chat still works and returns normal replies.

OpenAI API key

Add your OpenAI key for ChatGPT/OpenAI access.

This key is only required for the ChatGPT (OpenAI) provider.

Create or manage keys at OpenAI API keys.

If you do not have a key yet, sign in and generate one, then paste it here.

Use Test to save the key and verify it by fetching the available OpenAI models.

Use Ask from AI below if you want step-by-step guidance.