polytex.stats

polytex.stats.bw_opt

polytex.stats.bw_opt.bw_scott(sigma, n='')[source]

Scott’s rule for bandwidth selection.

Parameters
sigmafloat

The standard deviation of the data.

nint

The number of data points.

Returns
bwfloat

The bandwidth of the kernel.

polytex.stats.bw_opt.log_likelihood(pdf)[source]

Calculate the likelihood of the given probability density function. The likelihood is:

\[L = \frac{1}{N}\sum_{i=1}^{N} f(x_i)\]
Parameters
pdfNumpy array

The probability density function.

Returns
LLfloat

The log-likelihood of the given probability density function.

polytex.stats.bw_opt.opt_bandwidth(variable, x_test, bw)[source]

Find the optimal bandwidth by tuning of the bandwidth parameter via cross-validation and returns the parameter value that maximizes the log-likelihood of data.

Parameters
variableNumpy array

A N x 1 dimension numpy array. The data to apply the kernel density estimation.

x_testNumpy array

Test data to get the density distribution.

bwlist of float

The bandwidth of the kernels to be tested.

Returns
kde.bandwidthfloat

The optimal bandwidth of the kernel.

polytex.stats.kde

polytex.stats.kde.kdePlot(xkde, ykde, cluster_center_idx)[source]
Parameters
xkdeNumpy array

The normalized distance.

ykdeNumpy array

The probability density distribution corresponding to the normalized distance.

cluster_center_idxNumpy array

The index of the cluster centers.

Returns
None.
polytex.stats.kde.kdeScreen(variable, x_test, bw, kernels='gaussian', plot=False)[source]

This function estimates the probability density distribution of the input variable with the non-parametric kernel density estimation (KDE) method. The local maxima and minima of the probability density distribution are identified to decompose the input variable into a set of clusters. The former is used as the cluster centers and the latter is used as the cluster boundaries.

Parameters
variableNumpy array

A N x 1 dimension numpy array to apply the kernel density estimation.

x_testNumpy array

Test data to get the density distribution. It has the same shape as the given variable. It should cover the whole range of the variable.

bwfloat

The bandwidth of the kernel.

kernelstring, optional

The kernel to use. The default is ‘gaussian’. The possible values are {‘gaussian’, ‘tophat’, ‘epanechnikov’, ‘exponential’, ‘linear’, ‘cosine’}.

plotbool, optional

Whether plot the probability density distribution. The default is False.

Returns
clustersdictionary

The index of the cluster centers, cluster boundary and the probability density distribution (pdf).

polytex.stats.kde.movingKDE(dataset, bw=0.002, windows=1, n_clusters=20, x_test=None)[source]

This function applies the kernel density estimation (KDE) method to the input dataset with a moving window. Namely, the dataset is divided into a set of windows and the KDE method is applied to each window. This allows to capture more details of geometry changes of a fiber tow.

Parameters
datasetNumpy array

A N x 2 dimension numpy array for kernel density estimation. The first colum should be the variable under analysis, the second is the label of cross-sections that the variable belongs to.

bwNumpy array or float, optional

A range of bandwidth values for kde operation usually generated with np.arange(). The optimal bandwidth will be identified within this range and be used for kernel density estimation. If a number is given, the number will be used as the bandwidth for kernel estimation.

windowsint,

The number of windows (segmentations) for KDE analysis. The default is 1, namely, the whole dataset is used for KDE analysis and gives the same result as using the function kdeScreen() directly.

n_clustersint

The target number of cluster_center. The default is 20.

x_testNumpy array

Test data to get the density distribution. The default is None.

Returns
kdeOutputNumpy array

A N x 3 dimension numpy array. The first column is the label of the window under analysis, the second is normlized distance, the third is the probability density.

cluster_centerNumpy array

A M x N dimension numpy array. M is the number of windows and N-1 is the number of cluster centers. The first column is the maximum index for each window, the following columns are the cluster centers.