polytex.stats

polytex.stats.bw_opt

polytex.stats.bw_opt.bw_scott(sigma, n='')[source]

Scott’s rule for bandwidth selection.

Parameters

sigmafloat: The standard deviation of the data.
nint: The number of data points.

Returns

bwfloat: The bandwidth of the kernel.

polytex.stats.bw_opt.log_likelihood(pdf)[source]

Calculate the likelihood of the given probability density function. The likelihood is:

\[L = \frac{1}{N}\sum_{i=1}^{N} f(x_i)\]

Parameters

pdfNumpy array: The probability density function.

Returns

LLfloat: The log-likelihood of the given probability density function.

polytex.stats.bw_opt.opt_bandwidth(variable, x_test, bw)[source]

Find the optimal bandwidth by tuning of the bandwidth parameter via cross-validation and returns the parameter value that maximizes the log-likelihood of data.

Parameters

variableNumpy array: A N x 1 dimension numpy array. The data to apply the kernel density estimation.
x_testNumpy array: Test data to get the density distribution.
bwlist of float: The bandwidth of the kernels to be tested.

Returns

kde.bandwidthfloat: The optimal bandwidth of the kernel.

polytex.stats.kde

polytex.stats.kde.kdePlot(xkde, ykde, cluster_center_idx)[source]

Parameters

xkdeNumpy array: The normalized distance.
ykdeNumpy array: The probability density distribution corresponding to the normalized distance.
cluster_center_idxNumpy array: The index of the cluster centers.

Returns

None.

polytex.stats.kde.kdeScreen(variable, x_test, bw, kernels='gaussian', plot=False)[source]

This function estimates the probability density distribution of the input variable with the non-parametric kernel density estimation (KDE) method. The local maxima and minima of the probability density distribution are identified to decompose the input variable into a set of clusters. The former is used as the cluster centers and the latter is used as the cluster boundaries.

Parameters

variableNumpy array: A N x 1 dimension numpy array to apply the kernel density estimation.
x_testNumpy array: Test data to get the density distribution. It has the same shape as the given variable. It should cover the whole range of the variable.
bwfloat: The bandwidth of the kernel.
kernelstring, optional: The kernel to use. The default is ‘gaussian’. The possible values are {‘gaussian’, ‘tophat’, ‘epanechnikov’, ‘exponential’, ‘linear’, ‘cosine’}.
plotbool, optional: Whether plot the probability density distribution. The default is False.

Returns

clustersdictionary: The index of the cluster centers, cluster boundary and the probability density distribution (pdf).

polytex.stats.kde.movingKDE(dataset, bw=0.002, windows=1, n_clusters=20, x_test=None)[source]

This function applies the kernel density estimation (KDE) method to the input dataset with a moving window. Namely, the dataset is divided into a set of windows and the KDE method is applied to each window. This allows to capture more details of geometry changes of a fiber tow.

Parameters

datasetNumpy array: A N x 2 dimension numpy array for kernel density estimation. The first colum should be the variable under analysis, the second is the label of cross-sections that the variable belongs to.
bwNumpy array or float, optional: A range of bandwidth values for kde operation usually generated with np.arange(). The optimal bandwidth will be identified within this range and be used for kernel density estimation. If a number is given, the number will be used as the bandwidth for kernel estimation.
windowsint,: The number of windows (segmentations) for KDE analysis. The default is 1, namely, the whole dataset is used for KDE analysis and gives the same result as using the function kdeScreen() directly.
n_clustersint: The target number of cluster_center. The default is 20.
x_testNumpy array: Test data to get the density distribution. The default is None.

Returns

kdeOutputNumpy array: A N x 3 dimension numpy array. The first column is the label of the window under analysis, the second is normlized distance, the third is the probability density.
cluster_centerNumpy array: A M x N dimension numpy array. M is the number of windows and N-1 is the number of cluster centers. The first column is the maximum index for each window, the following columns are the cluster centers.