AI Headshot Creator Free

AI Headshot Creator Free — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • GitHub Codespaces

    GitHub Codespaces

    GitHub Codespaces is a cloud-based online integrated development environment developed by GitHub. It allows users to create and manage development environments directly within the browser or through Visual Studio Code desktop. Codespaces is tightly integrated with GitHub repositories and enables on-demand coding, debugging, and testing in a full-featured development container hosted in the cloud. == Features == Instant development environments integrated with GitHub Browser-based and desktop access via Visual Studio Code Configurable Dockerfile or devcontainer.json environments Built-in support for GitHub Copilot, extensions, snippets, and SSH. == Licensing == GitHub Codespaces is proprietary software and available to GitHub users under various subscription plans. Codespaces includes a monthly usage quota for free tier users of 120 hours, and expanded access for GitHub education, Pro, Team, and GitHub Enterprise plans. == GitHub Classroom == GitHub Classroom is an educational tool developed by GitHub to streamline the process of managing programming assignments and coursework. Integrated with GitHub repositories, it allows instructors to distribute starter code, automate grading workflows, and track student progress. GitHub Classroom is widely used in computer science education and supports integration with GitHub Codespaces for cloud-based development environments. == Programming languages supported == == Extensions == Some of the popular extensions include:

    Read more →
  • Witness set

    Witness set

    In combinatorics and computational learning theory, a witness set is a set of elements that distinguishes a given Boolean function from a given class of other Boolean functions. Let C {\displaystyle C} be a concept class over a domain X {\displaystyle X} (that is, a family of Boolean functions over X {\displaystyle X} ) and c {\displaystyle c} be a concept in X {\displaystyle X} (a single Boolean function). A subset S {\displaystyle S} of X {\displaystyle X} is a witness set for c {\displaystyle c} in X {\displaystyle X} if S {\displaystyle S} distinguishes c {\displaystyle c} from all the other functions in C {\displaystyle C} , in the sense that no other function in C {\displaystyle C} has the same values on S {\displaystyle S} . For a concept class with | C | {\displaystyle |C|} concepts, there exists a concept that has a witness of size at most log 2 ⁡ | C | {\displaystyle \log _{2}|C|} ; this bound is tight when C {\displaystyle C} consists of all Boolean functions over X {\displaystyle X} . By a result of Bondy (1972) there exists a single witness set of size at most | C | − 1 {\displaystyle |C|-1} that is valid for all concepts in C {\displaystyle C} ; this bound is tight when C {\displaystyle C} consists of the indicator functions of the empty set and some singleton sets. One way to construct this set is to interpret the concepts as bitstrings, and the domain elements as positions in these bitstrings. Then the set of positions at which a trie of the bitstrings branches forms the desired witness set. This construction is central to the operation of the fusion tree data structure. The minimum size of a witness set for c {\displaystyle c} is called the witness size or specification number and is denoted by w C ( c ) {\displaystyle w_{C}(c)} . The value max { w C ( c ) : c ∈ C } {\displaystyle \max\{w_{C}(c):c\in C\}} is called the teaching dimension of C {\displaystyle C} . It represents the number of examples of a concept that need to be presented by a teacher to a learner, in the worst case, to enable the learner to determine which concept is being presented. Witness sets have also been called teaching sets, keys, specifying sets, or discriminants. The "witness set" terminology is from Kushilevitz et al. (1996), who trace the concept of witness sets to work by Cover (1965).

    Read more →
  • Oja's rule

    Oja's rule

    Oja's learning rule, or simply Oja's rule, named after Finnish computer scientist Erkki Oja (Finnish pronunciation: [ˈojɑ], AW-yuh), is a model of how neurons in the brain or in artificial neural networks change connection strength, or learn, over time. It is a modification of the standard Hebb's Rule that, through multiplicative normalization, solves all stability problems and generates an algorithm for principal components analysis. This is a computational form of an effect which is believed to happen in biological neurons. == Theory == Oja's rule requires a number of simplifications to derive, but in its final form it is demonstrably stable, unlike Hebb's rule. It is a single-neuron special case of the Generalized Hebbian Algorithm. However, Oja's rule can also be generalized in other ways to varying degrees of stability and success. === Formula === Consider a simplified model of a neuron y {\displaystyle y} that returns a linear combination of its inputs x using presynaptic weights w: y ( x ) = ∑ j = 1 m x j w j {\displaystyle \,y(\mathbf {x} )~=~\sum _{j=1}^{m}x_{j}w_{j}} Oja's rule defines the change in presynaptic weights w given the output response y {\displaystyle y} of a neuron to its inputs x to be Δ w = w n + 1 − w n = η y n ( x n − y n w n ) , {\displaystyle \,\Delta \mathbf {w} ~=~\mathbf {w} _{n+1}-\mathbf {w} _{n}~=~\eta \,y_{n}(\mathbf {x} _{n}-y_{n}\mathbf {w} _{n}),} where η is the learning rate which can also change with time. Note that the bold symbols are vectors and n defines a discrete time iteration. The rule can also be made for continuous iterations as d w d t = η y ( t ) ( x ( t ) − y ( t ) w ( t ) ) . {\displaystyle \,{\frac {d\mathbf {w} }{dt}}~=~\eta \,y(t)(\mathbf {x} (t)-y(t)\mathbf {w} (t)).} === Derivation === The simplest learning rule known is Hebb's rule, which states in conceptual terms that neurons that fire together, wire together. In component form as a difference equation, it is written Δ w = η y ( x n ) x n {\displaystyle \,\Delta \mathbf {w} ~=~\eta \,y(\mathbf {x} _{n})\mathbf {x} _{n}} , or in scalar form with implicit n-dependence, w i ( n + 1 ) = w i ( n ) + η y ( x ) x i {\displaystyle \,w_{i}(n+1)~=~w_{i}(n)+\eta \,y(\mathbf {x} )x_{i}} , where y(xn) is again the output, this time explicitly dependent on its input vector x. Hebb's rule has synaptic weights approaching infinity with a positive learning rate. We can stop this by normalizing the weights so that each weight's magnitude is restricted between 0, corresponding to no weight, and 1, corresponding to being the only input neuron with any weight. We do this by normalizing the weight vector to be of length one: w i ( n + 1 ) = w i ( n ) + η y ( x ) x i ( ∑ j = 1 m [ w j ( n ) + η y ( x ) x j ] p ) 1 / p {\displaystyle \,w_{i}(n+1)~=~{\frac {w_{i}(n)+\eta \,y(\mathbf {x} )x_{i}}{\left(\sum _{j=1}^{m}[w_{j}(n)+\eta \,y(\mathbf {x} )x_{j}]^{p}\right)^{1/p}}}} . Note that in Oja's original paper, p=2, corresponding to quadrature (root sum of squares), which is the familiar Cartesian normalization rule. However, any type of normalization, even linear, will give the same result without loss of generality. For a small learning rate | η | ≪ 1 {\displaystyle |\eta |\ll 1} the equation can be expanded as a Power series in η {\displaystyle \eta } . w i ( n + 1 ) = w i ( n ) ( ∑ j w j p ( n ) ) 1 / p + η ( y x i ( ∑ j w j p ( n ) ) 1 / p − w i ( n ) ∑ j y x j w j p − 1 ( n ) ( ∑ j w j p ( n ) ) ( 1 + 1 / p ) ) + O ( η 2 ) {\displaystyle \,w_{i}(n+1)~=~{\frac {w_{i}(n)}{\left(\sum _{j}w_{j}^{p}(n)\right)^{1/p}}}~+~\eta \left({\frac {yx_{i}}{\left(\sum _{j}w_{j}^{p}(n)\right)^{1/p}}}-{\frac {w_{i}(n)\sum _{j}yx_{j}w_{j}^{p-1}(n)}{\left(\sum _{j}w_{j}^{p}(n)\right)^{(1+1/p)}}}\right)~+~O(\eta ^{2})} . For small η, our higher-order terms O(η2) go to zero. We again make the specification of a linear neuron, that is, the output of the neuron is equal to the sum of the product of each input and its synaptic weight to the power of p-1, which in the case of p=2 is synaptic weight itself, or y ( x ) = ∑ j = 1 m x j w j p − 1 {\displaystyle \,y(\mathbf {x} )~=~\sum _{j=1}^{m}x_{j}w_{j}^{p-1}} . We also specify that our weights normalize to 1, which will be a necessary condition for stability, so | w | = ( ∑ j = 1 m w j p ) 1 / p = 1 {\displaystyle \,|\mathbf {w} |~=~\left(\sum _{j=1}^{m}w_{j}^{p}\right)^{1/p}~=~1} , which, when substituted into our expansion, gives Oja's rule, or w i ( n + 1 ) = w i ( n ) + η y ( x i − w i ( n ) y ) {\displaystyle \,w_{i}(n+1)~=~w_{i}(n)+\eta \,y(x_{i}-w_{i}(n)y)} . === Stability and PCA === In analyzing the convergence of a single neuron evolving by Oja's rule, one extracts the first principal component, or feature, of a data set. Furthermore, with extensions using the Generalized Hebbian Algorithm, one can create a multi-Oja neural network that can extract as many features as desired, allowing for principal components analysis. A principal component aj is extracted from a dataset x through some associated vector qj, or aj = qj⋅x, and we can restore our original dataset by taking x = ∑ j a j q j {\displaystyle \mathbf {x} ~=~\sum _{j}a_{j}\mathbf {q} _{j}} . In the case of a single neuron trained by Oja's rule, we find the weight vector converges to q1, or the first principal component, as time or number of iterations approaches infinity. We can also define, given a set of input vectors Xi, that its correlation matrix Rij = XiXj has an associated eigenvector given by qj with eigenvalue λj. The variance of outputs of our Oja neuron σ2(n) = ⟨y2(n)⟩ then converges with time iterations to the principal eigenvalue, or lim n → ∞ σ 2 ( n ) = λ 1 {\displaystyle \lim _{n\rightarrow \infty }\sigma ^{2}(n)~=~\lambda _{1}} . These results are derived using Lyapunov function analysis, and they show that Oja's neuron necessarily converges on strictly the first principal component if certain conditions are met in our original learning rule. Most importantly, our learning rate η is allowed to vary with time, but only such that its sum is divergent but its power sum is convergent, that is ∑ n = 1 ∞ η ( n ) = ∞ , ∑ n = 1 ∞ η ( n ) p < ∞ , p > 1 {\displaystyle \sum _{n=1}^{\infty }\eta (n)=\infty ,~~~\sum _{n=1}^{\infty }\eta (n)^{p}<\infty ,~~~p>1} . Our output activation function y(x(n)) is also allowed to be nonlinear and nonstatic, but it must be continuously differentiable in both x and w and have derivatives bounded in time. == Applications == Oja's rule was originally described in Oja's 1982 paper, but the principle of self-organization to which it is applied is first attributed to Alan Turing in 1952. PCA has also had a long history of use before Oja's rule formalized its use in network computation in 1989. The model can thus be applied to any problem of self-organizing mapping, in particular those in which feature extraction is of primary interest. Therefore, Oja's rule has an important place in image and speech processing. It is also useful as it expands easily to higher dimensions of processing, thus being able to integrate multiple outputs quickly. A canonical example is its use in binocular vision. === Biology and Oja's subspace rule === There is clear evidence for both long-term potentiation and long-term depression in biological neural networks, along with a normalization effect in both input weights and neuron outputs. However, while there is no direct experimental evidence yet of Oja's rule active in a biological neural network, a biophysical derivation of a generalization of the rule is possible. Such a derivation requires retrograde signalling from the postsynaptic neuron, which is biologically plausible (see neural backpropagation), and takes the form of Δ w i j ∝ ⟨ x i y j ⟩ − ϵ ⟨ ( c p r e ∗ ∑ k w i k y k ) ⋅ ( c p o s t ∗ y j ) ⟩ , {\displaystyle \Delta w_{ij}~\propto ~\langle x_{i}y_{j}\rangle -\epsilon \left\langle \left(c_{\mathrm {pre} }\sum _{k}w_{ik}y_{k}\right)\cdot \left(c_{\mathrm {post} }y_{j}\right)\right\rangle ,} where as before wij is the synaptic weight between the ith input and jth output neurons, x is the input, y is the postsynaptic output, and we define ε to be a constant analogous the learning rate, and cpre and cpost are presynaptic and postsynaptic functions that model the weakening of signals over time. Note that the angle brackets denote the average and the ∗ operator is a convolution. By taking the pre- and post-synaptic functions into frequency space and combining integration terms with the convolution, we find that this gives an arbitrary-dimensional generalization of Oja's rule known as Oja's Subspace, namely Δ w = C x ⋅ w − w ⋅ C y . {\displaystyle \Delta w~=~Cx\cdot w-w\cdot Cy.}

    Read more →
  • KXEN Inc.

    KXEN Inc.

    KXEN was an American software company which existed from 1998 to 2013 when it was acquired by SAP AG. == History == KXEN was founded in June 1998 by Roger Haddad and Michel Bera. It was based in San Francisco, California with offices in Paris and London. On September 10, 2013, SAP AG announced plans to acquire KXEN. On October 1, 2013, a letter to KXEN customers announced the acquisition closed. KXEN primarily marketed predictive analytics software. == Predictive analytics == InfiniteInsight is a predictive modeling suite developed by KXEN that assists analytic professionals, and business executives to extract information from data. Among other functions, InfiniteInsight is used for variable importance, classification, regression, segmentation, time series, product recommendation, as described and expressed by the Java Data Mining interface, and for social network analysis. InfiniteInsight allows prediction of a behavior or a value, the forecast of a time series or the understanding of a group of individuals with similar behavior. Advanced functions include behavioral modeling, exporting the model code into different target environments or building predictive models on top of SAS or SPSS data files. Competitors are SAS Enterprise Miner, IBM SPSS Modeler, and Statistica. Open source predictive tools like the R package or Weka are also competitors, since they provide similar features free of charge.

    Read more →
  • G'MIC

    G'MIC

    G'MIC (GREYC's Magic for Image Computing) is a free and open-source framework for image processing. It defines a script language that allows the creation of complex macros. Originally usable only through a command line interface, it is currently mostly popular as a GIMP plugin, and is also included in Krita. G'MIC is dual-licensed under CECILL-2.1 or CECILL-C. == Features == G'MIC's graphical interface is notable for its noise removal filters, which came from an earlier project called GREYCstoration by the same authors. G'MIC offers many built-in commands for image processing, including basic mathematical manipulations, look up tables, and filtering operations. More complex macros and pipelines built out of those commands are defined in its library files. == Interpreters == === Command line === G'MIC is primarily a script language callable from a shell. For example, to display an image: This command displays the image contained in the file image.jpg and allows zooming in to examine values. Several filters can be applied in succession. For example, to crop and resize an image: === Graphical interface === G'MIC comes with a Qt-based graphical interface, which may be integrated as a Gimp or Krita plugin. It contains several hundred filters written in the G'MIC language, dynamically updated through an internet feed. The interface provides a preview and setting sliders for each filter. G'MIC is one of the most popular Gimp plugins. === G'MIC Online === Most of the filters available for the graphical interface are also available online. === ZArt === ZArt is a graphical interface for real-time manipulation of webcam images. === libgmic === Libgmic is a C++ library that can be linked to third-party applications. It sees integration in Flowblade and Veejay.

    Read more →
  • Recursive neural network

    Recursive neural network

    A recursive neural network is a kind of deep neural network created by applying the same set of weights recursively over a structured input, to produce a structured prediction over variable-size input structures, or a scalar prediction on it, by traversing a given structure in topological order. These networks were first introduced to learn distributed representations of structure (such as logical terms), but have been successful in multiple applications, for instance in learning sequence and tree structures in natural language processing (mainly continuous representations of phrases and sentences based on word embeddings). == Architectures == === Basic === In the simplest architecture, nodes are combined into parents using a weight matrix (which is shared across the whole network) and a non-linearity such as the tanh {\displaystyle \tanh } hyperbolic function. If c 1 {\displaystyle c_{1}} and c 2 {\displaystyle c_{2}} are n {\displaystyle n} -dimensional vector representations of nodes, their parent will also be an n {\displaystyle n} -dimensional vector, defined as: p 1 , 2 = tanh ⁡ ( W [ c 1 ; c 2 ] ) {\displaystyle p_{1,2}=\tanh(W[c_{1};c_{2}])} where W {\displaystyle W} is a learned n × 2 n {\displaystyle n\times 2n} weight matrix. This architecture, with a few improvements, has been used for successfully parsing natural scenes, syntactic parsing of natural language sentences, and recursive autoencoding and generative modeling of 3D shape structures in the form of cuboid abstractions. === Recursive cascade correlation (RecCC) === RecCC is a constructive neural network approach to deal with tree domains with pioneering applications to chemistry and extension to directed acyclic graphs. === Unsupervised RNN === A framework for unsupervised RNN has been introduced in 2004. === Tensor === Recursive neural tensor networks use a single tensor-based composition function for all nodes in the tree. == Training == === Stochastic gradient descent === Typically, stochastic gradient descent (SGD) is used to train the network. The gradient is computed using backpropagation through structure (BPTS), a variant of backpropagation through time used for recurrent neural networks. == Properties == The universal approximation capability of RNNs over trees has been proved in literature. == Related models == === Recurrent neural networks === Recurrent neural networks are recursive artificial neural networks with a certain structure: that of a linear chain. Whereas recursive neural networks operate on any hierarchical structure, combining child representations into parent representations, recurrent neural networks operate on the linear progression of time, combining the previous time step and a hidden representation into the representation for the current time step. === Tree Echo State Networks === An efficient approach to implement recursive neural networks is given by the Tree Echo State Network within the reservoir computing paradigm. === Extension to graphs === Extensions to graphs include graph neural network (GNN), Neural Network for Graphs (NN4G), and more recently convolutional neural networks for graphs.

    Read more →
  • Evolutionary programming

    Evolutionary programming

    Evolutionary programming is an evolutionary algorithm, where a share of new population is created by mutation of previous population without crossover. Evolutionary programming differs from evolution strategy ES( μ + λ {\displaystyle \mu +\lambda } ) in one detail. All individuals are selected for the new population, while in ES( μ + λ {\displaystyle \mu +\lambda } ), every individual has the same probability to be selected. It is one of the four major evolutionary algorithm paradigms. == History == It was first used by Lawrence J. Fogel in the US in 1960 in order to use simulated evolution as a learning process aiming to generate artificial intelligence. It was used to evolve finite-state machines as predictors.

    Read more →
  • GraphLab

    GraphLab

    Turi is a graph-based, high performance, distributed computation framework written in C++. The GraphLab project was started by Prof. Carlos Guestrin of Carnegie Mellon University in 2009. It is an open source project that uses the Apache License. While GraphLab was originally developed for machine learning tasks, it has also been developed for other data-mining tasks. == Motivation == As the amounts of collected data and computing power grow (multicore, GPUs, clusters, clouds), modern datasets no longer fit into one computing node. Efficient distributed parallel algorithms for handling large-scale data are required. The GraphLab framework is a parallel programming abstraction targeted for sparse iterative graph algorithms. GraphLab provides a programming interface, allowing deployment of distributed machine learning algorithms. The main design considerations behind the design of GraphLab are: Sparse data with local dependencies Iterative algorithms Potentially asynchronous execution == GraphLab toolkits == On top of GraphLab, several implemented libraries of algorithms: Topic modeling - contains applications like LDA, which can be used to cluster documents and extract topical representations. Graph analytics - contains applications like pagerank and triangle counting, which can be applied to general graphs to estimate community structure. Clustering - contains standard data clustering tools such as Kmeans Collaborative filtering - contains a collection of applications used to make predictions about users interests and factorize large matrices. Graphical models - contains tools for making joint predictions about collections of related random variables. Computer vision - contains a collection of tools for reasoning about images. == Turi == Turi (formerly called Dato and before that GraphLab Inc.) is a company that was founded by Prof. Carlos Guestrin from University of Washington in May 2013 to continue development support of the GraphLab open source project. Dato Inc. raised a $6.75M Series A from Madrona Venture Group and New Enterprise Associates (NEA). They raised a $18.5M Series B from Vulcan Capital and Opus Capital, with participation from Madrona and NEA. On August 5, 2016, Turi was acquired by Apple Inc. for $200,000,000.

    Read more →
  • Evolutionary robotics

    Evolutionary robotics

    Evolutionary robotics is an embodied approach to Artificial Intelligence (AI) in which robots are automatically designed using Darwinian principles of natural selection. The design of a robot, or a subsystem of a robot such as a neural controller, is optimized against a behavioral goal (e.g. run as fast as possible). Usually, designs are evaluated in simulations as fabricating thousands or millions of designs and testing them in the real world is prohibitively expensive in terms of time, money, and safety. An evolutionary robotics experiment starts with a population of randomly generated robot designs. The worst performing designs are discarded and replaced with mutations and/or combinations of the better designs. This evolutionary algorithm continues until a prespecified amount of time elapses or some target performance metric is surpassed. Evolutionary robotics methods are particularly useful for engineering machines that must operate in environments in which humans have limited intuition (nanoscale, space, etc.). Evolved simulated robots can also be used as scientific tools to generate new hypotheses in biology and cognitive science, and to test old hypothesis that require experiments that have proven difficult or impossible to carry out in reality. == History == In the early 1990s, two separate European groups demonstrated different approaches to the evolution of robot control systems. Dario Floreano and Francesco Mondada at EPFL evolved controllers for the Khepera robot. Adrian Thompson, Nick Jakobi, Dave Cliff, Inman Harvey, and Phil Husbands evolved controllers for a Gantry robot at the University of Sussex. However the body of these robots was presupposed before evolution. The first simulations of evolved robots were reported by Karl Sims and Jeffrey Ventrella of the MIT Media Lab, also in the early 1990s. However these so-called virtual creatures never left their simulated worlds. The first evolved robots to be built in reality were 3D-printed by Hod Lipson and Jordan Pollack at Brandeis University at the turn of the 21st century.

    Read more →
  • Variable kernel density estimation

    Variable kernel density estimation

    In statistics, adaptive or "variable-bandwidth" kernel density estimation is a form of kernel density estimation in which the size of the kernels used in the estimate are varied depending upon either the location of the samples or the location of the test point. It is a particularly effective technique when the sample space is multi-dimensional. == Rationale == Given a set of samples, { x → i } {\displaystyle \lbrace {\vec {x}}_{i}\rbrace } , we wish to estimate the density, P ( x → ) {\displaystyle P({\vec {x}})} , at a test point, x → {\displaystyle {\vec {x}}} : P ( x → ) ≈ W n h D {\displaystyle P({\vec {x}})\approx {\frac {W}{nh^{D}}}} W = ∑ i = 1 n w i {\displaystyle W=\sum _{i=1}^{n}w_{i}} w i = K ( x → − x → i h ) {\displaystyle w_{i}=K\left({\frac {{\vec {x}}-{\vec {x}}_{i}}{h}}\right)} where n is the number of samples, K is the "kernel", h is its width and D is the number of dimensions in x → {\displaystyle {\vec {x}}} . The kernel can be thought of as a simple, linear filter. Using a fixed filter width may mean that in regions of low density, all samples will fall in the tails of the filter with very low weighting, while regions of high density will find an excessive number of samples in the central region with weighting close to unity. To fix this problem, we vary the width of the kernel in different regions of the sample space. There are two methods of doing this: balloon and pointwise estimation. In a balloon estimator, the kernel width is varied depending on the location of the test point. In a pointwise estimator, the kernel width is varied depending on the location of the sample. For multivariate estimators, the parameter, h, can be generalized to vary not just the size, but also the shape of the kernel. This more complicated approach will not be covered here. == Balloon estimators == A common method of varying the kernel width is to make it inversely proportional to the density at the test point: h = k [ n P ( x → ) ] 1 / D {\displaystyle h={\frac {k}{\left[nP({\vec {x}})\right]^{1/D}}}} where k is a constant. If we back-substitute the estimated PDF, and assuming a Gaussian kernel function, we can show that W is a constant: W = k D ( 2 π ) D / 2 {\displaystyle W=k^{D}(2\pi )^{D/2}} A similar derivation holds for any kernel whose normalising function is of the order hD, although with a different constant factor in place of the (2 π)D/2 term. This produces a generalization of the k-nearest neighbour algorithm. That is, a uniform kernel function will return the KNN technique. There are two components to the error: a variance term and a bias term. The variance term is given as: e 1 = P ∫ K 2 n h D {\displaystyle e_{1}={\frac {P\int K^{2}}{nh^{D}}}} . The bias term is found by evaluating the approximated function in the limit as the kernel width becomes much larger than the sample spacing. By using a Taylor expansion for the real function, the bias term drops out: e 2 = h 2 n ∇ 2 P {\displaystyle e_{2}={\frac {h^{2}}{n}}\nabla ^{2}P} An optimal kernel width that minimizes the error of each estimate can thus be derived. == Use for statistical classification == The method is particularly effective when applied to statistical classification. There are two ways we can proceed: the first is to compute the PDFs of each class separately, using different bandwidth parameters, and then compare them as in Taylor. Alternatively, we can divide up the sum based on the class of each sample: P ( j , x → ) ≈ 1 n ∑ i = 1 , c i = j n w i {\displaystyle P(j,{\vec {x}})\approx {\frac {1}{n}}\sum _{i=1,c_{i}=j}^{n}w_{i}} where ci is the class of the ith sample. The class of the test point may be estimated through maximum likelihood.

    Read more →
  • Gremlin (query language)

    Gremlin (query language)

    Gremlin is a graph traversal language and virtual machine developed by Apache TinkerPop of the Apache Software Foundation. Gremlin works for both OLTP-based graph databases as well as OLAP-based graph processors. Gremlin's automata and functional language foundation enable Gremlin to naturally support imperative and declarative querying, host language agnosticism, user-defined domain specific languages, an extensible compiler/optimizer, single- and multi-machine execution models, and hybrid depth- and breadth-first evaluation with Turing completeness. As an explanatory analogy, Apache TinkerPop and Gremlin are to graph databases what the JDBC and SQL are to relational databases. Likewise, the Gremlin traversal machine is to graph computing as what the Java virtual machine is to general purpose computing. == History == 2009-10-30 the project is born, and immediately named "TinkerPop" 2009-12-25 v0.1 is the first release 2011-05-21 v1.0 is released 2012-05-24 v2.0 is released 2015-01-16 TinkerPop becomes an Apache Incubator project 2015-07-09 v3.0.0-incubating is released 2016-05-23 Apache TinkerPop becomes a top-level project 2016-07-18 v3.1.3 and v3.2.1 are first releases as Apache TinkerPop 2017-12-17 v3.3.1 is released 2018-05-08 v3.3.3 is released 2019-08-05 v3.4.3 is released 2020-02-20 v3.4.6 is released 2021-05-01 v3.5.0 is released 2022-04-04 v3.6.0 is released 2023-07-31 v3.7.0 is released 2025-11-12 v3.8.0 is released == Vendor integration == Gremlin is an Apache2-licensed graph traversal language that can be used by graph system vendors. There are typically two types of graph system vendors: OLTP graph databases and OLAP graph processors. The table below outlines those graph vendors that support Gremlin. == Traversal examples == The following examples of Gremlin queries and responses in a Gremlin-Groovy environment are relative to a graph representation of the MovieLens dataset. The dataset includes users who rate movies. Users each have one occupation, and each movie has one or more categories associated with it. The MovieLens graph schema is detailed below. === Simple traversals === For each vertex in the graph, emit its label, then group and count each distinct label. What year was the oldest movie made? What is Die Hard's average rating? === Projection traversals === For each category, emit a map of its name and the number of movies it represents. For each movie with at least 11 ratings, emit a map of its name and average rating. Sort the maps in decreasing order by their average rating. Emit the first 10 maps (i.e. top 10). === Declarative pattern matching traversals === Gremlin supports declarative graph pattern matching similar to SPARQL. For instance, the following query below uses Gremlin's match()-step. What 80's action movies do 30-something programmers like? Group count the movies by their name and sort the group count map in decreasing order by value. Clip the map to the top 10 and emit the map entries. === OLAP traversal === Which movies are most central in the implicit 5-stars graph? == Gremlin graph traversal machine == Gremlin is a virtual machine composed of an instruction set as well as an execution engine. An analogy is drawn between Gremlin and Java. === Gremlin steps (instruction set) === The following traversal is a Gremlin traversal in the Gremlin-Java8 dialect. The Gremlin language (i.e. the fluent-style of expressing a graph traversal) can be represented in any host language that supports function composition and function nesting. Due to this simple requirement, there exists various Gremlin dialects including Gremlin-Groovy, Gremlin-Scala, Gremlin-Clojure, etc. The above Gremlin-Java8 traversal is ultimately compiled down to a step sequence called a traversal. A string representation of the traversal above provided below. The steps are the primitives of the Gremlin graph traversal machine. They are the parameterized instructions that the machine ultimately executes. The Gremlin instruction set is approximately 30 steps. These steps are sufficient to provide general purpose computing and what is typically required to express the common motifs of any graph traversal query. Given that Gremlin is a language, an instruction set, and a virtual machine, it is possible to design another traversal language that compiles to the Gremlin traversal machine (analogous to how Scala compiles to the JVM). For instance, the popular SPARQL graph pattern match language can be compiled to execute on the Gremlin machine. The following SPARQL query would compile to In Gremlin-Java8, the SPARQL query above would be represented as below and compile to the identical Gremlin step sequence (i.e. traversal). === Gremlin Machine (virtual machine) === The Gremlin graph traversal machine can execute on a single machine or across a multi-machine compute cluster. Execution agnosticism allows Gremlin to run over both graph databases (OLTP) and graph processors (OLAP).

    Read more →
  • Radial basis function network

    Radial basis function network

    In the field of mathematical modeling, a radial basis function network is an artificial neural network that uses radial basis functions as activation functions. The output of the network is a linear combination of radial basis functions of the inputs and neuron parameters. Radial basis function networks have many uses, including function approximation, time series prediction, classification, and system control. They were first formulated in a 1988 paper by Broomhead and Lowe, both researchers at the Royal Signals and Radar Establishment. == Network architecture == Radial basis function (RBF) networks typically have three layers: an input layer, a hidden layer with a non-linear RBF activation function and a linear output layer. The input can be modeled as a vector of real numbers x ∈ R n {\displaystyle \mathbf {x} \in \mathbb {R} ^{n}} . The output of the network is then a scalar function of the input vector, φ : R n → R {\displaystyle \varphi :\mathbb {R} ^{n}\to \mathbb {R} } , and is given by φ ( x ) = ∑ i = 1 N a i ρ ( | | x − c i | | ) {\displaystyle \varphi (\mathbf {x} )=\sum _{i=1}^{N}a_{i}\rho (||\mathbf {x} -\mathbf {c} _{i}||)} where N {\displaystyle N} is the number of neurons in the hidden layer, c i {\displaystyle \mathbf {c} _{i}} is the center vector for neuron i {\displaystyle i} , and a i {\displaystyle a_{i}} is the weight of neuron i {\displaystyle i} in the linear output neuron. Functions that depend only on the distance from a center vector are radially symmetric about that vector, hence the name radial basis function. In the basic form, all inputs are connected to each hidden neuron. The norm is typically taken to be the Euclidean distance (although the Mahalanobis distance appears to perform better with pattern recognition) and the radial basis function is commonly taken to be Gaussian ρ ( ‖ x − c i ‖ ) = exp ⁡ [ − β i ‖ x − c i ‖ 2 ] {\displaystyle \rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}=\exp \left[-\beta _{i}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert ^{2}\right]} . The Gaussian basis functions are local to the center vector in the sense that lim | | x | | → ∞ ρ ( ‖ x − c i ‖ ) = 0 {\displaystyle \lim _{||x||\to \infty }\rho (\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert )=0} i.e. changing parameters of one neuron has only a small effect for input values that are far away from the center of that neuron. Given certain mild conditions on the shape of the activation function, RBF networks are universal approximators on a compact subset of R n {\displaystyle \mathbb {R} ^{n}} . This means that an RBF network with enough hidden neurons can approximate any continuous function on a closed, bounded set with arbitrary precision. The parameters a i {\displaystyle a_{i}} , c i {\displaystyle \mathbf {c} _{i}} , and β i {\displaystyle \beta _{i}} are determined in a manner that optimizes the fit between φ {\displaystyle \varphi } and the data. === Normalization === ==== Normalized architecture ==== In addition to the above unnormalized architecture, RBF networks can be normalized. In this case the mapping is φ ( x ) = d e f ∑ i = 1 N a i ρ ( ‖ x − c i ‖ ) ∑ i = 1 N ρ ( ‖ x − c i ‖ ) = ∑ i = 1 N a i u ( ‖ x − c i ‖ ) {\displaystyle \varphi (\mathbf {x} )\ {\stackrel {\mathrm {def} }{=}}\ {\frac {\sum _{i=1}^{N}a_{i}\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}}{\sum _{i=1}^{N}\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}}}=\sum _{i=1}^{N}a_{i}u{\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}} where u ( ‖ x − c i ‖ ) = d e f ρ ( ‖ x − c i ‖ ) ∑ j = 1 N ρ ( ‖ x − c j ‖ ) {\displaystyle u{\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}\ {\stackrel {\mathrm {def} }{=}}\ {\frac {\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}}{\sum _{j=1}^{N}\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{j}\right\Vert {\big )}}}} is known as a normalized radial basis function. ==== Theoretical motivation for normalization ==== There is theoretical justification for this architecture in the case of stochastic data flow. Assume a stochastic kernel approximation for the joint probability density P ( x ∧ y ) = 1 N ∑ i = 1 N ρ ( ‖ x − c i ‖ ) σ ( | y − e i | ) {\displaystyle P\left(\mathbf {x} \land y\right)={1 \over N}\sum _{i=1}^{N}\,\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}\,\sigma {\big (}\left\vert y-e_{i}\right\vert {\big )}} where the weights c i {\displaystyle \mathbf {c} _{i}} and e i {\displaystyle e_{i}} are exemplars from the data and we require the kernels to be normalized ∫ ρ ( ‖ x − c i ‖ ) d n x = 1 {\displaystyle \int \rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}\,d^{n}\mathbf {x} =1} and ∫ σ ( | y − e i | ) d y = 1 {\displaystyle \int \sigma {\big (}\left\vert y-e_{i}\right\vert {\big )}\,dy=1} . The probability densities in the input and output spaces are P ( x ) = ∫ P ( x ∧ y ) d y = 1 N ∑ i = 1 N ρ ( ‖ x − c i ‖ ) {\displaystyle P\left(\mathbf {x} \right)=\int P\left(\mathbf {x} \land y\right)\,dy={1 \over N}\sum _{i=1}^{N}\,\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}} and The expectation of y given an input x {\displaystyle \mathbf {x} } is φ ( x ) = d e f E ( y ∣ x ) = ∫ y P ( y ∣ x ) d y {\displaystyle \varphi \left(\mathbf {x} \right)\ {\stackrel {\mathrm {def} }{=}}\ E\left(y\mid \mathbf {x} \right)=\int y\,P\left(y\mid \mathbf {x} \right)dy} where P ( y ∣ x ) {\displaystyle P\left(y\mid \mathbf {x} \right)} is the conditional probability of y given x {\displaystyle \mathbf {x} } . The conditional probability is related to the joint probability through Bayes' theorem P ( y ∣ x ) = P ( x ∧ y ) P ( x ) {\displaystyle P\left(y\mid \mathbf {x} \right)={\frac {P\left(\mathbf {x} \land y\right)}{P\left(\mathbf {x} \right)}}} which yields φ ( x ) = ∫ y P ( x ∧ y ) P ( x ) d y {\displaystyle \varphi \left(\mathbf {x} \right)=\int y\,{\frac {P\left(\mathbf {x} \land y\right)}{P\left(\mathbf {x} \right)}}\,dy} . This becomes φ ( x ) = ∑ i = 1 N e i ρ ( ‖ x − c i ‖ ) ∑ i = 1 N ρ ( ‖ x − c i ‖ ) = ∑ i = 1 N e i u ( ‖ x − c i ‖ ) {\displaystyle \varphi \left(\mathbf {x} \right)={\frac {\sum _{i=1}^{N}e_{i}\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}}{\sum _{i=1}^{N}\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}}}=\sum _{i=1}^{N}e_{i}u{\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}} when the integrations are performed. === Local linear models === It is sometimes convenient to expand the architecture to include local linear models. In that case the architectures become, to first order, φ ( x ) = ∑ i = 1 N ( a i + b i ⋅ ( x − c i ) ) ρ ( ‖ x − c i ‖ ) {\displaystyle \varphi \left(\mathbf {x} \right)=\sum _{i=1}^{N}\left(a_{i}+\mathbf {b} _{i}\cdot \left(\mathbf {x} -\mathbf {c} _{i}\right)\right)\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}} and φ ( x ) = ∑ i = 1 N ( a i + b i ⋅ ( x − c i ) ) u ( ‖ x − c i ‖ ) {\displaystyle \varphi \left(\mathbf {x} \right)=\sum _{i=1}^{N}\left(a_{i}+\mathbf {b} _{i}\cdot \left(\mathbf {x} -\mathbf {c} _{i}\right)\right)u{\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}} in the unnormalized and normalized cases, respectively. Here b i {\displaystyle \mathbf {b} _{i}} are weights to be determined. Higher order linear terms are also possible. This result can be written φ ( x ) = ∑ i = 1 2 N ∑ j = 1 n e i j v i j ( x − c i ) {\displaystyle \varphi \left(\mathbf {x} \right)=\sum _{i=1}^{2N}\sum _{j=1}^{n}e_{ij}v_{ij}{\big (}\mathbf {x} -\mathbf {c} _{i}{\big )}} where e i j = { a i , if i ∈ [ 1 , N ] b i j , if i ∈ [ N + 1 , 2 N ] {\displaystyle e_{ij}={\begin{cases}a_{i},&{\mbox{if }}i\in [1,N]\\b_{ij},&{\mbox{if }}i\in [N+1,2N]\end{cases}}} and v i j ( x − c i ) = d e f { δ i j ρ ( ‖ x − c i ‖ ) , if i ∈ [ 1 , N ] ( x i j − c i j ) ρ ( ‖ x − c i ‖ ) , if i ∈ [ N + 1 , 2 N ] {\displaystyle v_{ij}{\big (}\mathbf {x} -\mathbf {c} _{i}{\big )}\ {\stackrel {\mathrm {def} }{=}}\ {\begin{cases}\delta _{ij}\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )},&{\mbox{if }}i\in [1,N]\\\left(x_{ij}-c_{ij}\right)\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )},&{\mbox{if }}i\in [N+1,2N]\end{cases}}} in the unnormalized case and in the normalized case. Here δ i j {\displaystyle \delta _{ij}} is a Kronecker delta function defined as δ i j = { 1 , if i = j 0 , if i ≠ j {\displaystyle \delta _{ij}={\begin{cases}1,&{\mbox{if }}i=j\\0,&{\mbox{if }}i\neq j\end{cases}}} . == Training == RBF networks are typically trained from pairs of input and target values x ( t ) , y ( t ) {\displaystyle \mathbf {x} (t),y(t)} , t = 1 , … , T {\displaystyle t=1,\dots ,T} by a two-step algorithm. In the first step, the center vectors c i {\displaystyle \mathbf {c} _{i}} of the RBF functions in the hidden layer

    Read more →
  • Neural field

    Neural field

    In machine learning, a neural field (also known as implicit neural representation, neural implicit, or coordinate-based neural network), is a mathematical field that is fully or partially parametrized by a neural network. Initially developed to tackle visual computing tasks, such as rendering or reconstruction (e.g., neural radiance fields), neural fields emerged as a promising strategy to deal with a wider range of problems, including surrogate modelling of partial differential equations, such as in physics-informed neural networks. Differently from traditional machine learning algorithms, such as feed-forward neural networks, convolutional neural networks, or transformers, neural fields do not work with discrete data (e.g. sequences, images, tokens), but map continuous inputs (e.g., spatial coordinates, time) to continuous outputs (i.e., scalars, vectors, etc.). This makes neural fields not only discretization independent, but also easily differentiable. Moreover, dealing with continuous data allows for a significant reduction in space complexity, which translates to a much more lightweight network. == Formulation and training == According to the universal approximation theorem, provided adequate learning, sufficient number of hidden units, and the presence of a deterministic relationship between the input and the output, a neural network can approximate any function to any degree of accuracy. Hence, in mathematical terms, given a field y = Φ ( x ) {\textstyle {\boldsymbol {y}}=\Phi ({\boldsymbol {x}})} , with x ∈ R n {\displaystyle {\boldsymbol {x}}\in \mathbb {R} ^{n}} and y ∈ R m {\displaystyle {\boldsymbol {y}}\in \mathbb {R} ^{m}} , a neural field Ψ θ {\displaystyle \Psi _{\theta }} , with parameters θ {\displaystyle {\boldsymbol {\theta }}} , is such that: Ψ θ ( x ) = y ^ ≈ y {\displaystyle \Psi _{\theta }({\boldsymbol {x}})={\hat {\boldsymbol {y}}}\approx {\boldsymbol {y}}} === Training === For supervised tasks, given N {\displaystyle N} examples in the training dataset (i.e., ( x i , y i ) ∈ D t r a i n , i = 1 , … , N {\displaystyle ({\boldsymbol {x_{i}}},{\boldsymbol {y_{i}}})\in {\mathcal {D_{train}}},i=1,\dots ,N} ), the neural field parameters can be learned by minimizing a loss function L {\displaystyle {\mathcal {L}}} (e.g., mean squared error). The parameters θ ~ {\displaystyle {\tilde {\theta }}} that satisfy the optimization problem are found as: θ ~ = argmin θ 1 N ∑ ( x i , y i ) ∈ D t r a i n L ( Ψ θ ( x i ) , y i ) {\displaystyle {\tilde {\boldsymbol {\theta }}}={\underset {\boldsymbol {\theta }}{\text{argmin}}}\;{\frac {1}{N}}\sum _{({\boldsymbol {x_{i}}},{\boldsymbol {y_{i}}})\in {\mathcal {D_{train}}}}{\mathcal {L}}(\Psi _{\theta }({\boldsymbol {x}}_{i}),{\boldsymbol {y}}_{i})} Notably, it is not necessary to know the analytical expression of Φ {\displaystyle \Phi } , for the previously reported training procedure only requires input-output pairs. Indeed, a neural field is able to offer a continuous and differentiable surrogate of the true field, even from purely experimental data. Moreover, neural fields can be used in unsupervised settings, with training objectives that depend on the specific task. For example, physics-informed neural networks may be trained on just the residual. === Spectral bias === As for any artificial neural network, neural fields may be characterized by a spectral bias (i.e., the tendency to preferably learn the low frequency content of a field), possibly leading to a poor representation of the ground truth. In order to overcome this limitation, several strategies have been developed. For example, SIREN uses sinusoidal activations, while the Fourier-features approach embeds the input through sines and cosines. == Conditional neural fields == In many real-world cases, however, learning a single field is not enough. For example, when reconstructing 3D vehicle shapes from Lidar data, it is desirable to have a machine learning model that can work with arbitrary shapes (e.g., a car, a bicycle, a truck, etc.). The solution is to include additional parameters, the latent variables (or latent code) z ∈ R d {\displaystyle {\boldsymbol {z}}\in \mathbb {R} ^{d}} , to vary the field and adapt it to diverse tasks. === Latent code production === When dealing with conditional neural fields, the first design choice is represented by the way in which the latent code is produced. Specifically, two main strategies can be identified: Encoder: the latent code is the output of a second neural network, acting as an encoder. During training, the loss function is the objective used to learn the parameters of both the neural field and the encoder. Auto-decoding: each training example has its own latent code, jointly trained with the neural field parameters. When the model has to process new examples (i.e., not originally present in the training dataset), a small optimization problem is solved, keeping the network parameters fixed and only learning the new latent variables. Since the latter strategy requires additional optimization steps at inference time, it sacrifices speed, but keeps the overall model smaller. Moreover, despite being simpler to implement, an encoder may harm the generalization capabilities of the model. For example, when dealing with a physical scalar field f : R 2 → R {\displaystyle f:\mathbb {R} ^{2}\rightarrow \mathbb {R} } (e.g., the pressure of a 2D fluid), an auto-decoder-based conditional neural field can map a single point to the corresponding value of the field, following a learned latent code z {\displaystyle {\boldsymbol {z}}} . However, if the latent variables were produced by an encoder, it would require access to the entire set of points and corresponding values (e.g. as a regular grid or a mesh graph), leading to a less robust model. === Global and local conditioning === In a neural field with global conditioning, the latent code does not depend on the input and, hence, it offers a global representation (e.g., the overall shape of a vehicle). However, depending on the task, it may be more useful to divide the domain of x {\displaystyle {\boldsymbol {x}}} in several subdomains, and learn different latent codes for each of them (e.g., splitting a large and complex scene in sub-scenes for a more efficient rendering). This is called local conditioning. === Conditioning strategies === There are several strategies to include the conditioning information in the neural field. In the general mathematical framework, conditioning the neural field with the latent variables is equivalent to mapping them to a subset θ ∗ {\displaystyle {\boldsymbol {\theta }}^{}} of the neural field parameters: θ ∗ = Γ ( z ) {\displaystyle {\boldsymbol {\theta }}^{}=\Gamma ({\boldsymbol {z}})} In practice, notable strategies are: Concatenation: the neural field receives, as input, the concatenation of the original input x {\displaystyle {\boldsymbol {x}}} with the latent codes z {\displaystyle {\boldsymbol {z}}} . For feed-forward neural networks, this is equivalent to setting θ ∗ {\displaystyle {\boldsymbol {\theta }}^{}} as the bias of the first layer and Γ ( z ) {\displaystyle \Gamma ({\boldsymbol {z}})} as an affine transformation. Hypernetworks: a hypernetwork is a neural network that outputs the parameters of another neural network. Specifically, it consists of approximating Γ ( z ) {\displaystyle \Gamma ({\boldsymbol {z}})} with a neural network Γ ^ γ ( z ) {\displaystyle {\hat {\Gamma }}_{\gamma }({\boldsymbol {z}})} , where γ {\displaystyle {\boldsymbol {\gamma }}} are the trainable parameters of the hypernetwork. This approach is the most general, as it allows to learn the optimal mapping from latent codes to neural field parameters. However, hypernetworks are associated to larger computational and memory complexity, due to the large number of trainable parameters. Hence, leaner approaches have been developed. For example, in the Feature-wise Linear Modulation (FiLM), the hypernetwork only produces scale and bias coefficients for the neural field layers. === Meta-learning === Instead of relying on the latent code to adapt the neural field to a specific task, it is also possible to exploit gradient-based meta-learning. In this case, the neural field is seen as the specialization of an underlying meta-neural-field, whose parameters are modified to fit the specific task, through a few steps of gradient descent. An extension of this meta-learning framework is the CAVIA algorithm, that splits the trainable parameters in context-specific and shared groups, improving parallelization and interpretability, while reducing meta-overfitting. This strategy is similar to the auto-decoding conditional neural field, but the training procedure is substantially different. == Applications == Thanks to the possibility of efficiently modelling diverse mathematical fields with neural networks, neural fields have been applied to a wide range of problems: 3D scene reconstruction: neural fields can be used to model t

    Read more →
  • Multilayer perceptron

    Multilayer perceptron

    In deep learning, a multilayer perceptron (MLP) is a kind of modern feedforward neural network consisting of fully connected neurons with nonlinear activation functions, organized in layers, notable for being able to distinguish data that is not linearly separable. Modern neural networks are trained using backpropagation and are colloquially referred to as "vanilla" networks. MLPs grew out of an effort to improve on single-layer perceptrons, which could only be applied to linearly separable data. A perceptron traditionally used a Heaviside step function as its nonlinear activation function. However, the backpropagation algorithm requires that modern MLPs use continuous activation functions such as sigmoid or ReLU. Multilayer perceptrons form the basis of deep learning, and are applicable across a vast set of diverse domains. == Timeline == In 1943, Warren McCulloch and Walter Pitts proposed the binary artificial neuron as a logical model of biological neural networks. In 1958, Frank Rosenblatt proposed the multilayered perceptron model, consisting of an input layer, a hidden layer with randomized weights that did not learn, and an output layer with learnable connections. In 1962, Rosenblatt published many variants and experiments on perceptrons in his book Principles of Neurodynamics, including up to 2 trainable layers by "back-propagating errors". However, it was not the backpropagation algorithm, and he did not have a general method for training multiple layers. In 1965, Alexey Grigorevich Ivakhnenko and Valentin Lapa published Group Method of Data Handling. It was one of the first deep learning methods, used to train an eight-layer neural net in 1971. In 1967, Shun'ichi Amari reported the first multilayered neural network trained by stochastic gradient descent, was able to classify non-linearily separable pattern classes. Amari's student Saito conducted the computer experiments, using a five-layered feedforward network with two learning layers. Backpropagation was independently developed multiple times in early 1970s. The earliest published instance was Seppo Linnainmaa's master thesis (1970). Paul Werbos developed it independently in 1971, but had difficulty publishing it until 1982. In 1986, David E. Rumelhart et al. popularized backpropagation. In 2003, interest in backpropagation networks returned due to the successes of deep learning being applied to language modelling by Yoshua Bengio with co-authors. In 2021, a very simple NN architecture combining two deep MLPs with skip connections and layer normalizations was designed and called MLP-Mixer; its realizations featuring 19 to 431 millions of parameters were shown to be comparable to vision transformers of similar size on ImageNet and similar image classification tasks. == Mathematical foundations == === Activation function === If a multilayer perceptron has a linear activation function in all neurons, that is, a linear function that maps the weighted inputs to the output of each neuron, then linear algebra shows that any number of layers can be reduced to a two-layer input-output model. In MLPs some neurons use a nonlinear activation function that was developed to model the frequency of action potentials, or firing, of biological neurons. The two historically common activation functions are both sigmoids, and are described by y ( v i ) = tanh ⁡ ( v i ) and y ( v i ) = ( 1 + e − v i ) − 1 {\displaystyle y(v_{i})=\tanh(v_{i})~~{\textrm {and}}~~y(v_{i})=(1+e^{-v_{i}})^{-1}} . The first is a hyperbolic tangent that ranges from −1 to 1, while the other is the logistic function, which is similar in shape but ranges from 0 to 1. Here y i {\displaystyle y_{i}} is the output of the i {\displaystyle i} th node (neuron) and v i {\displaystyle v_{i}} is the weighted sum of the input connections. Alternative activation functions have been proposed, including the rectifier and softplus functions. More specialized activation functions include radial basis functions (used in radial basis networks, another class of supervised neural network models). In recent developments of deep learning the rectified linear unit (ReLU) is more frequently used as one of the possible ways to overcome the numerical problems related to the sigmoids. === Layers === The MLP consists of three or more layers (an input and an output layer with one or more hidden layers) of nonlinearly-activating nodes. Since MLPs are fully connected, each node in one layer connects with a certain weight w i j {\displaystyle w_{ij}} to every node in the following layer. === Learning === Learning occurs in the perceptron by changing connection weights after each piece of data is processed, based on the amount of error in the output compared to the expected result. This is an example of supervised learning, and is carried out through backpropagation, a generalization of the least mean squares algorithm in the linear perceptron. We can represent the degree of error in an output node j {\displaystyle j} in the n {\displaystyle n} th data point (training example) by e j ( n ) = d j ( n ) − y j ( n ) {\displaystyle e_{j}(n)=d_{j}(n)-y_{j}(n)} , where d j ( n ) {\displaystyle d_{j}(n)} is the desired target value for n {\displaystyle n} th data point at node j {\displaystyle j} , and y j ( n ) {\displaystyle y_{j}(n)} is the value produced by the perceptron at node j {\displaystyle j} when the n {\displaystyle n} th data point is given as an input. The node weights can then be adjusted based on corrections that minimize the error in the entire output for the n {\displaystyle n} th data point, given by E ( n ) = 1 2 ∑ output node j e j 2 ( n ) {\displaystyle {\mathcal {E}}(n)={\frac {1}{2}}\sum _{{\text{output node }}j}e_{j}^{2}(n)} . Using gradient descent, the change in each weight w i j {\displaystyle w_{ij}} is Δ w j i ( n ) = − η ∂ E ( n ) ∂ v j ( n ) y i ( n ) {\displaystyle \Delta w_{ji}(n)=-\eta {\frac {\partial {\mathcal {E}}(n)}{\partial v_{j}(n)}}y_{i}(n)} where y i ( n ) {\displaystyle y_{i}(n)} is the output of the previous neuron i {\displaystyle i} , and η {\displaystyle \eta } is the learning rate, which is selected to ensure that the weights quickly converge to a response, without oscillations. In the previous expression, ∂ E ( n ) ∂ v j ( n ) {\displaystyle {\frac {\partial {\mathcal {E}}(n)}{\partial v_{j}(n)}}} denotes the partial derivate of the error E ( n ) {\displaystyle {\mathcal {E}}(n)} according to the weighted sum v j ( n ) {\displaystyle v_{j}(n)} of the input connections of neuron i {\displaystyle i} . The derivative to be calculated depends on the induced local field v j {\displaystyle v_{j}} , which itself varies. It is easy to prove that for an output node this derivative can be simplified to − ∂ E ( n ) ∂ v j ( n ) = e j ( n ) ϕ ′ ( v j ( n ) ) {\displaystyle -{\frac {\partial {\mathcal {E}}(n)}{\partial v_{j}(n)}}=e_{j}(n)\phi ^{\prime }(v_{j}(n))} where ϕ ′ {\displaystyle \phi ^{\prime }} is the derivative of the activation function described above, which itself does not vary. The analysis is more difficult for the change in weights to a hidden node, but it can be shown that the relevant derivative is − ∂ E ( n ) ∂ v j ( n ) = ϕ ′ ( v j ( n ) ) ∑ k − ∂ E ( n ) ∂ v k ( n ) w k j ( n ) {\displaystyle -{\frac {\partial {\mathcal {E}}(n)}{\partial v_{j}(n)}}=\phi ^{\prime }(v_{j}(n))\sum _{k}-{\frac {\partial {\mathcal {E}}(n)}{\partial v_{k}(n)}}w_{kj}(n)} . This depends on the change in weights of the k {\displaystyle k} th nodes, which represent the output layer. So to change the hidden layer weights, the output layer weights change according to the derivative of the activation function, and so this algorithm represents a backpropagation of the activation function.

    Read more →
  • Cartesian genetic programming

    Cartesian genetic programming

    Cartesian genetic programming is a form of genetic programming that uses a graph representation to encode computer programs. It grew from a method of evolving digital circuits developed by Julian F. Miller and Peter Thomson in 1997. The term ‘Cartesian genetic programming’ first appeared in 1999 and was proposed as a general form of genetic programming in 2000. It is called ‘Cartesian’ because it represents a program using a two-dimensional grid of nodes. Miller's keynote explains how CGP works. He edited a book entitled Cartesian Genetic Programming, published in 2011 by Springer. The open source project dCGP implements a differentiable version of CGP developed at the European Space Agency by Dario Izzo, Francesco Biscani and Alessio Mereta able to approach symbolic regression tasks, to find solution to differential equations, find prime integrals of dynamical systems, represent variable topology artificial neural networks and more.

    Read more →