Q301- Connectionism
I. Introduction
A. Review of models
1. What do models do? They try to simulate a very complex system.
2. The system responds in a certain way to a certain input: when we see a light we push a button (assuming sufficient motivation).
3. The model tries to predict when we press light
4. Why study models?
a. System is too complex to study directly
b. Instead we develop a simplification of model
c. Study that instead.
5. So a model takes some input and based on that input responds in a certain way.
B. Previous models:
1. Previous models provided more general description of what happened during process:
a. e.g. Atkinson and Shiffrin: detection, decision, response
b. ACT*: Execution of productions based on a match made between contents of working memory and a rule in production memory.
C. Connectionism gives more detailed explanation of how information from input is transformed to make a response (the output).
II. Intro to parallel distributed processes
A. Networks different from other models in that the information is spread out across multiple pathways
1. Transformed along the way.
2. Take input in, transform it according to the rules of the network
3. Output pattern is said to represent some abstraction
4. For example, you might show a network a picture of a dog, it might convert that image into something it understands (a computer pictures) and then transform that picture somehow until it outputs some pattern that corresponds to the concept of "dog".
III. In-class demo of connectionist network: The Wedge
This is a metaphor for connectionism. It is not exact, because we deal with numbers no verbal information in connectionism
A. Play classic concentration
B. People in front row can see only under one square of game
C. They try to guess what puzzle is, or just pass along any information they can get
1. e.g. I see an "eye" or "it looks like part of a d"
2. They give this information to one person in the second row.
3. Person in the second row takes this information and makes their own guess as to what the puzzle says: "is it Last Tango in Paris?"
4. I know the answer, and I tell them. After they find out how they did, they consider this answer and then figure out who was most responsible for giving you good information. What you do is compare what was said by each person, and if it was right or not. Then you adjust how much you are going to listen to them in the future.
5. E.g. if person at the end guessed "Tango" then you realize that they are pretty smart or are looking at a valuable section of the board. Thus you pay more attention to them in the future.
6. We continue this, with the same input pattern (game board) after erasing everyone's memory for the previous time. However, we don't erase how much you are going to listen to each person.
7. Eventually you realize that persons 1, 4 and 7 are smart and are looking at valuable sections of the board and you listen only to them and not to the others.
8. This describes how an input pattern can be distributed across multiple nodes, and how learning can take place.
9. It doesn't explain how we represent the input and output patterns.
IV. Architecture of a network
A. Four types of components (draw fig 6.3 top):

1. Input units: the initial representation of the dog.
2. Output units: the part of the network that is going to produce the pattern that corresponds to the "dog" concept.
3. Connections: Things that connect input units to output units. Each has a weight associated with it that describes how much the output nodes use information from the associated input unit.
a. Similar to exam weighting: Your final grade doesn't use much information from exam 1, because it is only 20% of your grade.
4. Activation values
a. If on input units, the value of the input at each input node
b. If on output units, the value of the output at each output node (unit).
B. How they work (draw fig 6.3 bottom)

1. Input value: value between 0 and 1. When we have lots of input units these together will represent an input pattern
2. Connection weight. Value that we weight the activation from the input node by. These can change, and when they do it represents learning.
3. Output value: number that will represent the response of the system (arbitrarily defined by the experimenter).
4. If input is 0.8 and weight is 0.3, then output is 0.8 x 0.3 = 0.24. An input of 0.8 means something, and an output of 0.24 may or may not mean something. Don't worry right now about how we assign meanings to these numbers (it's arbitrary anyway).
V. Simple networks: How information flows through them
A. Two-input network (fig 6.4)

B. Output calculated from sum of input values weighted by connection weights.

C. Equivalent representation: lines (fig 6.4 c)

D. Multiple output nodes (fig 6.5 b)


VI. Example of simple network:
A. Simple xor problem (dartnet) (use "tom" versions of files)
B. Truth table from logic:
C. XOR logic operator: responds "0" if two things the same, otherwise responds "1" if they are different
|
Input 1 |
Input 2 |
Output value |
|
|
0 |
0 |
0 |
|
|
1 |
0 |
1 |
|
|
0 |
1 |
1 |
|
|
1 |
1 |
0 |
D. Can we get a network to learn how to do this operator?
E. All we tell it is what output values are associated with which input value sets
F. This is not cheating, because we aren't telling the network how to transform the input values to achieve the correct output value
G. Just tell it what the correct answer is.
H. Do network in Dartnet (Clear activations first, show graph window first)
0 is gray, 1 is black
I. Test network with different values.
J. Try Or problem. (actually easier) Any activation at all in the input gives output that is correct.
K. What have we done? We established a problem that we wanted the network to learn, gave it some structure to learn it with, told it what the right answer was in each case.
a. It went through and used a (to-be-described) learning rule to adjust the weights so that given each input the math would work out to give the correct output.
b. This continued for the four different patterns until the model learned all four rules
VII. What does this do for us?
A. We know that the brain works via distributed representations:
1. It has to: individual neurons too slow to do rapid processing
2. Parallel distributed representations are resilient to damage
3. Can deal with partial information
VIII. Example 2: Little Red Riding Hood
|
Grandma |
Woodcutter |
Wolf |
|
|
Friendly |
1 |
1 |
1 |
|
Big Eyes |
0 |
1 |
1 |
|
Big Teeth |
0 |
0 |
1 |
|
Wrinkels |
1 |
1 |
0 |
|
Tall |
0 |
1 |
0 |
|
Big Nails |
0 |
0 |
1 |
|
Lauughs |
1 |
1 |
0 |
A. Train on all 3
B. Reset
C. Train on no-wood
D. Test on Wood- says is grandma
E. Train on wood- don't reset first
F. Grandma is forgotten
G. Do for young grandma- still does OK with partial info.
H. Video from CNN
IIX. Physiological examples
A. Backprop example from neural book
IX. Learning (skim 178-188)
A. Delta rule learning
consider simple network:

desired:
1 0
1. The network learns by getting feedback from an all-knowing teacher (tom)
2. This makes it not really biologically plausible
3. Each time the network processes an input pattern, it gets an output pattern. It compares this output pattern to the correct pattern. The difference between the two is the error:
erroro = target output - actual output
4. The weights are changed by an amout that is the product of three numbers
a. the activation for that input unit: ai
b. a learning parameter (a constant value of about 0.5): lr
c. erroro
5. This gives:
CWio = ai * lr * erroro
6. The new values for the weights are computed by:
New Wio = Old Wio + CWio
7. This continues though all patterns until the erroro's get close to zero (which implies no change in the weights as a result of equation in 5 above).
8. This is called backprop
8. Go through example on excel spreadsheet
B. Hebbian learning
1. Problems with delta-rule learning: we don't have a teacher in our heads
2. Instead use a self-directed learning rule
3. Based on a physiological property: neurons that fire together, wire together.
4. So if an input unit and an output unit are both highly active, the network increases the connection strength (the weight) between the two units.
5. The increase is in proportion to the product of the joint activation
a. so if unit a has activation 0.5, and b has activation 0.5, then increase teh weight strength between them by:
Cwio = ai * ao * lr
where the a's are the input and output activations and lr is the learning parameter again (controls how fast we learn)
b. Change the weight between the input and output units by:
New Wio = Old Wio + CWio
6. problems with hebbian learning: has trouble learning patterns that are similar to each other (correlated input patterns). It gets confused when trying to decide which is which
7. Not very powerful- it does what it can do, and can't do what it can't do.
X. Word Superiority effect: pages 120-125.
A. Effect: ask subjects to say letter that appears above a bar
1. either show letter alone or letter as part of a word
2. letter alone takes longer to recognize. Why?
3. Can't just be that it is flanked by any other letter (e.g. more letters helps you to attend to a given location). No speedup if letter in non-word
4. Other letters in a word place constraints on what letter can be
e.g. wor_ implies that last letter is either d or k or m, not a or b or etc.
5. However, if you do the experiment like this, person may be able to guess the letter just by seeing the other letters. Could get the w.s.e. even if didn't see last letter!
6. Equate this by 2-alternative forced choice with two valid letters (e.g. k or d)
person had to choose the correct one.
7. We've eliminated the effects of using the letters to help guess the last letter, but we still get the word superiority effect
a. person must be getting some information from the letter, but also must be using information from the other letters.
B. McClelland & Rumelhart's Neural network model
1. Three levels

2. Inhibition within a level: what this means is that when a node starts becomeing active, it tends to surpress the other nodes. The rich get richer.
3. E.g. if one letter becomes more dominate like a "k", it tends to surpress the alternatives like "d"
4. The letter level also receives feedback from the word level, both excitatory and inhbitory.

5. Let's say the word presented is "trap" and the person then sees tr_p and has to choose beween "i" and "a". The "a" letter node startes becoming active, as do the "t" "r" and "p" letter nodes. However, the "i" letter node doesn't become very active.
6. Because the "t" "r" "a" and "p" letter nodes are becoming active, the "trap" node becomes active as well. As it becomes active it does two things:
a. It excites the "t" "r" "a" and "p" letter nodes with what's called top down or feedback activation.
b. Other words are also becoming active at the word node level. For example, "trip" is also partially active because "t" "r" and "p" letter nodes are activating it. However, the "trap" node is more active and thus it begins to inhibit the "trip" node. Winner take all.
c. because the "trap" node is the most active node at the word level, and because it helps activate the "t" "r" "a" and "p" letter nodes at the letter level activation on these letter nodes increases quickly
-getting activation from the feature level (bottom up) and word level (top down).
d. to perform the letter identification task, we just look at the activation at the letter node level, so whichever has the most activation (the "i" node or the "a" node) is how we respond.
7. How does this work when only a single letter is presented?
a. still get bottom up activation from feature nodes to letter nodes.
b. however, word nodes are never activated, and so we don't get any top-down assistance from the word node level
c. Since the activation is only in one direction (bottom up) we don't acrue activation in the "a" node as fast as we would in the case where "a" is presented in the word "trap"
d. This predicts the word superiority effect
8. Word sup. effect is predicted by a model that assumes bottom up activation from the feature level, and top-down activation from the word level.
XI. Applications of Neural Networks
A. Become an important buzzword: eg. Terminator 2
B. Predicting takeovers
1. A company's stock goes up when it gets taken over
2. Lots of financial information about a company
a. cash, future earnings predictions
b. P/E ratios
c. huge databases of this stuff for all kinds of companies.
3. Some of these companies get taken over. It would be nice to own this stock before this happens
4. How can predict which companies get taken over?
5. Develop a huge nework: need an input node for each piece of financial information we can get: cash on hand, stock price, etc.
6. Have a bunch of hidden units- just so the network has the ability to learn complex relationships between different indicators
7. Have one output unit that is active if the company was/will be taken over
8. Train network on previous data- include all companies, and set the desired output to zero if the company wasn't taken over, and 1 if it was (and stock went up).
9. Could train on one half of sample, test on second half
10. Doesn't have to be 100% accurate to give you an edge in the market.
C. Determining cancer causing agents
1. Scientists often go on "fishing expeditions" looking for cancer-causing agents
2. Collect lots of data about what eat, drink, how much excerzise, etc.
3. Have lots of different cancer types that are recorded.
4. Use diet and exercize data as input, cancer rates as output
5. Train network on lots of people
6. See if network can predict cancer in patients in the future
7. Examine weights to determine the relationship between variables
8. E.g. red wine may make up for eating a fat-filled diet (not true, actually)
D. Satallite photos- detecting tanks
1. way too much information from satalites to view by people
2. Train computers to look for tanks, ships etc.
3. Apochraphal story:
a. Trained network on photos of tanks
b. Also trained network on photos of pictures without tanks
c. Tested network on new pictures
d. It was a dismail failure. Why?
e. Tank pictures taken on sunny day
f. Non-tank pictures taken on cloudy day.
g. Easiest thing for the network to learn is that if it's sunny, there are tanks
D. A tight fit between connectionism and neurophysiology: Hinton's Lesioned Network model of Acquired Dyslexia
"A major attraction of connectionist models is that they perform computations by means of processing utnits that behave on broadly similar principles to those used in the brain. Their appeal would be strngthened if it could be shown that analogous operations ont he elements of the two systems lead to similar consequences of rthe behavior of the overal systems. An obvious candedate operation is the effect of lesions. We take a realtively simple connectionist model and show that the lesioned model has properties that are similar to those already described in certain neurological patients."
1. Train a network to extract the features of letters, form letters and words, and then pronouce them
2. Look at two forms of dyslexia in people
a. Surface dyslexia: confusion at visual level
b. When presented with "Walk" people might say "Talk"
c. Also get Deep dyslexia: confusion at semantic level
d. when presented with "walk" people might say "stroll" or something similar
3. Dyslexia is an umbrella term for many different disabilities, but the two that are described above are thought to result from a lesion in the information-processing pathway
4. To study this, first program an train a neural network to read words (like the word-superiority effect model, only one that can talk).
5. Once it is working, go in and "lesion" parts of the network by randomizing some of the connection weights.
6. Now look at the errors produced by various lesions.
7. Do get vision confusions (surface dyslexia) for lesions in one location
8. Get semantic confusions (deep dyslexia) for lesions in another location
9. Lesions in a third location produce both surface and deep errors
10. All three patterns are similar to those seen in humans
11. Interestingly, the lesion responsible for vision confusion is actually quite late in the process
a. You would think that it would be as a result of lesions quite early on
b. Corresponds to locations seen in humans
c. Helps researchers understand why those lesions have those effect
d. we can study the model a lot easier than we can study humans (we know all the weights).
XI. Summary of connectionism
A. Looked at some simple connectionist models
1. Describe exactly how information is represented and transformed
2. Distributed, which is biologically plausible
3. Seems like an intuitive model of actual neurons.
A. What they are good at:
1. recognizing patterns and forming abstract representations of those patterns
2. recognizing patterns in the presence of noise
3. recognizing partial patterns
4. distributes the representation and so can deal with "lesions"
B. Problems
1. may not be biologically plausible (much of human learning is self-directed)
2. Take a long time to learn big sets
3. researchers are designing specific models for each task; the brain cannot be as specialized.