at 8:00 speaker asks about how you would make cameras more like the vision system. I think this is already being done. Saccades are similar to the practice of stitching images together to form a larger image that's independent of the focal area of the eye. So if you wanted to make a camera similar to the eye you would build in an image stitching ability and leave the camera always on and always stitching from microsecond to microsecond. One interesting result of this is that when you move the camera or the location of the viewer's eye in a human you can determine distance and in fact it's very easy using this technique to map a border between an object and it's background because from one time frame to the next a near ground objects border will reveal only slightly more information as you see the near ground object from a slightly different angle, but the far away object behind the near object will show a dramatic change as new features are revealed that were hidden by the closer foreground object. You could further learn to identify objects in 3D space by the nature of the irregularities that occur when stitching and moving the point of view. In my opinion, trying to identify a physical object such as a toaster in a kitchen and where the toaster starts and what pixel is part of the kitchen is almost impossible to do from static images without first developing a 3D model of the toaster based on a stitched moving point of view. So if you're going to train a visual processing system to recognize objects, you should first train it to stitch images together. Then train it to recognize irregularities in stitching that occur as the result of moving the point of view so you can identify objects as discrete in 3D space and isolate them as concepts from the background image. Once you have a well trained visual processor for recognizing the toaster is separated physically from the kitchen background or a blender behind it, then it becomes easier to decide if this object is a toaster. In practice people use these techniques for recognizing things daily and people are prone to move their point of view (head) around to see the other sides of an a small object such as a stick as if seeing it from multiple angles while moving their eyes to a new location and point of view will help them understand what is part of the stick in question and what is part of the branch behind it. Detecting movement such as from a change in view point is key to recognizing objects and also detecting movement when the view point does not change has very noticeable affect on recognizing things. It's no accident that people trying to camouflage themselves remain motionless. Lack of movement makes it much more difficult to recognize an object.
@SteveWhetstone-UXDesigner9 жыл бұрын
Steve Whetstone another intersting upshot of stitching multiple temporally displaced images is you naturally build a sort of 3D model of the world. It's perspective based and not so good for measuring things, but it's great for navigating in a fuzzy way or understanding the physical location of things in relationship to other things in the environment. In theory, if you want to map a 3D space such as your apartment this can be accomplished quite easily by turning on your stitching and stitching irregularity learning camera and simply walking around your apartment with the camera. When you go around a corner for example you not only establish a vague distance to the corner and what the hallway looks like you also build a representational map of the space. Edges are places where predictions of what the eye sees next come into effect. So when moving the point of view and looking at a toaster the edge pixels violate the prediction that applies to far away objects. so what your eye or camera percieves for things that are close is less predictable as a result of moving your head than things that are far away. If you're looking at a mountain far away moving your head has almost no effect on the pixels reaching your eye in a stitched image of your environment. But if you're looking at a toaster 2 feet from your face, then moving your head 2 feet to the left has a profound effect on the stitched image your eye is using to predict what the eye or camera will perceive. So close up items are more manipulatable than far away objects and more interesting. The brain has certain observed inherent traits such as novelty seeking and this is primarily an outcome of a feedback loop in the brain that optimizes our perception and attention to correspond with the movement of our eyes or body. Novelty in vision is defined as our visual perception not matching our expectations. Literally the pixels in the eye are not what is expected. instead of white, its blue. So the human physiology has some built in solutions or solutions learned at an early age for how to handle predictions that don't match reality. the first rule seems to be to give violations of prediction increased attention and salience. So if the pixels reaching the eye don't match the expectation then the eye focuses and pays more attention to those pixels. Novelty seeking is an emergent property of the base goal of prediction. The literature supports this in the arts study of composition and Rudolf Arnheim describes these areas as focal points in books such as Art and Visual Perception: A Psychology of the Creative Eye and The Power of the Center: A Study of Composition in the Visual Arts
@SteveWhetstone-UXDesigner9 жыл бұрын
Steve Whetstone So how does the brain predict what pixels the eye will see when you move your head to see a toaster from a different angle? Well, it's very hard obviously and little babies can't recognize a toaster very well because they're have to first learning to stitch eye saccades from a stationary point so they can predict what the a pixel on the eye's retina will see when they move their eye. After that babies get very excited in moving around and that's because it's a way for them to better predict how their movement will affect what the pixels in the eye will be (green or red). Babies get excited by looking around corners and playing peek-a-boo and hiding things behind other things spend a while learning to map 3D space before they learn object permanence, which is that objects exist even when you can't see them such as you hide the toaster behind something else. Along with object permanance babies learn to appreciate things that move as novelty. So object permanence and independent movement are both probably developed concurrently. Now the question comes up with object permanence of how do you know what you saw before the curtain covers the object is the same as what you see after the curtain is raised. predicting your eye pixels will see the same stiched representation of reality after the curtain is raised to reveal a toy requires some understanding and ability to distinguish between one object an another. This is a hard thing to learn for babies and can also take weeks or months worth of hours of perception experience training. So finally there's a gestalt learning principle that understands what a blender is. How did we get there? It starts with a lot of data and a huge sparse data structures and then most of them get pruned and this is observed in brain development. The data structures that are most effective at predicting the pixels that the eye will perceive from moment to moment or from one location of the eyes to another location of the eyes are the ones that don't get pruned.
@SteveWhetstone-UXDesigner9 жыл бұрын
Steve Whetstone eventually the eye develops very high level learning such as in graphic design and marketing we have a rule that when you show an image of a person in a photo you always pick the photo so that the face of the person in the photo is looking towards what we want the viewer to look at immediately after they look at the face. So Faces get a special place in the brain as we know from brain anatomy and that's because faces have high predictive ability for what's the pixels of the eye are going to perceive next as a high level abstract concept. On a more concrete and observable level we study this in advertising so we always put the picture of a face looking towards the text and we call this drawing the eye towards the text. If you look at good compositions in classic art paintings, the faces and other elements are structured to plan the movement of the eye and use a crude, non-mathematical model for how the eye moves in relationship to the pixels currently presented to the eye and that can be used to verify our training of intellegence. For faces, the rule is that when people see the a face, they then move their eyes to follow the line of sight of the person in the photo so see what the person in the picture is looking at. Line of sight is a way of finding novelty, which is a way of finding the most predictive elements for what the pixels in the eye will perceive next. It's interesting to note that dogs can understand pointing with the hand or arm as well, but cats and most animals only understand line of sight or don't even understand and use the line of sight principle. allmost all mammals understand being stared at is somehow significant in predicting what they will see hear smell or feel in the future and tend to stare back. In language we "feel" a stare and describe it as cold or hot so being stared at is primarily connected to predicting how we feel physically and less a predictor of what we will see or hear. So there's a limit to how far you can go to understanding human emotion such as cold or warm or angry or gentle or benevolent or malevolent with only attempting to predict visual pixels. At a guess, it would be hard to train an SDA to recognize emotion or "feel" emotion without some sort of tactile sensor. An AI without tactile sensation might have great difficulty understanding the concept of anger or pleasure.
@SteveWhetstone-UXDesigner9 жыл бұрын
+pineapple head I looked at your video uploads. you're doing much more technical work than I can follow and I'm not sure I can give good advice or even follow what you're saying. I stopped doing advanced math theory and programing a long time ago, so I am basing my comments on a rough heuristic understanding of how HTM and the Numenta system function. Mostly I'm great at conceiving psuedoalgorythms that mimic childhood brain development like learning "object permanence" based on simple rules and environmental experiences. At a guess, I'd say from looking at the setup there's a difficulty if you don't have the AI in charge of moving the point of view. Most of what the early learning consists of in a human visual system at birth is learning to focus and making a connection between moving the eye or head to the left and how that affects the visual construct (stitched image). The Stitched image is evaluated based on it's ability to predict the pixels in the field of view when the eye rotates or moves to scan the environment. At this point of life the brain is really focused on just predicting what the individual rods and cones (pixels) of the eye sees when the eye rotates half a degree to the left. in order to get really good at predicting what's outside the visual field the eye learns to focus on key points like the bend in a line which signals a change in pattern. An intersection of two lines also is a focal point that is more useful (salient) for predicting the pixels around it. So the computer controlling the movement of the scrolling or the focus point of the camera should develop a preference for seeking out graphical focal points where lines converge or where there's corners or the edges of objects or movement. It develops this preference because those are the places where the predictive content of the pixels is best suited for predicting what the eye will see when it rotates left or right or up or down. Also, it's important to note that the eye and all learning seems attracted to things that are just on the edge of the pattern recognition ability. So focal points and attention will shift as the thing learns and continually seek novel or slightly unexpected changes that represent opportunities for improved prediction of what the eye will see. At first just a corner or a bend in a line is novel, but then any prediction that is 100% right all the time gets boring so the eye looks for movement. Movement is tough because it could mean so many things and at first has very low predictive ability. At first movement just means that things can move, and then as learning progresses the movements are broken down into "Point of view movements" and "object movements" with Point of view movements being interesting only in what they reveal. Object movements can stay interesting forever depending on the nature of the object moving (say for example a person entering the field of view with something that looks like a spoon of baby food can have early predictive ability even if the concept of spoon and person isn't well defined in the learning representation). Well, I could be a long ways off from being helpful with my above guess so I hope it helps, but please forgive my presumption if it's just clueless and not relevant.
@SteveWhetstone-UXDesigner9 жыл бұрын
+pineapple head You're right. What I'm doing is just idle speculation because I find it an interesting puzzle and sometimes ideas click and seem to make sense and I want to share. I don't think I'll ever make money or benefit from this so it's purely an entertainment conversation for me and doesn't get the investment of my time that you've engaged in. Running a Numenta instance on my spare computer would take a more major investment of my time and still wouldn't be likely to return financially for my career focus which is User Experience Design/Graphic design/web design and development. I appreciate people like you who do the harder (and for me, less intrinsically rewarding work) of getting your hands dirty with the code and hardware and making it a major hobby in terms of time investment. I am more of a very deep thinking generalist who ties a lot of fields of study and interest together but I don't dive too deep into commiting my resources to General AI development. Most of the time I'm writing or blogging about use of linguistic framing in race discusions or political campaign rhetoric or I'm opinionating on user centric design forums or arguing about the merits of regulating bitcoin, etc. So I'm more of a JOAT (Jack of All Trades) and an integrator of fields of knowledge.
@InnovumTechnology10 жыл бұрын
Jeff, have you considered that the basal ganglia might be doing the work you think L5 might be doing?
@magnuswootton6181 Жыл бұрын
very exciting talk! very good speaker!!!
@johnragin3 Жыл бұрын
At 10:25 you say you don't model spikes. So then do you believe that electric fields generated by the brain are passive by-products? This from Scientific American
@magnuswootton6181 Жыл бұрын
its probably not an exact simulation, they have to change it to get it work at least half acceptably on the computer.