I’m going to give a very brief overview of what I mean when I say the word “AI.” When I say “artificial intelligence,” what is it talking about? Because it can mean a lot of things. I’m going to give some examples of AI in the world, I’m going to give some very specific recommendations for things to look at that impinge upon the world, AI impinging upon the world of libraries and the information professions, publishing and elsewhere.
And then I’m going to go over some of my concerns and conclusions about where we should be thinking, what should our five-year, ten-year outlook be, what should we be worried about? So we’re going to start with just the state of artificial intelligence. When I say artificial intelligence, I’m using it very broadly to encompass things like machine learning and deep learning and neural nets and the entirety of the spectrum of what we technically mean when we say the words AI.
And I’m also talking about a sort of very narrow type of AI. For those of you that may be imagining Terminators and Skynet, that is not what I’m talking about. I’m talking about sort of weak AI rather than corresponding strong AI. Strong AI is usually understood sort of as general intelligence, something that is capable of reasoning like a human. That’s not what I’m talking about at all. I don’t think we’re there, I’m not sure we’ll ever be there.
What I’m talking about is what’s understood as sort of weak or applied AI, where we have a machine that
is trained to do one thing very, very well, usually better than humans are capable of doing that particular thing, but if you ask it to do something else, it falls down and falls over. If you asked IBM’s chess-playing AI to play Go, it would not be able to do that, because those are two very separate sorts of endeavors.
And so when I talk about AI, I’m using sort of weak AI, something that is trained on a specific corpus with specific data to do a particular thing. One of the things that is different about AI, machine learning, and technology in general than some of the other areas that we talk about in information science on a daily basis is, AI machine learning is heavily driven by advances in technology, and technology doesn’t change in a linear fashion, technology changes in an exponential fashion.
So as it gets faster and better, it gets faster and better. And the way that I like to frame this when I do talks is that, if you think about the moment we are in right now, this is as bad as AI will ever be. It will only ever
get better from here. And it will only ever get cheaper, and it will only ever get easier to do. It will never get harder, it will never get worse. That’s what exponential means. And it only ever gets faster and faster. That curve is not, again, straight. It’s a hockey stick.
And so the changes that we have seen over the last five to seven years in machine learning, how it has gotten so much better, so quickly – well, buckle up, because it only gets faster. So what are some of the active, modern, current uses of AI that I think we should be aware of thinking about taking advantage of every day? There’s a number of them, and pretty much everyone in here probably has an AI in their pocket, right? How many people in here have a smartphone in their pocket right now, or in your hand if you’re tweeting? Right, right, like almost everyone has a smartphone.
If you have a smartphone, a modern smartphone, it has a dedicated AI engine inside of it that is working every time you put it in your pocket to organize things like your photos. So this is an example of my photo library, where I searched for the word “beach.” It gave me these pictures, not because there is someone that has applied metadata to this set of photos that have in one of the fields the word “beach,” it is because the AI has been trained on a corpus of millions and millions of photos to look for beach-like things, and when it looked at these pictures, it matched a certain criteria for “beach-like” and it returned these to me.
So this is not the result of any sort of manual classification, this is all happening automatically, it’s all happening automatically every day, all the time, in your pockets. So this is the sort of AI that we are all living with every day that we may not think about as machine learning and AI.
That’s a fairly trivial example of AI. But it becomes very, very important. Even something as trivial as image analysis becomes incredibly important when you think about things that are being done with AI in medical diagnoses. This is a quote from Geoffrey Hinton at the University of Toronto: “If you work as a radiologist you are like Wile E. Coyote in the cartoon. You’re already over the edge of the cliff and you just haven’t looked down. There’s no ground underneath, it’s completely obvious that in five years deep learning is going to do better than radiologists. It might be ten years.” So he wrote this, this is from a New Yorker article in 2017, two years ago, almost three years ago.
And it should come as no surprise that we are already seeing results where AI trained to recognize cancerous cells – melanomas, carcinomas, breast cancer, etc. Anything that relies on imagery is doing so at much higher rates than human experts. So how many of you are universities? How many of your universities have radiology programs? Awesome. You should probably worry about that.
Again, this was March of 2019. There’s a lot of radical studies showing that AIs are better at image-based diagnoses. As a matter of fact, they are in general better at diagnoses of all types, once they are trained. This is a story, “AI equals human experts in medical diagnoses.” And this is general symptomology, not just epidemiology, not necessarily cancers.
There’s a lot of this sort of thing that is going on in the world now. Now, this is a happy, positive example; there are lots of really terrible, awful examples, and those may be the ones that often people think of when they think of AI. AI is only as good as the data that is used to train it. You can ask any question you want, but unless the data that you have has the answers wrapped up inside them, you aren’t going to get what you want.
And if the data is biased, then you’re going to get exactly the opposite sorts of answers that you want. And there are lots of examples in our modern world where AI is doing exactly this sort of thing. How many people in here knew that there were AI-based systems that judges in the U.S. use in order to determine sentencing guidelines for people? A number of you. What trained this AI? What data did they use? Historical sentencing data, which is absolutely fair and unbiased in every way, right? No, the historical sentencing data in the U.S. is awful.
It’s racist, it is a terrible set of things that you would use to actually determine the length of time someone should be incarcerated. And yet we are sort of willingly using historical data in order to determine modern sentencing rates for people that are in the situation in the U.S. We have examples of big companies making similar sorts of data mistakes. Two of my favorites: Google, very early in their Photos experiment – Google Photos is a lot better now than it used to be, but very early on, it was classifying African American individuals as gorillas. Why? Because they used bad data and did not train it appropriately, and because they did not look at the results of their own work. It was gross oversight for this to happen.
And my own personal favorite example: Microsoft determined that they were going to do an experiment in AI a few years back, this was about four years ago, five years ago. They created an AI that they wanted to learn how to act like a human, to behave like a human, talk like a human online, and they thought, “Well, we need people to feed it data to, like, teach it how to be human.” So they took this bot and they put it on Twitter, because that’s, again, a perfectly rational and reasonable place to teach something to be a human.
And so this bot, “Tay,” took about three hours to become a white supremacist, Neo-Nazi, just horrible, awful, evil thing, because it learned from the data that people gave it. And when people said horrible things to it, it incorporated that as part of its personality and it parroted those things back. Data matters. AI can also be used as a tool specifically for bad things.
We are all in this room certainly aware of the misinformation, disinformation campaigns that have been regularly going on on our internet for a few years now. AI can be used to enhance and incorporate evidence that is entirely false, and yet seems to support a point of view. Let’s do a test. So AIs at this point can create humans from whole cloth, they can create facial structures of humans, entirely false and fake ones, but they look real.
So a show of hands – one of these is an AI-generated face and one of these is a human. So how many people think the male is AI-generated? How many people think the female is AI-generated? You’re all wrong, they’re both AI-generated. Neither of them are real.
These are entirely fake people. As are these. They’re entirely fake people. When I started talking about AI several years ago, this sort of work to create the actual generation of a fake person took a lot of horsepower.
You had to have specialized rigs, you had to have access to high-powered servers, you had to have high bandwidth data, you had to really sort of have an infrastructure in place in order to do this sort of work.
Now it is a website. Literally a website. If you go to thispersondoesnotexist.com you can create entirely fictional people for use in your advertising. And it has become fake people as a service at this point. You can, through the website-generated photos, download 100,000 fake people for use in your advertising, promotional material, whatever websites, whatever you’re doing.
The ability to create fake people, the ability to create fake voices – Google and others have AIs that can disassemble and reassemble phonemes at this point so that you can, through just a few minutes of audio, make an individual say anything you want them to, so you can take my voice and make me say horrible, awful things.
You can do the same to a politician, all through the use of some very, very simple AI tools. And you can combine the two, visual and audio, in order to get something called a deep fake, where you have entirely fictional videos being produced with recognizable human faces on them that, you know, you could make Mark Zuckerberg do anything you’d like in a video these days if you want. So yay, what a brave new world. What does this mean, though, for our world? It’s only going to get better.
Yeah, it’s only going to get better. That’s right, that’s right. No fear, no fear, only going to get better. So what does this mean for our world? For libraries, for research, for scholarly publishing, for kind of all
of this that we live inside? Well, there’s a lot going on right now. And I’m going to point out one specific tool, and then I’m going to rush through a whole bunch of other tools and try to paint a picture for you.
The one tool I wanted to point out in specific is, as far as I’m aware, the only one built by a librarian at a library for public consumption. This was built by a librarian named Andromeda Yelton when she was still at MIT. HAMLET stands for “How about Machine Learning Enhanced Theses?” This is an entirely machine learning driven discovery engine for electronic dissertations and theses at MIT. Uses no subject headings, uses no human-generated metadata, anything like that, this is entirely using machine learning driven semantic analysis of those digital goods.
And it is miraculous. It is a completely different beast than the sorts of discovery systems that we have had in the past. Quick example of how it is different: the first time I used this, the first time I had a chance to test it, the first thing I do with any system is trying to break it because that’s what you do with systems, is try to break them. So one of the things you can do with this is actually upload text of your own, so you could copy and paste a paragraph that you thought was particularly interesting, and it would find semantically similar works.
So I fed it the entire book Peter Pan by J.M. Barrie. I figured, “MIT theses are probably going to choke on this, they don’t have anything,” you know, what are they going to do? Give me rocketry for flying? I don’t know, like, what is the match going to be? So I gave it Peter Pan and just said, “Show me what’s similar.” And it gave me all of the creative writing theses from MIT, of which there aren’t many, because it’s MIT and the creative writing department is, you know, not awesome. Not a large corpus there.
But the machine learning system was smart enough to realize that Peter Pan is a work of fiction, and that other things in this corpus that most relate to it are other works of creativity, rather than rocketry or other engineering topics. Fascinating.
So you should take a look. The rest of the tools that I want to whiz through all have to do with a sort of dissembling of the research process – libraries, publishers, etc., learning management systems. We all sort of have in our head this research process that individuals undertake as a part of their learning – you find a topic, you search for that topic, you locate them, you evaluate them, you make notes, you write the paper,
you cite it, you proofread it, you edit it.
There’s a sort of process of scholarly creation that happens. And a while back, I sat down and thought, “How many of these are done by AI right now?” All of them. Every single one. So you have AIs like CARA AI, wizdom.ai, Diffeo AI, Iris.ai. All of these are search engines that you give a topic to and it does a search, an analysis, a retrieval, and a set of citations for you, all automatically.
No interface for you. AI takes care of it; all you do is give it your rough idea of the topic you want. All right, so now you’ve got your sources, you want to take notes and actually write the thing. Okay, well there’s an AI called Resoomer that will take a look at all of your papers and automatically summarize them for you, give you just the high points, tell you what the important pieces are, give you the thesis statement, kind of break it all apart for you.
Scholarcy does the same sort of thing, it creates flashcards from a document to give you just the key facts about what it was that you fed it. You’ve got AI Writer, which gives you, again, unique content, it takes topics and/or germs of stories that you give it and generates entirely AI-written papers. You’ve got EssayBot, which does the same sort of thing, which has my favorite tagline of any of these: “Finish your essay today! No plagiarism!” That’s my favorite, like, you have an AI writing for you but no plagiarism!
That’s the important part. Yeah, anyway. EssayBot, again, you feed it stuff, it collates the information, rips the information apart, rewrites it for you, gives you entirely original content based upon what you gave it. And then you have, again, Articoolo, which is my favorite example for a tech name that has never been part of a romance language, that does the same thing, you give it research and it gives you a paper out the other side.
Proofreading and editing. Probably somebody in here is using Grammarly, because it’s an incredibly popular service to do grammar check and proofreading and everything like that. It is all driven on the back end by a machine learning system that does English grammar and sentence structures. Writing Assistant, “the most powerful writing improvement software in the world,” also machine learning driven, doing corrections on your grammar.
So more or less every aspect, every single part of a traditional research project, you could outsource entirely to an AI. Now that would be horrible right now, right? Nobody expects that to come out the other end and be like a masterpiece of scholarship. But as I said, right now is the worst this will ever be. And while right now it is garbage to do that process beginning to end, in five years, it’s going to be better.
And in ten years, it might be good enough to fool us. And that’s something that we need to think about. The thing that I am anticipating and that I expect to come very, very shortly is a sort of end-to-end solution that someone is going to patch together. I did say earlier that I was interested in sort of the weak AI, just the individual little trained to do one thing. And that’s true. But if you chain those together, if you put a series of weak AIs all together that can take something from beginning to end and pass it from one to the other, it is more or less the same as a stronger AI system.
And so these personal AI assistants are things that I think are coming very, very soon. AI is dependent on data. But if you capture a student very early in a learning management system, something like that, even as early as high school, then you’re going to have data on them – on their interests, on their research projects – as they move through college into graduate school and finally into a sort of research professorship or research institutions, you’re going to have a decade of data. And I say you – we might not, but someone will.
Libraries might not be able to do that, but somebody is. And that decade worth of research information is going to allow that individual to have a personal AI assistant that is far more in tune with what they do and what they want than any sort of reference librarian interaction might be. They’re going to have historical data that just is an incredible amount of stuff. It worries me that this sort of project is definitely coming and yet libraries haven’t thought about how to sort of get around it.
And it may not be one of our traditional vendors, it may not be one of the vendors here in the room. Amazon is certainly making a play for this sort of thing. These are the big boys that are playing in this space. Amazon has an Alexa skill for educational technology applications, including Coursera, Canvas, Blackboard, and all of the sort of learning management systems that you may have on your campus. And if this is integrated through something like Alexa, then Amazon is scarfing all that up. And they’re going to have longitudinal data that we could only dream of.
So conclusions. I don’t really have conclusions, I have concerns. I have things that I worry about. I worry about privacy implications. I worry about privacy implications because AI necessitates data, and if we are going to play in the pool that is AI, we have to have water. And that worries me because any data is dangerous data. I worry about the historical record of automation, moving funding from labor to capital.
Traditionally when things are automated, you lose individuals and you gain things, you gain capital. As AI increases and we offload responsibilities to it, I worry about the chance that we’re losing personnel and we’re gaining this outsourced thing. I worry about us repeating the mistakes of history, because in the same way that judicial systems rely on historical data, any sort of search retrieval discovery engines that we build will have some degree of historical data in them, and if we only move in with the data we have and aren’t very careful about it, we may simply be codifying old procedures and old biases.
I worry that AIs are in many ways black boxes, that we have data on the front end and answers on the back, and some of the stuff in the middle is real fiddly and un-understandable by us. When we start offloading that sort of decision-making out of a library and out to a vendor or out to a corporation like Amazon, I worry that that black box nature of it allows for fiddling in ways of information flows that don’t make me comfortable at all.
I worry about the externalization leading to ethical decisions being made in those spaces and not in our spaces, not in libraries, but in situations where the corporation is making what we would consider an ethical decision. That bothers me. I worry that the focus on these agents, on these AI agents, only increases the sort of filter bubble aspect of our informational resources. We’ve seen this filter bubble, especially in our political discourse, over the last several years.
If you only ever get the things you’re interested in, you only ever get the things you know. And that bubble only increases and strengthens. And so I worry that AI-driven discovery and/or personal assistance only drive that sort of filter and bubble effect of it. I worry about incentives that are placed before vendors and corporations in a world where discovery and consumption are data-driven. We have a historical precedent in the world of news and journalism, and how that has been disincentivized in so many ways over the last ten years.
I worry that other pieces of the information ecosystem may follow. And in conclusion, Roy Amara was an American researcher and scientist. He said, “We tend to overestimate the effect of technology in the short run and underestimate it in the long.” It is entirely possible that I’m overestimating right now, but AI and the effects of AI and these AI systems is going to be something that isn’t going to go away. We are going to feel it for a long time.
What I’m going to do is talk about AI, as I said, in the context of my institution, for reasons that I’ll explain. And really what I want to convey is a sense that AI exists across the academic landscape. It’s no longer rooted entirely in computer science, we find it in the fine arts and the humanities and elsewhere. And from a library perspective, we need to support research and education and present our core library services with all of that in mind.
So how I’m going to get there is talk a little bit about AI, frame that in the context of the work we’ve been doing the last couple of years around open science, say some words about AI as it has evolved at Carnegie Mellon, the education and research world that we find ourselves supporting, some of the library activity we’re seeing, and hopefully get through that in time to allow for a few questions.
When I dig around looking for images for presentations, I find I get two sorts for AI. One is the more algorithmic, mapping-type stuff.
When you try and do this for libraries you get things like this, or this. Slightly more ripped, maybe. But there’s always the sense that there’s some sort of great brain thing going on. And the more you dig around, the more you realize that it’s difficult to get AI images that don’t reflect the intelligence. It’s all about the brain and technology. And as I said, it’s impossible to get anything that isn’t really reflecting the sense of supreme intelligence.
I had to get one Brexit joke in. Thursday is going to be a big day. But that’s maybe the best outcome. As I said, I’m not going to talk about AI as it applies to libraries in the abstract or, indeed, about AI and higher education. Jason has brought us a great book. Joseph Aoun’s book on AI and higher education really is a fantastic read.
But I do want just to make some framing remarks about how I view AI. And I really like the way that Oracle has presented this, this evolution from artificial intelligence through to machine learning and on to deep learning as the more contemporary approach. I’d argue that machine learning in our space, driven by massive-scale computation of correlations supported by high-powered computers, is most relevant to us.
It’s about extracting minute but highly relevant correlations from massive datasets, not about computers trying to behave as humans do. I don’t have time to articulate an argument on this in any depth, but I hope we can agree that, properly done, however this plays out, it is good to have a different intelligence to ours working alongside us. We argue this in the diversity space, we argue this in the AI space, as well – diversifying the views that come together to help us form insights and make decisions.
In the next decade I think we’ll see a growing focus on how AI can help liberate us, just as the machine did in the Second Industrial Revolution. William Ross Ashby coined the term “amplifying intelligence,” sometimes referred to as “cognitive augmentation,” in his introduction to cybernetics, published in 1956, and that was one of the earliest works about AI as a discipline.
Recently – I said I wasn’t going to talk about Elsevier, but here I go, there’s no getting away from it. Recently, Elsevier produced a very helpful overview of AI as a discipline and showed the huge growth in AI research over the past 60 years. But what it points to is the growth over the past five years of scholarly output in this field, fueled in part by a huge investment in research in AI in China, but also reflected in growths in Europe and in this country.
And it’s nice to really understand that AI is not a discipline in itself, but really is a collation of a variety of different fields. If we look at this growth, we see that the volume of papers over the last 20 years has ticked up, and as I said, the past five years really has illustrated that. But what we’ve seen is the emergence of machine learning, which was kind of in the middle of the pack 20 years ago, has become the leading field from the research perspective, alongside neural networks. And that perhaps is indicative of where the academy is focusing its interests at this moment in time. I’ll come back to that shortly.
But at this point, I do want to make a few remarks about open science. And this is where I begin to talk about some of the stuff we’re doing at Carnegie Mellon. It’s perhaps helpful to frame this in the context of the OCLC University Futures report. We were ranked one of the ten most research-intensive in their typology of institutions. And I do that just to illustrate the volume of research activity on campus, not just amongst the faculty, but indeed amongst our undergraduate students, many of whom are publishing in top-tier journals during their bachelor’s degrees.
We’ve invested heavily in open science infrastructure in recent years. To me that really is about delivering scientific excellence at scale. Machines will play a key role in delivering a new generation of services for authors, reviewers, and editors built upon the principles of open science. And we’re already seeing a lot of interplay between the products of research made open and computational science. In essence, machine learning, the dominant component of AI research today, needs data.
And the more open that data are, the more activity the machine learning researchers can engage in. I realized we really had crossed the Rubicon with open science in our libraries at CMU when a couple of my colleagues created a LibGuide.
That’s the test – when they’ve got a LibGuide, it’s a real thing. And if you want to understand more about how we’re doing this, please do visit that site. Our institutional repository KiltHub has started to become an integral part of our support for AI activity. To demonstrate replicability and consistency, the algorithms, explanations, data training sets should be archived by the institution. And that is something that our repository architecture has been explicitly configured to support.
AI can only be as good as the ecosystem services provided by the data management that helps our researchers retrieve and interrogate data. An example of this from the neuroscience space at CMU was the BOLD5000 fMRI dataset that is being used by our AI researchers but had been generated by our neuroscience researchers as part of their own work, and you can read that news story online.
Much of our work has taken its inspiration from OCLC’s model of the evolving scholarly record, and I don’t have time to explain that in depth. But our service model on the left, and the recognition of the open science tools from our colleagues in Germany on the right, really illustrate this rethink about what
a library is, and the recognition that gone are the days when researchers would come through our doors to use our collections.
Our collections have moved to the cloud, and we have had to build our services and the tools we deliver to researchers around their workflows, what Lorcan Dempsey describes as the “inside-out library.” Just going to let the photo op take place. So we’ve begun to dissolve many of our traditional services and think about how we can take our offerings into the lab, into the researcher’s office, into the classroom. Cliff mentioned at the executive roundtable this morning when he spoke a few moments ago, and many of the remarks that you’ll see documented in his report really reflect the culture that we are experiencing.
I spoke with my colleague Huajin Wang at this conference last year, and our detailed views on open science can be seen on that presentation. We’ve had a couple of open science symposia run by the University Libraries at CMU over the past couple of years. We just released the videos on YouTube on Friday of this year’s symposium; if you’re interested in finding out more about our work there and how it’s playing into the AI machine learning space, do please have a look at that. Gary Price told me earlier he’d put this onto INFOdocket, so that’s another way to find the link to that.
So AI at Carnegie Mellon, just to touch back on the Elsevier report for a moment, whilst some of the Chinese powerhouses really are dominating the world in terms of scholarly output, in the United States, the five institutions at the bottom, you can see that I’ve predictably called out Carnegie Mellon as the most productive in terms of scholarly articles. Not many people think about that, they think about places
east and west, but Carnegie Mellon is the most productive in terms of scholarly articles.
And our engagement in that place can be tracked back to the 1950s. In 1955, three professors at Carnegie Mellon – Alan Newell, Herb Simon, who’s pictured there, and Cliff Shaw wrote the first program deliberately engineered to mimic the problem-solving skills of human beings, and they are credited with having developed the first artificial intelligence program, even before the term AI had been coined.
Early computer users at Carnegie Mellon gravitated to questions of human and computer logic, what Herb Simon, who won the Nobel Prize for his work, termed “intelligences, artificial and natural,” and which he investigated through observations of students interacting with logic puzzles. Out of this grew the university’s reputation, leading to the formation of our machine learning department around about 20 years ago, and more recently, a strong investment in deep learning. I’m just mimicking the Oracle
model here. If you’re interested in our deep
learning work, there’s a great series of lectures on YouTube from those working in that space. As we think through the situation of having great depth in AI and related research, some of my colleagues have developed this model, the AI Stack, which is trying to show how we are deploying artificial intelligence research across campus and how strategic investment decisions are being made.
The idea here is that we are working in an environment where there is so much expertise that we don’t need to be skilled in every area. The intent is to focus on one area and call upon others for help. And we have individual departments focused on most of the themes called out in the stack, such as our machine learning department, our social and decision sciences department, human-computer interaction, and so on.
And that AI Stack appears around about the front lobal region, I think, temporal lobal region in this model of AI research at CMU, which broadens to complement the AI Stack with expertise in fields such as robotics and design. So a lot of AI activity at CMU. The interesting point is that it’s scattered in disciplines you wouldn’t necessarily think about. Our Center for Human Rights Science, for example, is working on AI and human rights.
Our department of philosophy is world-leading in the ethics of artificial intelligence. I didn’t have time to create a slide, but our creative writing program is top-notch, and I’m intrigued as to whether they’re using AI and I must pass that on. We have specialized interdisciplinary institutes such as the Block Center for Technology and Society, looking at analytics and ethics. Our business school is very focused on things like the business of health care and transforming that with machine learning, blockchain-type activities and so on. The College of Engineering, perhaps more expectedly, is working heavily in AI and how big data can transform engineering activities.
And our College of Fine Arts is heavily engaged in the design aspects of AI, but also how AI can lead to creating illustrations; instead of essays and dissertations, you can create artwork and maybe sell it on YouTube – not on YouTube, on eBay – using AI to do a lot of the hard work.
So research activity across the institution, as is educational programming. This is just the first of multiple pages for our course listings, where I did a simple search for “AI” or “machine learning.” Hundreds of courses at the university in these fields. Typical course, this one from the College of Fine Arts on creativity and AI from the University Libraries’ IDeATe program, our integration of technology and life, that human and machine autonomy aspect.
And dedicated degrees, a Master’s in Artificial Intelligence and Innovation; this year we launched our first Bachelor’s Degree in Artificial Intelligence. So you get a sense of the institutional environment. What about the role of the library in that space? We clearly can’t ignore AI, otherwise we would miss a large part of the institution’s activity. So for example, in the Dietrich College of Humanities and Social Sciences, their general education offering is in a series of grand challenges, things like climate change and, in this case, artificial intelligence and humanity.
And as a core part of teaching the gen ed programs, that is an opportunity for us to interact with early-stage students and help them understand some of the challenges that I’ll turn to in a moment. Next semester, a couple of University Libraries faculty will teach a new course, listed by the Department of Statistics and Data Science called “Discovering the Data Universe,” where they are talking about data collection, data management, formatting, visualization, storytelling, ethics, and so on.
And much of this is designed as a course, partly for non-specialist students to understand some of the basics, but also to prepare them for subsequent study if they wish to dip their toes into machine learning or statistics. We’re seeing a lot of humanities students, for example, wanting to explore how they can use these technologies in their majors.
They often haven’t come from a rich computational background in high school, and we’re helping them feel comfortable and confident in working with these activities. Like many libraries, we are offering Data Carpentry software, Carpentry-type workshops; these are consistently sold out. We have a number of our faculty trained instructors and they could be doing this full time. We are in discussions with our statistics colleagues about making the Carpentries a prerequisite for people who wish to declare stats and data science as their major.
In our special collections we are also beginning to dabble around some of these things, and we have a couple of Enigma machines, which have attracted a lot of attention. I use this to point to the needs we hear from researchers who are humanists who are moving into machine learning and artificial intelligence, and like the students, they often have come from humanistic backgrounds rather than computational ones. And they view our faculty as being trusted and accessible, they are safe people to come and tell that you’re terrified of numbers or you just don’t know how to understand the algorithms I showed at the beginning.
But more broadly, we are seen as reputable and reliable intermediaries between the humanists and the computer science specialists that they need to work with. We are the ones who can act as interpreters and explain to the computer scientists, “This papery thing is a book,” and to the humanists, “This squiggly thing is an algorithm.” What this points to, as a side note, is the importance for us of staying engaged with researchers, because if we are to build these relationships and foster these collaborations, we need to know who’s playing in different spaces.
And given that many researchers, particularly the ones who don’t know what a papery thing is, haven’t come through our doors in years, it is prompting us to reach out and build connections that traditionally have not been our natural space. One way in which we’re doing that is a new service called the Data ColLABorations, or DataCoLAB, where we are bringing together researchers who have generated data and want advice on how to share it and look after it and so on, and those who need data to do their own research.
And it’s almost like a Data-holics Anonymous where we bring them together and they can trade data and algorithms in a variety of ways. So this is something we started this semester that has been a big hit, and I’m sure that will continue. I’ve mentioned already some of our training programs. We received Mellon funding to offer a series of digital humanities literacy workshops and have very much focused there on issues around metadata, standards, analyzing large datasets.
One example of a project there is the Six Degrees of Francis Bacon project, which had NEH funding. And the idea here was to use machine learning techniques, graph inferences, and web development to reconstruct the social networks of early modern Britain from about 1500 to 1700, long before anybody had heard of Brexit, and to take the history of scholarship as produced initially in the Directory of National Biography and then build the networks out.
So that was an interesting project with a fair bit of library support. But what it then triggered was a recognition that these approaches allowed us to do things like bring a voice to the historically marginalized. What you have here on the right, or on my left, looking at the back of the screen, on the top is the network of everyone, predominantly men, and you just can’t see there the depth of individual identities and their networks.
But when our colleagues looked at the presence of women in these networks, they were almost invisible. And it led to an interesting study in its own right about the role and presence of women in London in those days, which in turn makes us recognize the importance of calling attention to things like biases and ethical concerns. And these are the sorts of things we are trying to encourage students to reflect upon.
Separately, the university has established a variety of research programs and engagement activities on ethics and AI. Our ethical principles play nicely into this. This is the British equivalent of the American Library Association. I just like their color scheme, but calling up their ethical principles quite strongly,
they are very much in line with what’s seen in this country.
Earlier this year, the ARL issued a report on the ethics of artificial intelligence. I don’t have time to review that report, but it calls out many of the key issues. These are also evident in academic administration. We know that AI is being used, or people are talking about using it, in things like making decisions about college admissions, identifying students at risk, personalized learning. Some of these are good things, some of them are confronting, and we, I think, have a professional responsibility in our institutions to be the voice of conscience.
There are I think, powerful upsides. Our Language Technologies Institute is working on making privacy policies more accessible by extracting from the 40 page clickthrough thing and presenting easy-to-digest summaries of what you are signing up for before you hit “Yes, I agree.” We’ve seen research on campus looking at flu tracking.
We’ve seen some fun stuff around poker, building upon chess and Go, and have beaten the world’s pros. Things that are challenges but have the power to improve our lives, such as autonomous vehicles. And frankly, some things that are more confronting around AI and military applications. In terms of library-specific activity, I promised Thomas that I would congratulate him publicly on releasing his report about an hour before we all came into this room. Please read it.
I had the pleasure of seeing some early drafts. It really is a great agenda for research in this space. If you look at Twitter, you’ll find it, it’s been tweeted everywhere. But things that are interesting in the library
space – earlier this year, Springer published its first machine-generated book. And I think this is a really confronting issue. It’s nice in some ways to see how an algorithm can ingest thousands of articles and spit out a 400-page literature review.
But does it become a bit like a Ponzi scheme where an algorithm reads a bunch of literature reviews written by algorithms and so on, and it becomes a bit, you know, is this the mortgage-backed securities of
2020-something? Jason gave you a great host of examples, I’ll just call out things like Yewno, as examples in our space, the Chan Zuckerberg Initiative announced their funding of Meta recently, another good application.
What about a world in which we could test and validate hypotheses against the scholarly literature? I’ve got a research question, I’m going to express it in natural language and have the answer delivered to me from surfacing ScienceDirect and Wiley Online Library and PLOS and so on. Other interesting examples – using AI to analyze patents, both to streamline the patent application process in a world where we are all encouraging innovation and entrepreneurship, but also to leverage the vast quantities of technical information locked up in fairly dense literature.
Research we were involved with was computer analysis of the Teenie Harris Archive from the Carnegie Museums of Pittsburgh, and understanding some of the threats around facial recognition, but some of the opportunities for creating metadata for vast corpora of old photographs. Automatic shortening of titles and other related things here. And some work in the fine arts space; my colleague Matt Lincoln has been looking at how he can read art auction catalogues from 18th, 19th, 20th centuries and figure out how the behavior of the fine art industry has shifted over time.
What is it that people are valuing at different points in history? Another interesting project that people are working on is looking at how we can identify the publishers of works that were published anonymously 500 or so years ago, and by looking at the fonts as bits of data, and looking for similarities with broken or damaged fonts, you can find books that were published anonymously for political reasons, must have been printed in the same place as books published by identifiable publishers or authors, and you can begin to infer new insights in ways that were impossible before now.
We’ve also been working on, coming back to some of the more traditional library things, looking at our special collections. We have one of the first printings of Frankenstein; we leveraged AI and machine learning to understand some of the opportunities around that. In Cliff’s remarks at the beginning of this event, he talked about AI and its potential for data discovery and reuse. We hosted a three-day conference on that earlier this year, thanks to the National Science Foundation.
Lots of interesting insights from that. The papers are available in F1000. And my colleague Huajin Wang and myself wrote an editorial across some formally published papers which was released last month by the ACM. I don’t have time to summarize all of the key themes, but very much focused on the power of reuse, thinking about incentives and standards. One of my takeaway remarks from that conference was, you may not have $70 million to do research, but you can access and build upon the data from somebody else’s $70 million worth of research.
And that really is at the heart of many of the opportunities that we see coming out of data sharing and the opportunities of machine learning to exploit that. I won’t belabor this, I just want to get to a couple of points. The National Institutes of Health currently is calling for feedback on its data management and sharing proposals.
These are due in by the 10th of January, please do take time to comment on those, because clearly we are at a tipping point. If we are to leverage responsibly – and it is much more challenging for the NIH than the NSF – the data generated by experimental research, then they need to understand what sort of expectations can be shared with the research community and their institutions.
We’ve seen a huge amount of data sharing in repositories and repository services, like Dryad and Zenodo, in recent times, but we need to recognize that sharing is not reusable. Again, Cliff made the point about FAIR data, and I think there is some really searching questions that we haven’t begun to explore in depth yet. And these will very much share our agenda over the next while.
Total creator. General coffe buff. Award-winning internet trailblazer. Devoted tv practitioner. Gamer. Communicator. Travel fan. AI and machine learning are everyday!