Skills Upgrades

BD2K Builds Training Resources

The field of biomedical data science is growing so fast that it threatens to leave some researchers behind.


“Some of these big data skills were not needed 5 to 10 years ago, and many of the tools that we now use were simply not available,” says Daniela Witten, PhD, associate professor of biostatistics and statistics at the University of Washington. These skills and tools are not part of the standard curricula at many universities.


To make a dent in this problem, the National Institutes of Health (NIH) have funded a number of Big Data to Knowledge (BD2K) training grants designed to develop a variety of training resources, including summer workshops; massive online open courses (MOOCs); and repositories for training materials, including materials at the bleeding edge of educational philosophy. The grants, which were made a year ago, are already making a difference.


“It was a good decision by the NIH to offer a smorgasbord of learning opportunities around the general topic of biomedical big data,” says Rommie Amaro, PhD, assistant professor of chemistry and biochemistry at the University of California, San Diego (UCSD).


Summer Boot Camps

Some researchers who want to gain new skills seek out the intensive learning experience provided by a summer training institute. Two BD2K-funded workshops launched this summer—one at the University of Washington and the other at the Mayo Clinic in Rochester, Minnesota.


Short courses of this type appeal to a broad spectrum of people, Witten says, from PhD students and post-docs to research scientists or faculty. “None of us is an expert in everything,” she says. “I’m only teaching one of these modules for a reason; for some of the others, I’m learning along with the other students.”


This year, the Summer Institute for Statistics of Big Data, run by Witten and her colleagues, consisted of five separate 2.5-day-long courses or modules covering how to access biomedical big data; data visualization; reproducible research; and both supervised and unsupervised methods for statistical machine learning.


“Whether you are a current student or were trained 20 years ago, chances are you don’t know this stuff,” Witten says. The institute has been incredibly popular. “We quickly ran into a room capacity problem,” Witten says. After 150 people enrolled, they had to turn people away for some of the modules. “We’re clearly filling an unmet need,” she says.


The BD2K grant paid for the instructors as well as tuition and scholarships for attendees, with a maximum of three scholarships (for three modules) going to one person. “We have some people staying for all five modules,” Witten says. “That’s a big time commitment.”


Witten sees a clear benefit to the in-person classroom experience afforded by her summer institute. “Being there, talking to other students with teaching assistants walking around—it’s really a hands-on computational experience,” she says.


The Mayo short course, called Big Data Coursework for Computational Medicine, was also in high demand, with 80 applicants for 20 spots. “There seems to be a lot of interest in spending summer vacation in a boot camp,” says Claudia Neuhauser, PhD, who directs the Institute of Informatics at the University of Minnesota Twin Cities. She co-leads the BD2K program with Jyotishman Pathak, PhD, professor of biomedical informatics at the Mayo Clinic.


By its nature, a weeklong intensive course covering six topics in six days can’t go very deep. “The workshop gives them pointers and literature references and exposure. That entrée then lets them dig in further,” Neuhauser says. “Many of the students are used to learning on their own.”


Students work together and learn from each other. “There’s almost always someone in the room who is experienced and someone who isn’t,” Neuhauser says. “The diversity of students means the questions are quite rich.”


Neuhauser and Witten agree that both in-person workshops and MOOCs are needed to address the training gaps of biomedical data science. But live workshops allow greater interactivity. “Being in a group for a whole week—talking about things and asking questions, even ones that go off topic—allows students to get what they want out of the workshop,” Neuhauser says. “And we can adjust how we teach.”


In-person trainings provide another key benefit: networking opportunities. Often, says Neuhauser, bioinformatics and health informatics researchers can be the lone quantitative people in their workplaces. “So talking to others can be very important,” Neuhauser says. “This kind of work doesn’t have a recipe book. There’s a lot at the arts level where you have to figure it out. Personal contact becomes important.”


Biomedical Big Data MOOCs

Several BD2K training grants are being used to launch new MOOCs for teaching biomedical data science. Like summer institutes, MOOCs in biomedical big data science serve a heterogeneous group of people who want to retool or get involved in a new area such as genomics, says Brian Caffo, PhD, professor of biostatistics at the Johns Hopkins Bloomberg School of Public Health.


Caffo and his team developed the first data science specialization on Coursera, the popular education platform that partners with top universities and organizations worldwide to offer free courses online. A specialization is a program of study—a bundle of courses designed to be taken serially. Caffo is now using a BD2K training grant to launch two new Coursera specializations in genomics and neuro-imaging. “Our specializations are longer than a summer institute but a tad shorter than a full-on masters’ degree,” Caffo says.



For the new genomics program, which started in the summer of 2015, many of the students are people who want to work in the field or already work in the field and need genomics skills for their current jobs. “We’ve had some people say ‘our whole office is doing this.’ Or ‘I make all new employees do it,’” Caffo says.


One benefit of MOOCs: They are typically free, and students can choose their level of engagement. The genomics series can be completed in about six months if taken serially, Caffo says. But students can take modules simultaneously or out of order. And those in the data science specialization who choose to fork over a nominal fee ($50 or less) to get Coursera signature track verification also get another bonus—a project-based class available only to those who pay. And the MOOC completion rate for people who make this minor investment is quite high—on the order of 90 percent, Caffo says.


With his BD2K funding, Rafael Irizarry, PhD, professor of biostatistics at the Harvard School of Public Health and professor of biostatistics and computational biology at the Dana Farber Cancer Institute, revised a very dense data science for genomics MOOC he launched two years ago by dividing it into eight parts. To complete the series, a student takes the first four modules and one of the last four, which are case studies using specific types of data. “But a generalist in genomics might want to take all of them because someday they might face all of those types of data,” Irizarry says.


The course has proven quite successful, with the usual caveat: Many people sign up for MOOCs and don’t finish them. In some cases, perhaps they watch a few lectures and learn what they needed to know. For Irizarry’s MOOC, two or three thousand completed the first of the eight modules—a very general statistics course for the life sciences—and about 300 completed the entire series of eight. Of these, Irizarry says many are post-docs who want to be better able to do their jobs. Another subset, he says, are educators—”people tasked with teaching this kind of thing who take the class to help them prepare a class.”


Building a Better MOOC

Caffo and Irizarry are both serious about incorporating interactive learning into their MOOCs. So too is Pavel Pevzner, PhD, professor of computer science at the University of California, San Diego. He also received BD2K funding to launch several MOOCs. Indeed, Pevzner wants to change the nature of MOOCs from being massive and impersonal to being more like the experience of receiving one-on-one tutoring in a professor’s office. “There’s a need to address individual breakdowns in students’ learning,” Pevzner says. Large lecture courses don’t and can’t do that.


“We wanted to build a better MOOC,” Pevzner says. And he’s been at it for a while, having created a MOOC for bioinformatics algorithms several years ago. The key to a better MOOC, he says, is short lectures (under ten minutes), and intelligent tutoring systems such as one called Rosalind that Pevzner created, or another called SWIRL that Caffo uses as part of his MOOCs. Rosalind allows for automated individualized assessments of students’ work on robust, “just-in-time” assignments that are evaluated using a sophisticated software system at the exact moment that assessment would facilitate the transition to the next topic.


Similarly, SWIRL is an active learning tool for learning data analysis using the programming language R. “It prompts you to do things, and if you mess up it asks you to do it again,” Caffo says. SWIRL, which is free and open source, was developed by Nick Carchedi in 2013 while he was pursuing his master’s degree in biostatistics at Johns Hopkins. “It is now very mature,” Caffo says. “We’re focused on making content for it.”


Pevzner has seen professors at other universities use his bioinformatics algorithm MOOC in a flipped classroom—students watch the videos and do the lessons outside of class and come to class to discuss and work through any questions or problems they are having. This is a sign that his approach has to some extent succeeded, Pevzner says. Indeed, he believes MOOCs of the future will turn into MAITs, Massive Adaptive Interactive Text. His paper outlining the concept of MAITs appeared in Communications of the Association of Computing Machinery (CACM) in September 2015. 


Like Pevzner, Irizarry’s MOOC avoids multiple-choice assessments (widespread in the MOOC world) because, he says, they aren’t effective teaching tools. Instead, the BD2K-funded MOOCs he’s developing use fill-in-the blank questions that offer the student multiple chances to get it right. “They have to download data, analyze it the way they think best, and tell us, for example, how many genes show statistically significant differences in cancer samples compared to controls,” Irizarry says. “There’s a correct answer (it might be a specific number, like 154). And many times they get it wrong. Then they go to discussion boards and talk about it.” With repeated effort to solve a problem, and the support of the people on the board, students often get the question right. And if they don’t, the correct approach is revealed, along with a follow-up question to ensure students really understand the material.


“The discussion board can get pretty crazy,” Irizarry says. With thousands of students, a single question can generate several hundred posts. And while that might sound like a lot of work for the professors, there are usually students in the class who answer other students’ questions before the professors do. Once they’ve proven their reliability, Irizarry can tag these individuals as community TAs, alleviating the discussion board burden.

Michelle Dunn, PhD, senior advisor for data science training, diversity, and outreach in the Office of the Associate Director for Data Science at NIH, is enthusiastic about MOOCs such as Caffo’s data science specialization. “The fact that they can put out thousands of students per year through a mini-masters program can only help the rest of us have the quality people we need in order to get the job done,” she says.


Some MOOC graduates might become the programmers who are given direction about what algorithms to program. Others with advanced biology backgrounds have used MOOCs to obtain needed data science skills. “People with PhDs are self-learners and do well with MOOCs,” Dunn notes.


Courselets and Concept Inventories

Another BD2K grant recipient is applying the latest advances in educational psychology to bioinformatics education and making the results available online.


After more than 10 years teaching bioinformatics theory to computer scientists, physicists and life science students, Christopher Lee, PhD, professor of chemistry and biochemistry at the University of California, Los Angeles, felt discouraged. “After a quarter-long class, students were still not understanding basic things,” he says. In addition about 50 percent of his students were dropping the class.


Then he learned about concept inventory studies from the field of education. About 20 years ago, researchers discovered that students of freshman physics—including bright Harvard freshmen—scored about 45 percent on a test of physics concepts before taking the class, and only about 50 to 55 percent immediately afterward. “It got peoples’ attention because of these pretty shocking results,” Lee says. “And this is universal.” The same phenomenon is seen in many fields. To Lee, this matched up with his frustration in teaching introductory bioinformatics.


To address the problem, Lee changed his teaching methods. He now presents a single concept and then, immediately after, poses a question designed to test understanding of that concept. Students then have a few minutes to think about how the concept applies to the question and write an answer on a web page on their laptop—just a few lines to capture their thinking. “We can then identify the underlying conceptual errors that we are seeing in all the students’ answers,” Lee says.


As an example, Lee says, students in his class should understand the concept of conditional and unconditional probability from prior coursework in statistics and probability. But, he says, “My experience is that their understanding is brittle and falls apart when they try to use it.” Shifting to concept-based instruction has proven helpful in bringing students up to speed.


“As soon as I started doing this, it was eye-opening,” Lee says. He could see what every student was thinking on every concept every single day. “I’d realize that one word has two meanings and half the class is off on a wrong tangent. It’s wrong and nobody’s going to repair it for them.”


After three years of teaching the introductory bioinformatics course this way, the attrition rate dropped from 50 percent to about 10 percent—without any detriment in overall test scores, Lee says. “So we’ve taken the lower half of the class (the ones who dropped) and put them up where the top half were.”


For his BD2K concept network grant, Lee is taking all that he’s learned from his work with concept inventories in his introductory bioinformatics course and putting it online as courselets that any teacher or student can use. A courselet can allow someone to understand a concept really well in a single sitting. It includes a brief explanation and definition followed by exercises that are broken into pieces: question, answer, and error models—common misconceptions—as well as resolutions for various error models. is still in the early stages (the user interface needs refinement and Lee needs to do some usability studies), but having it online allows others to dip a toe into Lee’s methods by trying out one or two concept exercises a week, either in the classroom or as homework. Lee’s team will also provide support for instructors who use the platform. “We have a lot of experience creating these concept tests,” he says, “and we are totally willing to work closely with people to do this.”


Lee also plans to cross-link Courselets with Rosalind, Pevzner’s interactive learning site. Eventually, he says, “If you are working on a Rosalind problem and you feel that you are missing a concept, you can jump over to Courselets.”

Repositories and Virtual Machines

Summer short-courses, MOOCs and Courselets will serve a vast constituency, but plenty of principal investigators just want to train the students in their lab or in a class of 15 to 20 people. These folks don’t necessarily need to launch a MOOC, Amaro says, but they could benefit from a resource of plug-and-play training materials.


Amaro and her co-PI, Ilkay Altintas, PhD, Director for the Center of Excellence in Workflows for Data Science at the San Diego Supercomputer Center (SDSC), UCSD, received a BD2K grant to build such a resource. Called the Biomedical Big Data Training Collaborative (BBDTC), it will serve as a sort of clearinghouse for training materials related to biomedical big data. It will allow instructors and students to create playlists and add them to an educational queue, designing a personalized, flexible, online learning experience, she says. “Instructors can easily create their own modular courses based on the content we serve and what they create, and deploy it to their students in a more flexible way,” Amaro says.


The site also provides virtual toolboxes—virtual machines that will contain all that a student would need to run hands-on exercises. “Instructors can create the environment the students will be working in,” Amaro says. “And we can package these toolboxes up and ship them out with the course materials in a way that scales,” Amaro says.


Amaro’s prototype site is now up and running at, and she is eager for the BD2K Centers of Excellence and others to put their content there. “As we get content, we’re working on developing tags for the various kinds of training that get uploaded to the BBDTC so people can sort and search and find what they are looking for,” she says.


Choices, Choices, Choices

Online education is not for everyone. “In the end, most people will agree that face-to-face training is always the best,” Amaro says. “But there are so many people who we need to reach, it’s just not possible to train them all in-person.” Online resources allow training in a scalable way, which will be needed in order to close the gap that exists and that will continue to exist in the trained workforce, she says.


Caffo agrees that because the demand for trained people outstrips the supply, there’s plenty of room for all different sorts of solutions for training people. “More MOOCs, more institutes, more online degrees, more in-person degrees—all of those things are going to be necessary,” he says.  



The Summer Institute for Statistics of Big Data:
Big Data Coursework for Computational Medicine:
Coursera Genomic Data Science Specialization:
Coursera Data Science Specialization:
Biomedical Big Data Training Collaborative:

Post new comment

The content of this field is kept private and will not be shown publicly.
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Enter the characters shown in the image.