EPISODE 07

17-OCTOBER-2023

Unleashing the power of synthetic data in AI training

In this episode, Scott Swigart, SVP of Technology Group at Shapiro+Raj joins host TJ, VP – Product Marketing at Yellow.ai to discuss the role of synthetic data in training AI models. Learn how businesses can leverage AI role-playing for market research and to gain valuable business insights.

Listen on:

Key takeaway

Using synthetic data to train AI models

[03:31]

Overcoming the challenge of bias in synthetic data

[06:02]

Successful use cases for synthetic data

[10:51]

Meet the guest expert

Guest

Scott Swigart

SVP, Technology Group – Shapiro+Raj

Scott Swigart is an SVP of Technology Group at Shapiro+Raj, a marketing research and consultancy leader. Scott has over 20 years of experience in the B2B technology sector. He leads teams to investigate and deliver insights on the latest tech trends reshaping every industry, such as generative AI, cloud services, cybersecurity, and IoT. His mission is to help clients understand and leverage the opportunities and challenges of the rapidly evolving tech landscape and to provide them with actionable recommendations and solutions for growth.

Transcript

TJ – 00:00:03:

Generative AI takes the center stage. But is your enterprise still watching from the sidelines? Come on in, let’s fix that. This is Not Another Bot, the generative AI show, where we unpack and help you understand the rapidly evolving space of conversational experiences and the technology behind it all. Here is your host, TJ.

TJ – 00:00:26:

Hello and welcome to Not Another Bot, the Generating VI show. I’m your host, TJ. Today we are privileged to have Scott Swigart, the Senior Vice President of the Technology Group at Shapiro+Raj, a leading market insights and strategy consulting firm. With over two decades of experience navigating the B2B technology sector, Scott stands at the forefront of understanding tech trends that are reshaping industries. His dedication to helping clients leverage the opportunities in a rapidly evolving tech landscape, especially around Generating VI, cloud services, and cybersecurity, has positioned him as a pivotal figure for businesses aiming to harness the power of technology-driven insights. Welcome, Scott. We’re absolutely pumped to have you here.

Scott Swigart – 00:01:08:

Excited to be here. Thanks for having me on.

TJ – 00:01:11:

Indeed. All right, Scott, the way we get started is we definitely ask you a bit about your journey and how you kind of traverse the path all the way here. So, with over two decades in the B2B technology sector, now at the helm of the Technology Group at Shapiro+Raj, can you walk us through your journey and what drew you to this, ever-evolving world of technology and data?

Scott Swigart – 00:01:33:

I got my start in technology pretty early on. When I was 12, my parents bought me one of the first personal computers you could buy somebody. And I plowed my way through the programming manual for it and never looked back. And so then throughout my career, I worked as a developer. Worked as an instructor training other developers on how to use the latest technology and eventually wound my way into market research and at my heart, I’m a technologist. The reason why I’m doing what I’m doing now with market research is it lets me investigate emerging technologies, you know, the latest technologies for our B2B technology clients and so, that’s what’s got me here. And Generative AI is obviously the latest and greatest and internally, we’re using it as part of the projects we do and at the same time, as everybody has seen, every B2B tech company out there is adding generative AI features to their products or the startup space is just exploding with new generative AI startups. And it’s honestly the most exciting thing I’ve seen since the dawn of the internet.

TJ – 00:02:42:

Absolutely, same here. And I think the sort of use cases we are landing to discuss with our potential customers is just very exciting. Also, it’s not, you know, kind of take all use cases and attach that to generative AI. So, I think that education is equally happening, but to your point, it’s just something really massive that’s ongoing at the moment. So, Scott, no AI or generative AI is possible without the right data set and certainly building a model reads that core of the data, which is going to make it more possible in terms of the outcomes we expect from generative AI or AI ecosystem. Synthetic data is becoming a buzzword, and that’s heavily used in a Generative AI scenario and many organizations are still grappling with this potential. What was your first encounter with synthetic data and what piqued your interest in its capabilities?

“There are a lot of dangers and caveats around this, but there’s also a ton of promise, so the idea of getting some directional insights very quickly at a low cost is extremely intriguing to people”.

Scott Swigart

SVP, Technology Group – Shapiro+Raj

Scott Swigart- 00:03:31:

It’s a good question. Generative AI is generating stuff, right? So, it’s generating data. And the question really is what kind of data is it generating? Is it generating answers to questions? Is it generating summaries? Well, for a lot of scenarios, you can actually have it generate data. So, if you think about software development and test, you might want test data to run through your application or things like that. Some of the open-source large-laden language models are being trained or fine-tuned using data generated from the proprietary models in the space that we’re in, our client tends to be a B2B marketer and the synthetic data is a little bit different. In our case, it’s more about asking the AI to role-play somebody. So, let’s say that you’ve got a survey and you want 400 responses and there’s a lot of dangers with this because these things are happy to be hallucination factories. So, there’s a lot of caveats around what I’m about to say but you can have the AI role-play 400 different people taking the survey and see if those answers are useful or you could have messaging that you want to test. And you could have the AI role-play different technical decision-makers who would be viewing this messaging on a website or something like that. And do they like it? What do they like about it? So, you can have it role-play the kind of personas that would normally experience your messaging. Like I said, there are a lot of dangers and caveats around this, but there’s also a ton of promise because frankly, what we do with market research is expensive and it’s time-consuming, you know, it can be tens of thousands or hundreds of thousands of dollars, take six to 12 weeks. And so, the idea of being able to get some directional insights very quickly at a low cost is extremely intriguing to people, if it works, if it’s accurate, if it’s valid, if it’s not just hallucinations that wouldn’t hold up or mirror an actual human population.

TJ – 00:05:40:

So, in a nutshell, you’re saying that some of these AI general responses may capture the nuances, biases, and subconscious drivers. But then keeping that in mind and given how it may impact human decision-making, how do you control for the possibility of the AI providing answers that things researchers want rather than genuine consumer insights?

What generative AI does best is its ability to mash things up. It’s able to mash up different kinds of datasets and sources and what it kind of knows about the role from its training, what you’ve given it for context, what you’ve given it for something to react to. And it’s able to come back with things that you wouldn’t have just thought of”.

Scott Swigart

SVP, Technology Group – Shapiro+Raj

Scott Swigart – 00:06:02:

Yeah, so that’s really the biggest issues. And what we found is there has to be a ground truth that comes from actual humans. So, whether it’s humans who were surveyed or whether it’s humans who were interviewed you’ve got to have a repository of data that’s very good that the AI can emulate. And then the farther you ask it to extrapolate away from that core data set and what it covers, the more kind of hallucination you’re going to get but if you’ve done buyer persona interviews, for example, so you’re trying to understand, OK, there’s four different kinds of people who are involved in decisions for our product, we’ve gone out, we’ve interviewed 20 of each of them, so, we’ve got a repository of 80 interviews. Now we can have the AI not just role play, instead of just saying, you know, pretend you’re a CISO for a Fortune 500 company, we can say, here’s a chief information security officer, look at this interview with them, and now as though you were that person, respond to this messaging. And now you’re really grounded in what a person would say. They never saw the messaging, but the extrapolation makes a lot more sense. And it’s something that you, as a human, if you want to, you can sort of validate it. You can look at that interview, you can look at that messaging, and you can say, does this seem logically consistent? But importantly, it will also contain surprises because you’re exposing it to new stimuli that this person hadn’t seen. You’ll get kind of what the generative AI does best, which is it’s able to mash things up. It’s able to mash up different kinds of datasets and sources and what it kind of knows about the role from its training, what you’ve given it for context, what you’ve given it for something to react to. And it’s able to come back with things that you wouldn’t have just thought of if you had just read the transcript and tried to guess what they would say about this messaging.

TJ – 00:08:06:

Very interestingly mentioned that one of the things you did call out question before this was about hallucinations. It’s called in these scenarios when you’re really working with synthetic data, there may be some of the real or synthetic data might blur too at the same time certainly, transparency becomes a critical parameter. Does hallucination kind of show up a lot more in these scenarios? And if it is, how do we contain it? And how do you ensure that the organizations are getting the best information possible? Like, is there a truth checker? Is that some sort of validation that we should be running or you guys are running at the moment?

Scott Swigart – 00:08:42:

Yeah. So first of all, you have to be completely transparent, right? So, our clients who would be buying market research would be completely informed of this, right? They would know this is what we’re doing. You would never pass this off as real human beings. This is framed as something that’s directional it’s going to help expand your peripheral vision and help you think of things you didn’t think of before. The only real ground truth, the only way to test it would be go expose those 50 people to the same messages and see what they say. So that’s the real ground truth is you would go back to those people and get their real responses, or if you’re doing it in a quantitative setting, field it to an equal number of people and same thing, get their actual responses, that’s the ground truth. In some ways you’re asking it to hallucinate, right? You’re saying, pretend you’re this person. So, you’re asking for a little bit of hallucination, but you’re asking it to be very grounded to things it knows. And the more you ask it to extrapolate, the more you’re presenting it with something to react to that is very different from anything it talked about, the real person talked about in the interview. Or, if you say, hey, instead of pretending to be a person who works in an enterprise company, pretend to be somebody who works in a small business or a mid-market. You’re going to get farther from what an actual human would be. So, you, as the user of this, know how much you’re stretching it, or you should know how much you’re stretching it. And the farther you pull that rubber band, the more you’re getting from it making a very educated, valid guess, so it’s just making stuff up.

TJ – 00:10:25:

Very interesting. Very, very interesting. I think going back again to data and synthetic data in general, as we were continuing the conversation, Scott, could you shed some light on some of the successful applications or projects in your experiences where synthetic data has led to valuable insights or innovation, and in what scenarios where synthetic data may be the primary choice over traditional research methodologies?

“You’re going to need people who know where it works and where it doesn’t work, where it is useful and dangerous, how to communicate to the stakeholders in the business who don’t know a lot about market research anyways, and synthetic data on top of that, where they should and shouldn’t use this”.

Scott Swigart

SVP, Technology Group – Shapiro+Raj

Scott Swigart – 00:10:51:

Yeah, so we’re in the experimental phase of this. So, we’re in the phase of having it generate synthetic data and having a real baseline to compare it to, because we want to figure out the biases. So, for example, we’re seeing that AIs tend to have a positivity bias and this seems to be built into them, they’re these cheerful, happy assistants, you ask it a question, and it’s like, oh, I’m happy to help you. So, if you’re exposing it to messaging that may appear on a website, it’s likely to react to it more positively than kind of a jaded human being who’s bombarded with marketing might reply to. So, we’re still in the figuring it out phase. Some of the things that we’re seeing are, marketers are really excited about this because the notion of having infinite research at their fingertips, the notion of being able to test everything just as part of a workflow for very low cost is extremely exciting and it doesn’t have to be perfect. Again, if it makes them think of things that they wouldn’t have thought of if it all seems very reasonable, it’s definitely better than nothing. And it may be 60 or 70% of the way there to using a real human audience. The real human audience has its own biases too. Market researchers, on the other hand, are a little bit revolted by this idea because it can look a little bit threatening to their position. If it’s, well, we can just ask a computer for the answers. What is my job? What is my role? And I have an answer to that that I can get into. I think it’s actually a pretty good thing for market researchers and I can draw an analogy to photographers in mid-journey. So, an analogy that really comes to mind is, I think about mid-journey, and when mid-journey came out and Dolly and stable diffusion and all of those kinds of things, I thought like, this is gonna obliterate the stock photography market. You’re gonna be able to just generate so many images of increasingly whatever you want, whatever you can imagine and describe, that the need to go out and set up a photo shoot and lighting and get models and do all of that work to generate stock photography is gonna go down. But the thing that was really interesting about mid-journey that I noticed is, the artists and the photographers who dove into it were better at it than anybody else because they knew different artist styles. They knew different art styles. They knew camera settings. They knew lighting setups. They had all of this terminology and all of this knowledge from their art and their craft that they had been practicing for all these years to guide this AI much better than me showing up and saying, I want an astronaut riding a horse, right? You know? So those people were able to bring all their domain knowledge into it and be up and running much quicker and I view the exact same thing with synthetic data and market research. You’re gonna need people who know where it works and where it doesn’t work, where it is useful and dangerous, how to communicate to the stakeholders in the business who don’t know a lot about market research anyways, and synthetic data on top of that, where they should and shouldn’t use this. And then when you use this, you wanna craft a research study just as well as you would have ever crafted a research study before. You want the thing that you’re reacting to to be just as well designed. You want the questions you’re asking it to be just as well designed. You wanna analyze the data that comes out of it exactly the same way you would have analyzed it for human respondents. The only thing you’re really changing about the research process is you’re not going out and recruiting actual human beings, which is slow and expensive. But all of that market research expertise that you have, you bring to this domain and it isn’t a threat, it’s a way to move forward and potentially serve a lot more stakeholders than you could have before.

TJ – 00:15:01:

Very interesting. And then just continuing from that discussion, how do researchers then ensure that AI role-playing doesn’t fall into repetitive or predictable patterns, thereby compromising the diversity of responses precisely?

Scott Swigart – 00:15:15:

It really comes down to grounding it in those real respondents. And that’s what we see. If you’ve got a diversity of real respondents and you’re grounding it to that, you get a diversity of real responses. If you don’t, if you just give it something very brief, you say, again, role play an IT decision maker, role play a chief information officer, but you don’t give it rich, rich context, you don’t give it whole transcript or things like that, then you are going to run into that repetition, you are going to end up, even though you tell it to play 50 people, it converging on a lot of the same stuff. So, the thing about it that doesn’t make it magic and free is you’ve got to build that baseline the hard way. You have to go out and do those 50 interviews, the hard way to have a baseline and you’re going to have to refresh those. Those things are going to have an expiration date on it because perceptions in the market are going to change over six months, over a year, so, it’s not free. There’s a pretty decent-sized upfront investment to get that corpus of data that’s going to serve as your ground truth for the AI to make extrapolations off of.

TJ – 00:16:27:

Very interesting. Now, definitely as something which I have learned back in my Microsoft days and also AWS, and certainly, and we started discussing about that as soon as we started the interview, was around the data itself, how the data has been trained and what sort of data the models were trained on, certainly, outcomes pretty much either biased based on that. So. I’ve definitely seen if you’re tagging photographs of a specific geography or a region, and then you’re expecting maybe, you know, match me to this sort of celebrity, you will start seeing outcomes which are based on the data, which is from different regions. Maybe for me, I was expecting something back from APAC or Asia Pacific, I should say, but I was getting consistently being mapped to somebody here or somewhere else, and certainly, neither any of the traits which I have was not matching. Now, keeping that in mind and keeping the different cultural and regional landscapes, are there cultural nuances or biases that AI role-playing needs to be particularly attuned to when generating data for different regions?

Scott Swigart – 00:17:35:

I don’t have hard data on this, I’m going to answer with my gut just knowing what ChatGPT and Claude, kind of that more text-focused, generative AIs are trained on, It’s, probably going to work better if you’re assuming you’re getting the opinions of a US audience. It just probably has less training data on other parts of the world. Now, again, you can counteract that somewhat by what you’re grounding it in, if you’re interviewing people globally, and especially you’re having them talk about regional differences and how things different might be. But again, it’s that rubber band, when you ask it to respond to something it hasn’t seen before, how far are you stretching that rubber band into just hallucination factory territory? So, sticking with stuff that you assume is going to be more represented in its training data is likely going to be safer.

TJ – 00:18:32:

And then now keeping the fact that it’s, you know, last six months have been rapid advancements in AI and technology, I mean, it was always there, you know, so many things that was happening, but with the innovation of things like chat GPT and some of the more language understanding and large language models precisely, I think where do you envision the trajectory of generative AI and synthetic data heading into the next decade?

Scott Swigart – 00:18:58:

Yeah, it’s the worst it will ever be right now. So however good this is, however much mileage you can get from this, it’s gonna be better in three months, it’s gonna be better in six months, it’s gonna be better in nine months. And something, too, that I don’t think people appreciate is people are expecting a linear ramp with this. But this is going to follow some kind of Moore’s law trajectory. So, I don’t know if the doubling rate is every two years or every six months or what but let’s say it’s two years. In two years, it’s going to be twice as good. In another two years, it’s going to be twice as good again. So, we’re going to see this exponential increase. And the limitation I’m seeing is honestly not the technology. The limitation I’m seeing is the rate that organizations can adopt it. Organizations are very, or almost nobody in corporate market research is really even seriously looking into synthetic data, much less even just simple things like, can we use generative AI to analyze just transcripts from real human beings, to just be able to ask questions of 10 interviews and get stuff out? People are just at the baby steps with this technology. And I don’t know what that branch is going to look like as the capabilities of the technology accelerate, but the rate at which corporations can ingest change is linear at best. I don’t know exactly what happens there, but it’s definitely time for people to jump into the river and not wait for better technology because everything you learn on this technology you just get to carry forward with you. It just makes you smarter on the next thing.

TJ – 00:20:41:

So rightly said and so nicely explained, I think that’s the key. A lot of people are just still thinking about it and I think I know a lot of our customers who actually were able to adopt and get through with Gen2AI. We offer a conversational AI platform. So, for us, it’s all about how can you build this advanced automation like literally in minutes, rather than going through all the buildup process, if you can be given everything more prompt-based and you can define what the prompt looks like and you’re not generating all of your workflows dynamically and getting human-like interactions, I think definitely a lot of the customers want to kind of go and adopt. So, we have definitely seen some adoption, but to your point, I think, yeah, a lot of them are still like, okay, what are my concerns? I think ethical concerns or security and privacy concerns are equally kind of dominating the discussions. But I think to your point, so rightly said that they all should be thinking how to quickly get into it and figure out and not spray and pray for all use cases, but at least pick some of them and see the impact of it.

Scott Swigart – 00:21:46:

I would say on the security use cases, this is going to be maybe an unpopular hot take. I think it’s overblown a little bit. I think it’s overblown if you’re willing to look into the policies of the AIs you’re using. So, if you just go into ChatGPT and start using it, it will use your prompts for training but if you go into the settings, there’s a little history slider if you turn history off, it won’t. So that’s all you have to do to be able to use it for more sensitive data. If you’re using it through the API, it doesn’t use your prompts as training. If you use an AI like Claude, which is a fantastic AI, completely underrated, the amount of data you’re able to put into it is enormous. You can put basically 75,000 words in it at one shot, which is just huge. It does not use your prompts as for its own training and you’re even seeing, if you go into Bing chat, it has a little like protected icon now, and it says your personal and business conversations are protected, they’re not being used for training. All of the AI vendors know this. All of the AI vendors know this is the biggest barrier to use is this fear around We put sensitive data into this, it knows it. Our competitors could just come along and ask for it, and it would just serve it up happily as an answer. That’s going away. That’s turning into an outdated perception.

TJ – 00:23:20:

We should have a session with you only on this card. This is so good. I mean, not many people came and talked about this with the clarity you just spoke about because I think responsibility is a thing, and most of the companies are following that principle. I think it’s about just going in and checking in like these are very simple things you just explained, but such a high on impact because I think it looks like there’s this whole aura and hype about security, but the tools are actually letting you make a decision whether you really want to share this or not, so, I think very rightly said. Scott, as we wrap up our conversation on generative AI and synthetic data and the future of insights, what is that one message or sort of thought you would leave listeners and enterprises so that they can embrace the transformative wave of this technology in the coming years?

Scott Swigart – 00:24:08:

I think the tech bro thing to say would be get in the arena, you know, if you haven’t used this stuff, use it for, I think of high risk and low-risk use cases. So low-risk use cases are, you’re dealing with data that isn’t sensitive at all. Maybe you’re dealing with data that’s already on your website, it’s already public, and you want to get a sense for how it could be improved. So, putting that stuff into an AI and just saying critique this. Using it in your personal life, just to get comfortable, just to get familiar with it, just to understand how to prompt it and get answers back. But there are low-risk use cases where you’re not asking it for an answer that you can’t check, you’re just asking it to be an assistant and help you work on something and you’re using it for data that’s completely nonsensitive and share what you learn with your colleagues, try to be an internal thought leader on this stuff, set up an AI task force, volunteer to be part of it. But just show an interest, get in the game, be curious, I mean, I think my biggest advice to anybody in life in any aspect is be relentlessly curious, it will never serve you wrong and this field is no different. And then the higher risk use cases are things like you’re asking it for something that you can’t easily check. And so synthetic data falls into that category, right? If it’s gonna be used and it’s gonna be used at scale, you can’t verify everything that comes out of it with a real human interview and so you do have to be careful. You have to know the boundaries. You have to know how far you’re stretching the rubber band. You have to know how well it’s grounded in a real human being. You have to be able to communicate that to stakeholders because there are companies out there right now. I’ve seen at least three companies that are selling synthetic data for market research and I just think stakeholders are gonna run to this. They’re gonna love it. AI is very good at giving a confident wrong answer and you don’t wanna be fighting a rear guard against that. They’re going there, right? So, if you don’t get ahead of this if you don’t learn about synthetic data, if you don’t learn about the do’s and don’ts, they’re going to be coming to you and you’re going to be flat-footed. Your gut is gonna say, I don’t like this, I don’t think this is right, but you’re not gonna have anything to stand on and this is absolutely coming, this is coming like an asteroid heading towards the market research industry. So, learn to stay in the game, to not get bulldozed on this, and to be a partner to those stakeholders, and to be somebody that they can trust and be credible on this because they don’t want bad data either.

TJ – 00:26:52:

Absolutely. Well, some of the statements you just said could well become messaging statements, the whole app, the way you explained. So, we’ll definitely look back into this recording and learn a few more things. Well, Scott, that brings us to the end of this amazing conversation just totally loved the way you kind of took us through the journey of synthetic data, how we should not be kind of taking a step back rather than taking a step forward into adoption of the technology and especially genetic AI because it’s gonna stay forever and certainly your insights into the market research in general. So totally appreciate it and it’s been an amazing discussion, so, on that note, we appreciate you being on our show. Thanks, Scott.

Scott Swigart – 00:27:32:

Yeah, you can’t shut me up about this stuff. Loved being invited and really appreciate it.

TJ – 00:27:37:

It’s a pleasure. We’ll definitely be looking into some more of these engagements with you. I think there’s a lot more we can leverage your experiences. I hope you’ll be okay to come to one of our events, speak in a panel. Amazing to have you, Scott, there’s some really good stuff you just explained and I think these are the sort of things we need to make a lot of enterprises listen to them as well. So, thanks for that.

Scott Swigart – 00:27:57:

Be happy to. Thanks a lot.

TJ – 00:28:01:

How impactful was that episode? Not Another Bot, the generative AI show, is brought to you by yellow.ai. To find out more about Yellow.ai and how you can transform your business with AI powered automation, visit Y-E-L-L-O-W.ai. And then make sure to search for the generative AI show in Apple Podcasts, Spotify, and Google Podcasts or anywhere else podcasts are found. Make sure to click subscribe so you don’t miss any future episodes. On behalf of the team here at yellow.ai. Thank you for listening.