Summarify.net

This is How I Scrape 99% of Sites

ji8F8ppY8bs — Published on YouTube channel John Watson Rooney on September 15, 2024, 11:00 AM

Watch Video

Summary

This summary is generated by AI and may contain inaccuracies.

- In this video, Speaker A shows how he goes about scraping almost every single site that he comes up against. Then he introduces the proxy provider that he uses and the sponsor of the video, proxy scrape. - It is important to understand the API and the endpoints and what's happening in it before starting to work with it. - We are going to use curl requests, CFI requests and I will do request status code. I will cover the reasons why this is happening in a video, and how to avoid it. - This is the main part of getting the data, which is the hardest part of web scraping. It is entirely up to you how to model the data and what to do with it.

Video Description

Check Out ProxyScrape here: https://proxyscrape.com/?ref=jhnwr

➡ JOIN MY MAILING LIST
https://johnwr.com

➡ COMMUNITY
https://discord.gg/C4J2uckpbR
https://www.patreon.com/johnwatsonrooney

➡ PROXIES
https://proxyscrape.com/?ref=jhnwr

➡ HOSTING (Digital Ocean)
https://m.do.co/c/c7c90f161ff6

If you are new, welcome. I'm John, a self taught Python developer working in the web and data space. I specialize in data extraction and automation. If you like programming and web content as much as I do, you can subscribe for weekly content.

⚠ DISCLAIMER
Some/all of the links above are affiliate links. By clicking on these links I receive a small commission should you chose to purchase any services or items.

This video was sponsored by ProxyScrape.

Transcription

This video transcription is generated by AI and may contain inaccuracies.

Speaker A: A large part of the work I do in scraping is ecommerce data, competitor analysis, product analysis and all that. And I want to show you in this video how I go about scraping almost every single site that I come up against, especially ones like this. So I've covered this before, but what you want to do is you absolutely don't want to be trying to pull out links and trying to, you know, scrape the HTML. That's just not going to work. I know if you look over my head here I'll make it a bit bigger. I mean this is just passing HTML for this is just not going to work. What we want to do is we want to find the backend API that this site uses to hydrate the front end to basically populate this data. To find that we want to open up our inspect tool, our dev tools here in Chrome. Go to network. I'll try and make this a little bit bigger and then we need to start interrogating the site. Now the first thing I always do pretty much is just sort of scroll around and see what pops up. I'm going to click on fetch XHR and it's responses that are JSON that we are going to be interested in. You can either move around, go to different categories or kick click on a product will do just fine.

Speaker B: When you start to scale up projects like this one, you'll find that your requests start to get blocked and that's where you need to start using high quality proxies. And I want to share with you the proxy provider that I use and the sponsor of this video, proxy scrape. Proxy Scrape gives us access to high quality, secure, fast and ethically sourced proxies that cover residential data center and mobile with rotating and sticky session options. There's 10 million plus proxies in the pool to use, all with unlimited concurrent sessions from countries all over the globe, enabling us to scrape quickly and efficiently. My go to is either geo targeted residential proxies based on the location of the website or the mobile proxies as these are the best options for passing anti bot protection on sites and with auto rotation or sticky sessions, it's a good first step to avoid being blocked. For the project we're working on today, I'm going to use these sticky sessions with residential proxies holding onto a single.

Speaker A: Ip for about three minutes.

Speaker B: It's still only one line of code to add to your project and then we can let proxyscrape handle the rest from there. And also any traffic you purchase is yours to use whenever you need as.

Speaker A: It doesn't ever expire.

Speaker B: So if this all sounds good to you, go ahead and check out proxyscrape at the link in the description below. Let's get on with the video.

Speaker A: So let's go ahead and look at what we've got here. So here right away I can see a load of images and a load of JSON data here. The one that I'm interested in straight away says availability. And this has all the products availability, the like, you know, basically the stock numbers and the SKus etcetera for this item. That's pretty handy. That's very relevant. And the other one is right here, which is sort of the whole product data, everything that comes with it. So we can see, we've got all the images and stuff like that and there's pricing information in here, metadata. If I collapse these, we can see everything coming up, pricing information. So this is essentially the data that I want. Now I've shown you all this before in other videos and if this is new to you, then I'll cover everything you need to do to get started with this. But what I haven't done before is I haven't showed you more of a full project, which is what I'm going to go through it through in a minute. The first thing I want to do though is we need to understand the API and the endpoints and what's happening. So I'm going to go ahead and I'm just going to copy the request URL for this one, which is the product. Now we can see that this is basically essentially just their API. And by hitting it like this we do indeed get the JSON response for this data. Now what that means is we could effectively take a different product, for example, let's see if I can grab the data for this one, the code for this one, and just put it on the end here. And we're going to get that information. But how do we go about getting these product codes? Well, there's another way that we can do this and I'm going to keep this one open. So now I've got the sort of the product link here and I'm going to open the availability one as well. So we can have all three and have a look. Where is the availability here? So again, HIV availability, it's basically very straightforward. So it's going to paste this in here, we get the availability again, if I change the product code, it's going to give us the availability for that product. Now to actually find the product ids. Well, how would you find them on the website. Well you could either go to a category or you might want to search. And this is kind of where I tend to go for go for to start with. So I might type something like boots into the search again with this open on this side, you know, here we go, 431 results. This is how I would typically sort of look to get this information. So if I come over back to the, the data here that I had, I need to scroll to the bottom. Somewhere around here we're going to find a request. I wish it wouldn't show me all of these. Actually what I'm going to do is I'm going to delete all this. I had all the other ones. I'm going to search again just so it comes up at the top. Okay, so this is it loading up. You can see it's loading up all these products. And this is because these are the products that have come from the search. So this endpoint is actually slightly different. It's going to give you different bits of information. We will cover that. The one I'm looking for is the actual search one here. Search query. There we go. I found it. So what this is, is this is like basically hitting the API endpoint with the search query that we gave it. And again, I can put this in here. Put this in. I wish this would go away. I don't know what this is for. And I can put this in here. And here is the response. Now I'm going to just collapse a lot of this information, get rid of all of this because we're not that interested in this information. But what we are interested in, if I make this full screen and we have a good look is we have a view size, a view set size. We have the count which is 431, which was the whole of the search. We have the search term and then we have the items at 48 per page, which was the view size. We also have the current set, which I believe, no, there should be another one. Start index. Here we go. So what we can actually do is we can start to see are any of these parameters available for us to manipulate. So if I change the start index to ten, what happens? Okay, that wasn't the right one, I think it's actually. So start index didn't work. So I'm going to change it. And quite often it's just start maybe. Okay, start is start index. Okay, that's fine. To find that out. If you were, you could try and guess it like that. But what you could do is you can, if we just come back here and we manually go to the next page with the developer tools open, you would see that and it would, it would be there. So if we scroll down somewhere along here, start is 48. We can see that there. So you can start to do everything that you would do on the page and just keep an eye on the actual network tab and you'll see everything come through. So now that I know that the start index works way too big, we can start to put together something that's going to give us, we can use to search. We can have like the, we can start, we want to start on zero index, I guess. Yeah. And then we can go through the items. So what we have in the actual items response is somewhere down here we have a lot of good information actually and in some cases this is enough, but a lot of cases you do want to go actually deep into the product itself. We have a product id. So this product is some kind of kid superstar boots, right? So now we come back to our products part endpoint and we hit this in here. Here's the product. Straight away it's come back and it's given us all this information. And the one that I want to look at the most is the pricing information. It's got a discount. All this cool stuff right here. Then we can of course go to the availability one, put the product code in and here's the availability. And this one has some availability. So you can see that we're starting to work out how their API works. Now this is not that difficult, especially if you've either worked with rest APIs before or built a rest API's before. But my best advice, as I said, is just to look through the website. So what I want to do now is I want to take this and I want to turn it into something we can repeat within our code. So I'm going to get rid of this. At the moment I don't think I'm going to need this. We can always actually, we can always come back to it. And I've got my terminal open here in a new folder. Let's make this bit bigger. I'm going to create a virtual environment like. So I'm going to activate it. What I want to show you now is a couple of interesting things. So I'm going to go and I'm going to use curl. I'm going to take this endpoint that we know that works in our browser. We can see it works there and I paste it here and we get denied. So this is a curl error and this is basically, you know, the akin to, you know, we can't get this data like this. Well, let's try it with requests. So let's import in requests and we'll do, our response is equal to requests get. Let's put the URL in there. You can see that we're having issues here. We're not able to stream the data for whatever reason. So I'm going to change the headers. I can't clear this up. Can I clear this up? We'll do it this way. We're going to change the headers. We'll say our headers are equal to because, you know, you always want to do a good user agent. Right? User agent. And let me just grab one. My user agent. This one will be fine. Put that in here. Oh, I need to sanitize and paste please. There we go. Cool. So now we'll import requests again and we'll do, our response is equal to requests get and we'll grab our URL again. This one will be fine. Put you in there. We'll say our headers is equal to the headers that we just created which is the user agent and response status code 403. Now this is because of TL's fingerprinting. I'm going to cover this much more in a video, much more in depth coming up. So if you're interested in finding out really why this is happening and what you can do to avoid it and how, you know, everything works underneath the hood, you want to subscribe for that video. But essentially what we want to do is we're going to, I'm going to come out of this just so I don't get any namespace issues. Actually I don't need to. We'll do input, we'll do from curl CFFI. We're going to import in requests as C rec Curl CFFI is going to give us a more consistent fingerprint that looks like a real browser. So what I can do now is I can go up to here. We don't need this one. We just want this. And instead of using actual requests I'm going to use curl requests, CFI requests and I'll do request status code. And I got 403 because I forgot to do this. Personate is equal to. And we can just put chrome in here. You don't have to put the version. And now if I do response status code. We got our 200. Our response JSON is all the data. So we basically needed to get our fingerprint sorted for the, to make the request. You notice I didn't need any cookies, I didn't need any headers. I didn't need anything other than what Curl CFFI or other, you know, TL's fingerprint sort of spoofers do. There's a few out there and I will, as I said, I'll cover that in the following video. So now that I know that this is going to work, what I'm going to do is I'm going to go into my, we need to activate this one here. I'm going to do pip three and we're going to use that curl CFFi library. Pip three install curl CFFI and I'm going to use rich. I always use rich for printing. We're also going to use pedantic because I want to get it to a point where we have modeled the data a bit better. So I will install these. I think that should probably be enough for us. In this instance. I'm going to touch main Py and we'll make this open here. Now I've imported everything that we're going to need. I'm going to look at modeling my data a little bit closer. Now, I've done this already, but essentially what I'm going to do is I'm going to take so from this, the products one and the search one so we can get that information. I haven't done the availability one, but you can add that one on nice and easy now that you know the, the endpoint here. So we're going to model this information. I'm basically just going to take what I want from here and create a pedantic model with it. The first one is the search item, which I'm going to have the product id, the model id price, sale price and the display name and the rating. So that's all comes from that search endpoint. And then the same thing I'm going to have with the search response, which means I can easily find out and manipulate what page and count, etc. Like this. So we can see the search term, the count of total items for that search, and the start index, which I talked about earlier. And then the items is the list of search items. Then I've modeled the item detail, which is the information that I was after before. So I've just basically put the product description and the pricing information in as dictionaries rather than modeling them because this is quite dynamic, this data. I found some products, they don't have all of this information, so it was easier just to do it like this again with the product description. So it's up to you. But basically what I'm saying is model your data from here. I creating a new session. Now I gave, I created a function for this because initially I thought maybe I would want to expand on this project and then be able to import this new session function from, into a different, you know, different file or different part of the project. So all I'm saying is I'm creating a session, I'm using request session and again this is Curl CFFI. So we have this impersonate here and I also am importing my proxy. Now I talked about sticky proxies earlier and that's what I'm going to be using here. It's not actually essential to do so with this specific site, but there are sites that will be, that will sort of match your fingerprint or your request with the IP address and if it starts to differ it starts to get flagged. That's a lot less common though. So this should be fine. And now I'm going to model a function that's going to go ahead and query the search API. We need our session, which we're going to create our query string and our start number. And I've just put in the NF string into the URL here to do that. And then I'm going to basically just get the data from here. We want to put in something to handle if we get a bad response. So basically I've put request raise for status, which is going to throw me an exception if we get anything that isn't a 200 response. Basically going to let me know if we're starting to get blocked. I'm not too fond of this. I think there's probably a more elegant way of handling it, but this will work just fine for now. Then we are basically taking the response data and pushing it into our model, our search response model. We're unpacking it and I'm unpacking from the raw and item list, which is essentially this piece of information here. So raw, I'm going to go to this one and then this one here and then I'm going to unpack everything that fits into my models, like so again, it's up to you how you model your data and then I'm going to return the search, which is a type of the search response model. I'm going to do exactly the same. Now for the detail API, very, very similar. We're going to put the item product id and this is why I like to use models with my data because now look, I can clearly see in this function that this takes in the search item and then we use the item product id as to put into our URL rather than just having, you know, the, whatever piece of data from a dictionary. I find this much, much easier to see request racer state race for status again and the same thing. We're going to push our response JSON into our item detail model. We return that out and here's our main function. We're going to create a new session. We're going to go and put a search term in here. So again, this is our session that we're giving it. The search query parameter which I define in the other function is hoodie start index. I put as one that should probably be zero, but you get the idea. Now I'm just going to loop through all of these and we're going to print out the name of the product as we go through. So I've got it to this point here. I wanted to show you up to here because this is kind of like the main part of getting the data, which is absolutely the hardest part of web scraping. And then sort of understanding how you can go through and figure out how sites backend APIs work and then manipulate them slightly to get the information that you're after. Once you've got that data, it's entirely up to you what you're going to do with it. I mean you could collect more here, you probably want to do the availability etc, etc. So I'm going to save this and I'm going to come over here and I'm going to run PI main and we should hopefully start to see some of the product names coming through. So I've searched for hoodie and we're now this is the information that's coming back. So I'm just looping through the products that were on that first search page. It was 48 and I'm querying their API as if I was a browser like I showed you on this page here and just putting the data out. So this is the absolute best and easiest way to get data from websites like this. Website owners and site designers will find it very, very difficult to protect their backend API in such a way that their front end can still access it. Just by the nature of it, it happens a lot. Now it's not always going to be as easy as this, but you will be surprised how often it is. The only thing I will say is that if you're going to do this, you're going to be able to pull a lot of data quite quickly. So I would always say, you know, be, be, considerate and don't hammer it. If you hammer it you're probably going to get blocked. They'll find out anyway. But pull the data that you need. It's all publicly available data. I'm not doing anything there. I'm not using any API keys here. I'm not using anything that I shouldn't do. This is all publicly available data. I'm just putting it in the most convenient and easy way to easy fashion as possible. So hopefully you got the idea and you can mimic this now with your own projects etcetera. If you've enjoyed this video I'd really appreciate like comment subscribe. It makes a whole load of difference to me. It really does. Check out the Patreon I always post stuff early on there or consider joining the YouTube, the YouTube channel down below as well. There's another video right here which if you watch this one now you'll continue my watch time across YouTube and they will promote my channel more. Thanks, bye.