preface

Recently, I used crawler technology, and I got, like, 10,000 pictures of NASA, the NASA that you see in the movies, of Mars exploration.

Yeah, little things, little things.

After that, I was a little excited, so I had this article, which will have the following contents:

  1. Why am I crawling NASA pictures
  2. How I crawled NASA images (super detail)
  3. What DO I get?
  4. What did I find out (super hot)

Why am I climbing NASA pictures

I’m over 35. I don’t know when I got fired.

Thinking every day in case of which day unemployed do what, thinking of playing from the media, every day to everyone nonsense. You know, historical mysteries, cosmic mysteries, that’s when I looked at NASA.

NASA has a great library of articles, interviews, photos, and videos about space exploration missions.

How I crawled NASA images (super detail)

NASA’s website is publicly accessible at

https://www.nasa.gov/
Copy the code

When you open it up, the front page looks like this, and you can see all kinds of content. There’s also a search box on the top right, and we’ll type in Mars which is Mars

In a moment, there’s all sorts of things about Mars, including Mars Exploration, which is Mars Exploration

When you click on it, you’ll get to a new page, and then you’ll find the Images, and you’ll get to the target page that we’re climbing to

https://www.nasa.gov/mission_pages/mars/images/index.html
Copy the code

Pull down and you’ll see a big button that says “MORE IMAGES.” Click on it and you’ll see:

The content of the page is not directly loaded by the page, but rendered asynchronously via API requests

F12, open the browser developer mode, repeat the previous steps, observe the request information, find the following situation

It looks like this url is very important, so let’s look at the request address:

https://www.nasa.gov/api/2/ubernode/_search?size=24&from=24&sort=promo-date-time%3Adesc&q=((ubernode-type%3Aimage)%20AND %20(topics%3A3152))&_source_include=promo-date-time%2Cmaster-image%2Cnid%2Ctitle%2Ctopics%2Cmissions%2Ccollections%2Coth er-tags%2Cubernode-type%2Cprimary-tag%2Csecondary-tag%2Ccardfeed-title%2Ctype%2Ccollection-asset-link%2Clink-or-attachme nt%2Cpr-leader-sentence%2Cimage-feature-caption%2Cattachments%2CuriCopy the code

Notice the parameters in there

size=24&from=24
Copy the code

Size is the number of images per request. From is the initial position of the query. We can change it to fetch other content

Let’s look at its return message again:

{
	"took": 3."timed_out": false."_shards": {
		"total": 5."successful": 5."skipped": 0."failed": 0
	},
	"hits": {
		"total": 659."max_score": null."hits": [{
			"_index": "nasa-public"."_type": "ubernode"."_id": "450040"."_score": null."_source": {
				"image-feature-caption": "Mars 2020 rover underwent an eye exam after several cameras were installed on the rover. "."topics": ["3140"."3152"]."nid": "450040"."title": "NASA 'Optometrists' Verify Mars 2020 Rover's 20/20 Vision"."type": "ubernode"."uri": "/image-feature/jpl/nasa-optometrists-verify-mars-2020-rovers-2020-vision"."collections": ["4525"."5246"]."link-or-attachment": "link"."missions": ["6336"]."primary-tag": "6336"."cardfeed-title": "NASA 'Optometrists' Verify Mars 2020 Rover's 20/20 Vision"."promo-date-time": "2019-08-05T17:49:00-04:00"."secondary-tag": "3140"."master-image": {
					"fid": "603128"."alt": Engineers test cameras on the top of the Mars 2020 Rover's Mast and front Chassis.."width": "1600"."id": "603128"."title": Engineers test cameras on the top of the Mars 2020 Rover's Mast and front Chassis.."uri": "public://thumbnails/image/pia23314-16.jpg"."height": "900"
				},
				"ubernode-type": "image"
			},
			"sort": [1565041740000] {},"_index": "nasa-public"."_type": "ubernode"."_id": "433172"."_score": null."_source": {
				"image-feature-caption": "NASA still hasn't heard from the Opportunity rover, but at least we can see it again."."topics": ["3152"]."nid": "433172"."title": "Opportunity Emerges in a Dusty Picture"."type": "ubernode"."uri": "/image-feature/opportunity-emerges-in-a-dusty-picture"."collections": ["7628"]."link-or-attachment": "link"."missions": ["3639"]."primary-tag": "3152"."cardfeed-title": "Opportunity Emerges in a Dusty Picture"."promo-date-time": "2018-09-26T12:39:00-04:00"."secondary-tag": "7628"."master-image": {
					"fid": "584263"."alt": "NASA's Opportunity rover appears as a blip in the center of this square"."width": "1400"."id": "584263"."title": "NASA's Opportunity rover appears as a blip in the center of this square"."uri": "public://thumbnails/image/pia22549-16.jpg"."height": "788"
				},
				"ubernode-type": "image"
			},
			"sort": [1537979940000]]}}}Copy the code

The json content above is too long, I removed some duplicates, and actually the hits array, also 24, is the same number of images displayed on the page. So you can pretty much assume that the information on the page is coming from this array.

By comparison, it is found that the master-image field is the information we need, including the image address, image size and image title.

Here’s the code, assemble the request URL, get the content, download the image in three steps

I used the Dart language. Feel free

import 'dart:convert';
import 'package:dio/dio.dart';

main() async {
  // The number of pages per page is fixed at 24
  for (int from = 0; from < 24 * 100; from = from + 24) {
    awaitgetPage(from); }}// Get the information on each page and download it
Future<void> getPage(int from) async {
  String url = 'https://www.nasa.gov/api/2/ubernode/_search?size=24&from=' +
      from.toString() +
      '&sort=promo-date-time%3Adesc&q=((ubernode-type%3Aimage)%20AND%20(topics%3A3152))&_source_include=promo-date-time%2Cmast er-image%2Cnid%2Ctitle%2Ctopics%2Cmissions%2Ccollections%2Cother-tags%2Cubernode-type%2Cprimary-tag%2Csecondary-tag%2Cca rdfeed-title%2Ctype%2Ccollection-asset-link%2Clink-or-attachment%2Cpr-leader-sentence%2Cimage-feature-caption%2Cattachme nts%2Curi';

  // Get the content
  var res = await Dio().get(url);
  var map = jsonDecode(res.toString());

  (map['hits'] ['hits'] as List<dynamic>).forEach((element) async {
    Uri fileUri = Uri.parse(getUri(element));
    String savePath = getSavePath(element);

    await Dio().downloadUri(fileUri, savePath);
    print('Downloaded:' + savePath);
  });
}

// Get the image download address
String getUri(dynamic element) {
  String uri = element['_source'] ['master-image'] ['uri'].toString();

  uri = uri.replaceAll('public://'.'https://www.nasa.gov/sites/default/files/styles/full_width_feature/public/');

  return uri;
}

// Process the information and return the image save address
String getSavePath(dynamic element) {
  String id = element['_id'];
  String fid = element['_source'] ['master-image'] ['fid'].toString();
  String title = element['_source'] ['master-image'] ['title'].toString();
  String uri = element['_source'] ['master-image'] ['uri'].toString();

  String savePath =
      id + '_' + fid + '_' + title.trim() + '. ' + uri.split('. ').last;
  savePath = savePath.replaceAll('/'.' ');
  savePath = savePath.replaceAll('\ \'.' ');
  savePath = savePath.replaceAll('"'.' ');
  savePath = 'images/' + savePath;

  return savePath;
}
Copy the code

The above code, or very simple, experienced students should understand a read.

Let’s walk.

Images /470436_643588_This is the third color image taken by NASA's Ingenuity fly.jpg Images /470435_643587_This is the second color image taken by NASA's Ingenuity fly.jpg images/468546_639327_This is the first high-resolution, Color image to be sent back by the Hazard Cameras (Hazcams).jpg Images /458478_615132_Gullies on Mars.jpg images/469416_641582_A field of sand dunes occupies this frosty 5-kilometer diameter crater in the high-latitudes of the  northern plains of Mars.. Mars 2020 With Sample Tubes (Artist's Concept).jpg images/458075_614251_Mars 2020 With Sample Tubes (Artist's Concept).jpg Images / 470381_643473_cme.jpg Downloaded: images/ 458813_615896_mars.jpg Downloaded: Images /467026_635309_Illustration of NASA's Perseverance Rover begins its descent through the Martian atmosphere Download: Images /470438_643591_This black and white image was taken by NASA's Ingenuity helicopter during its third flight on Cliffs in Ancient Ice on Mars.jpg April 25, 2021. JPG Images /463659_626874_Avalanche on Mars.jpg Images /470251_643164_This image from NASA's Perseverance Rover shows the Agency's Ingenuity Mars Helicopter right after it successfully completed a high-speed spin-up test.. Jpeg downloaded: images/468636_639726_Mars' Jezero Crater.jpgCopy the code

What do I get

These pictures

And these

I’ve got pictures, I’ve got captions for a month I think.

What did I find out

This is my favorite picture. Why is one so clear and the other so cloudy? A Martian rift generator?

Well, here’s the real secret:

NASA’s website does not have anti-collection, try it…